## Name: Sample weights based on difficutly
### Date: 12/6/2025
### Status: Better scores on 20/33, worst on 10/33 rest are the same over RF with same learning params. More time needed obvs.
### Idea: 

1) First run an RF classifier on the train data (let's say with fixed n_estimators=n=100, max_depth=d=5). IN this classifier there will be n trees, each of which will have at most 2**d leaves.

2) Then use this RF classifier to embed each training (and at inference, test) sample by passing the sample down the tree and keeping track of which leaf node the tree lands on. Essentially for each tree we have a number ranging from 1 to 2**d indicating the leaf node (per some order). So in total we will have n such numbers as embedding for each sample (or nxd matrix if we ohe them, but i don't think it is needed with the following order i will propose).

3) In order to ascertain a specific order of leaves for each node, we will sort them according to the percentage of samples that exist in each leaf node (based on fitting on the train set), so leaf node 0 will be the one with the most samples of class 0 in it, leaf node 1 the one with the second most samples of class 0 etc, while leaf node 2**d will be the the most populated by class 1 samples. So with this order in place there is a relative meaning to a sample getting assigned to a leaf node from 1 to 2**d, where 1 means most probably a 0 while 2**d means most probably a class 1 sample, according to the tree at hand. Obviously this process must be done for each tree separately.

4) With 2,3 in place essentially for each train sample we embed it to a 1xn vector with values ranging from 1 to 2**d, where each feature value now indicates how probable it is is according to each tree.

5) Now on top of that embedding, I want to have an NN, let's use what is the sota as an MLP architecture which takes as input the embedded samples, maybe concatenated as well with the original feature samples) and it is fitted using torch. 


### Results:
Run versus RF with the same number of trees (100) and fixed max_depth Νονε.

Tree with concatenation was better 20/33 datasets, lost 10/33 and the same on the rest.

So we have **better performance on average, 6 times the time needed though (4 seconds vs 24 seconds)**



### Comments on results:





In [2]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import cached_path
from pmlb import fetch_data
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.model_selection import StratifiedKFold, cross_val_predict
import time
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support
from scipy.special import softmax
from sklearn.base import clone
from tree_embedding_nn import TreeNetClassifier
import pytorch_lightning as pl


random_state = 42
pl.seed_everything(random_state)



path_to_data_summary = "https://raw.githubusercontent.com/EpistasisLab/pmlb/master/pmlb/all_summary_stats.tsv"
dataset_df = pd.read_csv(cached_path.cached_path(path_to_data_summary), sep="\t")

classification_datasets = dataset_df[
    # (dataset_df["n_binary_features"] == dataset_df["n_features"])
    (dataset_df["task"] == "classification")
    & (dataset_df["n_classes"] == 2)
    & (dataset_df["n_features"] <= 150)
    # & (dataset_df["n_features"] >= 10)
    & (dataset_df["n_instances"] <= 1000)
]["dataset"][:]

print(len(classification_datasets))

models = {
    "Baseline": {},
    "Tree_NN": {'concat':False},
    "Tree_NN_Concat": {'concat':True},

}


number_of_cv_folds = 5
num_estimators = 100
max_depth = 10

cv = StratifiedKFold(number_of_cv_folds, random_state=random_state, shuffle=True)
base_class = RandomForestClassifier(n_estimators=num_estimators, max_depth=max_depth, random_state=42)
  ##DecisionTreeClassifier(max_depth=None, random_state=42)#

res = [] 
for dataset_index, classification_dataset in enumerate(classification_datasets[::-1][:2]):
    
    print(f"{classification_dataset} ({dataset_index + 1}/{len(classification_datasets) + 1})")
    if 'deprecated' in classification_dataset:
        print(f"Skipping {classification_dataset} as deprecated from PMLB...")
        continue
    try:
        X, y = fetch_data(classification_dataset, return_X_y=True)
    except ValueError as e:
        print(f'Probably not found dataset {classification_dataset} in PMLB and skipping...\n {e}')
        continue
    if y.max() != 1 or y.min() != 0:
        for wanted, actual in enumerate(np.unique(y)):
            y[y==actual] = wanted
        
    imb_ratio = np.bincount(y).max() / np.bincount(y).min()
    print(f"{X.shape} with ratio : {imb_ratio:.4f}\n")
    

    for model_name, model_kwargs in models.items():
        y_pred = np.empty_like(y)
        sample_weights = None
        time_s = time.time()
        for train_indices, test_indices in cv.split(X,y):
            X_train, y_train = X[train_indices], y[train_indices]
            X_test, y_test = X[test_indices], y[test_indices]
            
            X_train_filtered = X_train.copy()
            y_train_filtered = y_train.copy()
            if model_name.startswith("Tree_NN"):
                clf = TreeNetClassifier(
                    n_estimators=num_estimators,
                    max_depth=max_depth,
                    mlp_hidden_dims=[32, 16],
                    lr=0.005,
                    epochs=100,
                    patience=3,
                    batch_size=256,
                    check_val_every_n_epoch = 5,
                    concat_original_features=model_kwargs['concat'],  # Try with False to see the difference
                    device="auto",
                )
            else:
                clf = clone(base_class)
            #print(model_name, X_train_filtered.shape[0])
            clf.fit(X_train_filtered , y_train_filtered)
            y_pred_cur = clf.predict(X_test)

            y_pred[test_indices] = y_pred_cur
            #print(f'TRUE', y_test)
            
        
        
        acc = accuracy_score(y, y_pred)
        (prec, rec, f1, sup) = precision_recall_fscore_support(
            y, y_pred, average="binary"
        )
            
        
        print(model_name)    
        print(classification_report(y, y_pred))
        time_end = time.time() - time_s

        res.append((classification_dataset, imb_ratio, model_name, time_end, acc, prec, rec, f1))
        
res = pd.DataFrame(res, columns=['dataset', 'dataset_class_imb', 'model', 'time', 'acc', 'pr', 'rec', 'f1'])

# Step 2: Sort each group by 'f1'
sorted_df = res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False)).reset_index(drop=True)

# Step 3: Assign ranks within each group
sorted_df['rank'] = sorted_df.groupby('dataset').cumcount() + 1

# Step 4: Calculate mean rank for each model across all datasets
mean_ranks = sorted_df.groupby('model')['rank'].mean().reset_index().sort_values(by='rank')

print(mean_ranks)
            

Seed set to 42
Seed set to 42


73
xd6 (1/74)
(973, 9) with ratio : 2.0217



Trainer will use only 1 of 2 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=2)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Baseline
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       651
           1       1.00      1.00      1.00       322

    accuracy                           1.00       973
   macro avg       1.00      1.00      1.00       973
weighted avg       1.00      1.00      1.00       973

--- Fitting Random Forest ---
--- Creating Leaf Probability Embeddings ---
--- Transforming Data with Forest ---
--- Fitting MLP (Input Dim: 100) ---


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Output()

In [9]:
import pandas as pd
import numpy as np

res = pd.read_csv("./results/tree_emb_res_depth_None.csv")
print(f'# of unique datasets: {res["dataset"].unique().shape[0]}')
# Step 2: Sort each group by 'f1'
sorted_df = res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False)).reset_index(drop=True)

# Step 3: Assign ranks within each group
sorted_df['rank'] = sorted_df.groupby('dataset').cumcount() + 1

# Step 4: Calculate mean rank for each model across all datasets
mean_ranks = sorted_df.groupby('model')['rank'].mean().reset_index().sort_values(by='rank')

print(mean_ranks)

# of unique datasets: 33
            model      rank
2  Tree_NN_Concat  1.636364
0        Baseline  2.060606
1         Tree_NN  2.303030


  sorted_df = res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False)).reset_index(drop=True)


In [10]:
res.groupby('model')['time'].agg('mean')

model
Baseline           3.399176
Tree_NN           19.656898
Tree_NN_Concat    23.027129
Name: time, dtype: float64

In [11]:
model_names = res['model'].unique()
wins_score = np.zeros((len(model_names), len(model_names)))
metric_to_score = 'f1'
for classification_dataset in res['dataset'].unique():
    cur_df = res[res['dataset'] == classification_dataset]
    # print(classification_dataset)
    # print(cur_df.sort_values('f1', ascending=False)[['model', 'time', 'acc', 'f1']])
    # print()
    cur_df = cur_df.set_index('model')
    score_metric = cur_df[metric_to_score]
    for i, m1 in enumerate(model_names):
        for j, m2 in enumerate(model_names[i:]):
            if cur_df.loc[m1][metric_to_score] > cur_df.loc[m2][metric_to_score]:
                wins_score[i, j+i] += 1
            elif cur_df.loc[m1][metric_to_score] < cur_df.loc[m2][metric_to_score]:
                wins_score[j+i, i] += 1
            else:
                pass
order_of_models = wins_score.mean(axis=1).argsort()[::-1]
wins_score = wins_score[order_of_models, :][:, order_of_models]
# Uncomment this for percentage wins
# wins_score /= res['dataset'].unique().shape[0]
print('WINS')
print(pd.DataFrame(wins_score, columns = model_names[order_of_models], index=model_names[order_of_models]))

WINS
                Tree_NN_Concat  Baseline  Tree_NN
Tree_NN_Concat             0.0      20.0     25.0
Baseline                  10.0       0.0     17.0
Tree_NN                    7.0      15.0      0.0
