## Name: Sample weights based on difficutly
### Date: 10/02/2025
### Status: Pending results
### Idea: 
If we have a way of scoring the samples from "easy-to-learn" to "hard-to-learn", can we use this with a learning strategy to improve generalization?

The implicit assumption is probably that some of the "hard" to classify samples are either noise or would "bend the model out of shape" to accommodate for them (not sure any is true).

There are 2 major components in this:
1. A scoring mechanism on the difficulty of the samples.
2. A strategy given the scores for each samples, that would lead to a better learner.


I've worked out the following for the above points:

##### Difficulty scoring

A simple solution is to fit decision trees with incrementaly increasing depth and check which samples are predicted correctly at the corresponding level.
The final score for each sample is the lowest depth for which it was correctly classified (i.e. low depth  implies easy model, so low score).


Interestingly enough, I've tried fitting a DT with arbitraty depth and then counting the path length for each sample and it is somewhat inversely correlated with the above score
(i.e. small path lengh is usually found on the previous methods high-depth samples)


##### Score-based strategy

Tried out a few things by hand. Focused on the following:

1. *Curriculum learning - Hard (CRH)*: Keep only the easiest k% of the training data (according to the score). This is closer to the assumption that difficult samples are mainly noise.
2. *Curriculum learning - Soft (CRS)*: Again sample k% of the dataset with probability proportional to the inverse score (i.e. high probability on low scores - easy samples). This is the relaxed version of the above
3. *Weighted learning - Easy (WLE)*: Weight the samples during learning according the inverse of their scores (i.e. high weights on low scores - easy samples). The idea is to focus on the easy samples.
4. *Weighted learning - Hard (WLH)*: Weight the samples during learning according their scores (i.e. high weights on high scores - hard samples). The idea is to focus on the hard samples.


So the overall workflow is:
1. Generate train split
2. Fit incrementally-increasing-depth DTs on the train data and keep track of the lowest possible DT that correctly classifies each sample.
3. Use the selected strategy to sample the dataset or generate the according weights for learning
4. Fit a classifier on top 


### Results:

**Best one being WLH with 57% better scores than the baseline and avg. rank of 3** 


With **best methods over Baseline (RF)** scoring better on X/64 datasets:

CRH_pc99  49.20%  31/64

WLH_sq    57.14%  36/64

CRS_pc95  52.38%  33/64

WLH_sm    52.38%  33/64


Ranking and wins shown below



Comments on results:
1. Top Curriculum learning methods seem to perform better than Weight learning methods (this was not expected)
2. No clear winnger on scaling of scores
3. In CR methods, decreasing the cutoff  makes the CR methods use less data so after a bit the results drop drammaticaly. In the experimented range [0.8, 0.9, 0.95, 0.99] there was not a clear winner. 0.99 seems the clear winner, with 0.90 being the second best in ranking but 0.95  being better against baseline.

Details:
1. Tried out with different variants of the above. 
   1. For all models there was a variant regarding how the score of each sample is calculated. Starting from the default scores we could extra:
      1. Square these scores (pushing out small probas to zero and keeping mainly big scores - aka easy samples)
      2. Softmaxing the same as above but smoother
      3. Square + softmax
   2.  Specifically tweaking the percentile cutoff for *CRH* and *CRS* in keepin 90% or 80% of the top-k easy samples.
  




In [None]:
import pandas as pd
import cached_path
from pmlb import fetch_data
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_predict
import time
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support
from scipy.special import softmax
from sklearn.base import clone

path_to_data_summary = "https://raw.githubusercontent.com/EpistasisLab/pmlb/master/pmlb/all_summary_stats.tsv"
dataset_df = pd.read_csv(cached_path.cached_path(path_to_data_summary), sep="\t")

classification_datasets = dataset_df[
    # (dataset_df["n_binary_features"] == dataset_df["n_features"])
    (dataset_df["task"] == "classification")
    & (dataset_df["n_classes"] == 2)
    & (dataset_df["n_features"] <= 150)
    & (dataset_df["n_instances"] <= 1000)
]["dataset"][:]

print(len(classification_datasets))

models = {
    "Baseline": {},
    # "CRH_pc80":{"top_k_to_keep":0.8},
    # "CRH_pc90":{"top_k_to_keep":0.9},
    "CRH_pc95":{"top_k_to_keep":0.95},
    "CRH_pc99":{"top_k_to_keep":0.99},
    
    "CRS_pc95":{"top_k_to_keep":0.95},
    "CRS_pc99":{"top_k_to_keep":0.99},
    "CRS_sm":{"rescale_score":'softmax'},
    
    "WLH_sq":{"rescale_score":'square'},
    "WLH_sm":{"rescale_score":'softmax'},
    
    "WLE_sq":{"rescale_score":'square'},
    "WLE_sm":{"rescale_score":'softmax'},
   
    # "CRS_sq":{"rescale_score":'square'},
    # "CRS_sq_pc90":{"rescale_score":'square', "top_k_to_keep":0.9},
    # "CRS_sm":{"rescale_score":'softmax'},
    # "CRS_sqsm":{"rescale_score":'square+softmax'},
    # "WLE":{"rescale_score":None},
    # "WLE_sq":{"rescale_score":'square'},
    # "WLE_sm":{"rescale_score":'softmax'},
    # "WLE_sqsm":{"rescale_score":'square+softmax'},
}


def change_scale_of_scores(scores, scaling_strategy=None):
    
    if scaling_strategy == None or scaling_strategy == "":
        new_scores = scores
    elif scaling_strategy == "square":
        new_scores = scores**2/(scores**2).sum()
    elif scaling_strategy == "softmax":
        new_scores = softmax(scores)
    elif scaling_strategy == "square+softmax":
        new_scores = softmax(scores**2)  
    return new_scores


number_of_cv_folds = 5
random_state = 42
max_depth = 100

cv = StratifiedKFold(number_of_cv_folds, random_state=random_state, shuffle=True)
base_class = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42) #DecisionTreeClassifier(max_depth=None, random_state=42)#

res = [] 
for dataset_index, classification_dataset in enumerate(classification_datasets[::-1][:]):
    
    print(f"{classification_dataset} ({dataset_index + 1}/{len(classification_datasets) + 1})")
    X, y = fetch_data(classification_dataset, return_X_y=True)
    if y.max() != 1 or y.min() != 0:
        for wanted, actual in enumerate(np.unique(y)):
            y[y==actual] = wanted
        
    imb_ratio = np.bincount(y).max() / np.bincount(y).min()
    print(f"{X.shape} with ratio : {imb_ratio:.4f}\n")
    
    
    scores_data =  np.zeros(len(X)) + max_depth + 1
    for depth in range(1, max_depth):
        clf = DecisionTreeClassifier(random_state=42, max_depth=depth)
        clf.fit(X, y)
        y_pred = clf.predict(X)
        found_flag = (y_pred == y).astype(int)*depth
        found_flag[found_flag == 0] = max_depth + 1
        scores_data = np.minimum(scores_data, found_flag)
        if (scores_data >= max_depth).sum() == 0:
            break
    
    # high score - easy samples
    # can use scores.data.sum() as well 
    scores_data = 1 - (scores_data / (scores_data.max()+1))
    
    
    
    for model_name, model_kwargs in models.items():
        y_pred = np.empty_like(y)
        sample_weights = None
        time_s = time.time()
        for train_indices, test_indices in cv.split(X,y):
            X_train, y_train = X[train_indices], y[train_indices]
            X_test, y_test = X[test_indices], y[test_indices]
            scores_train = scores_data[train_indices].copy()
            
            X_train_filtered = X_train.copy()
            y_train_filtered = y_train.copy()
            
            if 'rescale_score' in model_kwargs:
                scores_train = change_scale_of_scores(scores_train, model_kwargs['rescale_score'])

            if "CR" in model_name:
                if not 'top_k_to_keep' in model_kwargs:
                    model_kwargs['top_k_to_keep'] = 0.99
                num_samples_to_keep = int(model_kwargs['top_k_to_keep'] * len(scores_train))
                if 'CRH' in model_name:
                    selected_indices = np.argsort(scores_train)[::-1][:num_samples_to_keep]
                    X_train_filtered, y_train_filtered = X_train[selected_indices], y_train[selected_indices]
                elif "CRS" in model_name:
                    probabilities = scores_train / scores_train.sum()  # Normalize to sum to 1
                    sampled_indices = np.random.choice(len(scores_train), size=num_samples_to_keep, p=probabilities, replace=False)
                    X_train_filtered, y_train_filtered = X_train[sampled_indices], y_train[sampled_indices]
                else:
                    raise NotImplementedError(f"{model_name} not understood")
            elif "WLE" in model_name:
                sample_weights = scores_train/scores_train.sum()
            elif "WLH" in model_name:
                sample_weights = 1-(scores_train/scores_train.sum())
            
            clf = clone(base_class)
            #print(model_name, X_train_filtered.shape[0])
            clf.fit(X_train_filtered , y_train_filtered, sample_weight=sample_weights)
            y_pred_cur = clf.predict(X_test)

            y_pred[test_indices] = y_pred_cur
            
        
        
        acc = accuracy_score(y, y_pred)
        (prec, rec, f1, sup) = precision_recall_fscore_support(
            y, y_pred, average="binary"
        )
            
        
        print(model_name)    
        print(classification_report(y, y_pred))
        time_end = time.time() - time_s

        res.append((classification_dataset, imb_ratio, model_name, time_end, acc, prec, rec, f1))
        
res = pd.DataFrame(res, columns=['dataset', 'dataset_class_imb', 'model', 'time', 'acc', 'pr', 'rec', 'f1'])

# Step 2: Sort each group by 'f1'
sorted_df = res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False)).reset_index(drop=True)

# Step 3: Assign ranks within each group
sorted_df['rank'] = sorted_df.groupby('dataset').cumcount() + 1

# Step 4: Calculate mean rank for each model across all datasets
mean_ranks = sorted_df.groupby('model')['rank'].mean().reset_index().sort_values(by='rank')

print(mean_ranks)
            

63
xd6 (1/64)
(973, 9) with ratio : 2.0217

Baseline
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       651
           1       1.00      1.00      1.00       322

    accuracy                           1.00       973
   macro avg       1.00      1.00      1.00       973
weighted avg       1.00      1.00      1.00       973

CRH_pc95
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       651
           1       1.00      0.89      0.94       322

    accuracy                           0.96       973
   macro avg       0.97      0.95      0.96       973
weighted avg       0.97      0.96      0.96       973

CRH_pc99
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       651
           1       1.00      1.00      1.00       322

    accuracy                           1.00       973
   macro avg       1.00      1.00      1.00       973
wei

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Baseline
              precision    recall  f1-score   support

           0       0.86      1.00      0.93       155
           1       0.00      0.00      0.00        25

    accuracy                           0.86       180
   macro avg       0.43      0.50      0.46       180
weighted avg       0.74      0.86      0.80       180



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


CRH_pc95
              precision    recall  f1-score   support

           0       0.86      1.00      0.93       155
           1       0.00      0.00      0.00        25

    accuracy                           0.86       180
   macro avg       0.43      0.50      0.46       180
weighted avg       0.74      0.86      0.80       180



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


CRH_pc99
              precision    recall  f1-score   support

           0       0.86      1.00      0.93       155
           1       0.00      0.00      0.00        25

    accuracy                           0.86       180
   macro avg       0.43      0.50      0.46       180
weighted avg       0.74      0.86      0.80       180



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


CRS_pc95
              precision    recall  f1-score   support

           0       0.86      1.00      0.93       155
           1       0.00      0.00      0.00        25

    accuracy                           0.86       180
   macro avg       0.43      0.50      0.46       180
weighted avg       0.74      0.86      0.80       180



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


CRS_pc99
              precision    recall  f1-score   support

           0       0.86      1.00      0.93       155
           1       0.00      0.00      0.00        25

    accuracy                           0.86       180
   macro avg       0.43      0.50      0.46       180
weighted avg       0.74      0.86      0.80       180



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


CRS_sm
              precision    recall  f1-score   support

           0       0.86      1.00      0.93       155
           1       0.00      0.00      0.00        25

    accuracy                           0.86       180
   macro avg       0.43      0.50      0.46       180
weighted avg       0.74      0.86      0.80       180



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


WLH_sq
              precision    recall  f1-score   support

           0       0.86      1.00      0.93       155
           1       0.00      0.00      0.00        25

    accuracy                           0.86       180
   macro avg       0.43      0.50      0.46       180
weighted avg       0.74      0.86      0.80       180



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


WLH_sm
              precision    recall  f1-score   support

           0       0.86      1.00      0.93       155
           1       0.00      0.00      0.00        25

    accuracy                           0.86       180
   macro avg       0.43      0.50      0.46       180
weighted avg       0.74      0.86      0.80       180

WLE_sq
              precision    recall  f1-score   support

           0       0.86      0.99      0.92       155
           1       0.00      0.00      0.00        25

    accuracy                           0.86       180
   macro avg       0.43      0.50      0.46       180
weighted avg       0.74      0.86      0.79       180

WLE_sm
              precision    recall  f1-score   support

           0       0.86      0.99      0.92       155
           1       0.00      0.00      0.00        25

    accuracy                           0.86       180
   macro avg       0.43      0.50      0.46       180
weighted avg       0.74      0.86      0.79       180

  sorted_df = res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False)).reset_index(drop=True)


In [61]:
model_names = res['model'].unique()
wins_score = np.zeros((len(model_names), len(model_names)))
metric_to_score = 'f1'
for classification_dataset in res['dataset'].unique():
    cur_df = res[res['dataset'] == classification_dataset]
    # print(classification_dataset)
    # print(cur_df.sort_values('f1', ascending=False)[['model', 'time', 'acc', 'f1']])
    # print()
    cur_df = cur_df.set_index('model')
    score_metric = cur_df[metric_to_score]
    for i, m1 in enumerate(model_names):
        for j, m2 in enumerate(model_names[i:]):
            if cur_df.loc[m1][metric_to_score] > cur_df.loc[m2][metric_to_score]:
                wins_score[i, j+i] += 1
            elif cur_df.loc[m1][metric_to_score] < cur_df.loc[m2][metric_to_score]:
                wins_score[j+i, i] += 1
            else:
                pass
order_of_models = wins_score.mean(axis=1).argsort()[::-1]
wins_score = wins_score[order_of_models, :][:, order_of_models]
# Uncomment this for percentage wins
#wins_score /= res['dataset'].unique().shape[0]
print('WINS')
print(pd.DataFrame(wins_score, columns = model_names[order_of_models], index=model_names[order_of_models]))

WINS
          CRH_pc99  WLH_sq  CRS_pc95  WLH_sm  WLE_sq  WLE_sm  CRS_pc99  \
CRH_pc99       0.0    29.0      34.0    27.0    30.0    26.0      30.0   
WLH_sq        29.0     0.0      26.0    24.0    28.0    28.0      28.0   
CRS_pc95      21.0    30.0       0.0    27.0    26.0    29.0      31.0   
WLH_sm        28.0    23.0      31.0     0.0    30.0    24.0      25.0   
WLE_sq        24.0    26.0      30.0    24.0     0.0    25.0      30.0   
WLE_sm        30.0    21.0      28.0    26.0    25.0     0.0      29.0   
CRS_pc99      25.0    24.0      24.0    29.0    27.0    25.0       0.0   
CRS_sm        26.0    22.0      26.0    25.0    26.0    25.0      26.0   
CRH_pc95      28.0    25.0      21.0    23.0    27.0    26.0      23.0   
Baseline      23.0    15.0      23.0    13.0    26.0    21.0      29.0   

          CRS_sm  CRH_pc95  Baseline  
CRH_pc99    29.0      33.0      31.0  
WLH_sq      33.0      34.0      36.0  
CRS_pc95    31.0      37.0      33.0  
WLH_sm      32.0      37

In [66]:
print(mean_ranks.reset_index(drop=True).to_string())

      model      rank
0  CRH_pc99  4.888889
1    WLH_sq  5.158730
2  CRS_pc95  5.174603
3  Baseline  5.380952
4  CRS_pc99  5.444444
5    WLH_sm  5.476190
6    CRS_sm  5.777778
7    WLE_sq  5.793651
8  CRH_pc95  5.888889
9    WLE_sm  6.015873
