## Name: Sample weights based on difficutly
### Date: 10/02/2025
### Status: Pending results
### Idea: 
If we have a way of scoring the samples from "easy-to-learn" to "hard-to-learn", can we use this with a learning strategy to improve generalization?

The implicit assumption is probably that some of the "hard" to classify samples are either noise or would "bend the model out of shape" to accommodate for them (not sure any is true).

There are 2 major components in this:
1. A scoring mechanism on the difficulty of the samples.
2. A strategy given the scores for each samples, that would lead to a better learner.


I've worked out the following for the above points:

##### Difficulty scoring

A simple solution is to fit decision trees with incrementaly increasing depth and check which samples are predicted correctly at the corresponding level.
The final score for each sample is the lowest depth for which it was correctly classified (i.e. low depth  implies easy model, so low score).


Interestingly enough, I've tried fitting a DT with arbitraty depth and then counting the path length for each sample and it is somewhat inversely correlated with the above score
(i.e. small path lengh is usually found on the previous methods high-depth samples)


##### Score-based strategy

Tried out a few things by hand. Focused on the following:

1. *Curriculum learning - Hard (CRH)*: Keep only the easiest k% of the training data (according to the score). This is closer to the assumption that difficult samples are mainly noise.
2. *Curriculum learning - Soft (CRS)*: Again sample k% of the dataset with probability proportional to the inverse score (i.e. high probability on low scores - easy samples). This is the relaxed version of the above
3. *Weighted learning - Easy (WLE)*: Weight the samples during learning according the inverse of their scores (i.e. high weights on low scores - easy samples). The idea is to focus on the easy samples.
4. *Weighted learning - Hard (WLH)*: Weight the samples during learning according their scores (i.e. high weights on high scores - hard samples). The idea is to focus on the hard samples.


So the overall workflow is:
1. Generate train split
2. Fit incrementally-increasing-depth DTs on the train data and keep track of the lowest possible DT that correctly classifies each sample.
3. Use the selected strategy to sample the dataset or generate the according weights for learning
4. Fit a classifier on top 


### Results:

**Best one being WLH with 57.30% better scores than the baseline and avg. rank of 3 over 90 datasets (<=150 features binary classification)** 


With **best methods over Baseline (RF)** scoring better on X/90 datasets:

      model      rank
4  CRS_pc99  3.584270

6    WLH_sq  3.662921


Ranking and wins shown below



Comments on results:
1. Top Curriculum learning methods seem to perform better than Weight learning methods (this was not expected)
2. Time taken for the pre-processing <= 1 sec. on average, 10 seconds at most for 48K samples
3. No clear winnger on scaling of scores
4. In CR methods, decreasing the cutoff  makes the CR methods use less data so after a bit the results drop drammaticaly. In the experimented range [0.8, 0.9, 0.95, 0.99] there was not a clear winner. 0.99 seems the clear winner, with 0.90 being the second best in ranking but 0.95  being better against baseline.

Details:
1. Tried out with different variants of the above. 
   1. For all models there was a variant regarding how the score of each sample is calculated. Starting from the default scores we could extra:
      1. Square these scores (pushing out small probas to zero and keeping mainly big scores - aka easy samples)
      2. Softmaxing the same as above but smoother
      3. Square + softmax
   2.  Specifically tweaking the percentile cutoff for *CRH* and *CRS* in keepin 90% or 80% of the top-k easy samples.
  




In [222]:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin, clone
from sklearn.neighbors import BallTree
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_breast_cancer

class LocalizedBoundaryClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, k=0.5, use_pca=False, random_state=None, print_=False):
        self.k = k
        self.use_pca = use_pca
        self.random_state = random_state
        self.scaler = StandardScaler()
        self.local_model = LogisticRegression(max_iter=10000, random_state=self.random_state, class_weight='balanced') # Or LogisticRegression
        self.pos_tree = None
        self.neg_tree = None
        self.feature_thresholds = None
        self.important_features = None
        self.print_ = print_
        self.classes_ = None #required for sklearn classifiers

    def fit(self, X, y):
        self.classes_ = np.unique(y)
        X = self.scaler.fit_transform(X)
        
        self.k = min(self.k, X.shape[0])
        
        if self.use_pca:
            true_components = int(np.sqrt(X.shape[1]))
            self.pca = PCA(n_components=true_components, random_state=self.random_state)
            X = self.pca.fit_transform(X)
             

        X_pos = X[y == 1]
        X_neg = X[y == 0]

        self.pos_tree = BallTree(X_pos)
        self.neg_tree = BallTree(X_neg)
        
        self.X_combined = np.vstack([self.pos_tree.data, self.neg_tree.data])
        self.y_combined = np.concatenate([np.ones(len(self.pos_tree.data)), np.zeros(len(self.neg_tree.data))])
        
        return self

    def _find_important_features(self, X_query, X_pos_neighbors, X_neg_neighbors):
        pos_diffs = X_pos_neighbors - X_query
        neg_diffs = X_neg_neighbors - X_query

        
        important_features = np.array([], dtype=int)
        for aggr_name, aggr_function in zip(['median', 'mean', 'max', 'min'], [np.median, np.mean, np.max, np.min]):
            
            pos_diff = aggr_function(pos_diffs, axis=0)
            neg_diff = aggr_function(neg_diffs, axis=0)
            for aggr_in_sign in ['prod', 'sum']:
                if aggr_in_sign == 'prod':
                    sign_diffs = np.sign(pos_diff) * np.sign(neg_diff)
                else:
                    sign_diffs = np.sign(pos_diff) + np.sign(neg_diff)
                cur_important_features = np.where(sign_diffs < 0)[0]
                important_features = np.concatenate((important_features, cur_important_features))
                # Cases where equal feats
                if sign_diffs[0] == sign_diffs.sum()/len(sign_diffs):
                    cur_important_features = np.concatenate((np.where(pos_diff != 0)[0], np.where(neg_diff != 0)[0]))
                    important_features = np.concatenate((important_features, cur_important_features))
                if len(important_features) != 0 :
                    break
            if len(important_features) != 0 :
                    break
        # Extreme cases where the median is zero and we do have changes per feat:
        # if len(important_features) == 0 :
        #     print('Did not find important features for this sample')
            
        
        
        feature_thresholds = {}

        for feature_idx in important_features:
            # Find a threshold where the sign changes
            #threshold = (pos_diff[feature_idx] + neg_diff[feature_idx]) / 2 + X_query[feature_idx] #simple average for the threshold
            feature_thresholds[feature_idx] = (min(pos_diffs[:, feature_idx].min(), neg_diffs[:, feature_idx].min()), max(pos_diffs[:, feature_idx].max(), neg_diffs[:, feature_idx].max()))

        return important_features, feature_thresholds

    def predict(self, X):
        X = self.scaler.transform(X)
        if self.use_pca:
            X = self.pca.transform(X)

        predictions = []
        for x_query in X:
            predictions.append(self._predict_one(x_query.reshape(1, -1))) #make sure it is 2d
        return np.array(predictions)


    def _predict_one(self, x_query):
        #print('k pos', min(self.k, self.pos_tree.data.base.shape[0]), int(0.5*self.pos_tree.data.shape[0]))
        #print('k neg', min(self.k, self.neg_tree.data.base.shape[0]), int(0.5*self.neg_tree.data.shape[0]))
        dist_pos, ind_pos = self.pos_tree.query(x_query, k=max(1, int(self.k*self.pos_tree.data.base.shape[0])))
        dist_pos = dist_pos[0]
        ind_pos = ind_pos[0]
        
        dist_neg, ind_neg = self.neg_tree.query(x_query, k=max(1, int(self.k*self.neg_tree.data.base.shape[0])))
        dist_neg = dist_neg[0]
        ind_neg = ind_neg[0]
        
        min_dist = min(dist_pos.min(), dist_neg.min()) + 1e-6
        
        
        dist_pos = (dist_pos - min_dist).clip(min=1e-6) / np.abs(min_dist)
        pos_std_dists = np.bincount(np.round(dist_pos,0).astype(int))
        # print('dists', dist_pos)
        # print('stds', np.round(dist_pos,0).astype(int))
        
        try:
            pos_cumsum = pos_std_dists.cumsum()
            index_to_stop = (pos_cumsum > 0).argmax()
            mask_to_keep = np.zeros_like(dist_pos)
            mask_to_keep[:pos_cumsum[index_to_stop]] = 1#np.abs(dist_pos) <= self.factor_of_distance_away
        except ValueError:
            mask_to_keep = np.ones_like(dist_pos)
        mask_to_keep = mask_to_keep.astype(bool)

        # mask_to_keep = (dist_pos - (dist_pos.mean() + 2*dist_pos.std()) > 0) | (dist_pos - (dist_pos.mean() - 2*dist_pos.std()) > 0)
        ind_pos = ind_pos[mask_to_keep]
        dist_pos = dist_pos[mask_to_keep]
        
        
        
        
        
        dist_neg = (dist_neg - min_dist).clip(min=1e-6) / np.abs(min_dist)
       
        neg_std_dists = np.bincount(np.round(dist_neg,0).astype(int))
        
        try:
            neg_cumsum = neg_std_dists.cumsum()
            index_to_stop = (neg_cumsum > 0).argmax()
            # first_non_zero_diff = (np.diff(pos_cumsum) > 0).argmax()
            # index_to_stop = np.where(pos_cumsum>0)[0][first_non_zero_diff]
            # print('will stop at index:', index_to_stop, 'based on', first_non_zero_diff, 'of cumsum: ', np.where(pos_cumsum>0)[0])
            mask_to_keep = np.zeros_like(dist_neg)
            mask_to_keep[:neg_cumsum[index_to_stop]] = 1#np.abs(dist_pos) <= self.factor_of_distance_away
        except ValueError:
            mask_to_keep = np.ones_like(dist_neg)
            
        mask_to_keep = mask_to_keep.astype(bool)
       
       
        ind_neg = ind_neg[mask_to_keep]
        dist_neg = dist_neg[mask_to_keep]
        
        #print('pos', dist_pos, '\n neg', dist_neg)
        
        if len(ind_neg) == 0 or len(ind_pos) == 1:
            if self.print_:
                print(f"Only kept positives")
            return 1
        if len(ind_pos) == 0 or len(ind_neg) == 1:
            if self.print_:
                print(f"Only kept negatives")
            return 0

        X_pos_neighbors = self.pos_tree.data.base[ind_pos]  # Access data correctly
        X_neg_neighbors = self.neg_tree.data.base[ind_neg]
        self.important_features, self.feature_thresholds = self._find_important_features(x_query[0], X_pos_neighbors, X_neg_neighbors)

        if len(self.important_features) == 0:
            # Fallback: Predict based on majority class of k-NN
            if self.print_:
                print('Did not find important features for this sample')
            least_k = min(dist_pos.shape[0], dist_neg.shape[0])

            return self.classes_[np.argmin([sum(dist_neg[:least_k]), sum(dist_pos[:least_k])])]
        # Get data for local model
        X_local, y_local = self._get_local_data()


        if len(np.unique(y_local)) < 2:
            try:
                # Just return the one class that is found
                return self.classes_[np.bincount(y_local).argmax()]
            except ValueError:
                # If this is empty, then no samples satisfying the local criteria are found
                if self.print_:
                    print('Did not find local samples for this sample')
                least_k = min(dist_pos.shape[0], dist_neg.shape[0])
                return self.classes_[np.argmin([sum(dist_neg[:least_k]), sum(dist_pos[:least_k])])]

        clf = clone(self.local_model)
        # Fit and predict with the local model
        clf.fit(X_local, y_local)
        return clf.predict(x_query)[0] #return the prediction not the array

    def _get_local_data(self):

       

        for feature_idx, (lower_bound, upper_bound) in self.feature_thresholds.items():
            mask = (self.X_combined[:, feature_idx] >= lower_bound) & (self.X_combined[:, feature_idx] <= upper_bound)
        # print(mask.sum()/X_combined.shape[0])

        return self.X_combined[mask], self.y_combined[mask].astype(int)
# Load data and split
data = load_breast_cancer()
X, y = data.data, data.target
X, y = fetch_data('xd6', return_X_y=True)
if y.max() != 1 or y.min() != 0:
    for wanted, actual in enumerate(np.unique(y)):
        y[y==actual] = wanted


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate and train our classifier
lbc = LocalizedBoundaryClassifier(k=0.1, use_pca=False, random_state=42)
lbc.fit(X_train, y_train)
y_pred_lbc = lbc.predict(X_test)

# Instantiate and train a Random Forest classifier
rf = LogisticRegression(max_iter=10000, class_weight='balanced', random_state=42)#RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluate
print("Localized Boundary Classifier:")
print(classification_report(y_test, y_pred_lbc))
print("\nRandom Forest Classifier:")
print(classification_report(y_test, y_pred_rf))

Localized Boundary Classifier:
              precision    recall  f1-score   support

           0       0.95      0.46      0.62       127
           1       0.49      0.96      0.64        68

    accuracy                           0.63       195
   macro avg       0.72      0.71      0.63       195
weighted avg       0.79      0.63      0.63       195


Random Forest Classifier:
              precision    recall  f1-score   support

           0       0.93      0.63      0.75       127
           1       0.57      0.91      0.70        68

    accuracy                           0.73       195
   macro avg       0.75      0.77      0.73       195
weighted avg       0.80      0.73      0.73       195



In [206]:
import pandas as pd
import cached_path
from pmlb import fetch_data
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_predict
import time
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support
from scipy.special import softmax
from sklearn.base import clone

path_to_data_summary = "https://raw.githubusercontent.com/EpistasisLab/pmlb/master/pmlb/all_summary_stats.tsv"
dataset_df = pd.read_csv(cached_path.cached_path(path_to_data_summary), sep="\t")

classification_datasets = dataset_df[
    # (dataset_df["n_binary_features"] == dataset_df["n_features"])
    (dataset_df["task"] == "classification")
    & (dataset_df["n_classes"] == 2)
    & (dataset_df["n_features"] <= 150)
    & (dataset_df["n_instances"] <= 100)
]["dataset"][:]

print(len(classification_datasets))


models = {
    "Baseline": {},
    
    # "Localised_10_2":{"k":10, "factor_of_distance_away":2},
    # "Localised_10_1.5":{"k":10, "factor_of_distance_away":1.5},
    "Localised_.1":{"k":0.1},
    "Localised_.2":{"k":0.2},
    "Localised_.5":{"k":0.5},
    "Localised_.9":{"k":0.9},
    "Localised_1":{"k":1.0},

    "Localised_.2_pca":{"k":0.2, "use_pca":True},
    "Localised_.5_pca":{"k":0.5,  "use_pca":True},

}



number_of_cv_folds = 5
random_state = 42

cv = StratifiedKFold(number_of_cv_folds, random_state=random_state, shuffle=True)
base_class = LogisticRegression(random_state=random_state, class_weight='balanced')#RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42) #DecisionTreeClassifier(max_depth=None, random_state=42)#

res = [] 
for dataset_index, classification_dataset in enumerate(classification_datasets[::-1][:]):
    
    print(f"{classification_dataset} ({dataset_index + 1}/{len(classification_datasets) + 1})")
    X, y = fetch_data(classification_dataset, return_X_y=True)
    if y.max() != 1 or y.min() != 0:
        for wanted, actual in enumerate(np.unique(y)):
            y[y==actual] = wanted
        
    imb_ratio = np.bincount(y).max() / np.bincount(y).min()
    print(f"{X.shape} with ratio : {imb_ratio:.4f}\n")
    
    
    
    for model_name, model_kwargs in models.items():
        y_pred = np.empty_like(y)
        sample_weights = None
        time_s = time.time()
        for train_indices, test_indices in cv.split(X,y):
            X_train, y_train = X[train_indices], y[train_indices]
            X_test, y_test = X[test_indices], y[test_indices]
            
            if "Localised" in model_name:
                clf = LocalizedBoundaryClassifier(**model_kwargs)
            else:
                clf = clone(base_class)
            
            clf.fit(X_train, y_train)
            
            
            y_pred_cur = clf.predict(X_test)

            y_pred[test_indices] = y_pred_cur
            
        
        
        acc = accuracy_score(y, y_pred)
        (prec, rec, f1, sup) = precision_recall_fscore_support(
            y, y_pred, average="binary"
        )
            
        
        print(model_name)    
        print(classification_report(y, y_pred))
        time_end = time.time() - time_s

        res.append((classification_dataset, imb_ratio, model_name, time_end, acc, prec, rec, f1))
        
res = pd.DataFrame(res, columns=['dataset', 'dataset_class_imb', 'model', 'time', 'acc', 'pr', 'rec', 'f1'])

# Step 2: Sort each group by 'f1'
sorted_df = res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False)).reset_index(drop=True)

# Step 3: Assign ranks within each group
sorted_df['rank'] = sorted_df.groupby('dataset').cumcount() + 1

# Step 4: Calculate mean rank for each model across all datasets
mean_ranks = sorted_df.groupby('model')['rank'].mean().reset_index().sort_values(by='rank')

print(mean_ranks)
            

12
postoperative_patient_data (1/13)
(88, 8) with ratio : 2.6667

Baseline
              precision    recall  f1-score   support

           0       0.74      0.55      0.63        64
           1       0.29      0.50      0.37        24

    accuracy                           0.53        88
   macro avg       0.52      0.52      0.50        88
weighted avg       0.62      0.53      0.56        88

Localised_.1
              precision    recall  f1-score   support

           0       0.73      0.56      0.64        64
           1       0.28      0.46      0.35        24

    accuracy                           0.53        88
   macro avg       0.51      0.51      0.49        88
weighted avg       0.61      0.53      0.56        88

Localised_.2
              precision    recall  f1-score   support

           0       0.71      0.55      0.62        64
           1       0.26      0.42      0.32        24

    accuracy                           0.51        88
   macro avg       0.49    

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Localised_.2
              precision    recall  f1-score   support

           0       0.79      0.62      0.69        73
           1       0.20      0.37      0.26        19

    accuracy                           0.57        92
   macro avg       0.49      0.49      0.48        92
weighted avg       0.67      0.57      0.60        92

Localised_.5
              precision    recall  f1-score   support

           0       0.76      0.56      0.65        73
           1       0.16      0.32      0.21        19

    accuracy                           0.51        92
   macro avg       0.46      0.44      0.43        92
weighted avg       0.64      0.51      0.56        92

Localised_.9
              precision    recall  f1-score   support

           0       0.76      0.58      0.66        73
           1       0.16      0.32      0.21        19

    accuracy                           0.52        92
   macro avg       0.46      0.45      0.44        92
weighted avg       0.64      0.52  

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Localised_.1
              precision    recall  f1-score   support

           0       0.80      0.70      0.74        73
           1       0.33      0.46      0.39        24

    accuracy                           0.64        97
   macro avg       0.57      0.58      0.57        97
weighted avg       0.68      0.64      0.66        97

Localised_.2
              precision    recall  f1-score   support

           0       0.89      0.75      0.81        73
           1       0.49      0.71      0.58        24

    accuracy                           0.74        97
   macro avg       0.69      0.73      0.70        97
weighted avg       0.79      0.74      0.76        97

Localised_.5
              precision    recall  f1-score   support

           0       0.87      0.74      0.80        73
           1       0.46      0.67      0.54        24

    accuracy                           0.72        97
   macro avg       0.66      0.70      0.67        97
weighted avg       0.77      0.72  

  sorted_df = res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False)).reset_index(drop=True)


In [203]:
model_names = res['model'].unique()
wins_score = np.zeros((len(model_names), len(model_names)))
metric_to_score = 'f1'
for classification_dataset in res['dataset'].unique():
    cur_df = res[res['dataset'] == classification_dataset]
    # print(classification_dataset)
    # print(cur_df.sort_values('f1', ascending=False)[['model', 'time', 'acc', 'f1']])
    # print()
    cur_df = cur_df.set_index('model')
    score_metric = cur_df[metric_to_score]
    for i, m1 in enumerate(model_names):
        for j, m2 in enumerate(model_names[i:]):
            if cur_df.loc[m1][metric_to_score] > cur_df.loc[m2][metric_to_score]:
                wins_score[i, j+i] += 1
            elif cur_df.loc[m1][metric_to_score] < cur_df.loc[m2][metric_to_score]:
                wins_score[j+i, i] += 1
            else:
                pass
order_of_models = wins_score.mean(axis=1).argsort()[::-1]
wins_score = wins_score[order_of_models, :][:, order_of_models]
# Uncomment this for percentage wins
# wins_score /= res['dataset'].unique().shape[0]
print('WINS')
print(pd.DataFrame(wins_score, columns = model_names[order_of_models], index=model_names[order_of_models]))

WINS
                  Baseline  Localised_.5  Localised_.9  Localised_.2  \
Baseline               0.0           9.0           9.0           9.0   
Localised_.5           3.0           0.0           3.0           6.0   
Localised_.9           3.0           4.0           0.0           6.0   
Localised_.2           3.0           4.0           4.0           0.0   
Localised_1            3.0           4.0           1.0           6.0   
Localised_.5_pca       3.0           5.0           5.0           5.0   
Localised_.1           3.0           3.0           2.0           3.0   
Localised_.2_pca       2.0           4.0           4.0           4.0   

                  Localised_1  Localised_.5_pca  Localised_.1  \
Baseline                  9.0               9.0           9.0   
Localised_.5              4.0               7.0           9.0   
Localised_.9              1.0               7.0          10.0   
Localised_.2              4.0               7.0           9.0   
Localised_1          

In [79]:
print(mean_ranks.reset_index(drop=True).to_string())

      model      rank
0  CRS_pc99  3.584270
1    WLH_sq  3.662921
2  Baseline  3.764045
3  CRH_pc99  3.921348
4  CRS_pc95  4.191011
5    WLE_sm  4.314607
6  CRH_pc95  4.561798


## Plot performance over number of samples and imbalance ratio for baseline vs best model

#### **No clear trend** when plotting against imbalance, number of samples or number of features

In [95]:
import plotly.express as px

dataset_details = dataset_df[
    # (dataset_df["n_binary_features"] == dataset_df["n_features"])
    (dataset_df["task"] == "classification")
    & (dataset_df["n_classes"] == 2)
    & (dataset_df["n_features"] <= 150)
]


res_to_keep = res[res['model'].isin(["Baseline", "CRH_pc99"])]
res_to_keep['num_samples'] = res_to_keep['dataset'].map(dataset_details[['dataset', 'n_instances']].set_index('dataset')['n_instances'].to_dict().get)
res_to_keep['num_feats'] = res_to_keep['dataset'].map(dataset_details[['dataset', 'n_features']].set_index('dataset')['n_features'].to_dict().get)
# res = pd.DataFrame(res, columns=['dataset', 'dataset_class_imb', 'model', 'time', 'acc', 'pr', 'rec', 'f1'])


fig = px.scatter(res_to_keep, x="dataset_class_imb", y="f1", color="model")
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



## Average time taken

### The overhead is less than a second on average or 10 seconds at most (with 48K samples)

In [109]:
## Average time taken

times = []
for dataset_index, classification_dataset in enumerate(classification_datasets[::-1][:]):
    
    print(f"{classification_dataset} ({dataset_index + 1}/{len(classification_datasets) + 1})")
    X, y = fetch_data(classification_dataset, return_X_y=True)
    if y.max() != 1 or y.min() != 0:
        for wanted, actual in enumerate(np.unique(y)):
            y[y==actual] = wanted
        
    imb_ratio = np.bincount(y).max() / np.bincount(y).min()
    print(f"{X.shape} with ratio : {imb_ratio:.4f}\n")
    
    time_s = time.time()
    scores_data =  np.zeros(len(X)) + max_depth + 1
    for depth in range(1, max_depth):
        clf = DecisionTreeClassifier(random_state=42, max_depth=depth)
        clf.fit(X, y)
        y_pred = clf.predict(X)
        found_flag = (y_pred == y).astype(int)*depth
        found_flag[found_flag == 0] = max_depth + 1
        scores_data = np.minimum(scores_data, found_flag)
        if (scores_data >= max_depth).sum() == 0:
            break
    time_taken = time.time() - time_s
    times.append(time_taken)

xd6 (1/90)
(973, 9) with ratio : 2.0217

wdbc (2/90)
(569, 30) with ratio : 1.6840

vote (3/90)
(435, 16) with ratio : 1.5893

twonorm (4/90)
(7400, 20) with ratio : 1.0016

tokyo1 (5/90)
(959, 44) with ratio : 1.7717

titanic (6/90)
(2099, 8) with ratio : 2.0822

tic_tac_toe (7/90)
(958, 9) with ratio : 1.8855

threeOf9 (8/90)
(512, 9) with ratio : 1.1513

spectf (9/90)
(349, 44) with ratio : 2.6737

spect (10/90)
(267, 22) with ratio : 3.8545

spambase (11/90)
(4601, 57) with ratio : 1.5378

sonar (12/90)
(208, 60) with ratio : 1.1443

saheart (13/90)
(462, 9) with ratio : 1.8875

ring (14/90)
(7400, 20) with ratio : 1.0197

profb (15/90)
(672, 9) with ratio : 2.0000

prnn_synth (16/90)
(250, 2) with ratio : 1.0000

prnn_crabs (17/90)
(200, 7) with ratio : 1.0000

postoperative_patient_data (18/90)
(88, 8) with ratio : 2.6667

pima (19/90)
(768, 8) with ratio : 1.8657

phoneme (20/90)
(5404, 5) with ratio : 2.4073

parity5+5 (21/90)
(1124, 10) with ratio : 1.0180

parity5 (22/90)
(32

In [113]:
print(f"Mean time taken (seconds): {np.mean(times):.2f} +- {2*np.std(times):.2f} (mean +- 2*std)")

Mean time taken (seconds): 0.64 +- 4.02 (mean +- 2*std)
