## Name: Local / Boosted Normal Equation models for classification
### Date: 18/7/2024
### Status: Works as a proof of concept across many datasets.
### Idea: (on binary classification)
The idea was to fit iteratively normal equations (analytically) on subsets of features in two ways:
1. **Boosting** way:
   1. Begin with the whole train dataset as wrongly classified.
   2. Fit a NEQ on the whole dataset. The leftovers of this model, are used to fit the next NEQ and so forth, until no errors are made.
   3. At inference time, use all the NEQ weight matrices to predict all the test samples.
   4. The final prediction is based on the used *strategy*:
      1. For "min_residual_selection", use the NEQ with the least residual loss for each sample.
      2. For "weighted_residual_voting", use the mean prediction of the NEQs.
2. **Local** way:
   1. Again, start by fitting a NEQ on the data.
   2. The leftovers are then grouped into a number of clusters, and k NEQ are fitted on them.
   3. At inference time:
      1. For each sample, we use the main NEQ to generate the prediction and also fetch the k most similar train samples and whether the NEQ predicted correct each of those k samples.
      2. If, the NEQ performed > 50% on accuracy on the "close" train samples, the prediction is left as is.
      3. For the cases, where the NEQ under-performed, we change the prediction based on the *strategy*:
         1. For "selection" strategy, find the corresponding cluster/NEQ and use that to predict the final values.
         2. For "weighted_voting" strategy, use a  weighted aggregation of the "selection" prediction and the original NEQ prediction, weighted by the percentage of incorrect predictions of the origial NEQ on the local query neighborhood.


Some points:
1. All models have standard scaling for NEQ.
2. All models have OHE for classification. Thus the NEQ is essentially a multiple LSQ solution.
3. For the **Local** variant the number of clusters is selected by keeping the np.sqrt(num_leftovers_from_NEQ)
4. For the **Local** variant the number of **k** train samples to check the local goodness of fit, is selected by fitting a kNeighborsClassifier on the train for multiple k, and keeping the best performing one.


### Results:
1. This seems to work. Across 74 datasets vs DT, the mean rank of the NEQ Local models is better. Head to head DT is a bit better.
2. From the variants of Local the, selection strategy seems better and also fixing the number of neighbors (rather than relying on kNeighbors tuning).
3. NEQ Boost is worse on average. Strategy wise the min residual collection seems better

In [53]:
import pandas as pd
from pmlb import fetch_data
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.decomposition import PCA
import numpy as np

In [54]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)

y_ohe = OneHotEncoder().fit_transform(y.reshape(-1,1)).toarray()
X_sc = StandardScaler().fit_transform(X)

In [65]:
from sklearn.base import BaseEstimator
from scipy.special import softmax

class NEQBoost(BaseEstimator):
    
    def __init__(self, strategy = 'min_residual_selection'):
        self.ohe = OneHotEncoder()
        self.sc = StandardScaler()
        self.strategy = strategy
        self.W = []
    
    def fit(self, X, y):
        
        X_sc = self.sc.fit_transform(X)
        y_ohe = self.ohe.fit_transform(y.reshape(-1,1)).toarray()
        wrong = np.arange(X_sc.shape[0])
        while sum(wrong) > 0:
            X_leftover, y_leftover = X_sc[wrong], y_ohe[wrong]
            W_leftover, _, _, _ = np.linalg.lstsq(X_leftover, y_leftover, rcond=None)
            preds_ohe = X_leftover@W_leftover
            wrong = preds_ohe.argmax(axis=1) != y_leftover.argmax(axis=1)    
            self.W.append(W_leftover)
        self.num_neqs = len(self.W)
        # print(f"Rounds of boosting: {self.num_neqs}")
        return self
    
    def predict(self, X):
        probas = self.predict_proba(X)
        return probas.argmax(axis=1)
    
    def predict_proba(self, X):
        X_sc = self.sc.transform(X)
        y_probas_all = []
        dists = []
        for W in self.W:
            y_logit = X_sc @ W
            dist = np.linalg.norm(y_logit, axis=1)
            y_proba = softmax(y_logit, axis=1)
            y_probas_all.append(y_proba)
            dists.append(dist)
        dists = np.vstack(dists).T
        dists = softmax(dists, axis=1)
        y_probas_all = np.array(y_probas_all).reshape(self.num_neqs, X.shape[0], -1)
        #print(y_probas_all.shape)
        if self.strategy == 'weighted_residual_voting':
            y_probas = np.einsum('mic, im-> ic', y_probas_all, dists)
        elif self.strategy == 'min_residual_selection':
            neq_to_use = dists.argmin(axis=1)
            y_probas = y_probas_all[neq_to_use, np.arange(X.shape[0]), :]
        else:
            raise NotImplementedError(f"Can't understand {self.strategy}")
        return y_probas

clf = NEQBoost(strategy='min_residual_selection')  
clf.fit(X,y)
y_pred = clf.predict(X)
print(f"Full data fit acc: {accuracy_score(y, y_pred):.4f}")

Full data fit acc: 0.9297


In [58]:
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score

def fit_subsamples(X_leftovers, labels_leftovers, max_to_check='sqrt'):
    if max_to_check == 'sqrt':
        max_num_clusters =  int(np.round(np.sqrt(X_leftovers.shape[0]),0)) + 1
    elif max_to_check == 'full':
        max_num_clusters =  X_leftovers.shape[0] + 1
    elif isinstance(max_to_check, int):
        max_num_clusters = max_to_check
    else:
        raise NotImplementedError()
        
    res = []
    cluster_dict = {}
    for k in range(1, max_num_clusters):
        cl = KMeans(n_clusters=k, n_init='auto')
        cl.fit(X_leftovers)
        cluster_dict[k] = {'cl':cl, 'W': []}
        rr = 0
        acc = 0
        for unq_label in np.unique(cl.labels_):
            current_cluster = np.where(cl.labels_ == unq_label)[0]
            X_cur, y_cur_ohe = X_leftovers[current_cluster], labels_leftovers[current_cluster]
            W_ohe_cur, residual_sums, _, _= np.linalg.lstsq(X_cur, y_cur_ohe, rcond=None)
            cluster_dict[k]['W'].append(W_ohe_cur)
            cur_acc = accuracy_score(y_cur_ohe.argmax(axis=1), (X_cur @ W_ohe_cur).argmax(axis=1))
            rr += residual_sums.sum()
            acc += cur_acc
            
        #print(rr)
        res.append((int(k), cl.inertia_, rr, acc/k))
        #break
    res_df = pd.DataFrame(res, columns=['k', 'inertia', 'ss', 'acc'])
    wanted_k = res_df.sort_values(['acc', 'k'], ascending=[False,True]).iloc[0]['k']
    return cluster_dict[wanted_k]

In [60]:
from sklearn.base import BaseEstimator
from scipy.special import softmax
from sklearn.neighbors import BallTree
from sklearn.neighbors import KNeighborsClassifier

class NEQ_Local(BaseEstimator):
    
    def __init__(self, strategy = 'selection', k="sqrt"):
        self.ohe = OneHotEncoder()
        self.sc = StandardScaler()
        self.k = k
        if isinstance(self.k, int):
            self.num_neigh = self.k
        self.strategy = strategy
        self.W = None
        self.subsamples_dict = {}
        self.neighbor_tree = None
    
    def fit(self, X, y):
        
        X_sc = self.sc.fit_transform(X)
        
        if self.k == 'auto':
            accs = []
            for k in range(1, int(np.sqrt(X.shape[0]))):
                kn = KNeighborsClassifier(n_neighbors=k,)
                kn.fit(X_sc, y)
                accs.append(kn.score(X_sc, y))
            self.num_neigh = np.argmax(accs) + 1
                
        elif self.k == 'sqrt':
            self.num_neigh = int(np.sqrt(X.shape[0]))
        
        self.num_neigh = self.num_neigh + 1 if self.num_neigh % 2 == 0 else self.num_neigh
            
        
        self.neighbor_tree = BallTree(X_sc)
        
        
        y_ohe = self.ohe.fit_transform(y.reshape(-1,1)).toarray()
        
        self.W, _, _, _= np.linalg.lstsq(X_sc, y_ohe, rcond=None)
        preds_ohe = X_sc@self.W
        correct = (preds_ohe.argmax(axis=1) == y).astype(int)
        self.neq_train_labels = correct
        X_leftovers = X_sc[~correct.astype(bool)]
        labels_leftovers = y_ohe[~correct.astype(bool)]
        self.subsamples_dict = fit_subsamples(X_leftovers, labels_leftovers)
        return self
    
    def predict(self, X):
        probas = self.predict_proba(X)
        return probas.argmax(axis=1)
    
    def predict_proba(self, X):
        X_sc = self.sc.transform(X)
        dist, indices = self.neighbor_tree.query(X_sc, k=self.num_neigh)
        # num_test X 1
        neq_percentage_correct = self.neq_train_labels[indices].mean(axis=1)
        y_neq = X_sc @ self.W
        
        if self.strategy == 'selection':
            keep_neq = (neq_percentage_correct > 0.5)
            if (~keep_neq).sum():
                X_leftovers = X_sc[~keep_neq]
                local_neq_to_use = self.subsamples_dict['cl'].predict(X_leftovers)
                y_locals = []
                for sample_index, neq_index in enumerate(local_neq_to_use):
                    cur_X = X_leftovers[sample_index,:]
                    cur_W = self.subsamples_dict['W'][neq_index]
                    y_locals.append(cur_X @ cur_W)
                y_locals = np.vstack(y_locals)
                y_neq[~keep_neq] = y_locals
            y_probas = softmax(y_neq, axis=1)
        elif self.strategy == 'weighted_voting':
            keep_neq = (neq_percentage_correct > 0.5)
            if (~keep_neq).sum():
                X_leftovers = X_sc[~keep_neq]
                perc_nec = neq_percentage_correct[~keep_neq]
                local_neq_to_use = self.subsamples_dict['cl'].predict(X_leftovers)
                y_locals = []
                for sample_index, neq_index in enumerate(local_neq_to_use):
                    cur_X = X_leftovers[sample_index,:]
                    cur_W = self.subsamples_dict['W'][neq_index]
                    y_locals.append(cur_X @ cur_W)
                y_locals = np.vstack(y_locals)
                y_neq[~keep_neq] = perc_nec.reshape(-1,1) * softmax(y_neq[~keep_neq], axis=1) + (1-perc_nec).reshape(-1,1) * softmax(y_locals, axis=1)
            y_probas = softmax(y_neq, axis=1)
        else:
            raise NotImplementedError(f"Can't understand {self.strategy}")
    
        return y_probas

clf = NEQ_Local(k=3, strategy='weighted_voting')  
clf.fit(X,y)
y_pred = clf.predict(X)
print(f"Full data fit acc: {accuracy_score(y, y_pred):.4f}")

Full data fit acc: 0.9772


In [26]:
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)


clf = DecisionTreeClassifier(random_state=42, max_depth=None)
# y_pred = cross_val_predict(clf, X, y, cv=cv)

y_pred_all = []
y_true_all = []
for train, test in cv.split(X,y):
    X_train, y_train = X[train], y[train]
    X_test, y_test = X[test], y[test]
    clf = DecisionTreeClassifier(random_state=42, max_depth=3)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    y_pred_all.extend(y_pred.tolist())
    y_true_all.extend(y_test.tolist())

print(classification_report(y_true_all, y_pred_all))
print(confusion_matrix(y_true_all, y_pred_all))

              precision    recall  f1-score   support

           0       0.84      0.85      0.85       100
           1       0.85      0.84      0.84       100

    accuracy                           0.84       200
   macro avg       0.85      0.84      0.84       200
weighted avg       0.85      0.84      0.84       200

[[85 15]
 [16 84]]


In [39]:
import pandas as pd
res = pd.read_csv("./results/neq_boost_results.csv")
# Step 2: Sort each group by 'f1'
sorted_df = res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False)).reset_index(drop=True)

# Step 3: Assign ranks within each group
sorted_df['rank'] = sorted_df.groupby('dataset').cumcount() + 1

# Step 4: Calculate mean rank for each model across all datasets
mean_ranks = sorted_df.groupby('model')['rank'].mean().reset_index().sort_values(by='rank')

print(mean_ranks)
            

                      model      rank
4     NEQ_Local_selection_5  3.864865
3                       NEQ  3.905405
2                        DT  3.986486
6  NEQ_Local_selection_sqrt  4.297297
7   NEQ_Local_weighted_auto  4.310811
8   NEQ_Local_weighted_sqrt  4.554054
5  NEQ_Local_selection_auto  4.675676
1                Boost_Mean  7.486486
0                 Boost_Max  7.918919


  sorted_df = res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False)).reset_index(drop=True)


In [51]:
models = mean_ranks.model[:3].values
wins_score = np.zeros((len(models), len(models)))

metric_to_score = 'f1'
res_local = res[res['model'].isin(models)]
for classification_dataset in res_local['dataset'].unique():
    cur_df = res_local[res_local['dataset'] == classification_dataset]
    cur_df = cur_df.set_index('model')
    score_metric = cur_df[metric_to_score]
    for i, m1 in enumerate(models):
        for j, m2 in enumerate(models[i:]):
            if cur_df.loc[m1][metric_to_score] > cur_df.loc[m2][metric_to_score]:
                wins_score[i, j+i] += 1
            elif cur_df.loc[m1][metric_to_score] < cur_df.loc[m2][metric_to_score]:
                wins_score[j+i, i] += 1
            else:
                pass
order_of_models = wins_score.mean(axis=1).argsort()[::-1]
wins_score = wins_score[order_of_models, :][:, order_of_models]
print('WINS')
print(pd.DataFrame(wins_score, columns = np.array(models)[order_of_models], index=np.array(models)[order_of_models]))

WINS
                         DT  NEQ_Local_selection_5   NEQ
DT                      0.0                   40.0  37.0
NEQ_Local_selection_5  33.0                    0.0  41.0
NEQ                    36.0                   25.0   0.0
