## Name: Adaptive-collective classification
### Date: 18/7/2024
### Status: Need to work on the idea more.
### Idea: (on binary classification)
The idea was to fit LRs/simple DTs on subsets that are easily distinguisable betwen them. Afterwards we would only have to select the correct classifier to predict each sample at inference time.

Conceptually this is related to other ideas/questions as well:
1. Can we split a given training dataset to multiple sets that are linearly separable? This seems too PAC-related and was not really confident to tackle this.
2. For finding the correct classifier adaptively for each sample thought of the following:
   1. First thought of something like check whether taking a step/partial fit with the sample to infer would change the original weights/model.
      1. The problem with this approach is that we need to have the label of the sample to infer (obv. unknown).
      2. To tackle the previous I created the **MDL** classifier. We essentially re-train 2 classifiers, one with the full-training + (the query sample, pos) and the second with the full-training + (the query sample, neg).
      3. Then we count how much the model was changed:
         1. One way is to count the differences in model weights/feature importances_ (did not work well) -- MODEL BASED.
         2. Better yet was to count the number of changes against the original predictions (worked ok) -- PERF BASED.
         3. Better yet is to count the change in accuracy (because the original predictionc could be wrong) (worked ok) -- PERF BASED.
         4. Tried also to check whether it changed the max_depth (with max_depth = None) (works ok) -- MODEL BASED.W
      4. **DOES NOT WORK**
   2. Then we could do something distance based on the original training samples.
      1. We can think of things related to knn-graphs and label smoothness on the graph, except for distance based things.

A surrogate solution I started working on was:

1. Iteratively do the following during training (start with the full dataset):
   1. On the remaining samples, fit a simple (LR/bounded DT).
   2. Remove the correctly classified ones and repeat the process creating a new classifier
   3. Break the loop when all samples can be correctly classified by the respective classifier (or there is a very small number left miss-classified)
2. At inference time for each sample:
   1. We need to find the correct classifier to make the prediction.
      1. Iterate over the classifiers keeping their weights and updating them with a partial fit on the query sample.
      2. Whichever classifier does not change it's previous predictions (on the training samples) is the most suitable and used for prediction.
      3. **THIS IS INCOMPLETE**
      4. As a surrogate we can take the mean prediction of the classifiers in  bagging fashion.


### Results:
1. Created a (wrong, probably) implementation of the Normal Equation for classification (**NE_CLF**). Need to check the logit/expit functions.
2. The **MDL** classifier:
   1. Fit a bounded DT on the whole dataset.
   2. At inference time, re-fit two DTs similar to the original using the full training set + the sample with pos and neg label.
   3. Count the changes this created. In case the change for the positive label was bigger then infer negative and vice-versa.
   4. If no changes keep the original prediction.
   5. **IS NOT BETTER**
3. The **Cascader** is the incomplete version of the main idea.

In [1]:
import pandas as pd
from pmlb import fetch_data
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.decomposition import PCA
import numpy as np

In [2]:
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

In [36]:
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)


clf = DecisionTreeClassifier(random_state=42, max_depth=None)
# y_pred = cross_val_predict(clf, X, y, cv=cv)

y_pred_all = []
y_true_all = []
for train, test in cv.split(X,y):
    X_train, y_train = X[train], y[train]
    X_test, y_test = X[test], y[test]
    clf = DecisionTreeClassifier(random_state=42, max_depth=3)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    y_pred_all.extend(y_pred.tolist())
    y_true_all.extend(y_test.tolist())

print(classification_report(y_true_all, y_pred_all))
print(confusion_matrix(y_true_all, y_pred_all))

              precision    recall  f1-score   support

           0       0.91      0.88      0.89       212
           1       0.93      0.95      0.94       357

    accuracy                           0.92       569
   macro avg       0.92      0.91      0.92       569
weighted avg       0.92      0.92      0.92       569

[[186  26]
 [ 18 339]]


In [66]:
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.base import BaseEstimator, clone
from sklearn.metrics import accuracy_score, f1_score

class MDL(BaseEstimator):
    
    def __init__(self, clf=DecisionTreeClassifier(max_depth=3, random_state=42)):
        self.base_clf = clf
        self.clf = []
        self.X = []
        self.y = []
        self.orig_pred = []
        self.metric_to_use = accuracy_score
    
    def fit(self, X, y):
        self.X = X
        self.y = y
        self.clf = clone(self.base_clf)
        self.clf.fit(X, y)
        self.orig_pred = self.clf.predict(X)
        self.orig_score = self.metric_to_use(y, self.orig_pred)
        return self
    
    def predict(self, X, y=None):
        preds = []
        count_changed = {'pos->neg':0, 'neg->pos':0, 'pos->pos':0, 'neg->neg':0, 'same':0}
        preds_test = self.clf.predict(X)
        res = []
        
        for i_x, x in enumerate(X):
            cur_pred = preds_test[i_x]
            
            X_aug = np.vstack((self.X, x.reshape(1,-1)))
            y_aug_neg = np.concatenate((self.y, np.array([0])))
            y_aug_pos = np.concatenate((self.y, np.array([1])))
            
            clf_pos = clone(self.base_clf)
            clf_pos.fit(X_aug, y_aug_pos)
            #counts_diff_pos = ((clf_pos.feature_importances_-self.clf.feature_importances_)**2).sum()
            y_pos_new = clf_pos.predict(self.X)
            #counts_diff_pos = (y_pos_new != self.orig_pred).sum()
            counts_diff_pos = int((1 - f1_score(self.y, y_pos_new))*len(self.y))
            #diff_acc_pos = self.metric_to_use(self.y, y_pos_new) - self.orig_score #self.clf.tree_.max_depth - clf_pos.tree_.max_depth #
            
            clf_neg = clone(self.base_clf)
            clf_neg.fit(X_aug, y_aug_neg)
            #counts_diff_neg = ((clf_neg.feature_importances_-self.clf.feature_importances_)**2).sum()
            y_neg_new = clf_neg.predict(self.X)
            #counts_diff_neg = (y_neg_new != self.orig_pred).sum()
            counts_diff_neg = int((1 - f1_score(self.y, y_neg_new))*len(self.y))
            #diff_acc_neg = self.metric_to_use(self.y, y_neg_new) - self.orig_score# self.clf.tree_.max_depth - clf_neg.tree_.max_depth #
            change = ""
            #if diff_acc_pos < diff_acc_neg:
            if counts_diff_pos > counts_diff_neg:
                if cur_pred == 1:
                    count_changed['pos->neg'] += 1
                    change = "pos->neg"
                else:
                    count_changed['neg->neg'] += 1
                    change = "neg->neg"
                cur_pred = 0
                #print(f'POS CHANGED: {counts_diff_pos} > NEG CHANGED: {counts_diff_neg} -> NEG')
                
             
            #elif diff_acc_pos > diff_acc_neg:
            elif counts_diff_pos < counts_diff_neg:
                if cur_pred == 0:
                    count_changed['neg->pos'] += 1
                    change = "neg->pos"
                else:
                    count_changed['pos->pos'] += 1
                    change = 'pos->pos'
                cur_pred = 1
                #print(f'POS CHANGED: {counts_diff_pos} < NEG CHANGED: {counts_diff_neg} -> POS')
            else:
                count_changed['same'] += 1
                change = 'same'
            preds.append(cur_pred)
            res.append((i_x, y[i_x], preds_test[i_x], cur_pred, change))
        print(count_changed)
            
        return np.array(preds), res
            

# clf = MDL(clf=DecisionTreeClassifier(max_depth=2, random_state=42))#DecisionTreeClassifier(max_depth=3, random_state=42))
# clf.fit(X,y)
# y_pred = clf.predict(X)

y_pred_all = []
y_true_all = []
res_all = []
for train, test in cv.split(X,y):
    X_train, y_train = X[train], y[train]
    X_test, y_test = X[test], y[test]
    clf = MDL(clf=DecisionTreeClassifier(max_depth=5, random_state=42))
    clf.fit(X_train, y_train)
    y_pred, res = clf.predict(X_test, y_test)
    y_pred_all.extend(y_pred.tolist())
    y_true_all.extend(y_test.tolist())
    res_all.extend(res)
    
# y_pred = cross_val_predict(clf, X, y, cv=5, n_jobs=1)

print(classification_report(y_true_all, y_pred_all))
print(confusion_matrix(y_true_all, y_pred_all))         
            

{'pos->neg': 2, 'neg->pos': 2, 'pos->pos': 7, 'neg->neg': 1, 'same': 45}
{'pos->neg': 1, 'neg->pos': 0, 'pos->pos': 1, 'neg->neg': 0, 'same': 55}
{'pos->neg': 3, 'neg->pos': 1, 'pos->pos': 1, 'neg->neg': 1, 'same': 51}
{'pos->neg': 4, 'neg->pos': 2, 'pos->pos': 3, 'neg->neg': 1, 'same': 47}
{'pos->neg': 5, 'neg->pos': 0, 'pos->pos': 2, 'neg->neg': 5, 'same': 45}
{'pos->neg': 3, 'neg->pos': 1, 'pos->pos': 1, 'neg->neg': 1, 'same': 51}
{'pos->neg': 5, 'neg->pos': 2, 'pos->pos': 1, 'neg->neg': 0, 'same': 49}
{'pos->neg': 6, 'neg->pos': 3, 'pos->pos': 3, 'neg->neg': 5, 'same': 40}
{'pos->neg': 1, 'neg->pos': 0, 'pos->pos': 14, 'neg->neg': 10, 'same': 32}
{'pos->neg': 0, 'neg->pos': 0, 'pos->pos': 4, 'neg->neg': 10, 'same': 42}
              precision    recall  f1-score   support

           0       0.83      0.88      0.85       212
           1       0.92      0.89      0.91       357

    accuracy                           0.89       569
   macro avg       0.88      0.89      0.88      

In [71]:
res_all_df = pd.DataFrame(res_all, columns=['index', 'label', 'orig', 'final', 'status'])
res_all_df[(res_all_df['label'] !=res_all_df['orig']) & (res_all_df['label'] != res_all_df['final'])]['status'].value_counts()

status
neg->pos    4
pos->neg    4
Name: count, dtype: int64

In [37]:
from sklearn.base import BaseEstimator, clone
from sklearn.linear_model import SGDClassifier, LogisticRegression
from scipy.special import expit, logit


class Dummy(BaseEstimator):
    
    def __init__(self, label):
        self.label = label
        
    def fit(self, X, y):
        return self
    
    def predict_proba(self, X):
        return np.array([self.label for x in X])
    
    def decision_function(self, X):
        return logit(self.predict_proba(X))
    
    def predict(self, X):
        return self.predict_proba(X)

class NE_CLF(BaseEstimator):
    def __init__(self):
        pass
        
    def fit(self, X, y):
        self.classes_ = np.unique(y)
        A = np.hstack([X, np.ones(X.shape[0]).reshape(-1,1)])
        all_coefs = np.linalg.lstsq(A, y)[0]
        self.coef_ = all_coefs[:-1]
        self.intercept_ = all_coefs[-1]
        return self
    
    def predict_proba(self,X):
        probas = expit(X @ self.coef_.T + self.intercept_)
        return probas
    
    def predict(self, X):
        return (self.predict_proba(X) > 0.5).astype(int)
        

class Cascader(BaseEstimator):
    
    def __init__(self, 
                 clf=SGDClassifier(random_state=42, n_jobs=1),
                 acceptable_miss_ratio=0.01#loss="perceptron", eta0=1, learning_rate="constant", penalty=None
                 ):
        self.clf = clf
        self.clfs = []
        self.orig_preds = []
        self.acceptable_miss_ratio = acceptable_miss_ratio
        
    def fit(self, X, y):
        self.X = X
        missed_samples = np.arange(X.shape[0])
        acceptable_number_of_missed_training_samples = int(self.acceptable_miss_ratio*X.shape[0])
        while len(missed_samples) > acceptable_number_of_missed_training_samples:
            if np.bincount(y[missed_samples]).shape[0] == 1:
                cur_clf = Dummy(label=y[0])
                pred = y[missed_samples]
                #break
            else:
                cur_clf = clone(self.clf)
                #print(len(missed_samples), np.bincount(y[missed_samples]))
                cur_clf.fit(X[missed_samples], y[missed_samples])
                pred = cur_clf.predict(X[missed_samples])
                # print(pred, y)
                missed_samples = np.where(pred !=  y[missed_samples])[0]
                #print(missed_samples)
            self.clfs.append(cur_clf)
            self.orig_preds.append(pred)
        print(f'Fitted {len(self.clfs)} clfs...')
        return self
    
    def predict_proba(self, X):
        probas = []
        # for clf in self.clfs:
        #     try:
        #         cur_probas = expit(clf.decision_function(X))
        #     except AttributeError:
        #         cur_probas = clf.predict_proba(X)[:,1]
        #     probas.append(cur_probas)
            
        # probas = np.array(probas).T
        # probas = probas.mean(axis=1)
        
        # return probas
        
        for x in X:
            cur_proba = -1
            # for clf_index, clf in enumerate(self.clfs):
            #     clf_backup = clone(self.clf)
                
            #     clf_backup.coef_ = clf.coef_
            #     clf_backup.intercept_ = clf.intercept_
            #     clf_backup.classes_ = clf.classes_
                
            #     orig_preds = self.orig_preds[clf_index]
                
            #     clf.partial_fit(x.reshape(1,-1))
            #     new_preds = clf.predict(self.X)
                
            #     self.clfs[clf_index] = clf_backup
            #     if all(orig_preds == new_preds): 
            #         cur_proba = clf_backup.decision_function(x.reshape(1,-1))[0]
            #         break
            if cur_proba == -1:
                cur_proba = np.mean([expit(clf.decision_function(x.reshape(1,-1))[0]) for clf in self.clfs])
            probas.append(cur_proba)
        return np.array(probas)
    
    def predict(self, X):
        probas = self.predict_proba(X)
        preds = (probas > 0.5).astype(int)
        #print(preds)
        return preds

# clf = Cascader(clf=DecisionTreeClassifier(max_depth=3, random_state=42))#DecisionTreeClassifier(max_depth=3, random_state=42))
# clf.fit(X,y)
# clf.predict(X)


y_pred_all = []
y_true_all = []
for train, test in cv.split(X,y):
    X_train, y_train = X[train], y[train]
    X_test, y_test = X[test], y[test]
    clf= Cascader(
        clf=SGDClassifier(random_state=42),
        acceptable_miss_ratio=0.005
        )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    y_pred_all.extend(y_pred.tolist())
    y_true_all.extend(y_test.tolist())
    
# y_pred = cross_val_predict(clf, X, y, cv=5, n_jobs=1)

print(classification_report(y_true_all, y_pred_all))
print(confusion_matrix(y_true_all, y_pred_all))   

#y_pred = cross_val_predict(clf, X, y, cv=5, n_jobs=1)
            # no classifier could fit it

    

Fitted 3 clfs...
Fitted 3 clfs...
Fitted 3 clfs...
Fitted 2 clfs...
Fitted 3 clfs...
Fitted 2 clfs...


Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x7f0ba4661910>>
Traceback (most recent call last):
  File "/home/kbougatiotis/miniconda3/envs/prime/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 770, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 


In [35]:
clf = DecisionTreeClassifier(random_state=42, max_depth=3)
clf.fit(X, y)
missed_samples = (clf.predict(X) != y)
missed_samples.sum()

12

In [58]:
clf.feature_importances_[clf.feature_importances_.argsort()[::-1][:6]]

array([0.45331715, 0.25352713, 0.14166468, 0.07909693, 0.04688412,
       0.02550999])

In [59]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X,y)
np.exp(nb.feature_log_prob_).T[:,1].argsort()[::-1][:6]

array([4407, 6139, 9199, 2636, 9964,  551])

In [84]:
from sklearn.decomposition import PCA, NMF, TruncatedSVD # NMF works a bit, TruncatedSVD works a bit, PCA/FastICA/FactorAnalysis/SparsePCA/DictionaryLearning not
clf = NMF(n_components=2, max_iter=5000, random_state=42) # TruncatedSVD(n_components=2) #
X_tr = clf.fit_transform(np.einsum('ji,i->ji', X.T, y)) #fit_transform(np.hstack((X, y.reshape(-1,1))).T)
#X_tr = clf.fit_transform(X.T)

In [85]:
import plotly.express as px
%matplotlib inline

df = pd.DataFrame(X_tr, columns=['pca_0', 'pca_1'])
df['name'] = [f'feat_{i}' for i in range(X_tr.shape[0])]
df['color'] = ['class A']*2 + ['irrelevant']*(10000-4) + ['class B']*2
df['size'] = np.sqrt(df['pca_0']**2 + df['pca_1']**2)
df['size'] = (df['size'] - df['size'].min()) / (df['size'].max() - df['size'].min())
px.scatter(df, x="pca_0", y="pca_1", color='color', symbol='color', size='size', hover_data=['pca_0', 'pca_1', 'color', 'name'], opacity=0.5)