## Name: What is the effect of the Random Projections in terms of accuracy
### Date: 28/8/2024
### Status: It seems to work with kNN based things.. It did not work with DT as a classifier. For kNN, in dataset 1 we have the same performance while with dataset 2 we have a very small increase in average F1 score.

### Idea: 
Following [Johnson-Lindenstrauss](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma), check whether the reduced dimensionality of the random (or Gaussian projections) help.
i.e. transform  NxF to NxF' with F' << F and check a classifier on the transformed data.

### Results:
Tried with 2 different datasets from UCI.
1. TCGA RNA sequences for cancer types with 4 classes, 801 x 20.5K features
2. Farm Ads with precomputed BoW represenentations with 2 classes, 4K x 55K features

The results are (with eps=0.1):
1. 20.5K features -> 5.7K (72% reduction) features but **accuracy drops from 0.97 avg to 0.93**
2. 55K features -> 7K  features (87% reduction) features but **accuracy drops from 0.86 avg to 0.78**

Not much difference when using Gauss or Sparse. 

Also, changing eps=0.5 did not improve greatly results. For the 2nd dataset the change was: 55K features -> 27.5K  features (50% reduction) features but **accuracy drops from 0.86 avg to 0.80**.

In [1]:
import pandas as pd
df = pd.read_csv("../data/Johnson_Lindenstrauss/TCGA-PANCAN-HiSeq-801x20531/data.csv", index_col=0)
labels = pd.read_csv("../data/Johnson_Lindenstrauss/TCGA-PANCAN-HiSeq-801x20531/labels.csv", index_col=0)
labels = labels.values.ravel()
X = df.values

In [2]:
from collections import Counter
Counter(labels)

Counter({'BRCA': 300, 'KIRC': 146, 'LUAD': 141, 'PRAD': 136, 'COAD': 78})

In [3]:
# from sklearn.datasets import load_svmlight_file
# import numpy as np

# X, labels = load_svmlight_file("../data/Johnson_Lindenstrauss/Farm_Ads/farm-ads-vect")
# print(X.shape, np.bincount((labels + 1 /2).astype(int)))

In [4]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
ohe = OneHotEncoder(sparse_output=False)
labels_ohe = ohe.fit_transform(labels.reshape(-1,1))

In [5]:
from sklearn.random_projection import johnson_lindenstrauss_min_dim

johnson_lindenstrauss_min_dim(1000, eps=[0.05, 0.1, 0.5, 0.99])

array([22867,  5920,   331,   165])

In [6]:
labels.shape, labels_ohe.shape

((801,), (801, 5))

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_predict

random_state = 42
number_of_cv_folds = 5

max_depth = None

print(X.shape)

cv = StratifiedKFold(number_of_cv_folds, random_state=random_state, shuffle=True)
clf = KNeighborsClassifier()#DecisionTreeClassifier(random_state=random_state, max_depth=max_depth)
y_pred = cross_val_predict(clf, X, labels, cv=cv)

(801, 20531)


In [8]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(labels, y_pred))
print(confusion_matrix(labels, y_pred))

              precision    recall  f1-score   support

        BRCA       0.99      1.00      1.00       300
        COAD       1.00      1.00      1.00        78
        KIRC       1.00      1.00      1.00       146
        LUAD       1.00      0.99      0.99       141
        PRAD       1.00      1.00      1.00       136

    accuracy                           1.00       801
   macro avg       1.00      1.00      1.00       801
weighted avg       1.00      1.00      1.00       801

[[300   0   0   0   0]
 [  0  78   0   0   0]
 [  0   0 146   0   0]
 [  2   0   0 139   0]
 [  0   0   0   0 136]]


In [9]:
from sklearn.random_projection import SparseRandomProjection, GaussianRandomProjection

sp = GaussianRandomProjection(eps=0.1)#SparseRandomProjection(eps=0.1) #GaussianRandomProjection(eps=0.1)#SparseRandomProjection(eps=0.1)
X_tr = sp.fit_transform(X)
print(X_tr.shape)

(801, 5730)


In [10]:
from sklearn.metrics import classification_report, confusion_matrix

# clf = DecisionTreeClassifier(random_state=random_state, max_depth=max_depth)
y_pred = cross_val_predict(clf, X_tr, labels, cv=cv)

print(classification_report(labels, y_pred))
print(confusion_matrix(labels, y_pred))

              precision    recall  f1-score   support

        BRCA       0.99      1.00      1.00       300
        COAD       1.00      1.00      1.00        78
        KIRC       1.00      1.00      1.00       146
        LUAD       1.00      0.99      0.99       141
        PRAD       1.00      1.00      1.00       136

    accuracy                           1.00       801
   macro avg       1.00      1.00      1.00       801
weighted avg       1.00      1.00      1.00       801

[[300   0   0   0   0]
 [  0  78   0   0   0]
 [  0   0 146   0   0]
 [  2   0   0 139   0]
 [  0   0   0   0 136]]


In [11]:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin

import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin

class AMPClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, tau=None, alpha=1.0, n_iter=100):
        """
        AMP Classifier model for multi-label classification
        
        Parameters:
        tau : float, optional
            Threshold parameter for AMP (used in the denoising function). If not provided, 
            will be calculated based on sparsity.
        alpha : float, default=1.0
            Regularization parameter or step size in the AMP updates.
        n_iter : int, default=100
            Maximum number of iterations for AMP.
        """
        self.tau = tau
        self.alpha = alpha
        self.n_iter = n_iter
        self.w_ = None
    
    def fit(self, X, y):
        """
        Fit the AMP model using the training data.
        
        Parameters:
        X : array-like of shape (n_samples, n_features)
            The input data.
        y : array-like of shape (n_samples, num_labels)
            The target labels, where each sample has multiple possible labels.
        
        Returns:
        self : object
            Fitted model.
        """
        N, F = X.shape  # N: samples, F: features
        try:
            num_labels = y.shape[1]
        except IndexError:
            self.ohe = OneHotEncoder(sparse_output=True)
            y = self.ohe.fit_transform(y)
            num_labels = y.shape[1]
        
        if self.tau is None:
            self.tau = np.sqrt(2 * np.log10(F / np.sum(y == 1)))  # rough estimate of tau if not provided
        
        # Initialize a weight vector for each label
        self.w_ = np.zeros((F, num_labels))
        
        # Fit a model for each label (each column in y)
        for label in range(num_labels):
            z = y[:, label].copy()  # residuals initialized to current label
            w_label = np.zeros(F)  # weight vector for current label
            
            def eta(x, beta):
                """ Soft-thresholding function (denoiser) """
                return np.sign(x) * np.maximum(np.abs(x) - beta, 0)
            
            # AMP iterations for each label
            for _ in range(self.n_iter):
                sigma = np.linalg.norm(z, 2) / np.sqrt(N)
                w_label = eta(w_label + self.alpha * np.dot(X.T, z), self.tau * sigma)
                tmp = np.sum(np.abs(w_label) > 0)
                z = y[:, label] - np.dot(X, w_label) + (tmp / N) * z
            
            # Store the learned weights for this label
            self.w_[:, label] = w_label
        
        return self

    def predict(self, X):
        """
        Generate predictions using the learned weight vectors.
        
        Parameters:
        X : array-like of shape (n_samples, n_features)
            The input data.
        
        Returns:
        y_pred : array-like of shape (n_samples, num_labels)
            Predicted labels for each sample and each class.
        """
        if self.w_ is None:
            raise ValueError("The model hasn't been fitted yet. Call 'fit' first.")
        
        # Predict for each label: dot product of features and weight vectors
        y_pred = np.dot(X, self.w_)
        
        # Convert the raw scores into binary predictions (multi-label classification)
        return y_pred

    
clf = AMPClassifier()
clf.fit(X_tr, labels_ohe)
y_pred = clf.predict(X_tr)
# y_pred = cross_val_predict(clf, X_tr, labels, cv=cv)

print(classification_report(labels, y_pred))
print(confusion_matrix(labels, y_pred))

  return x.astype(dtype, copy=copy, casting=casting)


ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets

In [72]:
from amp_git import amp, opt_tuning_param
x = np.zeros(X.shape[1])
# opt tuning param needs eps = (num_of_important_feats/ num_feats) ratio essentially. We don't know it, so we approximate
alpha = opt_tuning_param(np.sqrt(X_tr.shape[1]) / X_tr.shape[1])
print(alpha)
amp(labels_ohe[:,1], X, x, labels_ohe[:,1], alpha)

1.8549502274769487


(array([  1.15369285, 267.61490118, 252.03338899, ..., 728.92657365,
        281.47053641,   2.53523689]),
 array([-88575312.2677997 , -87686369.84175786, -84373283.46993884,
        -87188346.02393809, -88531106.9072353 , -89058297.28556229,
        -88277831.38255385, -88665547.8816896 , -88280182.94745857,
        -86470703.22650869, -89068670.3158362 , -88559922.58457267,
        -89737698.72558923, -86435388.07338746, -86607723.79332666,
        -88524937.66440934, -89450529.69394025, -88927268.33438596,
        -88656126.03028932, -88733767.39472853, -86035167.86968236,
        -87521744.7832485 , -88705389.64801903, -85369588.45955047,
        -86469816.82697816, -90346467.9579527 , -88977780.75086026,
        -87241083.49768144, -89383127.24808352, -87834080.51246548,
        -87432119.43620768, -85550987.61031185, -88102034.67967495,
        -88828102.47490762, -88903387.64785887, -86532356.48268251,
        -88405443.72629297, -90102986.42972265, -88576940.65114602,
        -

In [12]:
from amp_git import soft_thresh, opt_tuning_param
from numpy import linalg as LA
import numpy as np


def avg_frob_vector_norm(x_new, x):
    return np.sqrt(((x_new - x)**2).sum() / len(x))


def amp_loc(y, A, x, z, alpha):
    '''Approximate message passing (AMP) iteration 
       with soft-thresholding denoiser.
    
    Inputs
        y: measurement vector (length M 1d np.array)
        A: sensing matrix     (M-by-N 2d np.array)
        x: signal estimate    (length N 1d np.array)
        z: residual           (length M 1d np.array)
        alpha: threshold tuning parameter
        
    Outputs
        x: signal estimate
        z: residual
        
    Note 
        Need to initialise AMP iteration with 
        x = np.zeros(N)
        z = y
    '''
    
    M = len(y)
    
    # Estimate vector
    theta = alpha*np.sqrt(LA.norm(z)**2/M) # alpha*tau
    # Calculate residual with the Onsager term
    b = LA.norm(x,0)/M
    z = y - A@x + b*z
    x  = soft_thresh_loc(x + A.T@z, theta)
    

    
    
    
    # L = theta*(1 - b) # The last L is the actual lambda of the LASSO we're minimizing

    return (x, z)


def soft_thresh_loc(x, L):
    '''Soft-thresholding function.

    x is the signal, L is the threshold lambda
    x = sign(x)(abs(x)-lambda)_+
    ()_+ is the element-wise plus operator which equals
    the +ve part of x if x>0, otherwise = 0.'''
    diff = np.absolute(x)-L
    mask = diff>0
    diff[~mask] = 0
    return diff

def oamp(y, H, sigma_w=1, n_iter=1, strategy='lmmse'):
    M, N = H.shape
    # Nx1
    x = np.zeros(N)
    # Nx1
    u = np.ones_like(x)
    eye = np.eye(N=N)
    
    # W_hat is MxN
    if strategy == 'llmse' or (strategy == 'pinv' and M >=N):
        augm = H.T@H
    elif strategy == 'pinv':
        if M < N:
            W_hat = H.T@np.linalg.pinv(H@H.T)
        else:
            W_hat = np.linalg.pinv(H.T@H) @ H.T
    
    elif strategy == 'MF':
        W_hat = H.T
    for i in range(n_iter):
        if (strategy == 'llmse'):
            W_hat = LA.pinv(augm + eye*(sigma_w/u)) @ H.T
        # scalar
        tau_val = N/np.trace(W_hat @ H)
        tau_sq = u * (tau_val - 1)
        diff = (y - H @ x)
        resid = x + tau_val * W_hat @ diff
        mse = np.abs(diff).sum()
        print(f'MSE: {mse:.4f}')
        x_mmse = soft_thresh_loc(resid, np.sqrt(tau_sq))
        # conditional variance 
        u_mmse = np.sum(np.square(x_mmse - resid))/len(x_mmse - 1)
        u_hat = 1/(1/u_mmse - 1/tau_sq)
        x_hat = u_hat * (x_mmse/u_mmse - resid/tau_sq)
        f = avg_frob_vector_norm(x_hat, x)
        print(f)
        # if f <= 1e-6:
        #     break
        x = x_hat
    return x_hat


def amp(y, H, n_iter=1, alpha=1, strategy='onsager'):
    M, N = H.shape
    # Nx1
    x = np.zeros(N)
    # Nx1
    r = y.copy()
    for i in range(n_iter):
        if (strategy == 'onsager'):
            onsager_term = 0 
        else:
            onsager_term = 0
        diff = (y - H @ x)
        mse = np.abs(diff).sum()
        print(f'MSE: {mse:.4f}')
        r_new = diff + onsager_term
        u = x + H.T@r_new
        x_new = soft_thresh_loc(u, alpha)
        f = avg_frob_vector_norm(x_new, x)
        print(f'F: {f}')
        # if f <= 1e-6:
        #     break
        x = x_new
    return x_new

cur_y = labels_ohe[:, 1]
z = cur_y.copy()
x = np.zeros(X_tr.shape[1])#np.random.random(X_tr.shape[1]) #
n_iter = 10


w = amp(cur_y, X_tr, strategy='pinv', n_iter=10, alpha=1)

# alpha = opt_tuning_param(np.sqrt(X_tr.shape[1]) / X_tr.shape[1])
# alpha = 2
# print('a: ', alpha)

# for n in range(n_iter):
#     (x_new, z_new) = amp_loc(cur_y, X_tr, x, z, alpha)
#     f = avg_frob_vector_norm(x_new, x)
#     x = x_new
#     z = z_new
#     print(x_new[0], f)

MSE: 78.0000
F: 1114.4428631736528
MSE: 1361377892.7352
F: 19560319851.92451
MSE: 22415565626531308.0000
F: 3.221069389072009e+17
MSE: 369179859682820584636416.0000
F: 5.305044669418644e+24
MSE: 6080244326068743996957901455360.0000
F: 8.737196176926924e+31
MSE: 100139196059712648329824470211888152576.0000
F: 1.4389813210045208e+39
MSE: 1649252566495405438679869078818703840200622080.0000
F: 2.3699447670856223e+46
MSE: 27162531108531707306498053044844918763854520121294848.0000
F: 3.903204382956379e+53
MSE: 447356039462941187953522984687498685625022780199115471978496.0000
F: 6.428421736534364e+60
MSE: 7367775309463332144461656359893684565101016425588171869009425203200.0000
F: 1.058735386832278e+68


In [30]:
x

array([0., 0., 0., ..., 0., 0., 0.])

In [28]:
X_tr@x

array([-4.23034724e-18,  1.27723718e-19,  1.14242742e-19,  8.20938294e-19,
        2.95397364e-19,  2.39996942e-19,  1.23485410e-18,  9.80221434e-19,
        2.12106480e-19, -8.39483270e-19,  1.67088623e-19,  1.01209601e-18,
        1.76190711e-19, -3.44193162e-19,  5.43417958e-19,  3.13809163e-19,
       -8.26327376e-19, -1.73330321e-19,  4.38758931e-19,  3.31613812e-19,
        2.54988970e-19,  2.11285917e-19,  4.12978193e-20,  8.25207787e-20,
       -4.02516995e-19,  4.06724707e-21, -2.00995984e-06,  1.53357135e-18,
       -8.33806329e-20, -9.68305897e-20,  2.56712505e-19, -1.72249299e-19,
        3.85606786e-19,  5.53922324e-19,  2.13889881e-19,  2.91470737e-19,
        7.17729315e-19,  2.58438107e-20,  4.59612981e-19,  3.34515976e-19,
        2.15626133e-19,  8.23234961e-20,  2.71477782e-19,  3.61889450e-19,
        2.77702730e-19,  5.10022402e-19,  3.01118558e-19, -2.00995984e-06,
        1.59139624e-19,  1.13861411e-21,  1.71308278e-19,  1.77080344e-19,
        2.35568631e-19,  

1090.80570721447