## Name: What is the effect of the Random Projections in terms of accuracy
### Date: 28/8/2024
### Status: It seems to work with kNN based things.. It did not work with DT as a classifier. For kNN, in dataset 1 we have the same performance while with dataset 2 we have a very small increase in average F1 score.

### Idea: 
Following [Johnson-Lindenstrauss](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma), check whether the reduced dimensionality of the random (or Gaussian projections) help.
i.e. transform  NxF to NxF' with F' << F and check a classifier on the transformed data.

### Results:
Tried with 2 different datasets from UCI.
1. TCGA RNA sequences for cancer types with 4 classes, 801 x 20.5K features
2. Farm Ads with precomputed BoW represenentations with 2 classes, 4K x 55K features

The results are (with eps=0.1):
1. 20.5K features -> 5.7K (72% reduction) features but **accuracy drops from 0.97 avg to 0.93**
2. 55K features -> 7K  features (87% reduction) features but **accuracy drops from 0.86 avg to 0.78**

Not much difference when using Gauss or Sparse. 

Also, changing eps=0.5 did not improve greatly results. For the 2nd dataset the change was: 55K features -> 27.5K  features (50% reduction) features but **accuracy drops from 0.86 avg to 0.80**.

In [91]:
import pandas as pd
df = pd.read_csv("../data/Johnson_Lindenstrauss/TCGA-PANCAN-HiSeq-801x20531/data.csv", index_col=0)
labels = pd.read_csv("../data/Johnson_Lindenstrauss/TCGA-PANCAN-HiSeq-801x20531/labels.csv", index_col=0)
labels = labels.values.ravel()
X = df.values

In [62]:
labels.value_counts(), X.shape

(Class
 BRCA     300
 KIRC     146
 LUAD     141
 PRAD     136
 COAD      78
 Name: count, dtype: int64,
 (801, 20531))

In [96]:
from sklearn.datasets import load_svmlight_file
import numpy as np

X, labels = load_svmlight_file("../data/Johnson_Lindenstrauss/Farm_Ads/farm-ads-vect")
print(X.shape, np.bincount((labels + 1 /2).astype(int)))

(4143, 54877) [1933 2210]


In [97]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_predict

random_state = 42
number_of_cv_folds = 5

max_depth = None

print(X.shape)

cv = StratifiedKFold(number_of_cv_folds, random_state=random_state, shuffle=True)
clf = KNeighborsClassifier()#DecisionTreeClassifier(random_state=random_state, max_depth=max_depth)
y_pred = cross_val_predict(clf, X, labels, cv=cv)

(4143, 54877)


In [98]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(labels, y_pred))
print(confusion_matrix(labels, y_pred))

              precision    recall  f1-score   support

        -1.0       0.67      0.95      0.79      1933
         1.0       0.93      0.59      0.72      2210

    accuracy                           0.76      4143
   macro avg       0.80      0.77      0.75      4143
weighted avg       0.81      0.76      0.75      4143

[[1832  101]
 [ 900 1310]]


In [99]:
from sklearn.random_projection import SparseRandomProjection, GaussianRandomProjection

sp = SparseRandomProjection(eps=0.05) #GaussianRandomProjection(eps=0.1)#SparseRandomProjection(eps=0.1)
X_tr = sp.fit_transform(X)
print(X_tr.shape)

(4143, 27572)


In [100]:
from sklearn.metrics import classification_report, confusion_matrix

# clf = DecisionTreeClassifier(random_state=random_state, max_depth=max_depth)
y_pred = cross_val_predict(clf, X_tr, labels, cv=cv)

print(classification_report(labels, y_pred))
print(confusion_matrix(labels, y_pred))

              precision    recall  f1-score   support

        -1.0       0.67      0.95      0.78      1933
         1.0       0.93      0.59      0.72      2210

    accuracy                           0.76      4143
   macro avg       0.80      0.77      0.75      4143
weighted avg       0.81      0.76      0.75      4143

[[1835   98]
 [ 911 1299]]
