<a href="https://colab.research.google.com/github/nebojsa55/Computational-Genomics_MidTerm-Project/blob/master/notebooks/2.Better_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description
In this notebook, we will try a different approach in feature selection. In the first notebook, PCA was used (295 components in total), and the models did not provide good accuracies. We will now use  **Select K Best** method and tune the parameter K to get the best model. 

Random Forest and Support Vector regressors will have the previously determined 'optimal' parameters.

The main scoring of the best K features will be **f_regression** from sklearn.features_selection module that is specially designed to work with SelectKBest class to provide F-scores for every feature.

Also, by carefully reviewing the data, I've found samples from different sets, and some samples from set **'GSE113966'** seem to be outliers. Further analysis will exclude those samples (32 total)


In [1]:
import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

# Navigate to the folder cointaing our data
%cd 'drive/MyDrive/ETF/Master/Computational-Genomics/Project/data'

Mounted at /content/drive
/content/drive/MyDrive/ETF/Master/Computational-Genomics/Project/data


In [2]:
anno = pd.read_csv('anoSC1_v11_nokey.csv', delimiter = ',', index_col = 0)
HTA20_RMA = pd.read_csv('HTA20_RMA.csv', delimiter = ',', index_col = 0).transpose()

# Sync the X and y data by sorting the labels

df1 = anno.sort_index()
df2 = HTA20_RMA.sort_index()

X = df2.iloc[np.array(np.logical_not(df1['GA'].isna())),:]
y = df1.dropna().loc[:,['GA','Batch']]

# Check to see if the indexes are the same
(X.index == y.index).all()

True

In [3]:
# Drop Sample_X samples

X = X.iloc[32:,:]
y = y.iloc[32:,:]

In [4]:
from sklearn.preprocessing import StandardScaler

XX = np.zeros(X.shape)
for i in [1,2,3,4,5,6,7,8,9,10]:
    scale = StandardScaler()
    indices = np.bool8(y['Batch'] == i)
    Xtemp = X.iloc[indices,:]
    scale.fit(Xtemp)
    XX[indices,:] = scale.transform(Xtemp)

# delete batch column
yy = y['GA']

In [5]:
from sklearn.feature_selection import f_regression,SelectKBest
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error,make_scorer
from sklearn.model_selection import KFold

# Create our models
rf_model = RandomForestRegressor(n_estimators = 400,
                                 max_depth = None,
                                 min_samples_leaf = 1,
                                 min_samples_split = 4)

svr_model = SVR(kernel = 'rbf', C = 1.0, gamma = 0.01)

# Create the scorer for cross_val_score
scorer = make_scorer(mean_squared_error, greater_is_better = False)

# Create KFold 
kfold = KFold(n_splits = 10, shuffle = True, random_state = 42)

# Create lists where cv score will be saved
cv_rf = []
cv_svr = []

for ktemp in [500,1000,2000,3000,5000,10000]:
  X_best = SelectKBest(f_regression,k = ktemp).fit_transform(XX,yy)
  
  # Random forest
  cv_rf.append(cross_val_score(rf_model, 
                          X_best, 
                          yy,
                          cv = kfold, 
                          scoring=scorer, 
                          verbose = 3))
  
  # SVR
  cv_svr.append(cross_val_score(svr_model, 
                          X_best, 
                          yy,
                          cv = kfold, 
                          scoring=scorer, 
                          verbose = 3))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ................................................................
[CV] .................................. , score=-32.132, total=  20.0s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   20.0s remaining:    0.0s


[CV] .................................. , score=-51.710, total=  20.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   40.2s remaining:    0.0s


[CV] .................................. , score=-35.595, total=  20.1s
[CV]  ................................................................
[CV] .................................. , score=-27.029, total=  20.0s
[CV]  ................................................................
[CV] .................................. , score=-46.322, total=  20.1s
[CV]  ................................................................
[CV] .................................. , score=-39.304, total=  20.1s
[CV]  ................................................................
[CV] .................................. , score=-23.180, total=  20.0s
[CV]  ................................................................
[CV] .................................. , score=-61.370, total=  20.5s
[CV]  ................................................................
[CV] .................................. , score=-29.397, total=  20.2s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  3.4min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s


[CV] .................................. , score=-61.582, total=   0.1s
[CV]  ................................................................
[CV] .................................. , score=-63.346, total=   0.1s
[CV]  ................................................................
[CV] .................................. , score=-62.859, total=   0.1s
[CV]  ................................................................
[CV] .................................. , score=-68.552, total=   0.1s
[CV]  ................................................................
[CV] .................................. , score=-52.261, total=   0.1s
[CV]  ................................................................
[CV] ................................. , score=-101.017, total=   0.1s
[CV]  ................................................................
[CV] .................................. , score=-55.077, total=   0.1s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-32.733, total=  39.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   39.1s remaining:    0.0s


[CV] .................................. , score=-51.268, total=  39.7s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.3min remaining:    0.0s


[CV] .................................. , score=-36.509, total=  39.3s
[CV]  ................................................................
[CV] .................................. , score=-28.043, total=  39.8s
[CV]  ................................................................
[CV] .................................. , score=-45.612, total=  39.3s
[CV]  ................................................................
[CV] .................................. , score=-39.198, total=  39.3s
[CV]  ................................................................
[CV] .................................. , score=-22.974, total=  38.9s
[CV]  ................................................................
[CV] .................................. , score=-61.600, total=  40.0s
[CV]  ................................................................
[CV] .................................. , score=-27.382, total=  39.4s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  6.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV] .................................. , score=-69.721, total=   0.2s
[CV]  ................................................................
[CV] .................................. , score=-62.090, total=   0.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s


[CV] .................................. , score=-63.899, total=   0.2s
[CV]  ................................................................
[CV] .................................. , score=-63.248, total=   0.2s
[CV]  ................................................................
[CV] .................................. , score=-69.283, total=   0.2s
[CV]  ................................................................
[CV] .................................. , score=-52.992, total=   0.2s
[CV]  ................................................................
[CV] ................................. , score=-101.981, total=   0.2s
[CV]  ................................................................
[CV] .................................. , score=-55.846, total=   0.2s
[CV]  ................................................................
[CV] .................................. , score=-55.954, total=   0.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    1.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-32.476, total= 1.3min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.3min remaining:    0.0s


[CV] .................................. , score=-49.260, total= 1.3min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.6min remaining:    0.0s


[CV] .................................. , score=-37.136, total= 1.3min
[CV]  ................................................................
[CV] .................................. , score=-29.965, total= 1.3min
[CV]  ................................................................
[CV] .................................. , score=-47.070, total= 1.3min
[CV]  ................................................................
[CV] .................................. , score=-39.296, total= 1.3min
[CV]  ................................................................
[CV] .................................. , score=-24.656, total= 1.3min
[CV]  ................................................................
[CV] .................................. , score=-61.233, total= 1.3min
[CV]  ................................................................
[CV] .................................. , score=-29.951, total= 1.3min
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 13.0min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-70.171, total=   0.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s


[CV] .................................. , score=-69.732, total=   0.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.7s remaining:    0.0s


[CV] .................................. , score=-62.098, total=   0.3s
[CV]  ................................................................
[CV] .................................. , score=-63.871, total=   0.3s
[CV]  ................................................................
[CV] .................................. , score=-63.209, total=   0.3s
[CV]  ................................................................
[CV] .................................. , score=-69.299, total=   0.3s
[CV]  ................................................................
[CV] .................................. , score=-53.001, total=   0.3s
[CV]  ................................................................
[CV] ................................. , score=-101.995, total=   0.3s
[CV]  ................................................................
[CV] .................................. , score=-55.853, total=   0.3s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    3.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-33.416, total= 1.9min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.9min remaining:    0.0s


[CV] .................................. , score=-50.721, total= 2.0min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.9min remaining:    0.0s


[CV] .................................. , score=-37.406, total= 2.0min
[CV]  ................................................................
[CV] .................................. , score=-31.431, total= 2.0min
[CV]  ................................................................
[CV] .................................. , score=-48.034, total= 2.0min
[CV]  ................................................................
[CV] .................................. , score=-40.694, total= 2.0min
[CV]  ................................................................
[CV] .................................. , score=-25.672, total= 1.9min
[CV]  ................................................................
[CV] .................................. , score=-61.499, total= 2.0min
[CV]  ................................................................
[CV] .................................. , score=-31.401, total= 2.0min
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 19.7min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-70.171, total=   0.5s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s


[CV] .................................. , score=-69.732, total=   0.5s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s


[CV] .................................. , score=-62.098, total=   0.5s
[CV]  ................................................................
[CV] .................................. , score=-63.870, total=   0.5s
[CV]  ................................................................
[CV] .................................. , score=-63.208, total=   0.5s
[CV]  ................................................................
[CV] .................................. , score=-69.299, total=   0.5s
[CV]  ................................................................
[CV] .................................. , score=-53.001, total=   0.5s
[CV]  ................................................................
[CV] ................................. , score=-101.995, total=   0.5s
[CV]  ................................................................
[CV] .................................. , score=-55.853, total=   0.5s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-33.187, total= 3.3min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  3.3min remaining:    0.0s


[CV] .................................. , score=-52.673, total= 3.3min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  6.6min remaining:    0.0s


[CV] .................................. , score=-38.618, total= 3.3min
[CV]  ................................................................
[CV] .................................. , score=-31.974, total= 3.4min
[CV]  ................................................................
[CV] .................................. , score=-49.639, total= 3.3min
[CV]  ................................................................
[CV] .................................. , score=-41.879, total= 3.3min
[CV]  ................................................................
[CV] .................................. , score=-26.027, total= 3.3min
[CV]  ................................................................
[CV] .................................. , score=-62.155, total= 3.3min
[CV]  ................................................................
[CV] .................................. , score=-31.755, total= 3.3min
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 33.1min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-70.171, total=   0.9s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s


[CV] .................................. , score=-69.732, total=   0.8s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.7s remaining:    0.0s


[CV] .................................. , score=-62.098, total=   0.9s
[CV]  ................................................................
[CV] .................................. , score=-63.870, total=   0.9s
[CV]  ................................................................
[CV] .................................. , score=-63.208, total=   0.9s
[CV]  ................................................................
[CV] .................................. , score=-69.299, total=   0.8s
[CV]  ................................................................
[CV] .................................. , score=-53.001, total=   0.9s
[CV]  ................................................................
[CV] ................................. , score=-101.995, total=   0.8s
[CV]  ................................................................
[CV] .................................. , score=-55.853, total=   0.9s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    8.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-36.291, total= 6.8min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  6.8min remaining:    0.0s


[CV] .................................. , score=-52.439, total= 6.7min
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 13.4min remaining:    0.0s


[CV] .................................. , score=-40.151, total= 6.7min
[CV]  ................................................................
[CV] .................................. , score=-33.714, total= 6.7min
[CV]  ................................................................
[CV] .................................. , score=-48.438, total= 6.7min
[CV]  ................................................................
[CV] .................................. , score=-42.835, total= 6.7min
[CV]  ................................................................
[CV] .................................. , score=-28.028, total= 6.7min
[CV]  ................................................................
[CV] .................................. , score=-66.899, total= 6.8min
[CV]  ................................................................
[CV] .................................. , score=-32.550, total= 6.8min
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 67.4min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-70.171, total=   1.7s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.7s remaining:    0.0s


[CV] .................................. , score=-69.732, total=   1.7s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.4s remaining:    0.0s


[CV] .................................. , score=-62.098, total=   1.7s
[CV]  ................................................................
[CV] .................................. , score=-63.870, total=   1.7s
[CV]  ................................................................
[CV] .................................. , score=-63.208, total=   1.7s
[CV]  ................................................................
[CV] .................................. , score=-69.299, total=   1.7s
[CV]  ................................................................
[CV] .................................. , score=-53.001, total=   1.7s
[CV]  ................................................................
[CV] ................................. , score=-101.995, total=   1.7s
[CV]  ................................................................
[CV] .................................. , score=-55.853, total=   1.7s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   17.1s finished


In [10]:
rmse_rf = [np.mean(np.sqrt(np.abs(x))) for x in cv_rf]
rmse_rf

[6.034167006409765,
 6.063969214754349,
 6.1216962106212875,
 6.1982268247399235,
 6.258539036337153,
 6.382534412607984]

In [11]:
rmse_svr = [np.mean(np.sqrt(np.abs(x))) for x in cv_svr]
rmse_svr

[8.075687868159163,
 8.120651240652332,
 8.12092800076083,
 8.120916923902438,
 8.120916806265408,
 8.12091680626173]

We can observe that the RF model gives much better results than SVR. Also, the best results are for 500 best features and then it constantly goes slightly up. 

Let's try to decrease the K parameter.


In [14]:
cv_rf_v2 = []

for ktemp in [200,300,400,500]:
  X_best = SelectKBest(f_regression,k = ktemp).fit_transform(XX,yy)
  
  # Random forest
  cv_rf_v2.append(cross_val_score(rf_model, 
                          X_best, 
                          yy,
                          cv = kfold, 
                          scoring=scorer, 
                          verbose = 3))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ................................................................
[CV] .................................. , score=-31.326, total=   8.6s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.6s remaining:    0.0s


[CV] .................................. , score=-50.413, total=   8.6s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   17.1s remaining:    0.0s


[CV] .................................. , score=-35.596, total=   8.5s
[CV]  ................................................................
[CV] .................................. , score=-26.888, total=   8.6s
[CV]  ................................................................
[CV] .................................. , score=-43.719, total=   8.6s
[CV]  ................................................................
[CV] .................................. , score=-38.133, total=   8.5s
[CV]  ................................................................
[CV] .................................. , score=-22.023, total=   8.5s
[CV]  ................................................................
[CV] .................................. , score=-57.010, total=   8.7s
[CV]  ................................................................
[CV] .................................. , score=-26.652, total=   8.6s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  1.4min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-30.003, total=  12.5s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   12.5s remaining:    0.0s


[CV] .................................. , score=-52.076, total=  12.5s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   25.0s remaining:    0.0s


[CV] .................................. , score=-35.832, total=  12.4s
[CV]  ................................................................
[CV] .................................. , score=-26.785, total=  12.7s
[CV]  ................................................................
[CV] .................................. , score=-45.280, total=  12.5s
[CV]  ................................................................
[CV] .................................. , score=-38.356, total=  12.4s
[CV]  ................................................................
[CV] .................................. , score=-22.185, total=  12.4s
[CV]  ................................................................
[CV] .................................. , score=-59.738, total=  12.8s
[CV]  ................................................................
[CV] .................................. , score=-26.794, total=  12.6s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.1min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-32.320, total=  16.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   16.1s remaining:    0.0s


[CV] .................................. , score=-50.232, total=  16.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   32.5s remaining:    0.0s


[CV] .................................. , score=-37.067, total=  16.2s
[CV]  ................................................................
[CV] .................................. , score=-27.217, total=  16.5s
[CV]  ................................................................
[CV] .................................. , score=-46.709, total=  16.5s
[CV]  ................................................................
[CV] .................................. , score=-37.885, total=  16.3s
[CV]  ................................................................
[CV] .................................. , score=-22.919, total=  16.3s
[CV]  ................................................................
[CV] .................................. , score=-62.118, total=  16.7s
[CV]  ................................................................
[CV] .................................. , score=-28.142, total=  16.5s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.7min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................. , score=-32.150, total=  20.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   20.3s remaining:    0.0s


[CV] .................................. , score=-50.226, total=  20.4s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   40.7s remaining:    0.0s


[CV] .................................. , score=-35.109, total=  20.1s
[CV]  ................................................................
[CV] .................................. , score=-28.281, total=  20.5s
[CV]  ................................................................
[CV] .................................. , score=-46.592, total=  20.0s
[CV]  ................................................................
[CV] .................................. , score=-39.294, total=  20.0s
[CV]  ................................................................
[CV] .................................. , score=-21.716, total=  19.9s
[CV]  ................................................................
[CV] .................................. , score=-61.703, total=  20.5s
[CV]  ................................................................
[CV] .................................. , score=-28.453, total=  20.1s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  3.4min finished


In [16]:
rmse_rf_v2 = [np.mean(np.sqrt(np.abs(x))) for x in cv_rf_v2]
rmse_rf_v2  

[5.932461934311158, 5.986688128705596, 6.045040161180783, 6.035397854068846]

K = **200** produces the best result of rmse = **5.932**. PCA analysis provided more than 200 components. So what happened? 

**f_regression** is a linear estimate test, and it appears that the data is very well suited for a simple linear regressor. In one last try in notebook #3, we will attempt to implement **linear regression** to see if we can lower the rmse test error even further.