# Feature selection:
**Selecting molecular descriptors following the Ash & Fourches (2017) procedure** (it is assumed that these steps were made independently for each set of descriptors):
1. **Low variance filter:** Features in the lower variance quartile were discarde.
    - This means that at least 25% of the features will be removed.
    
    
2. **Correlation filters:** For any pair of descriptors with $|r| > 0.9$ the descriptor with largest mean of $|r|$ was removed.

#### Additionaly they did the following analysis (not necessarily to drop features):
1. **Pearson correlation between each descriptor and pKi values.**  
2. **Paired t-test between active and inactive ligands using each set of descriptors.**

In [1]:
import pandas as pd
import numpy as np
import pickle

### Load the data

In [2]:
file_ = './main_table_of_Fourches_ligs_ERK2.pkl'
with open(file_, 'rb') as f:
    df_erk2_mols = pickle.load(f)
df_erk2_mols = df_erk2_mols.set_index('Name')



# MACC Keys

In [3]:
#  Let's extract the MACCS Keys as a Data frame
s = df_erk2_mols.maccs.map(lambda x: list(map(np.float, x)))
df_maccs_all = pd.DataFrame.from_dict(dict(zip(s.index, s.values))).T
# We have to clarify that MACCS Keys from rdkit includes a dummy key at the begining due to the 0 indexing
# Let's drop it
df_maccs_all = df_maccs_all.drop([0], axis=1)
df_maccs_all.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,157,158,159,160,161,162,163,164,165,166
CSAR_erk2_18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
CSAR_erk2_20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
CSAR_erk2_17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
CSAR_erk2_16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
CSAR_erk2_15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


### Variance Threshold

Ash and Fourches dropped all the features/bits inside the lower quartile of the feature variances. This meas they would remove 166/4 ~ 42 variables; however, they actually remove 45 variables (see Table 1). 

##### In our results, how many bits have only 0 or 1 values?

In [4]:
(df_maccs_all.sum().values == 1).sum()

5

In [5]:
# How many bits have only zeros
(df_maccs_all.sum().values == 0).sum()

37

 We can start by droping these 37 features with only zeros, plus 5 features with only ones, then we'll have 129 remained bits

##### Use the VarianceThreshold class from *sklearn*:

In [6]:
from sklearn.feature_selection import VarianceThreshold

In [7]:
sel_var = VarianceThreshold(0)
df_maccs_filt1 = sel_var.fit_transform(df_maccs_all)
df_maccs_filt1.shape

(86, 120)

In [8]:
np.array(sorted(sel_var.variances_))

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.0114927 , 0.0114927 , 0.0114927 , 0.0114927 ,
       0.0114927 , 0.02271498, 0.02271498, 0.02271498, 0.02271498,
       0.02271498, 0.02271498, 0.02271498, 0.02271498, 0.03366685,
       0.03366685, 0.03366685, 0.03366685, 0.03366685, 0.03366685,
       0.0443483 , 0.0443483 , 0.0443483 , 0.0443483 , 0.0443483 ,
       0.0443483 , 0.0443483 , 0.0443483 , 0.0443483 , 0.04434

##### Boolean features (Bernoulli rvs)
Assuming each descriptor is a Bernoulli random variable with $p = n_a/N$:
> $var(x) = p(1 - p)$

In [9]:
n = 166
p = df_maccs_all.sum()/n
var = p*(1-p)*n

q = np.quantile(var.values, 0.25)
df_maccs_filt2 = df_maccs_all.iloc[:, var.values > q]
df_maccs_filt2.shape

(86, 124)

In [10]:
var.values.sort()
var.values

array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.9939759 ,  0.9939759 ,  0.9939759 ,
        0.9939759 ,  0.9939759 ,  1.97590361,  1.97590361,  1.97590361,
        1.97590361,  1.97590361,  1.97590361,  2.94578313,  2.94578313,
        2.94578313,  2.94578313,  3.90361446,  3.90361446,  3.90361446,
        3.90361446,  3.90361446,  3.90361446,  3.90361446,  4.84939759,
        4.84939759,  4.84939759,  4.84939759,  4.84939759,  4.84939759,
        4.84939759,  4.84939759,  4.84939759,  4.84939759,  5.78

## RDKit 2D Descriptors

In [11]:
rdk2d_knime = pd.read_csv('knime/2d_rdki_knime.csv')
rdk2d_knime['Filename'] = rdk2d_knime['Filename'].apply(lambda x: x.lower().split('.')[0])
rdk2d_knime = rdk2d_knime.set_index('Filename')
rdk2d_knime = rdk2d_knime.iloc[:, 6:]

In [12]:
from rdkit.Chem import Descriptors
# Get the descriptor names but ommit all descriptors related to fragments inside molecules (all of them start with 'fr_')
names_of_all_rdkit_descriptors = [x[0] for x in Descriptors._descList if x[0][:3] != 'fr_']
#np.array(names_of_all_rdkit_descriptors)

In [13]:
s = df_erk2_mols['2d_rdkit'].map(lambda x: x)
df_rdkit_all = pd.DataFrame.from_dict(dict(zip(s.index, s.values))).T
df_rdkit_all.columns = names_of_all_rdkit_descriptors
#df_rdkit_all