# Feature selection:
**Selecting molecular descriptors following the Ash & Fourches (2017) procedure** (it is assumed that these steps were made independently for each set of descriptors):
1. **Low variance filter:** Features in the lower variance quartile were discarde.
    - This means that at least 25% of the features will be removed.
    
    
2. **Correlation filters:** For any pair of descriptors with $|r| > 0.9$ the descriptor with largest mean of $|r|$ was removed.

#### Additionaly they did the following analysis (not necessarily to drop features):
1. **Pearson correlation between each descriptor and pKi values.**  
2. **Paired t-test between active and inactive ligands using each set of descriptors.**

In [24]:
import pandas as pd
import numpy as np
import pickle

### Load the data

In [30]:
file_ = './main_table_of_Fourches_ligs_ERK2.pkl'
with open(file_, 'rb') as f:
    df_erk2_mols = pickle.load(f)
df_erk2_mols = df_erk2_mols.set_index('Name')

# MACC Keys

In [31]:
#  Let's extract the MACCS Keys as a Data frame
s = df_erk2_mols.maccs.map(lambda x: list(map(np.float, x)))
df_maccs_all = pd.DataFrame.from_dict(dict(zip(s.index, s.values))).T
# We have to clarify that MACCS Keys from rdkit includes a dummy key at the begining due to the 0 indexing
# Let's drop it
df_maccs_all = df_maccs_all.drop([0], axis=1)
df_maccs_all.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,157,158,159,160,161,162,163,164,165,166
CSAR_erk2_18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
CSAR_erk2_20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
CSAR_erk2_17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
CSAR_erk2_16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
CSAR_erk2_15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


### Variance Threshold

Ash and Fourches dropped all the features/bits inside the lower quartile of the feature variances. This meas they would remove 166/4 ~ 42 variables; however, they actually remove 45 variables (see Table 1). 

##### In our results, how many bits have only 0 or 1 values?

In [17]:
(df_maccs_all.sum().values == 1).sum()

5

In [18]:
# How many bits have only zeros
(df_maccs_all.sum().values == 0).sum()

37

 We can start by droping these 37 features with only zeros, plus 5 features with only ones, then we'll have 129 remained bits

##### Use the VarianceThreshold class from *sklearn*:

In [19]:
from sklearn.feature_selection import VarianceThreshold

In [20]:
sel_var = VarianceThreshold(0)
df_maccs_filt1 = sel_var.fit_transform(df_maccs_all)
df_maccs_filt1.shape

(87, 120)

In [21]:
np.array(sorted(sel_var.variances_))

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.01136214, 0.01136214, 0.01136214, 0.01136214,
       0.01136214, 0.02246003, 0.02246003, 0.02246003, 0.02246003,
       0.02246003, 0.02246003, 0.02246003, 0.02246003, 0.0332937 ,
       0.0332937 , 0.0332937 , 0.0332937 , 0.0332937 , 0.0332937 ,
       0.04386313, 0.04386313, 0.04386313, 0.04386313, 0.04386313,
       0.04386313, 0.04386313, 0.04386313, 0.04386313, 0.04386

##### Boolean features (Bernoulli rvs)
Assuming each descriptor is a Bernoulli random variable with $p = n_a/N$:
> $var(x) = p(1 - p)$

In [22]:
n = 166
p = df_maccs_all.sum()/n
var = p*(1-p)*n

q = np.quantile(var.values, 0.25)
df_maccs_filt2 = df_maccs_all.iloc[:, var.values > q]
df_maccs_filt2.shape

(87, 124)

## RDKit 2D Descriptors

In [None]:
### KNIME

In [45]:
rdk2d_knime = pd.read_csv('knime/2d_rdki_knime.csv')
rdk2d_knime['Filename'] = rdk2d_knime['Filename'].apply(lambda x: x.split('.')[0])
rdk2d_knime = rdk2d_knime.set_index('Filename')
rdk2d_knime = rdk2d_knime.iloc[:, 6:]

rdk2d_knime = rdk2d_knime.reindex(df_rdkit_all.index)

In [None]:
### RDKit

In [29]:
from rdkit.Chem import Descriptors
# MQN Descriptors go from 1 to 42
names_MQN = ['MQN' + str(i) for i in range(1, 43)]

# Final Lsit of descirptors
names_of_all_rdkit_descriptors = [x[0] for x in Descriptors._descList if x[0][:3] != 'fr_']
FINAL_names_of_all_rdkit_descriptors = names_of_all_rdkit_descriptors + ['CalcNumAtomStereoCenters',
                                                                        'CalcNumUnspecifiedAtomStereoCenters',
                                                                        'GetNumAtoms'] + names_MQN

In [35]:
s = df_erk2_mols['2d_rdkit'].map(lambda x: x)
df_rdkit_all = pd.DataFrame.from_dict(dict(zip(s.index, s.values))).T
df_rdkit_all.columns = FINAL_names_of_all_rdkit_descriptors
df_rdkit_all

Unnamed: 0,MaxEStateIndex,MinEStateIndex,MaxAbsEStateIndex,MinAbsEStateIndex,qed,MolWt,HeavyAtomMolWt,ExactMolWt,NumValenceElectrons,NumRadicalElectrons,...,MQN33,MQN34,MQN35,MQN36,MQN37,MQN38,MQN39,MQN40,MQN41,MQN42
CSAR_erk2_18,12.715656,-0.475203,12.715656,0.182531,0.492792,393.491,366.275,393.216475,152.0,0.0,...,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
CSAR_erk2_20,12.732601,-0.584261,12.732601,0.017665,0.346052,443.935,417.727,443.172417,164.0,0.0,...,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
CSAR_erk2_17,12.673357,-3.485752,12.673357,0.055118,0.318735,464.935,443.767,464.103352,164.0,0.0,...,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
CSAR_erk2_16,12.773795,-0.577508,12.773795,0.254042,0.319141,468.970,447.802,468.113523,164.0,0.0,...,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
CSAR_erk2_15,12.672667,-0.471951,12.672667,0.180730,0.505335,379.464,354.264,379.200825,146.0,0.0,...,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
erk2_05,12.390456,-0.323510,12.390456,0.047547,0.779551,356.382,336.222,356.148455,136.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
erk2_04,12.554039,-3.563305,12.554039,0.020421,0.629308,434.474,412.298,434.126005,160.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
erk2_03,12.497493,-0.602134,12.497493,0.004419,0.606162,447.879,425.703,447.130946,164.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,5.0,3.0
erk2_02,14.174431,-0.549608,14.174431,0.002430,0.506804,422.387,406.259,422.107813,156.0,0.0,...,0.0,0.0,1.0,4.0,0.0,0.0,0.0,0.0,4.0,2.0


In [49]:
df_rdkit_all.columns

Index(['MaxEStateIndex', 'MinEStateIndex', 'MaxAbsEStateIndex',
       'MinAbsEStateIndex', 'qed', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt',
       'NumValenceElectrons', 'NumRadicalElectrons',
       ...
       'MQN33', 'MQN34', 'MQN35', 'MQN36', 'MQN37', 'MQN38', 'MQN39', 'MQN40',
       'MQN41', 'MQN42'],
      dtype='object', length=160)

In [50]:
rdk2d_knime.columns

Index(['SMR', 'LabuteASA', 'TPSA', 'AMW', 'ExactMW', 'NumLipinskiHBA',
       'NumLipinskiHBD', 'NumRotatableBonds', 'NumHBD', 'NumHBA',
       ...
       'MQN33', 'MQN34', 'MQN35', 'MQN36', 'MQN37', 'MQN38', 'MQN39', 'MQN40',
       'MQN41', 'MQN42'],
      dtype='object', length=118)