# Create datasets for G1
## Scalling and Feature Selection

The original dataset and/or the ballanced ones will be first splitted into separated files as training and test subsets using a **seed**. All the scalling and feature selection will be apply **only on training set**:
- *Dataset split*: train, test sets; the train set will be divided into train and validation in future Machine Learning hyperparameter search for the best model with a ML method;
- *Scalling* of train set using centering, standardization, etc.;
- *Reduction* of train set dimension (after scalling): decrease the number of features using less dimensions/derived features;
- *Feature selection* using train set (after scalling): decrease the number of features by keeping only the most important for the classification.

Two CSV files will be create for each type of scalling, reduction or feature selection: *tr* - trainin and *ts* - test.

In [1]:
import numpy as np
import pandas as pd
import os
from sklearn.model_selection import train_test_split # for dataset split

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer

Let's define the name of the original dataset, the folder and the prefix characters for each scalling, dimension reduction or feature selection. Each transformation will add a prefix to the previous name of the file.

**You can used the original dataset that could be unballanced or the ballanced datasets obtained with previous scripts (one file only)!**

In [2]:
# Create scalled datasets using normalized MA dataset
# Two CSV files will be create for each type of scalling, reduction or feature selection
WorkingFolder = './datasets/'

# change this with ballanced datasets such as upsampl.ds_MA.csv or downsampl.ds_MA.csv
# if you want to run all files, you should modify the entire script by looping all
# transformation using a list of input files [original, undersampled, upsampled]
sOrigDataSet  = 'ds_MA.csv'
sOrigDataSet_G1  = 'ds.G1_MA.csv'
sOrigDataSet_G1_det  = 'ds.G1_details.csv'

# Split details
seed = 44          # for reproductibility

test_size = 0.25  # train size = 1 - test_size
outVar = 'Lij'    # output variable

# Scalers: the files as prefix + original name
# =================================================
# Original (no scaling!), StandardScaler, MinMaxScaler, RobustScaler,
# QuantileTransformer (normal), QuantileTransformer(uniform)

# scaler prefix for file name
#scalerPrefix = ['o', 's', 'm', 'r', 'pyj', 'qn', 'qu']
# scalerPrefix = ['o', 's', 'm', 'r']
scalerPrefix = ['s']

# sklearn scalers
#scalerList   = [None, StandardScaler(), MinMaxScaler(),
#                RobustScaler(quantile_range=(25, 75)),
#                PowerTransformer(method='yeo-johnson'),
#                QuantileTransformer(output_distribution='normal'),
#                QuantileTransformer(output_distribution='uniform')]

# sklearn scalers
# scalerList   = [None, StandardScaler(), MinMaxScaler(), RobustScaler()]
scalerList   = [StandardScaler()]

# Dimension Reductions
# ===================
# PCA
reductionPrefix = 'pca'

# Feature selection
# =================
# RF feature selection, Univariate feature selection using chi-squared test,
# Univariate feature selection with mutual information

# prefix to add to the processed files for each FS method
#FSprefix = ['fs.rf.',
#            'fs.univchi.',
#            'fs.univmi.']

FSprefix = ['fs-rf.']

# number of total features for reduction and selection if we are not limited by experiment
noSelFeatures = 50

Start by reading the original dataset:

In [3]:
print('-> Reading source dataset:',sOrigDataSet,'...')
df = pd.read_csv(os.path.join(WorkingFolder, sOrigDataSet))
print('Columns:',len(df.columns),'Rows:',len(df))
print('Done')

print('-> Reading source dataset G1:',sOrigDataSet_G1,'...')
df_G1 = pd.read_csv(os.path.join(WorkingFolder, sOrigDataSet_G1))
print('Columns:',len(df_G1.columns),'Rows:',len(df_G1))
print('Done')

print('-> Reading source dataset G1 details:',sOrigDataSet_G1_det,'...')
df_G1_det = pd.read_csv(os.path.join(WorkingFolder, sOrigDataSet_G1_det))
print('Columns:',len(df_G1_det.columns),'Rows:',len(df_G1_det))
print('Done')

-> Reading source dataset: ds_MA.csv ...
Columns: 1519 Rows: 12906
Done
-> Reading source dataset G1: ds.G1_MA.csv ...
Columns: 1519 Rows: 1384
Done
-> Reading source dataset G1 details: ds.G1_details.csv ...
Columns: 3307 Rows: 1384
Done


## Dataset split

First, split the dataset using stratification for non-ballanced datasets: the ratio between the classes is the same in training and test sets.

In [4]:
# Get features and ouput as dataframes
print('--> Split of dataset in training and test ...')
X = df.drop(outVar, axis = 1) # remove output variable from input features
y = df[outVar]                # get only the output variable

# get only the values for features and output (as arrays)
Xdata = X.values # get values of features
Ydata = y.values # get output values

# split data in training and test sets (X = input features, y = output variable)
# using a seed, test size (defined above) and stratification for un-ballanced classes
X_train, X_test, y_train, y_test = train_test_split(Xdata, Ydata,
                                                    test_size=test_size,
                                                    random_state=seed,
                                                    stratify=Ydata)
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

print('Done!')

--> Split of dataset in training and test ...
X_train: (9679, 1518)
y_train: (9679,)
X_test: (3227, 1518)
y_test: (3227,)
Done!


In [7]:
MAs = [col for col in df_G1.columns if ('MA-' in col)]
len(MAs)

1518

In [9]:
# Get features and ouput as dataframes for G1
print('--> Split of dataset in training and test ...')

X_G1 = df_G1[MAs] # remove output variable from input features
#X_G1 = df_G1.drop(outVar, axis = 1) # remove output variable from input features
y_G1 = df_G1[outVar]                # get only the output variable

# get only the values for features and output (as arrays)
Xdata_G1 = X_G1.values # get values of features
Ydata_G1 = y_G1.values # get output values

print('Xdata_G1:', Xdata_G1.shape)
print('Ydata_G1:', Ydata_G1.shape)

--> Split of dataset in training and test ...
Xdata_G1: (1384, 1518)
Ydata_G1: (1384,)


## Dataset scaling

Two files will be saved for training and test sets for each scaling including non-scalling dataset.

In [10]:
# Scale dataset
print('-> Scaling dataset train and test:')

for scaler in scalerList: # or scalerPrefix
    
    # new file name; we will add tr and ts + csv
    newFile = scalerPrefix[scalerList.index(scaler)]+'.'+sOrigDataSet[:-4]
    
    # decide to scale or not
    if scaler == None: # if it is the original dataset, do not scale!
        print('--> Original (no scaler!) ...')
        X_train_transf = X_train # do not modify train set
        X_test_transf  = X_test  # do not modify test set
        
    else:              # if it is not the original dataset, apply scalers
        print('--> Scaler:', str(scaler), '...')
        X_train_transf = scaler.fit_transform(X_train) # use a scaler to modify only train set
        X_test_transf  = scaler.transform(X_test)      # use the same transformation for test set
        X_G1_transf    = scaler.transform(Xdata_G1)    # use the same transformation for G1

    # Save the training scaled dataset
    df_tr_scaler = pd.DataFrame(X_train_transf, columns=X.columns)
    df_tr_scaler[outVar]= y_train
    newFile_tr = newFile +'_tr.csv'

    print('---> Saving training:', newFile_tr, ' ...')
    df_tr_scaler.to_csv(os.path.join(WorkingFolder, newFile_tr), index=False)

    # Save the test scaled dataset
    df_ts_scaler = pd.DataFrame(X_test_transf, columns=X.columns)
    df_ts_scaler[outVar]= y_test
    newFile_ts = newFile +'_ts.csv'

    print('---> Saving test:', newFile_ts, ' ...')
    df_ts_scaler.to_csv(os.path.join(WorkingFolder, newFile_ts), index=False)
    
    # Save G1 scaled dataset for future predictions
    df_G1_scaler = pd.DataFrame(X_G1_transf, columns=X.columns)
    df_G1_scaler[outVar]= Ydata_G1
    newFile_tr = newFile +'_G1.csv'

    print('---> Saving G1 scaled:', newFile_tr, ' ...')
    df_G1_scaler.to_csv(os.path.join(WorkingFolder, newFile_tr), index=False)


print('Done!')

-> Scaling dataset train and test:
--> Scaler: StandardScaler(copy=True, with_mean=True, with_std=True) ...
---> Saving training: s.ds_MA_tr.csv  ...
---> Saving test: s.ds_MA_ts.csv  ...
---> Saving G1 scaled: s.ds_MA_G1.csv  ...
Done!


In [11]:
# save scaler as file
from sklearn.externals import joblib
scaler_filename = os.path.join(WorkingFolder, "scaler.save") 
joblib.dump(scaler, scaler_filename)

['./datasets/scaler.save']

In [12]:
# means of the scaling
scaler.mean_

array([ 9.38012462e-01, -3.61046970e-02, -3.38916796e+02, ...,
       -4.24350032e-16,  2.21888679e-15, -2.26174103e-15])

In [13]:
# means of the scaling
scaler.mean_

array([ 9.38012462e-01, -3.61046970e-02, -3.38916796e+02, ...,
       -4.24350032e-16,  2.21888679e-15, -2.26174103e-15])

In [14]:
# variances of the scaling
scaler.var_

array([1.28891675e+06, 1.47732542e+02, 9.85131671e+08, ...,
       3.55586841e-32, 1.68402045e-30, 9.06104020e-31])

In [15]:
# s of the scaling
scaler.scale_

array([1.13530469e+03, 1.21545276e+01, 3.13868073e+04, ...,
       1.88570104e-16, 1.29769814e-15, 9.51894962e-16])

Save to files means, vars and s for StandarScaller (we need these value for the G1 prediction!):

In [16]:
np.savetxt(os.path.join(WorkingFolder, 'StandardScaler_mean.csv'), scaler.mean_.reshape((-1, 1)).T, delimiter=',')
np.savetxt(os.path.join(WorkingFolder, 'StandardScaler_var.csv'), scaler.var_.reshape((-1, 1)).T, delimiter=',')
np.savetxt(os.path.join(WorkingFolder, 'StandardScaler_scale.csv'), scaler.scale_.reshape((-1, 1)).T, delimiter=',')

### G1 scaling

In [45]:
from sklearn.externals import joblib
scaler_filename = os.path.join(WorkingFolder, "scaler.save") 

# load the scaler
scaler = joblib.load(scaler_filename)

In [47]:
WorkingFolder = './datasets/'
fG1_MAs = "ds.G1_MA.csv"
print('-> Reading source dataset G1:',fG1_MAs,'...')
df_G1 = pd.read_csv(os.path.join(WorkingFolder, fG1_MAs))
print('Columns:',len(df_G1.columns),'Rows:',len(df_G1))
print('Done')

-> Reading source dataset G1: ds.G1_MA.csv ...
Columns: 1519 Rows: 1384
Done


In [48]:
X_G1 = df_G1.drop(outVar, axis = 1) # remove output variable from input features
y_G1 = df_G1['Lij']                # get only the output variable

# get only the values for features and output (as arrays)
Xdata_G1 = X_G1.values # get values of features
Ydata_G1 = y_G1.values # get output values

print('Xdata_G1:', Xdata_G1.shape)
print('Ydata_G1:', Ydata_G1.shape)

Xdata_G1: (1384, 1518)
Ydata_G1: (1384,)


In [49]:
X_G1_transf = scaler.transform(Xdata_G1)    # use the same transformation for G1

In [52]:
# Save G1 scaled dataset for future predictions
df_G1_scaler = pd.DataFrame(X_G1_transf, columns=X_G1.columns)
df_G1_scaler['Lij']= Ydata_G1
newFile_tr = 's.ds_MA_G1.csv'

In [54]:
print('---> Saving G1 scaled:',newFile_tr, ' ...')
df_G1_scaler.to_csv(os.path.join(WorkingFolder, newFile_tr), index=False)
print('Done!')

---> Saving G1 scaled: s.ds_MA_G1.csv  ...
Done!


### Selection of the same features for standardized G1 

Choose for G1 only the selected features from the best model to use later for predictions:

In [55]:
# read G1 MAs
print('-> Reading source dataset:','s.ds_MA_G1.csv','...')
df_G1 = pd.read_csv(os.path.join(WorkingFolder, 's.ds_MA_G1.csv'))
print('Columns:',len(df_G1.columns),'Rows:',len(df_G1))
print('Done')
print(list(df_G1.columns))

-> Reading source dataset: s.ds_MA_G1.csv ...
Columns: 1519 Rows: 1384
Done
['MA-ALogP-STANDARD_TYPE_UNITSj', 'MA-ALogp2-STANDARD_TYPE_UNITSj', 'MA-AMR-STANDARD_TYPE_UNITSj', 'MA-apol-STANDARD_TYPE_UNITSj', 'MA-naAromAtom-STANDARD_TYPE_UNITSj', 'MA-nAromBond-STANDARD_TYPE_UNITSj', 'MA-nAtom-STANDARD_TYPE_UNITSj', 'MA-ATSc1-STANDARD_TYPE_UNITSj', 'MA-ATSc2-STANDARD_TYPE_UNITSj', 'MA-ATSc3-STANDARD_TYPE_UNITSj', 'MA-ATSc4-STANDARD_TYPE_UNITSj', 'MA-ATSc5-STANDARD_TYPE_UNITSj', 'MA-ATSm1-STANDARD_TYPE_UNITSj', 'MA-ATSm2-STANDARD_TYPE_UNITSj', 'MA-ATSm3-STANDARD_TYPE_UNITSj', 'MA-ATSm4-STANDARD_TYPE_UNITSj', 'MA-ATSm5-STANDARD_TYPE_UNITSj', 'MA-ATSp1-STANDARD_TYPE_UNITSj', 'MA-ATSp2-STANDARD_TYPE_UNITSj', 'MA-ATSp3-STANDARD_TYPE_UNITSj', 'MA-ATSp4-STANDARD_TYPE_UNITSj', 'MA-ATSp5-STANDARD_TYPE_UNITSj', 'MA-bpol-STANDARD_TYPE_UNITSj', 'MA-nB-STANDARD_TYPE_UNITSj', 'MA-C1SP1-STANDARD_TYPE_UNITSj', 'MA-C2SP1-STANDARD_TYPE_UNITSj', 'MA-C1SP2-STANDARD_TYPE_UNITSj', 'MA-C2SP2-STANDARD_TYPE_UNITS

In [56]:
# get seleted feature names from fs-rf.s.ds_MA_ts.csv
print('-> Reading:','fs-rf.s.ds_MA_ts.csv','...')
df_sel = pd.read_csv(os.path.join(WorkingFolder, 'fs-rf.s.ds_MA_ts.csv'))
print('Columns:',len(df_sel.columns),'Rows:',len(df_sel))
print('Done')
print(list(df_sel.columns))

-> Reading: fs-rf.s.ds_MA_ts.csv ...
Columns: 121 Rows: 3227
Done
['MA-C1SP1-STANDARD_TYPE_UNITSj', 'MA-khs.sBr-STANDARD_TYPE_UNITSj', 'MA-khs.tN-STANDARD_TYPE_UNITSj', 'MA-khs.dsN-STANDARD_TYPE_UNITSj', 'MA-C4SP3-STANDARD_TYPE_UNITSj', 'MA-khs.ddssS-STANDARD_TYPE_UNITSj', 'MA-khs.aaS-STANDARD_TYPE_UNITSj', 'MA-khs.ssssC-STANDARD_TYPE_UNITSj', 'MA-MDEC.44-STANDARD_TYPE_UNITSj', 'MA-khs.tsC-STANDARD_TYPE_UNITSj', 'MA-khs.dsN-ASSAY_CHEMBLID', 'MA-khs.aaS-ASSAY_CHEMBLID', 'MA-C3SP3-ASSAY_CHEMBLID', 'MA-MDEN.11-ASSAY_CHEMBLID', 'MA-khs.dsCH-ASSAY_CHEMBLID', 'MA-MDEO.12-ASSAY_CHEMBLID', 'MA-WPATH-ASSAY_CHEMBLID', 'MA-khs.aasN-ASSAY_CHEMBLID', 'MA-LipinskiFailures-ASSAY_CHEMBLID', 'MA-PetitjeanNumber-ASSAY_CHEMBLID', 'MA-MDEC.23-ASSAY_TYPE', 'MA-MDEC.33-ASSAY_TYPE', 'MA-MDEN.11-ASSAY_TYPE', 'MA-khs.tsC-ASSAY_TYPE', 'MA-SP.1-ASSAY_TYPE', 'MA-SP.2-ASSAY_TYPE', 'MA-SP.4-ASSAY_TYPE', 'MA-SP.5-ASSAY_TYPE', 'MA-VP.1-ASSAY_TYPE', 'MA-VP.5-ASSAY_TYPE', 'MA-naAromAtom-ASSAY_ORGANISM', 'MA-ATSm1-ASSAY

In [57]:
# check repeated column names
l = list(df_sel.columns)
set([x for x in l if l.count(x) > 1])

set()

In [58]:
df_G1_sel = df_G1[df_sel.columns]
print('Sel Columns:',len(df_G1_sel.columns),'Sel Rows:',len(df_G1_sel))

Sel Columns: 121 Sel Rows: 1384


In [41]:
df_G1_sel.head()

Unnamed: 0,MA-C1SP1-STANDARD_TYPE_UNITSj,MA-khs.sBr-STANDARD_TYPE_UNITSj,MA-khs.tN-STANDARD_TYPE_UNITSj,MA-khs.dsN-STANDARD_TYPE_UNITSj,MA-C4SP3-STANDARD_TYPE_UNITSj,MA-khs.ddssS-STANDARD_TYPE_UNITSj,MA-khs.aaS-STANDARD_TYPE_UNITSj,MA-khs.ssssC-STANDARD_TYPE_UNITSj,MA-MDEC.44-STANDARD_TYPE_UNITSj,MA-khs.tsC-STANDARD_TYPE_UNITSj,...,MA-CHOC760101.lag27-TARGET_CHEMBLID,MA-CHAM810101.lag10-TARGET_CHEMBLID,MA-CHAM810101.lag21-TARGET_CHEMBLID,MA-CHAM810101.lag20-TARGET_CHEMBLID,MA-CHAM810101.lag19-TARGET_CHEMBLID,MA-CHAM810101.lag18-TARGET_CHEMBLID,MA-CHAM810101.lag17-TARGET_CHEMBLID,MA-CHAM810101.lag16-TARGET_CHEMBLID,MA-CHAM810101.lag15-TARGET_CHEMBLID,Lij
0,-0.24857,11.182644,-0.240237,-0.427112,-0.180841,-0.581455,-0.377548,-0.424674,-0.080655,-0.24857,...,-1.460794,-1.116396,-1.981279,1.403685,-0.935244,-0.455293,2.108246,-0.423082,-0.146818,-1
1,-0.24857,11.182644,-0.240237,-0.427112,-0.180841,-0.581455,-0.377548,-0.424674,-0.080655,-0.24857,...,-1.460794,-1.116396,-1.981279,1.403685,-0.935244,-0.455293,2.108246,-0.423082,-0.146818,-1
2,-0.24857,11.182644,-0.240237,-0.427112,-0.180841,-0.581455,-0.377548,-0.424674,-0.080655,-0.24857,...,-1.460794,-1.116396,-1.981279,1.403685,-0.935244,-0.455293,2.108246,-0.423082,-0.146818,-1
3,-0.144075,11.211137,-0.139239,-0.101706,-0.324011,-0.15154,-0.263791,-0.444773,-0.102556,-0.144075,...,-1.460794,-1.116396,-1.981279,1.403685,-0.935244,-0.455293,2.108246,-0.423082,-0.146818,-1
4,-0.144075,11.211137,-0.139239,-0.101706,-0.324011,-0.15154,-0.263791,-0.444773,-0.102556,-0.144075,...,-1.460794,-1.116396,-1.981279,1.403685,-0.935244,-0.455293,2.108246,-0.423082,-0.146818,-1


In [59]:
# save the fs-rf.s.ds_MA_G1.csv
print('-> Saving:','fs-rf.s.ds_MA_G1.csv','...')
df_G1_sel.to_csv(os.path.join(WorkingFolder, 'fs-rf.s.ds_MA_G1.csv'), index=False)
print('Done!')

-> Saving: fs-rf.s.ds_MA_G1.csv ...
Done!
