# Regressor Training

## Introduction

This notebook guides through the training of a random forest machine learning regressor. The data is split into a training and a test set, by default with a relation of 9:1. The expression values are scaled to zero mean and unit variance based on the training set. Positions that are non-informative because no alternative nucleotides have been tested are deleted. The performance evaluation is based on the R^2 score from sklearn. The correlation of measured and predicted expression values is plotted. The feature importance from the random forest regression represent the contributions of each nucleotide-position to the prediction. They are extracted and visualized with a Logo-plot.

If you generate a new regressor with new train-test set division, make sure you put the current date on the `ML_Date parameter` in your `config.txt`.

If you generate a new regressor based on an existing train-test set division, use the date of the existing train-test set division on the `ML_Date parameter` in your `config.txt`. 

Make sure the `ML_Regressor` parameter in the `config.txt` file represents your favorit ML-approach and the parameter `Data_Standard` is set to `True` for SVR.

## System initiation

Loading all necessary libraries.

In [1]:
import os
import time
import timeit
import joblib
import pickle
from Exp2Ipynb import init_Exp2, Data_Src_Load, make_DataDir, split_train_test, ExpressionScaler, Sequence_Conserved_Adjusted, MyRF, MySV, MyGB
from sklearn.model_selection import GroupShuffleSplit

### Variable setting

We load the naming conventions and parameters for statistical analysis and regression from 'config.txt'

In [2]:
Name_Dict = init_Exp2('config_Pput.txt')

ML_Date = Name_Dict['ML_Date']
File_Base = Name_Dict['Data_File'].split('.')[0]
Data_Folder = 'data-{}'.format(File_Base) 
Measure_Numb = int(Name_Dict['Library_Expression'])
ML_Regressor = Name_Dict['ML_Regressor'][:-1]
ML_Type = Name_Dict['ML_Regressor'][-1]
Y_Col_Name = eval(Name_Dict['Y_Col_Name'])
Response_Value = eval(Name_Dict['Response_Value'])


Already existent data directory  data-Example1-Pput .


## Data loading

General information on the data source csv-file is stored in the 'config.txt' file generated in the '0-Workflow' notebook. The sequence and expression data is stored in a csv file with an identifier in column 'ID' (not used for anything), the DNA-sequence in column 'Sequence', and the expression strength in column 'promoter activity'. While loading, the sequence is converted to a label encrypted sequence, ['A','C','G','T'] replaced by [0,1,2,3], and a one-hot encoding.

In [3]:
SeqDat = Data_Src_Load(Name_Dict)
SeqDat.head(3)

Following outliers were detected: ID: ['BGSPL14g_19_a'], Value: [[50.13234789]]
Categorization of expression.
The expression values were sorted into the following bins: [ 0.2178722  16.5660553  24.76999231 35.15239853]


Unnamed: 0,level_0,index,Strain ID,Sequence,Promoter Activity,Sequence_label-encrypted,Sequence_letter-encrypted,GC-content,Promoter Activity_ML
0,0,0,BG14g,"[[0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0,...",35.057893,"[2, 1, 1, 1, 0, 3, 3, 2, 0, 1, 0, 0, 2, 2, 1, ...",GCCCATTGACAAGGCTCTCGCGGCCAGGTATAATTGCACG,0.575,2
1,12,12,BGSPL14g_01_a,"[[0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0,...",2.185554,"[2, 1, 1, 1, 0, 0, 3, 2, 0, 1, 0, 0, 2, 2, 1, ...",GCCCAATGACAAGGCTCTCGCGGCCAGGTATAATTGCACG,0.575,0
2,22,22,BGSPL14g_01_c,"[[0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0,...",1.383577,"[2, 1, 1, 1, 0, 1, 3, 2, 0, 1, 0, 0, 2, 2, 1, ...",GCCCACTGACAAGGCTCTCGCGGCCAGGTATAATTGCACG,0.6,0


## Data manipulation

For the machine learning the data is first separated into training and test sets. The training set is used to generate a standard scaler for expression standardization to zero mean and unit variance. On each position the entropy is calculated to assess how much nucleotide diversity has been sampled on each position. If at any position the entropy is zero, i.e. only one nucleotide is present in all samples, this position is removed because it is non-informative for further analysis (Position entropy analysis). 

### Split data to train and test set

In [4]:
# You can generate a new train-test split or use an existing file. 
# If the ML_Date is from the current date, it is assumed you generate a new regressor
# If another ML_Date than the current date is chosen, an existing train-test division is loaded
# The identifier is based on the date from the corresponding machine learning training.
# True generates a new split, False loads an existing split
TrainTest_File = os.path.join(Data_Folder, '{}_{}_{}_TrainTest-Data.pkl'.format(ML_Date, File_Base, Response_Value))
GenSplit = True if ML_Date == time.strftime('%Y%m%d') and not os.path.isfile(TrainTest_File) else False
if GenSplit:
    print('new train-test')
    Measure_Name = ['{}_ML'.format(MeasName) for MeasName in Y_Col_Name]

    train_size = 1 - eval(Name_Dict['TestRatio'])
    # split number '1' because we only use one final test set. Cross validation comes later
    gss = GroupShuffleSplit(n_splits=1, train_size=train_size)
    X = SeqDat['Sequence']
#     if ML_Type=='C':
#         y = SeqDat['ExprCat']
#     else:
    y = SeqDat[Measure_Name]
    groups = SeqDat['Sequence_letter-encrypted'].str.upper()
    Train_Idx, Test_Idx = list(gss.split(X, y, groups))[0]
    SeqTest = SeqDat.iloc[Test_Idx].reset_index(drop=True)
    SeqTrain = SeqDat.iloc[Train_Idx].reset_index(drop=True)

    TrainTest_Data = {'Train': SeqTrain, 'Test': SeqTest}
    TrainTest_File = os.path.join(Data_Folder, '{}_{}_{}_TrainTest-Data.pkl'.format(time.strftime('%Y%m%d'), File_Base, Response_Value))
    pickle.dump(TrainTest_Data, open(TrainTest_File, 'wb'))

    print('Train and test data stored as: {}'.format(TrainTest_File))
else:
    TrainTest_Data = pickle.load(open(TrainTest_File,'rb'))
    SeqTrain, SeqTest = TrainTest_Data['Train'], TrainTest_Data['Test']
    print('Load existing Train-Test split {}.'.format(TrainTest_File))

Load existing Train-Test split data-Example1-Pput/20210430_Example1-Pput_3_TrainTest-Data.pkl.


### Data engineering

Normalization of the data improves training for kernel and artificial neural network based strategies. However, omit this step for correlation and regression tree (CART) approaches. 

If at any position the entropy is below a threshold (`Entropy_cutoff`) because too few nucleotides are sampled, this position is removed because it is non-informative for further analysis (Position entropy analysis).

In [5]:
# standardization step, omit for CART approaches by setting variable 'Data_Standard' to 'False'
# Target for classification cannot be normalized
# The standardization is performed on the original data, not the data from previous standardization
# Previous standardization results are overwritten
Data_Standard = False
if eval(Name_Dict['Response_Value'])==0:
    print('New standardization of observed expression values.')
    Data_Standard = True
    SeqTrain, Expr_Scaler = ExpressionScaler(SeqTrain, Name_Dict)
    # storing scaler
    Scaler_File = os.path.join(Data_Folder, '{}_{}_{}-Scaler.pkl'.format(time.strftime('%Y%m%d'), File_Base, Name_Dict['ML_Regressor']))
    pickle.dump(Expr_Scaler, open(Scaler_File, 'wb'))

# removing non-informative positions where no base diversity exists, base one hot encoding
SeqTrain_Hadj, Positions_removed, PSEntropy = Sequence_Conserved_Adjusted(SeqTrain, Name_Dict, Entropy_cutoff=float(Name_Dict['Entropy_cutoff']))

print('Normalization: {}'.format(Data_Standard))

Normalization: False


## Regression with grid search on shuffle split

You can either choose to start a new training of a random forest regressor or load an existing regressor. If you load an existing random-forest regressor the parameters of the standard scaler are loaded based on names in the config-file. For the estimation the training set is dynamically separated into a new training and test set with a 9:1 ratio (parameter 'test_ratio') with 1000 random shuffle splits (parameter 'split_number'). The training takes about 5 minutes on 16 cpu-cores.

**User input:** <br>
 * Decision whether a new random-forest training is started or an existing regressor is loaded.
 
*Example:*<br>
 Start new random-forest training by setting:<br>
 RFR_File = 0<br>
 otherwise, insert the file adress:<br>
 RFR_File = 'data-Example1-Pput\\20191106_Example1-Pput_RFR_ML-File.pkl'

In [6]:
# Number of independent promoter library measurements
Regressor_Best = dict()
# ML Random Forest training for number of independent promoter library measurements
for Meas_Idx in range(Measure_Numb):
    # starting the machine learning
    # This can take a while
    start_time = timeit.default_timer()
    test_ratio = .1
    split_number = 100
    # If the data is standardized we have to extract it from the separate column, otherwise we use the original data column
    # If Categorization is targeted, we take the category data frame column
    Measure_Name = '{}_ML'.format(Y_Col_Name[Meas_Idx])
#     if eval(Name_Dict['Data_Standard']):
#         Meas_Name = '{}_scaled'.format(Y_Col_Name[Meas_Idx]) 
#     elif ML_Type == 'C':
#         Meas_Name = '{}_Cat'.format(Y_Col_Name[Meas_Idx])
#     else: 
#         Meas_Name = Y_Col_Name[Meas_Idx]
    AddFeat = eval(Name_Dict['Add_Feat'])
    MLType = 'Classification' if Response_Value > 1 else 'Regression'
    print('Starting new {}-ML training with {} for {}'.format(ML_Regressor, MLType, Measure_Name))
    # starting regression with regressor type defined in the config file
    # RFR: random forest regression
    if ML_Regressor == 'RF':
        MyML = MyRF(SeqTrain_Hadj, test_ratio, split_number, Measure_Name, Response_Value, AddFeat)
    # SVR: support vector regression
    elif ML_Regressor == 'SV':
        MyML = MySV(SeqTrain_Hadj, test_ratio, split_number, Measure_Name, Response_Value, AddFeat)
    # TPOT: TPOT automated tree regression
    elif ML_Regressor == 'GB':
        MyML = MyGB(SeqTrain_Hadj, test_ratio, split_number, Measure_Name, Response_Value, AddFeat)
#     elif ML_Type == 'TPOT':
#         print('TPOT')
    else:
        print('Machine Learning type not recognized, choose:')
        print('"RF<R/C>": random forest')
        print('"GB<R/C>": gradient boosting')
        print('"SV<R/C>": support vector')
#         print('"TPOT": automated tree regression pipeline (requires TPOT installation)')
        break
        
    run_time = timeit.default_timer() - start_time
    print('Training run time: {:.0f} sec'.format(run_time))
    print('Estimator hyperparameter: ', MyML.best_estimator_.get_params)
    
    # getting the best estimator
    ML_Best = MyML.best_estimator_

    # saving the best estimator
    Regressor_File = os.path.join(Data_Folder, '{}_{}_{}_{}{}-Regressor.pkl'.format(time.strftime('%Y%m%d'), File_Base, Measure_Name.replace(' ','-'), ML_Regressor, Response_Value))
    joblib.dump(ML_Best, Regressor_File)
    # conserved positions not used as input for the regressor
    Data_Prep_Params = {'Positions_removed': Positions_removed}
    # Mean and standard deviation of training set expression used for normalizing
    if eval(Name_Dict['Response_Value'])==0:
        # The standard scaler default name is the name of the expression measurement column with suffix: '_Scaler'    
        Scaler_DictName = '{}_Scaler'.format(Measure_Name)
        Data_Prep_Params[Scaler_DictName] = Expr_Scaler[Scaler_DictName]

    Parameter_File = os.path.join(Data_Folder, '{}_{}_{}_{}{}-Params.pkl'.format(time.strftime('%Y%m%d'), File_Base, Measure_Name.replace(' ','-'), ML_Regressor, Response_Value))
    pickle.dump(Data_Prep_Params, open(Parameter_File, 'wb'))

Starting new SV-ML training with Classification for Promoter Activity_ML
Training run time: 111 sec
Estimator hyperparameter:  <bound method BaseEstimator.get_params of SVC(C=1000.0, gamma=0.20691380811147903)>
