# The ExpressionExpert.ipynb

## Introduction

The ExpressionExpert.ipynb is a Jupyter notebook for the analysis of promoter libraries. The user uploads one or several sequence libraries with associated expression values. The notebook guides:
 1. the statistical analysis, 
 2. the training of a random forest regressor,
 3. the evaluation of the regressor performance,
 4. the set-up of a synthetic promoter library within the experimental exploration region, and 
 5. the selection of new promoters with defined expression.
 
--- 
## Workflow initialization
The workflow is distributed in different notebooks. However, many functions call the same file and will store results in a common directory. In the following, a project directory is generated in combination with the file name of the promoter library. These informations are stored in the 'config.txt' file for all other notebooks. The config-file is evaluated in dedicated cells in the start of all Notebooks. Once initialized, you can also rename plots or names by changing the names in the config-file and then re-running notebooks 1-5.

The code cell below generates a `config.txt` file. The important sensible user input is labeled between the '#' signs. Once you are fine with a given config-file, you can rename it and use the new config-file name in the approriate positions where the config-file is loaded in the other Notebooks.

**User Input:** <br>
 * **Provide information starting from `Data_File` until `Synth_Seq_MaxNumber`.**

*Example: <br>
An example file for a promoter library in* Pseudomonas putida *KT2440 is provided.* <br>
Data_File = 'liu_bacillus.csv'

---
**Author: Ulf Liebal** <br>
**Institution: Institute of Applied Microbiology, RWTH Aachen** <br>
**Contact: ulf.liebal@rwth-aachen.de** <br>
**Date: 2020/07/01** <br>

---

In [None]:
import os
import time
#####################
# start User Input
# File name of the library, in csv format
# example input: 'Example1-Pput.csv', 'PromLib_EcolPtai.csv'
Data_File = 'mutalik12_Ecol.csv' #'liu_bacillus.csv'  
# Sequence column name
Sequence = 'Short sequence'#'Sequence(5-3)'
# column name of expression activity
# example input: ['Promoter Activity'], ['Ecol Promoter Activity', 'Ptai Promoter Activity'],
# Make sure all entries are within square brackets and hyphens
Y_Col_Name = ['sE o/exp']#['GFP relative activity']
# Promoter ID column name
# 'Strain ID'
ID_Col_Name = 'Promoter'#'Name'
# expression unit for plots
# example: '$\mu$M(GFP)/(gCDW*h)' 'GFP intensity'
Expression_Unit = 'Expression'#'GFP Activity'
# Response value engineering
# The machine learning to the expression can either take place on the original values, standardized values with zero mean and unit variance, or categorized expression values
# Response_Value = 0: standardized expression with zero mean and unit variance
# Response_Value = 1: expression as measured in the input
# Response_Value = >1: categorized expression with bin number according to the value
Response_Value = 3
# the availabity of replicates improves regressor training, if your data is based on statistical summaries you can generate random samples with the same statistical properties, 
# you need to provide column names for standard deviation, the mean is assumed to be in Y_Col_Name
# Stats_Samples contains column names with the number of replicates
Stats2Samples = False
Stats_Std = ['Error']
Stats_Samples = ['Samples']
# column for additional features, this particular feature is generated by the script
Add_Feat = ['GC-content']
# Reference Sequence for sequence diversity histogram
# for X-Host 'GCCCATTGACAAGGCTCTCGCGGCCAGGTATAATTGCACG'
RefSeq = ''
# Machine learning approach for regression
# choose RFR: random forest regression, GBR: gradient boosting regression, SVR: support vector regression
ML_Regressor = 'RFR'
# Kernal and function based machine learning approaches like SVM and artificial neural networks, require data standardization. Correlation and regression trees, including random forest, do not benefit from data standardization.
# Set variable with Boolean for data standardization
Data_Standard = False
# if previously a synthetic library was generated it is stored under a generic name with only the date as variable
# give synthetic library date to load file with default naming
SynLib_Date = time.strftime('%Y%m%d')
# Date for the random forest regression. 
ML_Date = time.strftime('%Y%m%d')
# Test set cut-off, fraction of data removed from the original set to be used for quality assessment
TestRatio = .1
# Parameters for the synthetic promoter library
# the 'Sequence_Distance_cutoff' determines how distant are sequences allowed to be different from the most common sequence reference
# The statistical analysis gives a histogram of sequence distances, take an appropriate value from there
# in Example1-Pput: 0.11
Sequence_Distance_cutoff = .9
# not all positions might be sampled with all nucleotides, the parameter 'Entropy_cutoff' determines the minimum position diversity
# The position diversity entropy-bargraph gives an indication on the right parameter choice.
# in Example1-Pput: 0.15
Entropy_cutoff = .2
# decide on how many synthetic promoters you want to simulate
Synth_Seq_MaxNumber = 10000

# end User Input
#####################


# extract the filename for naming of newly generated files
# File_Base = Data_File.split('.')[0]
# # the generated files will be stored in a subfolder with custom name
# Data_Folder = 'data-{}'.format(File_Base)
# try:
#     os.mkdir(Data_Folder)
#     print('Data directory ', Data_Folder, 'created.')
# except FileExistsError:
#     print('Already existent data directory ', Data_Folder, '.')

# Definition of names
# make specific changes prefereably in the user input space above.
Name_Dict = {
    'Data_File': Data_File,
    # column name of sequence library
    # if distinct sequence libraries were used for different expression measurement, then put names in a vector ['sequence1', 'sequence2']
    'Sequence_column': Sequence,
    # column name for the expression strength
    # if more library expression measurements are conducted, then put Y_Col_Name in a vector ['name1', 'name2']
    'Y_Col_Name': Y_Col_Name, # ['Promoter Activity'], ['Ecol Promoter Activity', 'Ptai Promoter Activity'],
    # Promoter ID column name
    'ID_Col_Name': ID_Col_Name,
    # engineering of target value 
    'Response_Value': Response_Value,
    # Decision whether to remove outlier from the data set
    'Revome_Outlier': False,
    # information for generating samples from statistics
    'Stats2Samples': Stats2Samples,
    'Stats_Std': Stats_Std,
    'Stats_Samples': Stats_Samples,
    # number of separate promoter libraries expression measurements
    'Library_Expression': len(Y_Col_Name),
    # column name for additional features input
    'Add_Feat': Add_Feat,
    # set the desired figure file type here, e.g. svg, png, pdf
    'Figure_Type': 'png',
    'HM_File': 'Plot_PositionNucleotideStats',
    'SampSeqDist_File': 'Plot_SampleSequenceDistance',
    'Entropy_File': 'Plot_PositionEntropy',
    'SamplingDiv_File': 'Plot_SamplingDiversity',
    'ExprHist_File': 'Plot_ExpressionHist',
    'LogoPlot_File': 'Plot_Logo-FI',
    'CorrPlot_File': 'Plot_Corr-Ytrue-VS-Ypred',
    'Csv_ID': 'Synth-Library',
    'X_Expr': 'Plot_CrossExpr',
    'RefSeq': RefSeq,
    'ML_Regressor': ML_Regressor,
    'Data_Standard': Data_Standard,
    'SynLib_Date': SynLib_Date,
    'ML_Date': ML_Date,
    # Machine learning files
    'TestRatio': TestRatio,
    'Expression_Unit': Expression_Unit,
    # Parameters for the synthetic library
    'Sequence_Distance_cutoff': Sequence_Distance_cutoff,
    'Entropy_cutoff': Entropy_cutoff,
    'Synth_Seq_MaxNumber': Synth_Seq_MaxNumber,
    'Figure_Font_Size': 18
}

# constructing the config.txt file
with open('config.txt', 'w') as f:
    print('# {}'.format(time.strftime('%Y%m%d')), file=f)
    print('# This file contains the naming conventions for all output files. It is automatically generated when going through step "0-Workflow".', file=f)
    for key, value in Name_Dict.items():
        print('{}: {}'.format(key, value), file=f)


---
 
## Statistical analysis 

The statistical analysis provides an overview to important properties of the promoter library. It visualizes the distribution of measured expression strength with a histogram. This is particularly informative if several promoter libraries are analysed in order to compare the coverage of expression strengths. It is important to delineate the sequence exploration space that has been spanned with the promoter library. The sequence exploration space defines the allowed sequence input for prediction of expression strength and limits the sequences for the generation of the synthetic promoter library. The exploratory space is identified by two properties: 
 * the promoter library diversity, calculated by the percentage of nucleotide changes over the full sequence for all samples in the library, and 
 * the position specific diversity, i.e. how many different nucleotides are sampled on each position, determined by the entropy on each position.

The promoter diversity is visualized with a histogram of the frequency of sequences with a given number of nucleotide exchanges. The histogram shows how much nucleotides need to be changed to mutually convert sequences, and what is the maximum and average nucleotide differences. The entropy measurement of the position diversity informs about how many nucleotides have been sampled for each position.

The promoter library with associated expression activities contains information how each nucleotide-position contributes to expression. This information is extracted by calculating the average and the variance of expression on each nucleotide position. The output is a heat-map that shows which positions are on average associated with higher or lower expression.

[1-Statistical-Analysis.ipynb](./1-Statistical-Analysis.ipynb)

---

## Regressor training
For random-forest regression, the data is split into a training and a test set, by default with a ratio of 9:1. The expression values are scaled to zero mean and unit variance based on the training set. Positions that are non-informative because no alternative nucleotides have been tested are deleted. The performance evaluation is based on the R^2 score from sklearn. The correlation of measured and predicted expression values is plotted. The feature importance from the random forest regression represent the contributions of each nucleotide-position to the prediction. They are extracted and visualized with a Logo-plot.

[2-Regressor-Training.ipynb](./2-Random-Forest-Training.ipynb)

---

## Evaluation of regressor performance
The goodness of a regressor is evaluated using the coefficient of correlation, R^2, and root mean squared error. Moreover, the trained feature importances from the random forest is visualized as a sequence logo to allow comparison with knwon regulatory elements.

[3-Regressor-Performance.ipynb](./3-Regressor-Performance.ipynb)

---

## Generation of random sequences in the exploration space
The generation of random sequences covered by the experimentally tested sequences serves to delineate the whole expected range of expression strength and to select for novel promoters with defined activity. The sequence exploration space is defined by the distance of nucleotide exchanges to a reference sequence and the information content for each nucleotide on all positions. Only if a position along the sequence has a sufficient information content can the conclusion be drawn that enough input diversity was present to learn a correct output mapping. The output mapping is determined with the position specific entropy and the user decides about the threshold of the entropy such that a position is being used for random sequence construction. The reference sequence can be pre-determined by the user or automatically assembled from the most common nucleotides at each position. The user has to identify the appropriate distance that the random sequences are allowed to be different from the reference. If the available sequences within the allowed exploration space is smaller than a threshold, all sequences will be generated, otherwise random sequences will be generated that satisfy the exploration space conditions. 

[4-Exploration-Space.ipynb](./4-Exploration-Space.ipynb)


## Software dependencies

CPython 3.7.6<br>
IPython 7.12.0<br>
<br>
ipywidgets 7.5.1<br>
matplotlib 3.1.3<br>
numpy 1.18.1<br>
pandas 1.0.1<br>
sklearn 0.22.1<br>
scipy 1.4.1<br>
logomaker 0.8<br>
joblib 0.14.1<br>
<br>
compiler   : GCC 7.3..<br>
system     : Linux<br>
release    : 5.3.0-53-generic<br>
machine    : x86_64<br>
processor  : x86_64<br>
CPU cores  : 8<br>
interpreter: 64bit<br>

%load_ext watermark
%watermark -v -m -p ipywidgets,matplotlib,numpy,pandas,sklearn,scipy,logomaker