# Script to generate predictions of KO strain design performance

## Detailed in "Integrated knowledge mining, genome-scale modeling, and machine learning for predicting *Yarrowia lipolytica* bioproduction".

### Description. 
Script takes a target compound, a list of reactions or genes to test for KO or OE, and generates predicted titers for each design. 

The default conditions for prediction are in glucose, with no prior genetic engineering. The prediction is obtained through generated machine-learning model detailed in the publication.

#### Procedure:
1. Read in data and constructs to screen from "Template_ComputationalDesign" spreadsheet.
2. Generate FBA features for the WT and each strain construct.
3. Predict the titer of each strain.
4. Output the results.

#### Inputs:
1. Supplemental Excel File 6- CSD Template.xlsx: Spreadsheet where the product, testing environmental conditons, and list of KO targets to screen are input.
      Supplemental Excel File 6- CSD Template.xslx
2. Data Encoding File 
      Supplemental Data File 2- DataCharateristics & Encoding.xlsx

#### Output:
1. titerPredictionsKO.csv: Spreadsheet containing a prediction of the WT strain titer, each KO strain titer, and the FBA predicted product yield and biomass growth rate.


#### Additional required scripts:
1. FBA_function_.py:
    Performs FBA feature generation and extraction
2. encodingFunction_.py:
    Encodes the data for input to the ML model
3. FBA_functionOE_.py:
    Performs Gene OE feature generation and extraction

### Libraries to import

In [1]:
import pandas as pd
import pickle
from collections import defaultdict
import warnings
import numpy as np
import os




# from FBA_function_ import FBA_FeatureExtraction
from encodingFunction_ import encodeTransform

### Ensure the spreadsheet is within the directory.

In [2]:
dir_path = os.path.dirname(os.path.realpath('Supplemental Excel File 6- CSD Template.xlsx'))
file_path = os.path.join(dir_path,'Supplemental Excel File 6- CSD Template.xlsx')

In [3]:
# reads in if you will performing knocksouts or overexpressions
inputKOorOE = pd.read_excel(file_path,sheet_name='KO_or_OE')


from FBA_combined import FBA_FeatureExtraction

In [4]:
# if inputKOorOE['Specify'][0]=='KO':  

# # if perform_knockouts==1:
#     #custom functions provided in the directory.
#     from FBA_function_cs import FBA_FeatureExtraction #KO
# else:
#     #custom functions provided in the directory for OE.
#     from FBA_functionOE_cs import FBA_FeatureExtraction #OE


In [5]:
# from FBA_function_combined import FBA_FeatureExtraction as fbaCombined

In [6]:
#reads in the information from the datasheets
raw_construct = pd.read_excel(file_path,sheet_name='predictions',skiprows=range(2))
optKnockRxns = pd.read_excel(file_path,sheet_name='targetRxns')

optOERxns = optKnockRxns ## delete


# consolidate meta-information into usable features
data = raw_construct
optData = optKnockRxns
data['number_genes_mod'] = data.genes_modified_updated.apply(lambda x: x.count(';')+1 if isinstance(x,str) else 0)
data['number_genes_deleted'] = data.gene_deletion.apply(lambda x: x.replace(';','').count('1') if isinstance(x,str) else 0)
data['number_total_genes_overexp'] = data.gene_overexpression.apply(lambda x: x.replace(';','').count('1') if isinstance(x,str) else 0)
data['number_genes_het'] = data.heterologous_gene.apply(lambda x: x.replace(';','').count('1') if isinstance(x,str) else 0)

# hettemp1 = data.heterologous_gene#.apply(lambda x: x if isinstance(x,str) else 'NA')
hettemp1 = data.heterologous_gene.apply(lambda x: x if isinstance(x,str) else 'NA')
data.heterologous_gene
hettemp2 = hettemp1.str.split(';',expand=True)

# overexpressTemp1 = data.gene_overexpression.fillna('2')
overexpressTemp1 = data.gene_overexpression.apply(lambda x: x if isinstance(x,str) else 'NA')
overexpressTemp2 = overexpressTemp1.str.split(';',expand=True)
nativeGenes = overexpressTemp2[hettemp2=='0']

data['number_native_genes_overexp'] = nativeGenes.count(axis=1)


In [7]:
data.columns

Index(['paper_number', 'blank1', 'cs1', 'cs_conc1',
       'cs1_heatCombustion(kJ/mol)', 'cs2', 'cs_conc2',
       'cs2_heatCombustion(kJ/mol)', 'reactor_type', 'rxt_volume', 'temp',
       'oxygen', 'foldCarbonFed', 'FermentationTime', 'blank2', 'product_name',
       'product_deltaGo', 'mw', 'central_carbon_precursor',
       'ccm_precursor_deltaGo', 'Pathway_enzymatic_steps',
       'precursor_required', 'atp_cost', 'nadh_nadph_cost', 'media', 'pH',
       'genes_modified_updated', 'gene_deletion', 'gene_overexpression',
       'heterologous_gene', 'Unnamed: 30', 'Unnamed: 31', 'Unnamed: 32',
       'Unnamed: 33', 'Unnamed: 34', 'number_genes_mod',
       'number_genes_deleted', 'number_total_genes_overexp',
       'number_genes_het', 'number_native_genes_overexp'],
      dtype='object')

## FBA Modeling

In [14]:
##FBA modeling
#GSM to use, default is 'iYLI647'
FBA_models=['iYLI647']
KOorOE = inputKOorOE['Specify'].values[0]

output, errors = FBA_FeatureExtraction(data,optKnockRxns,optKnockRxns,FBA_models,KOorOE) # OE

1.1398166174414033 iYLI647
0 OE failures
0 Prod failures
0 0 0 0 0 0 failure cases 1-6


In [16]:
output

Unnamed: 0,paper_number,blank1,cs1,cs_conc1,cs1_heatCombustion(kJ/mol),cs2,cs_conc2,cs2_heatCombustion(kJ/mol),reactor_type,rxt_volume,...,EMP_iYLI647,PPP_iYLI647,TCA_iYLI647,NADPH_iYLI647,ATP_iYLI647,PrdtFlux_iYLI647,PrdtYield_iYLI647,Biomass_iYLI647,O2Uptake_iYLI647,GlcUptake_iYLI647
0,1,,1,20,2626,0,0,0,1,0.05,...,6.667322,7.343305,1.215066,15.982695,56.429296,0.21023,0.125476,0.8548625,-13.207577,-10.0
1,1,,1,20,2626,0,0,0,1,0.05,...,10.0,0.0,1.349693,1.349693,70.143149,0.777096,0.46381,4.651987e-16,-11.820041,-10.0
2,1,,1,20,2626,0,0,0,1,0.05,...,0.0,39.914334,2.178296,80.103564,75.828688,0.07764,0.046339,0.3837742,-40.031247,-10.0
3,1,,1,20,2626,0,0,0,1,0.05,...,9.488372,3.069767,0.0,6.139535,66.604651,0.790698,0.471928,2.763127e-15,-10.976744,-10.0


In [10]:
error

[]

## Encode data

In [17]:
#encode data, using output from FBA modeling section
encodedData = encodeTransform(output)

In [18]:
encodedData

Unnamed: 0,paper_number,blank1,cs1,cs_conc1,cs1_heatCombustion(kJ/mol),cs2,cs_conc2,cs2_heatCombustion(kJ/mol),reactor_type,rxt_volume,...,O2Uptake_iYLI647,GlcUptake_iYLI647,product_name2,carbonSourceOneMolecularWeight,carbonSourceTwoMolecularWeight,inputThermo(kJ/L),precursorsRequiredEncoded,totalThermBarrier,averageThermBarrier,csConcTotal
0,1,,1,20,2626,0,0,0,2,2,...,13.207577,10.0,Astaxanthin,180.156,0,291.525123,24,4342,271,20
1,1,,1,20,2626,0,0,0,2,2,...,11.820041,10.0,Astaxanthin,180.156,0,291.525123,24,4342,271,20
2,1,,1,20,2626,0,0,0,2,2,...,40.031247,10.0,Astaxanthin,180.156,0,291.525123,24,4342,271,20
3,1,,1,20,2626,0,0,0,2,2,...,10.976744,10.0,Astaxanthin,180.156,0,291.525123,24,4342,271,20


### Predict Machine learning trained model.

In [19]:
from ModulizeMLPredictions import perform_MLPrediction

In [22]:
MLOutput = perform_MLPrediction(encodedData,output)
MLOutput

Unnamed: 0,TiterPrediction(g/L),% of Original Strain Production,FBA predicted Biomass,FBA predicted Yield,Input Reaction Tested
0,0.011786,100.0,0.8548625,0.125476,a;b;c
1,0.021033,178.458189,4.651987e-16,0.46381,GND
2,0.024403,207.053906,0.3837742,0.046339,GAPD
3,0.027227,231.011941,2.763127e-15,0.471928,CSm
