# Corona Vaccine Recombinant Expression Simulation
***

## Background

The Covid19 associated pandemic is affecting societies and new vaccines are needed to provide lasting immunity. Traditionally, vaccination strategies based on protein subunits have been popular ([Jeyanathan et al., 2020](https://doi.org/10.1038/s41577-020-00434-6)), which are effectively produced by bacterial recombinant expression ([He et al., 2017](https://doi.org/10.1139/cjm-2016-0528), [Strizova et al.,2021](https://doi.org/10.1159/000514225)). You are member of a small biotech company, specialized in recombinant expression systems and your task is to develop a bacterial host to produce the highest possible amount of viral protein subunits. The expression of the viral protein is controlled by important sequences in the promoter and the final vaccine production rate depends on reaching a high biomass by cultivating in optimal temperatures. Alas, you have only a limited amount of money. You have to aquire the starting material equipment and each experiment costs resources.

<figure>
    <img src="../Figures/Jupyter/Jeyanathan_Covid19-Vaccines_20_Fig1.png" width="500">
    <figcaption>The largest share of Corona vaccines use recombinantly expressed protein subunits (<a href='https://doi.org/10.1038/s41577-020-00434-6'>Jeyanathan et al., 2020</a>). </figcaption>
</figure>

In this project, selected steps of a biotechnological project for recombinant expressions are simulated. The experimental work in your biotech company is highly automated: you are setting the parameters of the experiments and focus on computational data analysis. The goal is to optimize the production rate of the protein subunits to be competitive compared to your peers. To achieve this goal, virtual experiments have to be performed to optimize growth conditions and promoter sequence for the production. Finally, a comparison will be performed how the GC-content of the promoter affects the promoter activity. The data analysis can be performed either separately with Excel, or with some guided coding steps within this script. 

You start with a budget of 10.000 EUR. The initial laboratory setup and each subsequent experiment is associated with an investment. To optimize your initial host-promoter combination towards more effective production, a number of 4-6 strains might be necessary. Initially, you decide on how much money to spend on the laboratory equipment, investing too little will result in a higher failure rate of experiments. Some steps are difficult to perform. The exact parameters for effective cloning are unknown and depend on various complex factors.

|Initial Budget: 10.000 EUR|
| --- |


| Experiment | Cost in EUR |
| --- | --- |
| Equipment | 10-20% of budget |
| Temperature growth | 100 |
| Cloning | 200 |
| Promoter Strength | 100 |
| Production run | 500 |


## Workflow

**1 Set-up of simulation environment**

**2 Lab setup**
 
 * *2.1 Choose your host organism*
 * *2.2 Choose Equipment investment*
 
**3 Culture characterization**
 
 * *3.1 Experiment set-up*     
 * *3.2 Data analysis growth experiment*    

**4 Promoter sequence selection**
 * *4.1 Promoter and expression experiments*
 * *4.2 Data analysis of promoter strength*

**5 Evaluation by cross-group integration**

## 1 Set-up of simulation environment
Loading libraries and fixing visualization. No user input necessary.

In [2]:
# Adapting environment to GitHub or Google Colab
# In GitHub, the full repository is available with all directories, and libraries can be installed with pip on requirements.txt
# In Google Colab only the selected notebook is loaded and the requirements file must be downloaded from GitHub before installation.
# 

# file system and path operations
import os
import sys
if 'google.colab' in sys.modules:
    IN_COLAB = True
    # Download the requirements file
    os.system('wget https://raw.githubusercontent.com/biolabsim/BioLabSim/refs/heads/master/requirements.txt')
    os.system('pip install -r requirements.txt')
else:
    IN_COLAB = False
    os.system('pip install -r ../requirements.txt')



Collecting biopython==1.79
  Downloading biopython-1.79-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.7/2.7 MB 5.7 MB/s eta 0:00:00
Collecting IPython==7.20
  Using cached ipython-7.20.0-py3-none-any.whl (784 kB)
Collecting plotly
  Downloading plotly-6.0.0-py3-none-any.whl (14.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.8/14.8 MB 5.8 MB/s eta 0:00:00
Collecting scikit-learn==1.2.2
  Downloading scikit_learn-1.2.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.6/9.6 MB 6.0 MB/s eta 0:00:00
Collecting bokeh~=3.1.1
  Downloading bokeh-3.1.1-py3-none-any.whl (8.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.3/8.3 MB 5.5 MB/s eta 0:00:00
Collecting biocircuits==0.1.11
  Using cached biocircuits-0.1.11-py2.py3-none-any.whl (65 kB)
Collecting wget
  Using cached wget-3.2-py3-none-any.whl
Collecting joblib>=0.14
  Downloading joblib-1.4.2-py3

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipykernel 6.17.0 requires ipython>=7.23.1, but you have ipython 7.20.0 which is incompatible.
pydna 5.2.0 requires biopython>=1.80, but you have biopython 1.79 which is incompatible.
panel 0.14.4 requires bokeh<2.5.0,>=2.4.0, but you have bokeh 3.1.1 which is incompatible.
escher 1.7.3 requires ipywidgets<8,>=7.4.0, but you have ipywidgets 8.0.4 which is incompatible.
escher 1.7.3 requires Jinja2<3,>=2.7.3, but you have jinja2 3.0.3 which is incompatible.
escher 1.7.3 requires jsonschema<4,>=3.0.1, but you have jsonschema 4.17.3 which is incompatible.


In [2]:
# Allow imports from the current directory.
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.options.display.width = 100 # Make pandas output wider.

# Release only. Disable warnings.
import warnings
warnings.filterwarnings('ignore') # Comment this line to show warnings.

# Release only. Disable the traceback when exceptions occur (like the *_or_abort methods)
ipython = get_ipython()
def hide_traceback(exc_tuple=None, filename=None, tb_offset=None,
                   exception_only=False, running_compiled_code=False):
    etype, value, tb = sys.exc_info()
    return ipython._showtraceback(etype, value, ipython.InteractiveTB.get_exception_only(etype, value))
ipython.showtraceback = hide_traceback  # Comment this line to show traceback on exceptions.

from silvio.catalog.RecExpSim import RecExperiment, combine_data
from silvio.extensions.records.gene.crafted_gene import CraftedGene
from Bio.Seq import Seq

# additional parameter for experiment
bpar = 10000
wpar = 0.005

print('System ready')

System ready


### 2 Lab setup

In this stage, you decide on which host organism to use for your recombinant expression system and the investment to the laboratory equipment. You have the choice between two organisms for recombinant expression, namely *E. coli* (abbr. `ecol`) and *P. putida* (abbr. `pput`). A high investment in the laboratory equipment increases the probability of successfull experiments, 

The two most popular prokarytic organisms for biotechnology are *E. coli* and *P. putida*. *E. coli* is an intensely studied model organism, and allows easy manipulation and fast growth. *P. putida* is a soil bacterium with a high metabolic versatility, including xenobiotic degradation. It shows a very high robustness against extreme environmental conditions such as high temperature, extreme pH, or the presence of toxins or inhibiting solvents. 

**Resource cost:**
* **Free** 

**Input:** 
* **`mySeed`: Number, e.g., Student-ID (integer)**
* **`myInvest`: 1000-2000 Eur (10-20% of total budget, integer)**
* **`myMutant`: 'ecol' or 'pput' (string)**


In [None]:
# User input is required in the following code lines:

# Initialize the random seed with a number, e.g., student-ID
Seed = None
# Enter here the investment for the equipment, higher investment results in fewer experiment failures
myInvest = None
# To choose the host organism replace None in the 'Mutant'-command with the abbreviation:
myMutant = None

# SILVIO experiment initialization
exp = RecExperiment( Seed, myInvest, bpar )
host = exp.create_host(myMutant, name='origin')

# host organism and remaining resources are displayed:
exp.print_status()

## 3 Culture characterization

### 3.1  Experiment set-up
You have to identify the optimal growth temperature, the corresponding maximum growth rate and the maximum biomass of your strain by cultivating the cells at different temperatures. For each program start the optimal temperature and the maximum biomass is randomly initiated. The optimal temperature is sampled from the range of temperatures for **mesophilic microorganisms** within 20-40°C (see page 23, [Biotechnology: An illustrated primer](https://application.wiley-vch.de/books/sample/3527335153_c01.pdf)). Occasionally, a culture will not grow at all and therefore it might be helpful to measure temperature replicates. However, be aware that each cultivation costs resources.     

The results from the temperature experiments are stored in a comma separated value (csv) file  in the `Data` subfolder as `Growth_Simulation.csv`. You can find the csv-file in the left navigation panel within the `biolabsim` folder. For quick inspection double click the csv-file, for downloading and subsequent analysis in Excel right click on the file name and choose 'Download'.        

If you want to do another set of experiments afterwards, or if you want to repeat individual experiments, you should make sure that you change the ID of your set of experiments (experiments_ID), otherwise results already generated may be overwritten. By default the experiments_ID has the value `1`, as is shown in the code cell below. 

Example: `temperature = [22,24,26,28,30,32,34,36,38]`

**Resource cost:**
* **100 EUR for each Experiment**

**Input:**
* **`temperatures`: Temperature array (integer list)**
* **`experiments_ID`: variable name (integer)**

In [None]:
# User input is required in the following code lines:
temperatures = None

# SILVIO simulation
growth_out = exp.simulate_growth('origin', temperatures, wpar)
filepath = os.path.join('..','Data','Growth_Simulation.csv')
growth_out.export_data(filepath)

exp.print_status()

### 3.2 Data analysis growth experiment

The data of the experiment was stored in a coma-separated-value (`csv`) file in the local adress. The data has to be analysed to extract optimal temperature, growth rate and maximum biomass. You have the choice to either analyse the data via a spreadsheet application on your local computer, e.g. Excel, or via a programming approach with Python.

In Excel you have to import the csv-file to get the data into separate columns. You then apply a natural logarithm to the data and plot the value versus time. The plot allows to extract the temperature supporting fastest growth with highest slope and the time until which the increase of logarithmic biomass is linear, i.e. it displays exponential growth. Then you apply a linear regression on the linear section of the fastest logarithmic biomass increase to extract the growth rate as the regression slope. You determine the maximum biomass by averaging over a number of measurements on the plateau of the measured biomass (real values, not logarithm).

For the Python approach, there are scambled solutions of lines of code which you have to organize correctly. Even without programming experience this should take an equivalent time compared to the Excel approach. Please record the time it takes for you to conduct the data analysis and proceed to the item `Promoter sequence selection`.

#### 3.2.2 Python based growth analysis

In the following, data analysis via Python can be performed. The procedure is separated in two steps: 1. visualization of the results, 2. extracting growth rate and maximum biomass for the optimal temperature. For each process the corresponding lines of code are provided in the cell, but you have to arrange them in right order.

##### 3.2.2.1 Visual analysis of exponential growth

The visual indication of exponential growth is a linear slope in the plot of the logarithm of biomass versus time. The following commands are involved (scrambled order):
 * assign variables for time and biomass data
 * loading of the csv-file into a numpy data array ([genfromtxt](https://stackoverflow.com/questions/3518778/how-do-i-read-csv-data-into-a-record-array-in-numpy))
 * plotting biomass versus time ([plt.scatter](https://matplotlib.org/2.0.2/users/pyplot_tutorial.html))
 * store natural logarithm of the biomass in new variable ([np.log](https://numpy.org/doc/stable/reference/generated/numpy.log.html))

In [None]:
# Insert the correct code sequence for plotting in this cell.
# %load Snippets/rev_GrowthPlot.py 
Time, Biomass = my_data[:,0], my_data[:,1:]
DataFile = os.path.join('..','Data','Growth_Simulation.csv')
LnBiomass = np.log(Biomass)
[plt.scatter(Time, X, label=Exp) for Exp,X in enumerate(LnBiomass.T)]
plt.legend([r'{}:{}$^\circ$C'.format(Idx, T) for Idx, T in enumerate(temperatures)], bbox_to_anchor=(1.05, 1), loc='upper left'); plt.xlabel('time, h'); plt.ylabel('ln(Biomass)')
my_data = np.genfromtxt(DataFile, delimiter=',', skip_header=1)

##### 3.2.2.2 Determine maximum biomass and growth rate

After having the graphical insight, select the optimal temperature with fastest growth using the experiment index number in Python. Be carefull, Python starts counting with zero! Thus, the first experiment has the index `0`. Also, extract from the half-logarithmic plot the time range of linear growth as integer number (e.g. `10`h). This value will be used to estimate the growth rate via linear regression. The process is described in the following scrambled lines:
 * extract the mean biomass after the linear slope as max biomass ([np.mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html))
 * identify the latest time (integer number) of linear slope
 * conduct linear regression in the linear region ([np.polyfit](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html))
 
Like in the previous coding example you have to bring the code in right order.

**Input:**
* **1st None: Index of Optimum Temperature (integer)**
* **2nd None: max time in h for linear slope (integer)**

In [None]:
# Rearrange the correct code sequence for calculating the growth and biomass parameters here.
# For None enter the corresponding values of experiment index with fastest growth and latest time of linear growth (integer number).
# %load Snippets/snip_GrowthPars.py

Idx_optT, Linear_optT = None, None
MB = np.mean(Biomass[Linear_optT:,Idx_optT])
print('max biomass: {:.0f}\nmax growth rate: {:.2f}'.format(MB,GR[0]))
GR = np.polyfit(Time[:Linear_optT],LnBiomass[:Linear_optT,Idx_optT],1)


## 4 Promoter sequence selection
In bacteria, the initiation of transcription at promoters requires the sigma factor to bind the RNA polymerase core to form the holoenzyme. Sigma factors recognize and open the promoter DNA and perform the initial steps in RNA synthesis. Particularly important DNA recognition sites are the -10 box and -35 box positions. An important bacterial sigma factor is sigma70 and directs the bulk of transcription during active growth [Paget, 2015](https://doi.org/10.3390/biom5031245). Two sites in the promoter are particularly important to determine the transcription strength: -35 and -10 boxes. The boxes are conserved sequences with a defined number of nucleotides upstream of the transcription start site. A common nucleotide sequence for the -35 box is: `TTGACA`, and for the -10 box: `TATAAT`. 

In this simulation, the total length of the promoters must be 40 nt. The following template is a starting point for deriving promoters. Replace the **X** with nucleotides and identify the best sequence (hit: the six **X** at start and end represent -35 and -10 box, respectively):

 * GCCCA**XXXXXX**A**X**GC**XXX**C**X**CGT**XXX**GG**XXXXXX**TGCACG

### 4.1 Promoter and expression experiments

#### 4.1.1 Promoter choice and cloning
To introduce the promoter into your organism, you have to conduct a cloning and design a suitable primer for each promoter. First create the primer that is complementary to your promoter sequences. The primers should always start at the first nucleotide of the promoter sequences. You have to identify the optimal primer length, which can vary from 15-30 nucleotides. Because the primer will still work if it deviates by 5 nucleotides, you can make broad jumps to test the right primer length.

Calculate the melting temperature of the primer according to the formula [genelink manual](https://www.genelink.com/Literature/ps/R26-6400-MW.pdf):
 * Tm = 2(A+T) + 4(C+G)

**Important note/warning:** 
The success of cloning is a probabilistic event, even with the right conditions, it might fail in a first trial but succeed in a second identical trial. Note the meaning of the following failure reports:
 * Cloning failed: 
   * promoter deviates from 40nt?
   * primer fully complementary to promoter and starts with the first nucleotide?
   * primer length incorrect?
   * melting temperature incorrect?
   * just try two more executions to rule out random failure
 * Bad equipment:
   * Cloning failed because of bad equipment, just try more executions
   
**Resource cost:** 
* **200 EUR each cloning experiment**

**Input:**
* **`mySequenceID`: small identifier for the sequence (string)**
* **`myPromoter1`: promoter, 40nt from [ACGT] (string)**
* **`myPrimer1`: primer, 15-30nt complementary to sequence (string)**
* **`myTm`: melting Temperature, number (integer)**

In [None]:
# For each cloning, the Clone_ID, the promoter sequence, the corresponding primer and the melting temperature must be given.

# The following lines of code are an example of these variables.
# To be able to clone further sequences, you have to define another set of all variables including the Clone_ID for each cloning.
# To define a further Clone_ID you for example have to replace the number 1 in Clone_ID1 and Clone_1 with a different number.
mySequenceID = None
myPromoter = None
myPrimer = None
myTm = None

newGene1 = CraftedGene( name=mySequenceID, prom=Seq(myPromoter), orf="GGGGGGGGGG" )
newHost1, clone_outcome1 = exp.clone_with_recombination( 'origin', Seq(myPrimer), gene=newGene1, tm=myTm )

# mySequenceID = None
# myPromoter = None
# myPrimer =   None
# myTm =  None

# newGene2 = CraftedGene( name=mySequenceID, prom=Seq(myPromoter), orf="GGGGGGGGGG" )
# newHost2, clone_outcome2 = exp.clone_with_recombination( host, Seq(myPrimer), gene=newGene2, tm=myTm )

print("Clone Outcome: " + clone_outcome1)

exp.print_status()

#### 4.1.2 Measurement of the promoter strength
The promoter strength represents expression per cell. Later, it is multiplied by the growth rate and the biomass concentration in order to determine the expression rate. In the following experiment the promoter strength is determined. 

**Resource cost:**
* **100 EUR each experiment**

**Input:**
* **`newHost`**: hosts id generated in cloning experiment, e.g. 'origin.1'


In [None]:
newHost = None
promoter_act1 = exp.measure_promoter_strength( newHost, mySequenceID )
promoter_data = combine_data([promoter_act1])
promoter_data.display_data()
exp.print_status()

#### 4.1.3 Measurement of the final vaccine expression rate
Now that you have tested some promoter sequences, perform the production experiments with the promoter sequence (Clone_ID) and use the optimal growth temperature, the maximum growth rate and the maximum biomass from the strain growth characterization task. 

Finally, the experimental results are exported to a csv file for further data analysis. The function works without further user input. The output file is stored in the `Data` subfolder as `Production_Simulation.csv`.

**Resources:**
* **500 EUR each experiment**

**Input:**
* **in `simulate_vaccince_production`: host variable name (string), gene variable name (string), opt. Temp (int), Opt. Growth rate (float), Opt. Biomass (int)**

In [None]:
# To perform the production experiment replace None with the Clone_ID variable name of your best performing clone, the optimal growth temperature, the corresponding maximum growth rate and the maximum biomass (in this order).
vacProd1 = exp.simulate_vaccine_production(host_names=[None, None], gene_name=None, cult_temp=None, growth_rate=None, biomass=None)

if vacProd1.error:
    print('Experiment failed.')
else:
    filepath = os.path.join('..','Data','Production_Simulation.csv')
    # add variable names for additional vaccine production experiment results here:
    vacProdAll = combine_data([vacProd1]) # e.g. [vacProd1, vacProd2]
    vacProdAll.export_data(filepath)
    print('Clone Outcome production stored in: {}'.format(filepath))
    
exp.print_status()

## 4.2 Data analysis of promoter strength

The biotechnological goal is the construction of a host strain with high productivity. Moreover, scientifically we like to investigate the relationship between promoter strength and GC content. First, you will examine your own data in a plot for correlation, subsequently, all groups will enter their results for the species specific promoter strength versus GC content plot in an online plot. The online plot shows the results of all groups and allows a more solid conclusion.


### 4.2.1 Visualization of the results

Summarize your results from the laboratory workflow in a graph by plotting the promoter strengths of the final relative expression rates against the respective GC contents of the promoter sequences. You can perform the data analysis in Excel by importing the data file `Production_Simulation.csv` and generating a scatter plot of relative expression rate versus GC-content. Alternatively, you can use the Python code with scrambled lines (interactive web-site: [in this link](http://parsons.problemsolving.io/puzzle/7cd06ab7e8c94c7eba31bc50ac07a0a7)).

Enter the correct code sequence in the cell below. The code cell that extracts the data columns for GC content and relative expression rate is missing the column number for the corresponding data. Add these column number in the fields labeled with `None` and note that Python starts counting from zero, thus the first column has index `0`.

**Input:**
* **`my_data`: GC-content column name (string)**
* **`my_data`: expression rate column name (string)**

In [None]:
# %load Snippets/rev_ExprPlot.py
# Insert the correct code sequence for plotting in this cell.
# %load Snippets/snip_ExprPlot.py 

my_data = pd.read_csv(DataFile)
DataFile = os.path.join('Data','Production_Simulation.csv')
GCcont, Express = my_data[None].values, my_data[None].values
plt.gca().set(xlabel='GC-cont', ylabel='rel. expression')
plt.savefig('RelExpress_Vs_GCcont_allProm.png', format='png')
plt.plot(GCcont,Express, linestyle = '--', marker = 'x', color = 'grey')


#### Package dependencies

In [None]:
%load_ext watermark
%watermark -v -m -p IPython,ipywidgets,matplotlib,numpy,pandas,openpyxl,sklearn,scipy,joblib,watermark

### Loading of libraries if not automatic

In [None]:
%pip install pandas
%pip install silvio
%pip install scikit-learn