# Developing simulation of PCR step in the HR-SIP pipeline

* Main goal is to add in template saturation, which should flatten out the abundance distributions of the abundant taxa


## Method

Based on:
> Suzuki MT, Giovannoni SJ. (1996). Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Appl Environ Microbiol 62:625–630.


* User-input
  * Molarity of DNA used for PCRs
    * distribution
  * Molarity of primers used for PCRs
    * distribution
  * Number of PCR cycles
    * default: 30
  * Starting efficiency
* workflow
  * OTU_table --> PCR_sim --> subsample 
    * PCR_sim input/output = OTU_table

# Setting variables

In [10]:
# dirs
workDir = '/home/nick/notebook/SIPSim/dev/bac_genome3/PCR_sim/'
genomeDir = '/home/nick/notebook/SIPSim/dev/bac_genome3/genomes/'
R_dir = '/home/nick/notebook/SIPSim/lib/R/'

# input files
genomeIndexFile = '/home/nick/notebook/SIPSim/dev/bac_genome3/genomes/genome_index.txt'
fragBD_file = '/home/nick/notebook/SIPSim/dev/bac_genome3/validation/ampFrags_real_kde_dif.pkl'

# params
nprocs = 24

# Init

In [7]:
import os,sys
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [8]:
%%R
library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)

news(Version == "1.0.1", package = "ggplot2")

  res = super(Function, self).__call__(*new_args, **new_kwargs)
Attaching package: ‘dplyr’


  res = super(Function, self).__call__(*new_args, **new_kwargs)

    filter, lag


  res = super(Function, self).__call__(*new_args, **new_kwargs)

    intersect, setdiff, setequal, union


  res = super(Function, self).__call__(*new_args, **new_kwargs)

  res = super(Function, self).__call__(*new_args, **new_kwargs)


In [12]:
if not os.path.isdir(workDir):
    os.makedirs(workDir)

# Making a community file

In [14]:
commFile = 'comm_n1.txt'

!cd $workDir; \
    SIPSim communities \
    $genomeIndexFile \
    > $commFile
    
!cd $workDir; \
    head $commFile

library	taxon_name	rel_abund_perc	rank
1	Streptomyces_pratensis_ATCC_33331	69.936537915	1
1	Escherichia_coli_1303	20.426193477	2
1	Clostridium_ljungdahlii_DSM_13528	9.637268608	3


# Making an incorp config file

In [20]:
incorpConfigFile = 'PT0_PI0.config'

!cd $workDir; \
    SIPSim incorpConfigExample \
    --percTaxa 0 \
    --percIncorpUnif 0 \
    > $incorpConfigFile
    
!cd $workDir; \
    head $incorpConfigFile


[1]
    # baseline: no incorporation
    
    [[intraPopDist 1]]
        distribution = uniform
        
        [[[start]]]
            
            [[[[interPopDist 1]]]]


# Adding isotope incorporation to BD distribution

In [23]:
wIncorpFile = os.path.splitext(fragBD_file)[0] + '_incorp.pkl'

!cd $workDir; \
    SIPSim isotope_incorp \
    $fragBD_file \
    $incorpConfigFile \
    --comm $commFile \
    --np $nprocs \
    > $wIncorpFile
    
!cd $workDir; \
    ls -thlc $wIncorpFile

Processing: Clostridium_ljungdahlii_DSM_13528
Processing: Streptomyces_pratensis_ATCC_33331
Processing: Escherichia_coli_1303
Processing: Clostridium_ljungdahlii_DSM_13528
Processing: Streptomyces_pratensis_ATCC_33331
Processing: Escherichia_coli_1303
-rw-rw-r-- 1 nick nick 4.6M Sep  5 11:17 /home/nick/notebook/SIPSim/dev/bac_genome3/validation/ampFrags_real_kde_dif_incorp.pkl


# Simulating gradient fractions

In [16]:
fracsFile = 'fracs.txt'

!cd $workDir; \
    SIPSim gradient_fractions \
    comm_n1.txt \
    > $fracsFile
    
!cd $workDir; \
    head -n 4 $fracsFile

library	fraction	BD_min	BD_max	fraction_size
1	1	1.66	1.665	0.005
1	2	1.665	1.67	0.005
1	3	1.67	1.673	0.003


# Simulating an OTU table

In [24]:
otuTableFile = 'OTU_abs1e9.txt'

!cd $workDir; \
    SIPSim OTU_table \
    $wIncorpFile \
    $commFile \
    $fracsFile \
    --abs 1e9 \
    --np $nprocs \
    > $otuTableFile
    
!cd $workDir; \
    head -n 4 $otuTableFile

Loading files...
Simulating OTUs...
Processing library: "1"
  Processing taxon: "Streptomyces_pratensis_ATCC_33331"
   taxon abs-abundance:  699365379
  Processing taxon: "Escherichia_coli_1303"
   taxon abs-abundance:  204261935
  Processing taxon: "Clostridium_ljungdahlii_DSM_13528"
   taxon abs-abundance:  96372686
library	taxon	fraction	BD_min	BD_mid	BD_max	count
1	Clostridium_ljungdahlii_DSM_13528	1.660-1.665	1.66	1.663	1.665	0
1	Clostridium_ljungdahlii_DSM_13528	1.665-1.670	1.665	1.667	1.67	2954
1	Clostridium_ljungdahlii_DSM_13528	1.670-1.673	1.67	1.671	1.673	6756


# Simulating PCR

* assuming 5 ng of DNA in a 30 ul rxn
  * Avg. MW of dNTPs = 499.5 g/mol
  * molarity of DNA in rxn = `5 * 1e-9 * (1 / 499.5) * (1 / (30 * 1e-6))`
* molarity of primers:
  * 1 uM (each)
* number of PCR cycles:
  * 30
*  
  
## Model from Suzuki MT & Giovannoni SJ (1996)
  
* equation for modeling increase in polymerized DNA conc.
  * $M = M_0 * e^{f*n}$
* equation for modeling rxn effeciency
  * $f(n) = f_0 * \big(\frac{P(n)}{k * M(n) + P(n)}\big)$
  
  
## algorithm

* deepcopy of OTU table object
  * perc_rel_abund = 0 (to be filled in later)
  * post_PCR_conc = 0 
* foreach column (fraction community) in OTU table:
  * select total start DNA from distribution
  * foreach taxon:
    * taxon_DNA_conc = rel_abund_perc / 100 * DNA_molarity
      * (starting DNA conc. for taxon) = M_0
    * foreach PCR cycle:
      * calculate effeciency (f)
      * calculate new concentration (M)
    * add to dataframe: post_PCR_conc
  * Normalize all post_PCR_conc by total sample conc
    * this is the new perc_rel_abund