# Simulations Lab

In [None]:
%matplotlib inline
import numpy as np
import scipy.stats
from matplotlib import pyplot as plt

## Practice

### Optional exercises to review basic python

+ If you want to build a collection of things, keeping track of their order, but do not know how many things you will have, should you use a `dict`, `list`, or `array`?  

+ If you want to do fast math operations on many numbers at once, should you use a `tuple`, `array` or `list`?

+ Use an appropriate data type that allows you to look up the common name of an organism give its scientific name. 


| scientific | common |
|------------|--------|
| X. laevis  | frog   |
| M. musculus| mouse  |
| H. sapiens | human  |

+ Use a `for` loop to print out every 3rd value between -97 and 33.

+ You have an instrument that is going to spit out 1000 decimal numbers.  The following loop simulates this using the `time.time()` function, which spits out the time in seconds since January 1, 1970 at 12:00 GMT.  **Capture the output using a numpy array (no lists allowed) and determine the mean.**

In [None]:
import time

for i in range(1000):
    instrument_spew = time.time()

### Optional exercises to review plotting

Reference material is [here](https://aclarke.uoregon.edu:8000/user/[YOUR_USER_NAME_HERE]/notebooks/pythonic-science/chapters/00_inductive-python/plotting_reference.ipynb).

+ On the same graph, plot: $y = x$, $y = x^2$, $y = x^{3}$ ... $y = x^{10}$ for $x \in [-10,10]$.


+ Plot $y = e^{x}$ for $x \in [-5,5]$.  Make two different graphs (in the same cell!), one using a linear $x$ scale, one using a $ln(x)$ x scale.  (Hint: use `plt.show()`). 


+ Plot a histogram of 1000 random numbers sampled from a Poisson distribution with $\lambda = 3$.

# Lab

## Coins in a row

You place 100 coins heads up in a row and number them by position, with the coin all the way on the left No. 1 and the one on the rightmost edge No. 100. Next, for every number N, from 1 to 100, you flip over every coin whose position is a multiple of N. For example, first you’ll flip over all the coins, because every number is a multiple of 1. Then you’ll flip over all the even-numbered coins, because they’re multiples of 2. Then you’ll flip coins No. 3, 6, 9, 12 … And so on.

What do the coins look like when you’re done? Specifically, which coins are heads down?

Source: [fivethirtyeight](https://fivethirtyeight.com/features/can-you-survive-this-deadly-board-game/)

## Mutant Screen

You are doing a classic blue/white lacZ mutant screen, where bacterial mutants of interest have white colonies rather than blue colonies.  You screen a library containing 10,000 mutants.  You expect that 1/1,000 mutants will be white.  Assuming no bias in the library, how many colonies do you need to look at to have a < 2% chance of missing a mutant? (You can report your sampling to within a factor of ten.)

## Dubious Paper

You read a paper that makes a big deal out of the following result.  

<div style="margin:auto">
    <img src="https://raw.githubusercontent.com/harmsm/pythonic-science/master/labs/01_simulation/paper-result.png" />
</div>

The authors claim that the difference between treatment 1 and 2 is significant and important.  You are skeptical and want to test the claim.  Fortunately, these scientists published their response data for each treatment condition in the supplement.  These are copied below. 

In [None]:
treat_1 =   [57.26977195, 46.18382224, 49.53778012, 41.48839620, 60.208242,
             50.52545917, 46.35328597, 45.74836944, 48.44702572, 52.524908,
             55.10329891, 46.61524479, 52.13253421, 54.72779465, 42.324008,
             50.33964928, 52.18085508, 53.24086389, 43.14439906, 45.148827]

treat_2 =   [60.41763564, 48.83220035, 51.12384165, 42.96237314, 62.606467,
             51.96334172, 50.68015860, 48.75041835, 51.08492900, 55.163020,
             58.30618134, 51.06279668, 54.75658646, 57.74810245, 46.318017,
             51.32863816, 54.85243237, 55.94919523, 46.42182621, 48.367620]

+ What is the difference between the means of these samples?

+ You then apply a t-test to these samples, assuming each measurement is independent.  What is the p-value?  Do you believe the researchers' conclusions? 

+ Before you write the authors, you decide to carefully reread their methods.  You discover that the values in treat_1 and treat_2 are **paired** rather than independent.  (This means that you can compare the first sample in treat_1 to the first sample in treat_2, the second sample etc.).  What is the p-value for paired t-test?  Does this alter your conclusion above?

## Growing bacteria 

You are studying the growth of bacteria from a small inoculum to saturation in a flask.  The following information  will help you answer the questions below.

Given your growth conditions, the number of bacteria at time $t$ will grow according to:

$$N(t) = \frac{N_{c}}{1 + exp(\lambda - kt)}$$

where $N_{c}$, $\lambda$, and $k$ are constants. $N_{c}$ is the maximum number of bacteria that can be supported  by the environment, $\lambda$ captures how long it takes for the bacteria to start dividing post inoculation, and $k$ is the instantaneous growth rate. (We're ignoring the fact that bacteria eventually start to die after they run out of food).  For your wildtype bacteria, these constants are: $N_{c} = 1 \times 10^{10}\ cells \cdot mL^{-1}$, $\lambda = 12$ and $k = 0.05\  min^{-1}$.

  

You can measure bacterial growth by following the turbidity of your cultures using a spectrophotometer.  By careful calibration, you know that the observed $OD_{600}$ is related to the number of cells by:
$$OD_{600} = \frac{gN}{N + K}$$

where $g = 3.5$ and $K = 2 \times 10^{9}\ cells \cdot mL^{-1}$.


### Questions

+ Write two functions, one describing bacterial growth, the other describing the spectrophotometer.  (Hint: use numpy arrays, not lists!).  

+ Create three separate graphs: 
  + $N(t)$ vs. $t$.
  + $OD_{600}$ vs. $N$
  + $OD_{600}$ vs. $t$

+ You introduce a mutation that you expect may increase the carrying capacity by a factor of 5. Would you rather measure this effect at 150 or 300 minutes?

+ You send your research assistant out to measure five biological replicates of the wildtype and mutant.  Unfortunately, they do not record the exact starting time of each biological replicate.  For each replicate, they guess the start time is 0 +/- 25 min. (Normal distribution with a standard deviation of 25).  Can you still measure a difference between the wildtype and mutant at 200 minutes?  What about 300 minutes?


## PCR

You have a tube containing 10 molecules of DNA drawn from a population containing four different species $A$, $B$, $C$, and $D$.  You want to want to estimate the frequencies of $A$, $B$, $C$, and $D$ in the original popuation.  If you were omniscient, you would know that $A$, $B$, $C$, and $D$ have the following actual frequencies:

| species | frequency |
|:-------:|:---------:|
| A       |  0.5      |
| B       |  0.2      |
| C       |  0.2      |
| D       |  0.1      |

Since you aren't omniscient, you make a measurement. You use a Polymerase Chain Reaction (PCR) to amplify those 10 molecules, then use high-throughput sequencing to measure the relative frequencies of A-D in the final pool.  

**What are the mean and standard deivation of on your estimates of each frequency ($\hat{f}_{A}$, $\hat{f}_{D}$, $\hat{f}_{C}$, $\hat{f}_{D}$)?**

Some information about the experiment:

+ The PCR reaction takes every sequence present in the solution and doubles its number.
+ Your reaction is 90% effecient.  This means that a random fraction of 90% of the population is doubled each round. 
+ You run the reaction for 15 rounds. 
+ Your high-throughput sequencing reaction is perfect and does not introduce any error.
