In [69]:
import os
import glob
import pandas as pd
import numpy as np

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()

%load_ext blackcellmagic

The blackcellmagic extension is already loaded. To reload it, use:
  %reload_ext blackcellmagic


# Exercise 7.1: Computing things!

We have looked at a data set from Harvey and Orbidans on the cross-sectional area of C. elegans eggs. Recall, we loaded the data and converted everything to Numpy arrays like this:

    df = pd.read_csv('data/c_elegans_egg_xa.csv', comment='#')

    xa_high = df.loc[df['food']=='high', 'area (sq. um)'].values
    xa_low = df.loc[df['food']=='low', 'area (sq. um)'].values

Now we would like to compute the diameter of the egg from the cross-sectional area. Write a function that takes in an array of cross-sectional areas and returns an array of diameters. Recall that the diameter $d$ and cross-sectional area $A$ are related by $A=\pi d^2/4$. There should be no `for` loops in your function! The call signature is

    xa_to_diameter(xa)
    
Use your function to compute the diameters of the eggs.

In [2]:
df = pd.read_csv('data/c_elegans_egg_xa.csv', comment='#')

xa_high = df.loc[df['food']=='high', 'area (sq. um)'].values
xa_low = df.loc[df['food']=='low', 'area (sq. um)'].values

In [3]:
def xa_to_diameter(xa):
    """Takes in an array of cross-sectional areas and returns an array of 
    diameters."""
    
    return np.sqrt(4 * xa / np.pi)

In [4]:
diameter_high = xa_to_diameter(xa_high)

diameter_high

array([46.29105911, 51.22642581, 47.76657057, 48.5596503 , 51.59790585,
       47.61973991, 49.33998388, 47.89966242, 47.21697198, 46.94654036,
       49.08125119, 49.84064959, 47.9926071 , 46.29105911, 47.69988539,
       48.40207395, 48.15152345, 49.3141717 , 49.57168871, 47.87307365,
       48.30991705, 46.29105911, 46.12573337, 46.24978308, 46.41466697,
       47.87307365, 48.15152345, 48.95137203, 45.72372833, 47.18999856,
       46.68817945, 45.98750791, 46.53794651, 52.2111661 , 48.70364742,
       47.23045291, 47.06842687, 46.81073869, 45.97366251, 49.57168871,
       50.8397116 , 48.54653847, 52.08909166, 48.24398292])

In [5]:
diameter_low = xa_to_diameter(xa_low)

diameter_low

array([48.40207395, 51.58556628, 52.55146594, 50.31103472, 53.06982074,
       54.57203767, 50.32368681, 52.24773281, 53.99739399, 49.44309786,
       53.87936676, 47.9926071 , 52.41804019, 47.87307365, 52.11352942,
       51.21399674, 52.44232467, 50.47526453, 50.8397116 , 51.56087828,
       49.84064959, 55.96578669, 50.72688754, 50.58864976, 52.18677405,
       52.44232467, 51.78264653, 52.57568879, 51.86863366, 52.67246879,
       49.05530287, 52.67246879, 50.72688754, 50.07003758, 52.32078957,
       49.18490759, 53.72554372, 46.67454189, 49.19784929, 51.88090591,
       51.85635852, 54.8280819 , 52.07686848, 51.22642581, 51.96673046,
       48.29673743, 53.04582353, 52.07686848, 52.35727972, 50.57606396,
       51.70882946, 53.54750652, 52.23554675, 53.54750652, 53.18964437,
       51.96673046, 55.38261517])

# Exercise 7.2: Working with two-dimensional arrays

Numpy enables you do to matrix calculations on two-dimensional arrays. In exercise, you will practice doing matrix calculations on arrays. We’ll start by making a matrix and a vector to practice with. You can copy and paste the code below.

    A = np.array(
        [
            [6.7, 1.3, 0.6, 0.7],
            [0.1, 5.5, 0.4, 2.4],
            [1.1, 0.8, 4.5, 1.7],
            [0.0, 1.5, 3.4, 7.5],
        ]
    )

    b = np.array([1.1, 2.3, 3.3, 3.9])

In [6]:
A = np.array(
    [
        [6.7, 1.3, 0.6, 0.7],
        [0.1, 5.5, 0.4, 2.4],
        [1.1, 0.8, 4.5, 1.7],
        [0.0, 1.5, 3.4, 7.5],
    ]
)

b = np.array([1.1, 2.3, 3.3, 3.9])

a) First, let’s practice slicing.

1. Print row 1 (remember, indexing starts at zero) of `A`.
1. Print columns 1 and 3 of `A`.
1. Print the values of every entry in `A` that is greater than 2.
1. Print the diagonal of `A`. using the `np.diag()` function.


In [206]:
# 1

A[1, :]

array([0.1, 5.5, 0.4, 2.4])

In [208]:
# 2

A[:, [1, 3]]

array([[1.3, 0.7],
       [5.5, 2.4],
       [0.8, 1.7],
       [1.5, 7.5]])

In [9]:
# 3

A[A > 2]

array([6.7, 5.5, 2.4, 4.5, 3.4, 7.5])

In [10]:
# 4

np.diag(A)

array([6.7, 5.5, 4.5, 7.5])

b) The np.linalg module has some powerful linear algebra tools.

1. First, we’ll solve the linear system $A \cdot x = b$. Try it out: use `np.linalg.solve()`. Store your answer in the Numpy array `x`.
1. Now do `np.dot(A, x)` to verify that $A \cdot x = b$.
1. Use `np.transpose()` to compute the transpose of `A`.
1. Use `np.linalg.inv()` to compute the inverse of `A`.

In [11]:
# 1

x = np.linalg.solve(A, b)

x

array([0.03401232, 0.29137092, 0.60186517, 0.18888027])

In [12]:
# 2

np.dot(A, x) == b

array([ True,  True,  True,  True])

In [13]:
# 3

np.transpose(A)

array([[6.7, 0.1, 1.1, 0. ],
       [1.3, 5.5, 0.8, 1.5],
       [0.6, 0.4, 4.5, 3.4],
       [0.7, 2.4, 1.7, 7.5]])

In [14]:
# 4

np.linalg.inv(A)

array([[ 0.15267508, -0.03365026, -0.01778   ,  0.00054854],
       [-0.00906001,  0.19788853,  0.03719385, -0.07090934],
       [-0.04391535, -0.0144834 ,  0.26880108, -0.05219479],
       [ 0.02172029, -0.0330119 , -0.12929526,  0.17117684]])

c) Sometimes you want to convert a two-dimensional array to a one-dimensional array. This can be done with `np.ravel()`.

1. See what happens when you do `B = np.ravel(A)`.
1. Look of the documentation for `np.reshape()`. Then, reshape `B` to make it look like `A` again.


In [15]:
# 1

B = np.ravel(A)

B

array([6.7, 1.3, 0.6, 0.7, 0.1, 5.5, 0.4, 2.4, 1.1, 0.8, 4.5, 1.7, 0. ,
       1.5, 3.4, 7.5])

In [16]:
A

array([[6.7, 1.3, 0.6, 0.7],
       [0.1, 5.5, 0.4, 2.4],
       [1.1, 0.8, 4.5, 1.7],
       [0. , 1.5, 3.4, 7.5]])

In [17]:
# 2

np.reshape(B, (4,4))

array([[6.7, 1.3, 0.6, 0.7],
       [0.1, 5.5, 0.4, 2.4],
       [1.1, 0.8, 4.5, 1.7],
       [0. , 1.5, 3.4, 7.5]])

# Exercise 7.3: Understanding and building ECDFs

As a reminder, the empirical cumulative distribution function for a set of data point evaluated at x is

    ECDF(x) = fraction of data points ≤ x.


Write a function with call signature

    ecdf_vals(data)

which takes a one-dimensional NumPy array of data and returns the `x` and `y` values for plotting a "dot-style" ECDF. That is, each dot has a `y` value given by the ECDF evaluated at `x`.

In [213]:
def ecdf_vals(data):
    """Computes the ECDF for a one-dimensional NumPy array of data.
    Returns the x and y values for plotting a 'dot-style' ECDF."""
    
    # Sort the data array
    x = np.sort(data)
    
    # Compute ECDF
    y = (1 + np.arange(len(data))) / len(data)
    
    return x, y

In [214]:
# Dummy data set
rg = np.random.default_rng()
data = rg.normal(0, 1, size=100)

data

array([-0.78984585, -1.02862384, -0.94691183,  0.64565024,  1.94391978,
       -0.09510624, -1.36515065, -0.4148116 , -2.01215498,  0.18338547,
       -0.40339108, -1.08256148, -0.32764384, -0.59081842, -0.48814965,
       -0.12561053,  1.03185849, -0.45344707, -0.16346276, -0.25145224,
       -2.10595981, -1.43115957, -0.5799471 , -0.36197938, -1.03384511,
       -0.37097766, -0.49646029, -0.97110494, -0.62689164, -1.98281948,
       -1.05922596,  1.36532501, -0.5388963 , -0.37794264,  2.396229  ,
        0.03247631, -0.33961641, -0.32025818, -0.78020127,  0.71099197,
       -1.11754801, -3.04020301,  0.77270824, -1.11970336,  0.17440944,
        0.82101106,  0.14461258,  0.19323151,  0.22254713, -0.40321403,
       -0.25000085,  0.79627892,  1.44241537,  2.20550369,  0.48741652,
        0.04129688, -0.93886185, -1.34575963, -2.30895117,  0.54733632,
        0.37553072, -0.79388191, -0.19281764, -0.12018362, -0.00974653,
       -1.21235532,  0.9525258 ,  2.1151609 ,  1.70428909, -1.93

In [215]:
x, y = ecdf_vals(data)

In [216]:
p = bokeh.plotting.figure(
    frame_height=200,
    frame_width=300,
    x_axis_label='x',
    y_axis_label='ECDF',
)

p.circle(x, y)

bokeh.io.show(p)

# Exercise 7.4: Data collapse

a) Load in the three data. They are in the files `~/git/data/wt_lac.csv`, `~/git/data/q18m_lac.csv`, and `~/git/data/q18a_lac.csv`. You should put them in a single `DataFrame` with an added column for genotype. This can be accomplished, for example, with `pd.concat()`.

In [22]:
# Get list of files

files = glob.glob('data/*_lac.csv')

files

['data/q18a_lac.csv', 'data/q18m_lac.csv', 'data/wt_lac.csv']

In [23]:
!head data/q18a_lac.csv

# Data digitized from Fig. 14a of Phillips, Annu. Rev. Condens. Matter Phys. 2015. 6:85–111.
# Data are for Q18A mutant lac repressor.
[IPTG] (mM),fold change
9.4897068845641e-06,0.213514210553464
9.270247940633647e-05,0.25671154209865743
0.00019456274242902937,0.28426952769037683
0.00038672649807078616,0.34389711318581195
0.0007675898269530733,0.4183387493337461
0.0009368758732336757,0.4801739997592891
0.0015202885789386653,0.515001461460429


In [24]:
# Load in data sets

df_dict = {}

for file_name in files:
    # Extract genotype
    genotype = file_name[file_name.find('/')+1:file_name.rfind('_')]
    
    # Load in data set from file
    df_dict[genotype] = pd.read_csv(file_name, comment='#')

In [25]:
# Check one of them

df_dict['q18a'].head()

Unnamed: 0,[IPTG] (mM),fold change
0,9e-06,0.213514
1,9.3e-05,0.256712
2,0.000195,0.28427
3,0.000387,0.343897
4,0.000768,0.418339


In [26]:
# Add column for genotype

for genotype in df_dict:
    df_dict[genotype]['genotype'] = genotype.upper()
    
# Check one of them

df_dict['q18a'].head()

Unnamed: 0,[IPTG] (mM),fold change,genotype
0,9e-06,0.213514,Q18A
1,9.3e-05,0.256712,Q18A
2,0.000195,0.28427,Q18A
3,0.000387,0.343897,Q18A
4,0.000768,0.418339,Q18A


In [27]:
# Combine into a single DataFrame

df_lac = pd.concat(df_dict, ignore_index=True)

df_lac.head()

Unnamed: 0,[IPTG] (mM),fold change,genotype
0,9e-06,0.213514,Q18A
1,9.3e-05,0.256712,Q18A
2,0.000195,0.28427,Q18A
3,0.000387,0.343897,Q18A
4,0.000768,0.418339,Q18A


In [28]:
df_lac.tail()

Unnamed: 0,[IPTG] (mM),fold change,genotype
60,1.480551,0.723629,WT
61,3.1081,0.748718,WT
62,6.044332,0.768828,WT
63,11.494881,0.754358,WT
64,24.165452,0.764633,WT


b) Make a plot of fold change vs. IPTG concentration for each of the three mutants. Think: should any of the axes have a logarithmic scale?

In [29]:
import colorcet

def scatter(data, 
            cat, 
            x, 
            y, 
            legend_location='top_left', 
            x_axis_type='linear', 
            y_axis_type='linear',
            palette=colorcet.b_glasbey_category10
           ):
    """Generates a scatter plot from x and y in data.
    Colors the glyphs according to cat."""
    
    p = bokeh.plotting.figure(
        frame_width=400,
        frame_height=300,
        x_axis_label=x,
        y_axis_label=y,
        x_axis_type=x_axis_type,
        y_axis_type=y_axis_type,
    )

    for i, val in enumerate(np.unique(data[cat])):
        p.circle(
            source=data.loc[data[cat]==val],
            x=x,
            y=y,
            legend_label=str(val),
            color=palette[i % len(palette)],
        )
    
    p.legend.title = cat
    p.legend.location = legend_location
    p.legend.click_policy = 'hide'

    bokeh.io.show(p)

In [30]:
scatter(data=df_lac,
        cat='genotype', 
        x='[IPTG] (mM)', 
        y='fold change',
        x_axis_type='log',
        legend_location='top_left',
       )

c) Write a function with the signature

    fold_change(c, RK, KdA=0.017, KdI=0.002, Kswitch=5.8)
    
to compute the theoretical fold change. It should allow `c`, the concentration of IPTG, to be passed in as a NumPy array or scalar, and `RK`, the $R/K$ ratio, must be a scalar. Remember, with NumPy arrays, you don’t have to write `for` loops to do operations to each element of the array.

The theoretical expression for the fold change as a function of IPTG concentration, $c$, is

$$\textrm{fold change} = \left[1 + \frac{\frac{R}{K}(1+c/K^\textrm{A}_\textrm{d})^2}{(1+c/K^\textrm{A}_\textrm{d})^2 + K_{\textrm{switch}}(1+c/K^\textrm{I}_\textrm{d})^2} \right]^{-1}$$

In [31]:
def fold_change(c, RK, KdA=0.017, KdI=0.002, Kswitch=5.8):
    """Computes the theoretical fold change.
    Inputs:
    c: concentration of IPTG (mM)
    RK: ratio of number of repressors in cell to dissoc. const. for 
    active repressor binding operator
    KdA: dissoc. const. for active repressor binding IPTG (mM^-1)
    KdI: dissoc. const. for inactive repressor binding IPTG (mM^-1)
    Kswitch: equil. const. for switching active/inactive
    
    RK must be scalar.
    c can be passed in as a NumPy array or scalar."""
    
    return (1 + RK*(1 + c/KdA)**2 / ((1 + c/KdA)**2 + Kswitch*(1 + c/KdI)**2))**(-1)

d) You will now plot a smooth curve showing the theoretical fold change for each mutant. 
1. Make an array of closely spaced points for the IPTG concentration. Hint: The function `np.logspace()` will be useful. 
2. Compute the theoretical fold change based on the given parameters using the function you wrote in part (c). 
3. Plot the smooth curves on the same plot with the data.

In [32]:
# 1

c = np.logspace(-5, np.log(25), num=400)

In [33]:
# 2

RK_dict = dict(WT=141.5, Q18A=16.56, Q18M=1332)

fold_changes = {}

for genotype in RK_dict:
    fold_changes[genotype] = fold_change(c=c, RK=RK_dict[genotype])

In [94]:
# 3

palette=colorcet.glasbey_category10

p_lac = bokeh.plotting.figure(
        frame_width=400,
        frame_height=300,
        x_axis_label='[IPTG] (mM)',
        y_axis_label='fold change',
        x_axis_type='log',
    )

for i, val in enumerate(np.unique(df_lac['genotype'])):
    
    color=palette[i % len(palette)]
    
    p_lac.circle(
        x=df_lac.loc[df_lac['genotype']==val]['[IPTG] (mM)'],
        y=df_lac.loc[df_lac['genotype']==val]['fold change'],
        legend_label=str(val),
        color=color,
    )
    
    p_lac.line(
        x=c,
        y=fold_changes[val],
        legend_label=str(val)+', theoretical',
        color=color,
    )
        
p_lac.legend.title = 'genotype'
p_lac.legend.location = 'top_left'
p_lac.legend.click_policy = 'hide'

bokeh.io.show(p_lac)



ValueError: expected an element of either Enum('aliceblue', 'antiquewhite', 'aqua', 'aquamarine', 'azure', 'beige', 'bisque', 'black', 'blanchedalmond', 'blue', 'blueviolet', 'brown', 'burlywood', 'cadetblue', 'chartreuse', 'chocolate', 'coral', 'cornflowerblue', 'cornsilk', 'crimson', 'cyan', 'darkblue', 'darkcyan', 'darkgoldenrod', 'darkgray', 'darkgreen', 'darkgrey', 'darkkhaki', 'darkmagenta', 'darkolivegreen', 'darkorange', 'darkorchid', 'darkred', 'darksalmon', 'darkseagreen', 'darkslateblue', 'darkslategray', 'darkslategrey', 'darkturquoise', 'darkviolet', 'deeppink', 'deepskyblue', 'dimgray', 'dimgrey', 'dodgerblue', 'firebrick', 'floralwhite', 'forestgreen', 'fuchsia', 'gainsboro', 'ghostwhite', 'gold', 'goldenrod', 'gray', 'green', 'greenyellow', 'grey', 'honeydew', 'hotpink', 'indianred', 'indigo', 'ivory', 'khaki', 'lavender', 'lavenderblush', 'lawngreen', 'lemonchiffon', 'lightblue', 'lightcoral', 'lightcyan', 'lightgoldenrodyellow', 'lightgray', 'lightgreen', 'lightgrey', 'lightpink', 'lightsalmon', 'lightseagreen', 'lightskyblue', 'lightslategray', 'lightslategrey', 'lightsteelblue', 'lightyellow', 'lime', 'limegreen', 'linen', 'magenta', 'maroon', 'mediumaquamarine', 'mediumblue', 'mediumorchid', 'mediumpurple', 'mediumseagreen', 'mediumslateblue', 'mediumspringgreen', 'mediumturquoise', 'mediumvioletred', 'midnightblue', 'mintcream', 'mistyrose', 'moccasin', 'navajowhite', 'navy', 'oldlace', 'olive', 'olivedrab', 'orange', 'orangered', 'orchid', 'palegoldenrod', 'palegreen', 'paleturquoise', 'palevioletred', 'papayawhip', 'peachpuff', 'peru', 'pink', 'plum', 'powderblue', 'purple', 'red', 'rosybrown', 'royalblue', 'saddlebrown', 'salmon', 'sandybrown', 'seagreen', 'seashell', 'sienna', 'silver', 'skyblue', 'slateblue', 'slategray', 'slategrey', 'snow', 'springgreen', 'steelblue', 'tan', 'teal', 'thistle', 'tomato', 'turquoise', 'violet', 'wheat', 'white', 'whitesmoke', 'yellow', 'yellowgreen'), Regex('^#[0-9a-fA-F]{6}$'), Regex('^rgba\\(((25[0-5]|2[0-4]\\d|1\\d{1,2}|\\d\\d?)\\s*,\\s*?){2}(25[0-5]|2[0-4]\\d|1\\d{1,2}|\\d\\d?)\\s*,\\s*([01]\\.?\\d*?)\\)'), Regex('^rgb\\(((25[0-5]|2[0-4]\\d|1\\d{1,2}|\\d\\d?)\\s*,\\s*?){2}(25[0-5]|2[0-4]\\d|1\\d{1,2}|\\d\\d?)\\s*?\\)'), Tuple(Byte(Int, 0, 255), Byte(Int, 0, 255), Byte(Int, 0, 255)), Tuple(Byte(Int, 0, 255), Byte(Int, 0, 255), Byte(Int, 0, 255), Percent) or RGB, got [0.121569, 0.466667, 0.705882]

In [222]:
import colorcet


def scatter_p(
    data,
    cat,
    x,
    y,
    legend_location="top_left",
    x_axis_type="linear",
    y_axis_type="linear",
    palette=colorcet.b_glasbey_category10,
):
    """Generates a Bokeh figure p from x and y in data.
    Colors the glyphs according to cat."""

    p = bokeh.plotting.figure(
        frame_width=400,
        frame_height=300,
        x_axis_label=x,
        y_axis_label=y,
        x_axis_type=x_axis_type,
        y_axis_type=y_axis_type,
    )

    for i, val in enumerate(np.unique(data[cat])):
        p.circle(
            source=data.loc[data[cat] == val],
            x=x,
            y=y,
            legend_label=str(val),
            color=palette[i % len(palette)],
        )

    p.legend.title = cat
    p.legend.location = legend_location
    p.legend.click_policy = "hide"

    return p


p = scatter_p(
    data=df_lac,
    cat="genotype",
    x="[IPTG] (mM)",
    y="fold change",
    x_axis_type="log",
    legend_location="top_left",
)

data = df_lac
cat = "genotype"
palette = colorcet.b_glasbey_category10


def scatter_add_theor_line(
    data, cat, x, y, p,
):
    """Adds theoretical lines to Bokeh figure p."""

    for i, val in enumerate(np.unique(data[cat])):
        p.line(
            x=c,
            y=fold_changes[val],
            legend_label=str(val) + ", theoretical",
            color=palette[i % len(palette)],
        )

    return p


p = scatter_add_theor_line(data=df_lac, cat="genotype", x=c, y=fold_changes, p=p,)
p.legend.background_fill_alpha = 0

bokeh.io.show(p)

e) If we look at the functional form of the fold change and at the parameters we are given, we see that only $R/K$ varies from mutant to mutant. Daber, Sochol, and Lewis assumed that the binding to IPTG would be unaltered and the binding to DNA would be altered based on the position of the mutation in the lac repressor protein. Now, if this is true, then $R/K$ should be the only thing that varies. We can check this by seeing if the data collapse onto a single curve. To see how this works, we define the **Bohr parameter**, $F(c)$, as

$$ F(c) = -\ln(R/K) - \ln \left( \frac{(1 + c/K^\textrm{A}_\textrm{d})^2}{(1 + c/K^\textrm{A}_\textrm{d})^2 + K_\textrm{switch}(1 + c/K^\textrm{I}_\textrm{d})^2} \right) $$

The second term in the Bohr parameter is independent of the identity of the mutant, and the first term depends entirely upon it. Then, the fold change can be written as

$$ \textrm{fold change} = \frac{1}{1 + e^{-F(c)}}.$$

So, if we make our $x$-axis be the Bohr parameter, all data should fall on the same curve. Hence the term, data collapse.

Now, we will plot the theoretical curve of fold change versus Bohr parameter.

1. Write a function with call signature `bohr_parameter(c, RK, KdA=0.017, KdI=0.002, Kswitch=5.8)` that computes the Bohr parameter.
1. Write a function with call signature `fold_change_bohr(bohr_parameter)` that gives the fold change as a function of the Bohr parameter.
1. Generate values of the Bohr parameter ranging from $−6$ to $6$ in order to make a smooth plot.
1. Compute the theoretical fold change as a function of the Bohr parameter and plot it as a gray line.

In [37]:
# 1

def bohr_parameter(c, RK, KdA=0.017, KdI=0.002, Kswitch=5.8):
    """Computes the Bohr parameter."""
    
    return - np.log(RK) - np.log( (1 + c/KdA)**2 / ((1 + c/KdA)**2 + Kswitch*(1 + c/KdI)**2))

In [38]:
# 2

def fold_change_bohr(bohr_parameter):
    """Computes the fold change as a function of the Bohr parameter."""
    
    return 1 / (1 + np.exp(- bohr_parameter))

In [55]:
# 3

bp = np.linspace(-6, 6, 400)

In [58]:
# 4

fc = fold_change_bohr(bp)

p_bp = bokeh.plotting.figure(
    frame_width=400,
    frame_height=300,
    x_axis_label='Bohr parameter',
    y_axis_label='Fold change',
)

p_bp.line(
    x=bp,
    y=fc,
    color='gray',
    legend_label='F(c)'
)

p_bp.legend.location = 'top_left'

bokeh.io.show(p_bp)

f) Now, for each experimental curve:

1. Convert the IPTG concentration to a Bohr parameter using the given parameters.
1. Plot the experimental fold change versus the Bohr parameter you just calculated. Plot the data as dots on the same plot that you made the universal gray curve, making sure to appropriately annotate your plot.

In [41]:
# 1

bp_dict = {}

for genotype in RK_dict:
    bp = df_lac.loc[df_lac['genotype'] == genotype]['[IPTG] (mM)']
    bp_dict[genotype] = bohr_parameter(bp, RK_dict[genotype])

In [98]:
# 2

for i, val in enumerate(np.unique(df_lac["genotype"])):
    color = palette[i % len(palette)]
    p_bp.circle(
        x=bp_dict[val],
        y=df_lac.loc[df_lac["genotype"] == val]["fold change"],
        legend_label=str(val),
        color=color,
    )

p_bp.legend.location = "top_left"

bokeh.io.show(p_bp)

The data does seem to collapse, implying that only operator binding (described by $R/K$) is changing from mutant to mutant.

# Exercise 7.5: Monte Carlo simulation of transcriptional pausing

In this exercise, we will put random number generation to use and do a Monte Carlo simulation. The term Monte Carlo simulation is a broad term describing techniques in which a large number of random numbers are generated to (approximately) calculate properties of probability distributions. In many cases the analytical form of these distributions is not known, so Monte Carlo methods are a great way to learn about them.

We seek the probability distribution of backtrack times, $P(t_{bt})$, where $t_{bt}$ is the time spent in the backtrack. We could solve this analytically, which requires some sophisticated mathematics. But, because we know how to draw random numbers, we can just compute this distribution directly using Monte Carlo simulation!

We start at $x = 0$ at time $t = 0$. We “flip a coin,” or choose a random number to decide whether we step left or right. We do this again and again, keeping track of how many steps we take and what the $x$ position is. As soon as $x$ becomes positive, we have exited the backtrack. The total time for a backtrack is then $\tau n_{\textrm{steps}}$, where $\tau$ is the time it takes to make a step. Depken, et al., report that $\tau ≈ 0.5$ seconds.

a) Write a function, `backtrack_steps()`, that computes the number of steps it takes for a random walker (i.e., polymerase) starting at position $x=0$ to get to position $x=+1$. It should return the number of steps to take the walk.

In [265]:
def backtrack_steps():
    """Computes the number of steps it takes for a 1D random walker starting
    at position x = 0 to get to position x = +1.
    
    Returns the number of steps to take the walk."""
    
    rg = np.random.default_rng()
    
    x = 0
    steps = np.empty(0)
    
    while x != 1:
        steps = np.concatenate((steps, rg.choice([-1, 1], size=1, replace=True)))
        x = np.sum(steps)
        
    return len(steps)

b) Generate 10,000 of these backtracks in order to get enough samples out of $P(t_{\textrm{bt}})$. Some of these samples may take a very long time to acquire. (If you are interested in a way to really speed up this calculation, ask me about Numba. If you do use Numba, note that you must use the standard Mersenne Twister RNG for Numba; that is using `np.random.....`)

In [266]:
num_samps = 1000

array_backtracks = np.empty(num_samps)

for i in range(0, num_samps):
    array_backtracks[i] = backtrack_steps()

c) Generate an ECDF of your samples and plot the ECDF with the $x$ axis on a logarithmic scale.

In [267]:
bt, ECDF = ecdf_vals(array_backtracks)

In [268]:
p = bokeh.plotting.figure(
    frame_height = 300,
    frame_width = 400,
    x_axis_type = 'log',
    x_axis_label = 'number of steps',
    y_axis_label = 'ECDF'
)

p.circle(
    bt,
    ECDF,
)

bokeh.io.show(p)

d) Complementary cumulative distribution (CCDF)

In [276]:
CCDF = 1 - ECDF
tbt = 0.5 * bt

p = bokeh.plotting.figure(
    frame_height = 300,
    x_axis_type = 'log',
    y_axis_type = 'log', 
    x_axis_label = 't_bt',
    y_axis_label = 'CCDF',
)

p.circle(
    tbt,
    CCDF,
)

bokeh.io.show(p)

e) By doing some mathematical heavy lifting, we know that, in the limit of large $t_{bt}$,

$$P(t_{bt}) \propto t_{bt}^{-3/2},$$

so the plot you did in part (e) should have a slope of $−1/2$ on a log-log plot. Is this what you see?

In [277]:
x = [0.5, 10**5-0.5]
y = [1, 10**-2.5]

p.line(
    x,
    y,
)

bokeh.io.show(p)

In [210]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,colorcet,jupyterlab

CPython 3.7.7
IPython 7.13.0

numpy 1.18.1
pandas 0.24.2
bokeh 2.0.2
colorcet 2.0.2
jupyterlab 1.2.6
