In [1]:
import tfscreen
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

## tfscreen walkthrough

This notebook walks through the tfscreen simulation and analysis pipeline. It assumes that you downloaded the tfscreen [repo](https://github.com/harmslab/tfscreen) and that you installed the libray by running "`pip install .`" inside the base tfscreen directory. 


### Simulation inputs

+ **run_config.yaml**: A text file that defines the simulation parameters. It has comments throughout that explain what the different parameters are.
+ **ddG.xslx**: A spreadsheet with the effects of all mutations on each of the conformations in the ensemble selected in the run configuration. The path to this file (and its filename) are defined *in* run_config.yaml.
+ **calibration.json**: A json file that describes the linking function between fractional occupancy and growth rate under the conditions specified in the run_config. The path to this file (and its filename) are defined *in* run_config.yaml.

### Run the simulation

Simulate transformation, growth, and sequencing. This yields six dataframes:
+ `growth_df`: This exactly matches the final output of a processed experiment and can be used as input in all downstream analyses. 
+ `sample_df`: The conditions of each multiplexed sample sent in for "sequencing"
+ `counts_df`: The counts for each genotype under each condition. This also has the true parameters used in the generating model (`dk_geno`, `theta`, and `ln_cfu0`) so we can check to see how well our model is extracting these parameters. 
+ `library_df`: The genotypes in the library and their true frequencies.
+ `phenotype_df`: The calculated phenotypes of each genotype.
+ `genotype_ddG_df`: The effect of mutations on the energy of each conformation in the ensemble. 

In [3]:

# Read the yaml file defining the library, thermodynamic, and simulation 
# parameters. The `override_keys` dictionary lets you override keys in the 
# .yaml file without having to edit the file every time. We're going to run
# the simulation with a multi-transformation lamba of 2.5.
override_keys = {"transformation_poisson_lambda":2.5,
                 "total_num_reads": 2e9}

cf = tfscreen.util.read_yaml("run_config.yaml",override_keys)

# Make a dataframe holding the library contents (library_df), a dataframe
# with phenotypes, and a dataframe of genotype ddG per conformation. 
library_df, phenotype_df, genotype_ddG_df = tfscreen.simulate.library_prediction(cf)

all_replicates = []
for i in range(2):
    print(f"Replicate {i+1}",flush=True)

    # Simulate a selection experiment. This generates a sample_df and counts_df
    # exactly equivalent to experimental outputs. 
    sample_df, counts_df = tfscreen.simulate.selection_experiment(cf,library_df,phenotype_df)
    
    # Build a growth_df from the sample and counts dataframe (just like an
    # experiment). This `growth_df` is the input to either an independent 
    # (maximum likelihood) analysis or a hierarchical (Bayesian) analysis. 
    growth_df = tfscreen.process_raw.counts_to_lncfu(sample_df,counts_df)

    growth_df["replicate"] = i + 1
    all_replicates.append(growth_df)

print("Assembling and writing total growth dataframe")
growth_df_combo = pd.concat(all_replicates,ignore_index=True)
growth_df_combo.to_csv("growth.csv",index=None)

print("Assembling and writing single mutant growth dataframe")
growth_df_singles = tfscreen.genetics.expand_genotype_columns(growth_df_combo)
growth_df_singles = growth_df_singles[growth_df_singles["num_muts"] < 2]
growth_df_singles = growth_df_singles.drop(columns=["wt_aa_1","wt_aa_2","wt_aa_3",
                                                    "mut_aa_1","mut_aa_2","mut_aa_3",
                                                    "resid_1","resid_2","resid_3"])
growth_df_singles.to_csv("growth_singles.csv",index=None)

print("Writing phenotype dataframe")
phenotype_df.to_csv("phenotype.csv",index=None)

print("Creating and writing binding dataframe")
binding = phenotype_df[phenotype_df["genotype"].isin(["wt","M42I","H74A","K84L"])]
binding = binding[["genotype","titrant_name","titrant_conc","theta"]]
binding["theta_std"] = 0.02
binding.columns = ["genotype","titrant_name","titrant_conc","theta_obs","theta_std"]
binding.to_csv("binding.csv",index=None)



Initializing phenotype calculation... Done.


calculating theta using thermo model:   0%|          | 0/232214 [00:00<?, ?it/s]

  cov_x = invR @ invR.T
  cov_x = invR @ invR.T


Calculating growth rates and building final dataframe... Done.
Replicate 1
Setting up calculation.
Simulating growth and sequencing


replicate/library:   0%|          | 0/2 [00:00<?, ?it/s]

--> simulating transformation
--> simulating growth
--> simulating sequencing
--> simulating index hopping
--> simulating transformation
--> simulating growth
--> simulating sequencing
--> simulating index hopping
Generating final dataframe.
Simulation complete.
Replicate 2
Setting up calculation.
Simulating growth and sequencing


replicate/library:   0%|          | 0/2 [00:00<?, ?it/s]

--> simulating transformation
--> simulating growth
--> simulating sequencing
--> simulating index hopping
--> simulating transformation
--> simulating growth
--> simulating sequencing
--> simulating index hopping
Generating final dataframe.
Simulation complete.
Assembling and writing total growth dataframe
Assembling and writing single mutant growth dataframe
Writing phenotype dataframe
Creating and writing binding dataframe


#### Run the fit

This does a maximum-likelihood fit on each genotype individually. The outputs are:
+ `param_df`: holds all fit parameters
+ `pred_df`: holds all predicted ln_cfu values at all conditions
+ `results`: a dictionary that holds dataframes organized by parameter type (`theta`, `dk_geno`, `ln_cfu0`, and `pred`). 

In [None]:
# Fit for theta. We use the same calibration file we used to simulate the 
# data, so whatever differences we see are because of the experiment and/or
# analysis, not an incorrect growth model. 
param_df, pred_df = tfscreen.analysis.independent.cfu_to_theta(growth_df,
                                                               calibration_data=cf["calibration_file"],
                                                               non_sel_conditions=["kanR-kan","pheS-4CP"])

# Helper function generates clean versions of output. Because our counts_df 
# has the known, real answers, these are also loaded into the results. 
results = tfscreen.analysis.independent.process_theta_fit(param_df,
                                                          pred_df,
                                                          counts_df,
                                                          sample_df)

print("dataframes generated:",list(results.keys()))
results["theta"]

#### Compare estimated to real parameter values.

Since we simulated the dataset, we know the underlying growth parameters. We can thus ask how good we are at extracting these known values. The following code generates plots to help us analyze the results and spits out some statistics on the fit quality. Some things to look for:

1. **Correlation**: is there a strong correlation between the estimated and real k values? This can be evaluated in the in the plot on the left, as well as in the $R^2$ value in the stats output. 
2. **RMSE**: How wrong are we, on average? This is in the stats output. 
3. **Calibration**: Do our parameter uncertainty values capture the true error in our estimates? If yes, the red line on the right plot will exactly match the gray histogram. The test statistic "coverage probability" will also be 0.95. This measures the fraction of (predicted - real) parameters that land within their inferred 95% confidence intervals. If the value is *lower* than 0.95, it means that our errors are too small; if the value is *greater* than 0.95, it means our errors are too large. 

In [None]:

params = ["theta","ln_cfu0","dk_geno"]

for p in params:

    result_df = results[p]
    fig, ax = tfscreen.plot.est_v_real_summary(result_df[f"{p}_est"],
                                               result_df[f"{p}_std"],
                                               result_df[f"{p}_real"],
                                               axis_prefix=p)
    
    ax = tfscreen.plot.err_vs_mag(result_df[f"{p}_real"],
                                  result_df[f"{p}_est"],axis_name=p) 

    stats = tfscreen.fitting.stats_test_suite(result_df[f"{p}_est"],
                                              result_df[f"{p}_std"],
                                              result_df[f"{p}_real"])
    print(f"Fit statistics for {p}:")
    for s in stats:
        print(f"{s}: {stats[s]}")
    print()

In [None]:

def plot_single_genotype(genotype,theta_df):

    this_output_df = theta_df.loc[theta_df["genotype"] == genotype,:].copy()
    this_output_df.loc[this_output_df["titrant_conc"] == 0,"titrant_conc"] = 1e-6
    
    fig, ax = plt.subplots(1,figsize=(6,6))
    
    ax.errorbar(this_output_df["titrant_conc"],
                this_output_df["theta_est"],
                yerr=this_output_df["theta_std"],
                lw=0,
                elinewidth=1,
                capsize=5,
                color='firebrick')
    ax.scatter(this_output_df["titrant_conc"],
               this_output_df["theta_est"],
               s=50,
               edgecolor='firebrick',
               facecolor="none",
               label="fit")
    ax.scatter(this_output_df["titrant_conc"],
               this_output_df["theta_real"],
               s=50,
               color='royalblue',
               label="ground truth")
    
    ax.legend()

    fig.suptitle(genotype)
    ax.set_xscale("log")
    ax.set_ylim(-0.1,1.1)
    ax.set_xlabel("iptg")
    ax.set_ylabel("fraction operator occpuancy")
    fig.tight_layout()
    
    return fig, ax


theta_df = results["theta"]


_ = plot_single_genotype("wt",theta_df)
_ = plot_single_genotype("H29A",theta_df)
_ = plot_single_genotype("L62G",theta_df)
_ = plot_single_genotype("L63Y",theta_df)