### Extended Figure 2 - Deeper Look at Reproducibility 

In this notebook we will plot the gene-by-gene comparison of all 3 replicates of our MTC-encapsulated biological replicates, along with the Lysate comparisons. 

Note - this notebook makes use of bokeh's export svg functionality to create svgs of each image to inlcude in Adobe illustrator.  However each figure is also generated as a preview in Jupyter notebook.  Simply don't run cells that save the image to svg if this is an issue for you and you should still be able to preview the interactive figures. 

In [1]:
import pandas as pd
import os
from bokeh.io import push_notebook, show, output_notebook, export_svg, curdoc
from bokeh.plotting import figure
from bokeh.models import LinearColorMapper, ColorBar
import datetime
from scipy.stats import gaussian_kde
import numpy as np
from sklearn.metrics import r2_score
from bokeh.themes import Theme
curdoc().theme = Theme(filename="../../figure_theme.yaml")

output_notebook()

First we will fetch the correct reads for this experiment.  We will fetch all reads from the experiment folder, though we only use two for comparison in the figure. 

In [2]:
def normalization(gene_counts):
    '''Convert gene_counts into RPM after removing capsule reads and discarding reads with < 100 reads.'''
    sample_df = gene_counts.copy()
    sample_df.drop(sample_df[sample_df['Name'] == 'error'].index, inplace = True) # currently dropping any gene without reads in both
    sample_df.drop(sample_df[sample_df['Name'] == 'Capsule'].index, inplace = True) # currently dropping capsule reads from plot
    sample_df.drop(sample_df[sample_df['Name'] == 'Capsule_rev'].index, inplace = True) # currently dropping capsule reads from plot
    sample_df.drop(sample_df[sample_df['Name'] == 'lacI'].index, inplace = True) # currently dropping lacI from plot
    sample_df.drop(sample_df[sample_df['Counts'] <= 100].index, inplace = True)
    sample_df = sample_df.rename(columns = {'Counts':name})
    sample_df[name] = sample_df[name]/(sum(sample_df[name])/1000000) #RPM]
    
    return sample_df


df_dir = "../../Processed Sequencing Files/230724Li/"
df_dict = dict()
names = []
# Fetch all the relevant dataframe files and put them in a useable format:
for file in os.listdir(df_dir):
    if file.endswith('_dataframe.txt'):
        name = file[:file.find('_D')]
        new_df  = pd.read_csv(''.join([df_dir ,file]))
        new_df['Length'] = new_df['Stop'] - new_df['Start']
        df_dict[name] = normalization(new_df)
        names.append(name)
print(f'\nHere is the list of unique samples for which there are dataframe files: \n\n{names}\n')


Here is the list of unique samples for which there are dataframe files: 

['Lysate_Replicate_1', 'Lysate_Replicate_2', 'Lysate_Replicate_3', 'MTC_Replicate_1', 'MTC_Replicate_2', 'MTC_Replicate_3']



#### Main Figure:
First we will generate the main figure, which is simple a scatter plot comparing the gene-by-bene RPM for 2 MTC replicates.  Later we will also generate the insert which represents the distribution of log-fold changes in RPM between the two samples. (which was overlaid over this image to create the final published image).  We will also calculate and print the Pearson coefficient (R) for the data.

In [5]:
pairs = [
    ('MTC_Replicate_1', 'MTC_Replicate_2'),
    ('MTC_Replicate_1', 'MTC_Replicate_3'), 
    ('MTC_Replicate_2', 'MTC_Replicate_3'),
    ('Lysate_Replicate_1', 'Lysate_Replicate_2'),
    ('Lysate_Replicate_1', 'Lysate_Replicate_3'),
    ('Lysate_Replicate_2', 'Lysate_Replicate_3'),
]

export_to_svg = True  #Set this to false if you don't have the libraries installed for exporting to svg.

for x, y in pairs:

    x_df = df_dict[x].copy()
    y_df = df_dict[y].copy()
    plot_df = x_df.merge(y_df)

    p = figure(
        y_axis_type = "log", x_axis_type = "log",
        aspect_scale = 1, width = 400, height = 400,
        output_backend = "svg", tooltips = [("Gene", "@Name")],
       # y_range = (10**1.2, 10**4.5), x_range = (10**1.2, 10**4.5)
    )

    p.xaxis.axis_label = f"mRNA level in {x.replace('_', ' ').replace('Replicate', 'Rep')} (RPM)"
    p.yaxis.axis_label = f"mRNA level in {y.replace('_', ' ').replace('Replicate', 'Rep')} (RPM)"
    p.axis.ticker = [10**0, 10**2, 10**4, 10**6]


    p.circle(x=x,
             y=y,
             source=plot_df,
             size=5,
             fill_alpha=0.8,
             line_alpha=0,
            )

    show(p)
    
    plot_df['Log_Dif'] = np.log10(plot_df[x]) - np.log10(plot_df[y])
    p_ins = figure(width = 170, height = 120,
          output_backend = "svg")

    # Histogram
    bins = np.linspace(-1, 1, 10)
    plt_hist, plt_edges = np.histogram(plot_df['Log_Dif'], density=False, bins=bins)
    p_ins.quad(top = plt_hist, bottom = 0, left = plt_edges[:-1], right = plt_edges[1:],
             fill_color="steelblue", line_color="white", fill_alpha = 1,
             )

    p_ins.yaxis.visible = False
    p_ins.yaxis.ticker = [1000, 2000]
    p_ins.xaxis.ticker = [-1, -.5, 0, .5, 1]
    p_ins.xaxis.axis_label ="log-10 ratio"

    p_ins.background_fill_color = None
    p_ins.grid.grid_line_color = None
    p_ins.grid.grid_line_width = 4
    p_ins.outline_line_color = None

    show(p_ins)
    
    r = np.corrcoef(plot_df[x], plot_df[y])[0,1]
    print(f"For {x} vs {y}, pearson Coefficient: {r}")
    print(f"For {x} vs {y}, stdev of log fold change dist: {np.std(plot_df['Log_Dif'])}")
    
    if export_to_svg: 
        p.axis.major_label_text_font = "arial"
        p.axis.major_label_text_font_size = "22px"
        p.axis.axis_label_text_font_style = "normal"
        p.axis.axis_label_text_font = "arial"
        p.axis.axis_label_text_font_size = "24px"

        p.background_fill_color = "white"
        p.grid.grid_line_color = None
        p.title.text_color= '#000000'
        p.axis.major_label_text_color = '#000000'
        p.axis.axis_label_text_color = '#000000'
        
        p_ins.axis.axis_label_text_font_style = "normal"
        p_ins.axis.major_label_text_font = "arial"
        p_ins.axis.major_label_text_font_size = "18px"
        p_ins.axis.axis_label_text_font = "arial"
        p_ins.axis.axis_label_text_font_size = "18px"
        p_ins.axis.major_label_text_color = '#000000'
        p_ins.axis.axis_label_text_color = '#000000'

        export_svg(p, filename = f'./Extended_Figure_3_Main_{x}_vs_{y}_{datetime.date.today()}.svg')
        export_svg(p_ins, filename = f'./Extended_Figure_3_Insert_{x}_vs_{y}_{datetime.date.today()}.svg')

For MTC_Replicate_1 vs MTC_Replicate_2, pearson Coefficient: 0.9927904221104147
For MTC_Replicate_1 vs MTC_Replicate_2, stdev of log fold change dist: 0.067455811785603


For MTC_Replicate_1 vs MTC_Replicate_3, pearson Coefficient: 0.9706138078503127
For MTC_Replicate_1 vs MTC_Replicate_3, stdev of log fold change dist: 0.12344977012408859


For MTC_Replicate_2 vs MTC_Replicate_3, pearson Coefficient: 0.9827738846151849
For MTC_Replicate_2 vs MTC_Replicate_3, stdev of log fold change dist: 0.10232880398864923


For Lysate_Replicate_1 vs Lysate_Replicate_2, pearson Coefficient: 0.9956559141972862
For Lysate_Replicate_1 vs Lysate_Replicate_2, stdev of log fold change dist: 0.05893576063138566


For Lysate_Replicate_1 vs Lysate_Replicate_3, pearson Coefficient: 0.9944416027508375
For Lysate_Replicate_1 vs Lysate_Replicate_3, stdev of log fold change dist: 0.0676073266996318


For Lysate_Replicate_2 vs Lysate_Replicate_3, pearson Coefficient: 0.9984016873420233
For Lysate_Replicate_2 vs Lysate_Replicate_3, stdev of log fold change dist: 0.04586713782505206
