### Figure 2B - Comparison of Genes mapping to E. coli and B. sub for MTC encapsulated transcripts and Lysate:

Here we will look at the encapsulated transcripts versus general lysate transcripts for an experiment where MTC-containing E. coli were mixed ~ 50:50 with non-MTC-containing B. sub.  If the MTCs are effective at protecting encapsulated transcripts from contaimination from general RNAs in the surrounding lysate then the encapsulated transcript sample should be map entirely to the E. coli genome. 

Note - for all these transcripts, we have discarded all reads which do not uniquely map to either the E. coli genome or the B. sub genome. (this was done at the alignment step - please check out the raw data processing scripts for this experiment to see more!)

Note - this notebook makes use of bokeh's export svg functionality to create svgs of each image to inlcude in Adobe illustrator.  However each figure is also generated as a preview in Jupyter notebook.  Simply don't run cells that save the image to svg if this is an issue for you and you should still be able to preview the interactive figures. 

In [1]:
import pandas as pd
import os
from bokeh.io import push_notebook, show, output_notebook, export_svg
from bokeh.plotting import figure
from bokeh.layouts import column
import datetime

output_notebook()

First we will load the data from the appropriate pre-processed data folder.  We will load all of the data sets from this experiment:
- B_sub_Lys = the transcripts taken from a lysate sample of B. sub cells alone (before mixing). 
- E_coli_Lys =  the transcripts taken from a lysate sample of MTC-containing E. coli cells alone (before mixing). 
- Mix_Lys =  the transcripts taken from the mixed sample (~ 50:50 E. coli to B. sub).
- E_coli_Cap = the transcripts taken from the purified MTCs, extracted from the sample of MTC-containing E. coli cells alone (before mixing).
- Mix_Cap = the transcripts taken from the purified MTCs, extracted from the mixed sample (~ 50:50 E. coli to B. sub).

In [2]:
def normalization(gene_counts, name):
    '''Convert gene_counts into RPM after removing capsule reads and discarding reads with < 100 reads.'''
    sample_df = gene_counts.copy()
    sample_df.drop(sample_df[sample_df['Name'] == 'error'].index, inplace = True) # currently dropping any gene without reads in both
    sample_df.drop(sample_df[sample_df['Name'] == 'Capsule'].index, inplace = True) # currently dropping capsule reads from plot
    sample_df.drop(sample_df[sample_df['Name'] == 'Capsule_rev'].index, inplace = True) # currently dropping capsule reads from plot
    sample_df.drop(sample_df[sample_df['Name'] == 'lacI'].index, inplace = True) # currently dropping lacI from plot
    sample_df = sample_df.rename(columns = {'Counts':name})
    sample_df[name] = sample_df[name]/(sum(sample_df[name])/1000000) #RPM]
    
    return sample_df


df_dir = "../../Processed Sequencing Files/230712LiA_mixing/"
df_dict = dict()
names = []
# Fetch all the relevant dataframe files and put them in a useable format:
for file in os.listdir(df_dir):
    if file.endswith('_dataframe.txt'):
        name = file[:file.find('_D')]
        new_df  = pd.read_csv(''.join([df_dir ,file]))
        new_df['Length'] = new_df['Stop'] - new_df['Start']
        df_dict[name] = normalization(new_df, name)
        names.append(name)
print(f'\nHere is the list of unique samples for which there are dataframe files: \n\n{names}\n')


Here is the list of unique samples for which there are dataframe files: 

['B_sub_Lys', 'E_coli_Cap', 'E_coli_Lys', 'Mix_Cap', 'Mix_Lys']



In [3]:
def setAppearance(p):
    """Reconfigure the apperance of plots"""
    p.xaxis.visible = False
    p.yaxis.visible = False
    p.xaxis.ticker = [10000, 20000, 30000]
    p.yaxis.ticker = [10000, 20000, 30000]
    p.toolbar_location=None

    
    p.axis.major_label_text_font = "arial"
    p.axis.major_label_text_font_style = "normal"
    p.axis.major_label_text_font_size = "20px"
    p.axis.axis_label_text_font = "arial"
    p.axis.axis_label_text_font_style = "normal"
    p.axis.axis_label_text_font_size = "20px"
    p.legend.location = (35, 160)
    p.legend.label_text_font_size = '18pt'
    p.legend.glyph_width = 30
    p.legend.glyph_height = 30
    p.legend.label_text_font_style = "italic"
    
    p.grid.visible = False
    
    return p

In [4]:
x = 'B_sub_Lys' # b sub lysate
y1 = 'Mix_Lys' # capsule - mixture sample
y2 = 'Mix_Cap' # mixture - Lysate

y1_df = df_dict[y1]
y2_df = df_dict[y2]
x_df = df_dict[x]

plot1_df = x_df.merge(y1_df)
plot2_df = x_df.merge(y2_df)


# Plot one: B_sub Lys vs Mix Lys - breakdown of E coli vs B sub mapping genes. 
p1 = figure(
    aspect_scale = 1, width = 260, height = 260,
    output_backend = "svg", tooltips = [("Gene", "@Name"), ("Mix Lys", "@Mix_Lys"), ("B. sub Lys", "@B_sub_Lys")]
)
p1.circle(x=x, y=y1, source=plot1_df.loc[plot1_df['Genome'] == "NC_000964.3"],
          size=10, fill_alpha=0.8, line_alpha=0, color="coral", legend_label = "B. subtilis")
p1.circle(x=x, y=y1, source = plot1_df.loc[plot1_df['Genome'] == "NC_000913.2"],
          size=10, fill_alpha=0.8, line_alpha=0, fill_color="darkseagreen",
        legend_label = "E. coli")

# Plot two: B_sub Lys vs Mix Cap - breakdown of E coli vs B sub mapping genes. 
p2 = figure(
    aspect_scale = 1, width = 260, height = 260,
    output_backend = "svg", tooltips = [("Gene", "@Name"), ("Mix Cap", "@Mix_Cap"), ("B. sub Lys", "@B_sub_Lys")]
)
p2.circle(x=x, y=y2, source=plot2_df.loc[plot2_df['Genome'] == "NC_000964.3"],
          size=10, fill_alpha=0.8, line_alpha=0, color="coral", legend_label = "B. subtilis")
p2.circle(x=x, y=y2, source=plot2_df.loc[plot2_df['Genome'] == "NC_000913.2"],
          size=10, fill_alpha=0.8, line_alpha=0, fill_color="darkseagreen",
           legend_label = "E. coli")

p1 = setAppearance(p1)
p2 = setAppearance(p2)

p3 = column(p1, p2)


show(p3)

In [5]:

export_svg(p3, filename = f'./2C_{datetime.date.today()}.svg')

['./2C_2023-09-12.svg']

Now we will calculate the percentage of reads uniquely mapping to genes in each species, for both the Mix_Lys sample and the Mix_Cap sample.

In [6]:
def calcPercentageSpecies(df, name):
    b_sub = sum(df[df["Genome"] == "NC_000964.3"][name])
    e_coli = sum(df[df["Genome"] == "NC_000913.2"][name])
    
    print(f'Percent Uniquely mapping to B. sub: {100*b_sub/(b_sub + e_coli)}, percent E coli: {100*e_coli/(b_sub + e_coli)}')

calcPercentageSpecies(df_dict['Mix_Lys'], 'Mix_Lys')
calcPercentageSpecies(df_dict['Mix_Cap'], 'Mix_Cap')

Percent Uniquely mapping to B. sub: 46.89605985271221, percent E coli: 53.10394014728779
Percent Uniquely mapping to B. sub: 0.026967645695902438, percent E coli: 99.97303235430411
