## Figure 1C - Comparison of RNA Content between MTCs and Total Lysate

Here we plot the transcript by transcript RPM (RReads Per Million) of all genes > 0 reads in the MTC-protected sample and the total lysate sample. 

### Main Figure - Transcript by Transcript Density Plot:

In [3]:
import pandas as pd
import os
from bokeh.io import push_notebook, show, output_notebook, export_svg
output_notebook()
from bokeh.plotting import figure
from bokeh.models import LinearColorMapper, ColorBar
import datetime
import numpy as np
from scipy.stats import gaussian_kde
import numpy as np

First we load in our data.  We will load in all the datasets associated with the experiment, though we will only be plotting/looking at the 'Ecoli_Lysate' and the 'E_coli_Cap' data here:

In [4]:
#df_dir = '../../Processed_Data_Files/dataframe_files/230117Li/'
#prefix = '230117Li_Mixing_'
df_dir = '../../Processed_Data_Files/dataframe_files/230712LiA/mixing/'
prefix = ''
df_dict = dict()
names = []
# Fetch all the relevant dataframe files and put them in a useable format:
for file in os.listdir(df_dir):
    if file.endswith('_dataframe.txt') and file.startswith(prefix):
        name = file[file.find(prefix)+len(prefix):]
        name = name[:name.find('_D')]
        new_df  = pd.read_csv(''.join([df_dir,file]))
        new_df['Length'] = new_df['Stop'] - new_df['Start']
        df_dict[name] = new_df
        names.append(name)
print(f'\nHere is the list of unique samples for which there are dataframe files: \n\n{names}\n')

# a name dictionary for converting between the datafram name and a form nicer for plots:
names_dict = {'Ecoli_Cap':'E. coli Capsule', 'Mix_Cap':'Mixture Capsule', 'Ecoli_Lysate':'E. coli Lysate', 
              'Mix_Lysate':'Mixture Lysate', 'Bsub_Lysate':'B. subtilis Lysate'}


Here is the list of unique samples for which there are dataframe files: 

['B_sub_Lys', 'E_coli_Cap', 'E_coli_Lys', 'Mix_Cap', 'Mix_Lys']



Next we have the helper function to convert tread counts into RPM:

In [5]:
def into_RPM_theshold_remove_cap(df_dict, name):
    '''This felper function gets the dataframe for a particular experiment and converts its reads into RPM'''
    sample_df = df_dict[name].copy()
    num_capsule = np.asarray(sample_df[sample_df['Name'] == 'Capsule']['Counts'])[0]
    num_cysG = np.asarray(sample_df[sample_df['Name'] == 'gyrA']['Counts'])[0]
    print(f'Percent of reads in the capsule for sample {name}: {num_capsule/sum(sample_df["Counts"])}')
    print(f'Percent of reads in gyrA for sample {name}: {num_cysG/sum(sample_df["Counts"])}')
    print(f'Normalized value of Capsule (normalized with cysG: {num_capsule/num_cysG}')
    sample_df.drop(sample_df[sample_df['Name'] == 'error'].index, inplace = True) # currently dropping any gene without reads in both
    sample_df.drop(sample_df[sample_df['Name'] == 'Capsule'].index, inplace = True) # currently dropping capsule reads from plot
    sample_df.drop(sample_df[sample_df['Name'] == 'Capsule_rev'].index, inplace = True) # currently dropping capsule reads from plot
    sample_df.drop(sample_df[sample_df['Name'] == 'lacI'].index, inplace = True) # currently dropping lacI from plot
    sample_df.drop(sample_df[sample_df['Counts'] <= 100].index, inplace = True)
    sample_df = sample_df.rename(columns = {'Counts':name})
    sample_df[name] = sample_df[name]/(sum(sample_df[name])/1000000) #RPM]
    
    return sample_df

Now we will take the data we need in order to make our comparison between the MTC-protected RNA and the general lysate RNA. We also include steps which affect the plot appearance. 

**Note** This notebook will output both the an interactive plot below and an svg image of the plot. In order for the svg export to work you need to have selenium and the firefox/geckodriver libraries installed via conda. Feel free to comment these parts of the code out if all you want is the interactive plot.

In [6]:
x = 'B_sub_Lys' # MTC-protected RNA sample
y = 'E_coli_Cap' # General Lysate sample - the same as the MTC-protected RNA sample.

x_df = into_RPM_theshold_remove_cap(df_dict, x)
y_df = into_RPM_theshold_remove_cap(df_dict, y)
plot_df = x_df.merge(y_df)
plot_df['Log_Dif'] = np.log10(plot_df[x]) - np.log10(plot_df[y])

#Now we will make a density plot based on the log RPM of each dataset:
xy = np.vstack([np.log(plot_df[x]), np.log(plot_df[y])])
plot_df['Density'] = gaussian_kde(xy)(xy)
color_mapper = LinearColorMapper(
    palette='Viridis256',
    low = min(plot_df['Density']),
    high = max(plot_df['Density']),
)

p = figure(
    y_axis_type = "log", x_axis_type = "log",
    aspect_scale = 1, width = 600, height = 600,
    output_backend = "svg", tooltips = [("Gene", "@Name")],
   # y_range = (10**1.2, 10**4.5), x_range = (10**1.2, 10**4.5)
)

p.xaxis.axis_label = "mRNA level in lysate (RPM)"
p.yaxis.axis_label = "mRNA level in MTC (RPM)"
p.axis.ticker = [10**0, 10**2, 10**4, 10**6]
p.axis.major_label_text_font = "arial"
p.axis.major_label_text_font_size = "20px"
p.axis.axis_label_text_font_style = "normal"
p.axis.axis_label_text_font = "arial"
p.axis.axis_label_text_font_size = "22px"

p.background_fill_color = "whitesmoke"
p.grid.grid_line_color = "white"
p.grid.grid_line_width = 4
#p.toolbar_location = None

p.circle(x=x,
         y=y,
         source=plot_df,
         size=5,
         fill_alpha=0.8,
         line_alpha=0,
         color={'field': 'Density', 'transform': color_mapper},
        )
#p.line(x=[10**(1.5), 10**4], y=[10**(1.5), 10**4], color = "black", width = 4)
colorbar = ColorBar(
    color_mapper = color_mapper,
    location = (2,4),
    title = "Gaussian Kernel Desnity",
    label_standoff = 3,
    width = 20,
    height = 500,
    bar_line_color='black',
    major_tick_line_color='black',
    major_label_text_font_size = "20px",
    title_text_font_size = "20px",
    title_text_font_style = "normal",
)
#p.add_layout(colorbar, 'right')

show(p)
export_svg(p, filename = f'./1C_Main_Plot_{datetime.date.today()}.svg')

Percent of reads in the capsule for sample B_sub_Lys: 0.0017084398350398304
Percent of reads in gyrA for sample B_sub_Lys: 1.9757601885507463e-07
Normalized value of Capsule (normalized with cysG: 8647.0
Percent of reads in the capsule for sample E_coli_Cap: 0.0032485604616313128
Percent of reads in gyrA for sample E_coli_Cap: 1.4277593540163993e-06
Normalized value of Capsule (normalized with cysG: 2275.285714285714


['./1C_Main_Plot_2023-08-14.svg']

### Log 10 Density Insert:

In [5]:
p = figure(width = 200, height = 150,
          output_backend = "svg")

# Histogram
bins = np.linspace(-.5, .5, 10)
plt_hist, plt_edges = np.histogram(plot_df['Log_Dif'], density=False, bins=bins)
p.quad(top = plt_hist, bottom = 0, left = plt_edges[:-1], right = plt_edges[1:],
         fill_color="steelblue", line_color="white", fill_alpha = 1,
         )
p.axis.axis_label_text_font_style = "normal"
p.axis.major_label_text_font = "arial"
p.axis.major_label_text_font_size = "20px"
p.yaxis.visible = False
p.yaxis.ticker = [1000, 2000]
p.xaxis.ticker = [-1, -.5, 0, .5, 1]
p.xaxis.axis_label ="log-10 ratio"
p.axis.axis_label_text_font = "arial"
p.axis.axis_label_text_font_size = "20px"

p.background_fill_color = None
p.grid.grid_line_color = None
p.grid.grid_line_width = 4
p.outline_line_color = None

show(p)
export_svg(p, filename = f'1C_Density_Insert_{datetime.date.today()}.svg')

['1C_Density_Insert_2023-08-09.svg']

Next, let us calculate the standard deviation of this log10 fold distribution. 

In [6]:
std = np.std(plot_df['Log_Dif'])
print(f'The standard deviation of this distribution is : {std:.3f}')

The standard deviation of this distribution is : 0.130
