# Biology analysis
This notebook provides a central place to integrate and analyze transcriptome and progeny production data.

## Notebook setup
Import python modules:

In [1]:
from IPython.display import display

from dms_variants.constants import CBPALETTE

import pandas as pd

import plotnine as p9

import scipy

import statsmodels.stats.multitest

Get `snakemake` variables [as described here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration):

In [2]:
# Hardcoded for now
viral_tag_by_cell_csv = 'results/viral_tags_bcs_in_cells/scProgenyProduction_trial3_cell_barcodes_with_viral_tags.csv.gz'
viral_bc_background_freq_csv = 'results/viral_fastq10x/scProgenyProduction_trial3_viral_bc_background_freq.csv.gz'
viral_bc_in_progeny_freq_csv = 'results/viral_progeny/scProgenyProduction_trial3_viral_bc_in_progeny_freq.csv.gz'
expt = 'scProgenyProduction_trial3'
plot = 'results/viral_fastq10x/scProgenyProduction_trial3_viral_bc_correlations.pdf'
barcoded_viral_genes = ['fluHA', 'fluNA']

Set plotnine theme

In [3]:
p9.theme_set(p9.theme_classic())

## Correlation between transcripts and progeny
This section plots the correlation between viral barcode expression in the transcriptome and viral barcode fraction in the progeny datasets (supernatant or second infection).

### Organize data
First, read the cell barcodes and tags into a pandas dataframe. Only keep relevant columns.

In [4]:
all_cells = pd.read_csv(viral_tag_by_cell_csv)
all_cells = all_cells[['cell_barcode',
                       'infected',
                       'infecting_viral_tag']]
display(all_cells)

Unnamed: 0,cell_barcode,infected,infecting_viral_tag
0,AAACCCAGTAACAAGT,False,none
1,AAACCCATCATTGCTT,False,none
2,AAACGAAAGATGTTGA,False,none
3,AAACGAAGTACTTCCC,True,wt
4,AAACGAAGTAGACGTG,True,wt
...,...,...,...
3360,TTTGATCTCCCGTTCA,False,none
3361,TTTGATCTCGCATTGA,True,wt
3362,TTTGGAGAGTTGCCTA,False,none
3363,TTTGGAGGTATCGTTG,True,wt


Next, read the viral barcode frequencies from the transcriptome.

In [5]:
transcriptome_viral_bc_freqs = pd.read_csv(viral_bc_background_freq_csv)
assert set(transcriptome_viral_bc_freqs['gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
display(transcriptome_viral_bc_freqs)

Unnamed: 0,cell_barcode,infected,infecting_viral_tag,gene,viral_barcode,frac_viral_bc_UMIs,reject_uninfected
0,AAACCCAGTAACAAGT,False,none,fluHA,,0.000000,False
1,AAACCCATCATTGCTT,False,none,fluHA,,0.000000,False
2,AAACGAAAGATGTTGA,False,none,fluHA,,0.000000,False
3,AAACGAAGTACTTCCC,True,wt,fluHA,,0.000000,False
4,AAACGAAGTAGACGTG,True,wt,fluHA,AAGTAAGCGACATGAG,0.001271,True
...,...,...,...,...,...,...,...
7706,TTTGATCTCCCGTTCA,False,none,fluNA,,0.000000,True
7707,TTTGATCTCGCATTGA,True,wt,fluNA,,0.000000,True
7708,TTTGGAGAGTTGCCTA,False,none,fluNA,,0.000000,True
7709,TTTGGAGGTATCGTTG,True,wt,fluNA,ACATCTTATTTACACG,0.000349,True


Read the viral barcode frequencies from the progeny. **For now just work with supernatant data. Remove second_infection frequencies.**

In [6]:
progeny_viral_bc_freqs = pd.read_csv(viral_bc_in_progeny_freq_csv)
assert set(progeny_viral_bc_freqs['gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
progeny_viral_bc_freqs = (progeny_viral_bc_freqs
                          .rename(columns={'barcode': 'viral_barcode',
                                           'tag': 'infecting_viral_tag',
                                           'mean_freq': 'freq_progeny'}))
progeny_viral_bc_freqs = (progeny_viral_bc_freqs
                          .query('source == "supernatant"'))
progeny_viral_bc_freqs = progeny_viral_bc_freqs.drop(columns = 'source')
display(progeny_viral_bc_freqs)

Unnamed: 0,infecting_viral_tag,gene,viral_barcode,freq_progeny
0,syn,fluHA,AAAAAAGCACGAGCAG,2.507618e-07
1,syn,fluHA,AAAAAATCCTTCAGCA,2.574453e-07
2,syn,fluHA,AAAAAATGGCGACGCT,2.574453e-07
3,syn,fluHA,AAAAAATTGGTTTACT,2.574453e-07
4,syn,fluHA,AAAAACACTCACAAGT,2.574453e-07
...,...,...,...,...
57485,wt,fluNA,TTTTTCCCTTACATAT,4.799773e-07
57486,wt,fluNA,TTTTTCTTACGATCAC,4.799773e-07
57487,wt,fluNA,TTTTTCTTCGAGATAG,4.799773e-06
57488,wt,fluNA,TTTTTGGGATCATTGC,9.599545e-07


Merge dataframes to one. This will be done in two steps. First, merge transcriptome frequencies with all cells tag info.  Then, merge supernatant frequencies into this dataframe.

Structure of the final dataframe should be as follows:
1. Cell barcode
2. Infected
3. Infecting viral tag
4. Gene
5. Viral barcode
6. Frequency in transcriptome (`frac_viral_bc_UMIs`)
7. Frequency in supernatant sequencing (`freq_progeny`)

In [9]:
viral_bc_freqs = pd.merge(
    left=pd.concat([all_cells.assign(gene=gene)
                    for gene in barcoded_viral_genes]),
    right=transcriptome_viral_bc_freqs,
    how='outer',
    on=['cell_barcode', 'gene', 'infected', 'infecting_viral_tag'],
    validate='one_to_many')

assert (viral_bc_freqs['cell_barcode'].unique() ==
        all_cells['cell_barcode'].unique()).all(), \
       "Cell barcodes in merged dataframe don't " \
       "match barcodes in source data."
assert (viral_bc_freqs['viral_barcode'].nunique() ==
        transcriptome_viral_bc_freqs['viral_barcode'].nunique()), \
       "Number of viral barcodes in merged dataframe doesn't " \
       "match number of barcodes in source data."

viral_bc_freqs = pd.merge(
    left=viral_bc_freqs,
    right=progeny_viral_bc_freqs,
    how='outer',
    on=['viral_barcode', 'gene', 'infecting_viral_tag'])
# Need to think of what asserts to include here

display(viral_bc_freqs)

Unnamed: 0,cell_barcode,infected,infecting_viral_tag,gene,viral_barcode,frac_viral_bc_UMIs,reject_uninfected,freq_progeny
0,AAACCCAGTAACAAGT,False,none,fluHA,,0.0,False,
1,AAACCCATCATTGCTT,False,none,fluHA,,0.0,False,
2,AAACGAAAGATGTTGA,False,none,fluHA,,0.0,False,
3,AAACGAAGTGATAGAT,False,none,fluHA,,0.0,False,
4,AAACGCTCAAATGATG,False,none,fluHA,,0.0,False,
...,...,...,...,...,...,...,...,...
64902,,,wt,fluNA,TTTTTCCCTTACATAT,,,4.799773e-07
64903,,,wt,fluNA,TTTTTCTTACGATCAC,,,4.799773e-07
64904,,,wt,fluNA,TTTTTCTTCGAGATAG,,,4.799773e-06
64905,,,wt,fluNA,TTTTTGGGATCATTGC,,,9.599545e-07
