# Correlations
This notebook displays the correlation between viral barcode expression in the transcriptome, supernatant, and second infection.

## Notebook setup
Import python modules:

In [1]:
from IPython.display import display

from dms_variants.constants import CBPALETTE

import pandas as pd

import plotnine as p9

import scipy

import statsmodels.stats.multitest

Get `snakemake` variables [as described here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration):

In [3]:
# Hardcoded for now
viral_tag_by_cell_csv = 'results/viral_tags_bcs_in_cells/scProgenyProduction_trial2_cell_barcodes_with_viral_tags.csv.gz'
viral_bc_by_cell_filtered_csv = 'results/viral_fastq10x/scProgenyProduction_trial2_viral_bc_by_cell_filtered.csv.gz'
viral_bc_in_progeny_freq_csv = 'results/viral_progeny/scProgenyProduction_trial2_viral_bc_in_progeny_freq.csv.gz'
expt = 'scProgenyProduction_trial2'
plot = 'results/viral_fastq10x/scProgenyProduction_trial2_viral_bc_correlations.pdf'
barcoded_viral_genes = ['fluHA', 'fluNA']

## Organize data
Read the cell barcodes and tags into a pandas dataframe. Only keep relevant columns.

In [4]:
all_cells = pd.read_csv(viral_tag_by_cell_csv)
all_cells = all_cells[['cell_barcode',
                       'infected',
                       'infecting_viral_tag']]
display(all_cells)

Unnamed: 0,cell_barcode,infected,infecting_viral_tag
0,AAACCCAAGTAGGTTA,True,syn
1,AAACCCACAAGGCCTC,True,syn
2,AAACCCACACACACGC,True,both
3,AAACCCATCGTGCATA,True,syn
4,AAACCCATCTACTGCC,False,none
...,...,...,...
7436,TTTGGTTGTTAAGCAA,False,none
7437,TTTGGTTTCGTCGCTT,False,none
7438,TTTGTTGCATGTGGTT,True,wt
7439,TTTGTTGTCGTCGGGT,False,none


Read the viral barcode frequencies from the transcriptome. Filter out viral barcodes that didn't pass significant threshold.

In [8]:
transcriptome_viral_bc_freqs = pd.read_csv(viral_bc_by_cell_filtered_csv)
assert set(transcriptome_viral_bc_freqs['gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
transcriptome_viral_bc_freqs = (transcriptome_viral_bc_freqs
                                .query('reject_uninfected == True'))
display(transcriptome_viral_bc_freqs)

Unnamed: 0,cell_barcode,gene,viral_barcode,frac_viral_bc_UMIs,reject_uninfected
1,AAACCCAAGTAGGTTA,fluHA,AGAATCGACACATGTC,0.002609,True
4,AAACCCAAGTAGGTTA,fluHA,CACGGATGGTGTACGA,0.002236,True
6,AAACCCAAGTAGGTTA,fluHA,CTTAACTGTATATTCG,0.004844,True
8,AAACCCAAGTAGGTTA,fluHA,TCTTAAGTATATCAGA,0.002422,True
12,AAACCCAAGTAGGTTA,fluHA,TGTAAATAGAGTTCGC,0.003726,True
...,...,...,...,...,...
58905,TTTGGTTCATCTGCGG,fluNA,TTGTATAAAAATACAG,0.007323,True
58906,TTTGGTTGTCGGTGTC,fluNA,AATAAGCGGCTCTTTG,0.003318,True
58913,TTTGTTGCATGTGGTT,fluNA,CACCCCGTTAGTGGGG,0.007294,True
58923,TTTGTTGTCTAGGAAA,fluNA,GAACCCGATGGGGAAT,0.009800,True


Read the viral barcode frequencies from the progeny. **For now just work with supernatant data. Remove second_infection frequencies.**

In [13]:
progeny_viral_bc_freqs = pd.read_csv(viral_bc_in_progeny_freq_csv)
assert set(progeny_viral_bc_freqs['gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
progeny_viral_bc_freqs = (progeny_viral_bc_freqs
                          .rename(columns={'barcode': 'viral_barcode'}))
progeny_viral_bc_freqs = (progeny_viral_bc_freqs
                          .query('source == "supernatant"'))
display(progeny_viral_bc_freqs)

Unnamed: 0,source,tag,gene,viral_barcode,mean_freq
90716,supernatant,syn,fluHA,AAAAAAATCTTAATGA,2.414539e-04
90717,supernatant,syn,fluHA,AAAAAACGAATAAATT,2.483091e-06
90718,supernatant,syn,fluHA,AAAAAACGAATGGATT,3.352175e-04
90719,supernatant,syn,fluHA,AAAAAACGAGAAGAAT,1.277534e-06
90720,supernatant,syn,fluHA,AAAAAAGTTGAGATTT,1.205557e-06
...,...,...,...,...,...
130646,supernatant,wt,fluNA,TTTTGTTAGCGTCCTG,3.067874e-04
130647,supernatant,wt,fluNA,TTTTTTAGAAAACGTA,8.987181e-07
130648,supernatant,wt,fluNA,TTTTTTAGAAAACGTC,1.680603e-04
130649,supernatant,wt,fluNA,TTTTTTCACTGCCATT,8.987181e-07


Merge dataframes to one. Structure of the dataframe should be as follows:


In [None]:
viral_bc_freqs = pd.merge()