# Integrate data
This notebook takes processed data and integrates it into a single dataframe.

The general structure of the dataframe is that each cell barcode is listed on a row, and features of that cell are listed in columns. Cells that have more than one valid viral barcode identified may have multiple rows--one for each valid viral barcode.

Import python modules:


In [1]:
from IPython.display import display

import pandas as pd

Variables:

In [2]:
# Input sources
cell_annotations_csv = 'results/viral_tags_bcs_in_cells/scProgenyProduction_trial3_cell_barcodes_with_viral_tags.csv.gz' #snakemake.input.cell_annotations
viral_genes_by_cell_csv = 'results/viral_fastq10x/scProgenyProduction_trial3_viral_genes_by_cell.csv.gz' #snakemake.input.viral_genes_by_cell_csv
viral_barcodes_valid_csv = 'results/viral_fastq10x/scProgenyProduction_trial3_viral_bc_by_cell_valid.csv.gz' #snakemake.input.viral_barcodes_valid_csv
filtered_progeny_viral_bc_csv = 'results/viral_progeny/scProgenyProduction_trial3_filtered_progeny_viral_bc.csv.gz' #snakemake.input.filtered_progeny_viral_bc_csv
contributes_progeny_by_cell_csv = 'results/viral_fastq10x/scProgenyProduction_trial3_contributes_progeny_by_cell.csv.gz' #snakemake.output.contributes_progeny_by_cell_csv

# # Params and wildcards
barcoded_viral_genes = ['fluHA', 'fluNA'] #snakemake.params.barcoded_viral_genes
# expt = expt = #snakemake.wildcards.expt

# # Output
# integrated_data_csv = #snakemake.output.integrated_data_csv

## Load data and transform into wide format
Load data from the following sources:  
* `cell_annotations_csv` contains a list of all cell barcodes, their infection status/tag, total_UMIs, viral_UMIs, and frac_viral_UMIs  

* `viral_genes_by_cell_csv` contains every cell barcode, each viral gene, whether the viral gene is present, the fraction of UMIs from that gene, and the total number of viral genes detected in that cell  

* `viral_barcodes_valid_csv` contains each infected cell, the valid viral barcodes detected in that cell, viral_bc_UMIs, and frac_viral_bc_UMIs  

* `filtered_progeny_viral_bc_csv` contains the progeny source, the tag of the progeny source, the valid viral barcodes detected in that progeny source, and the frequency of the viral barcodes in that progeny source  

Some of these dataframes are in a long format, with many rows per cell barcode. I will transform them into a wide format, so that each cell barcode has only one row. The only exception is the viral barcodes, which will have one row per cell barcode-viral barcode pair.

Load cell barcodes, tags, and basic metrics for every cell in transcriptome:

In [3]:
cell_annotations = pd.read_csv(cell_annotations_csv)
cell_annotations = (
    cell_annotations
    [['cell_barcode',
      'infected',
      'infecting_viral_tag',
      'total_UMIs',
      'viral_UMIs',
      'frac_viral_UMIs']])

display(cell_annotations)

Unnamed: 0,cell_barcode,infected,infecting_viral_tag,total_UMIs,viral_UMIs,frac_viral_UMIs
0,AAACCCAGTAACAAGT,uninfected,none,47873,6,0.000125
1,AAACCCATCATTGCTT,uninfected,none,90114,10,0.000111
2,AAACGAAAGATGTTGA,uninfected,none,111630,18,0.000161
3,AAACGAAGTACTTCCC,infected,both,56828,24082,0.423770
4,AAACGAAGTAGACGTG,infected,wt,124341,4654,0.037429
...,...,...,...,...,...,...
3367,TTTGATCTCCCGTTCA,uninfected,none,63150,3,0.000048
3368,TTTGATCTCGCATTGA,infected,wt,170914,10415,0.060937
3369,TTTGGAGAGTTGCCTA,uninfected,none,65941,12,0.000182
3370,TTTGGAGGTATCGTTG,infected,wt,150130,3526,0.023486


Load viral genes detected in each cell:

In [4]:
viral_genes = pd.read_csv(viral_genes_by_cell_csv)
viral_genes = viral_genes.rename(columns={'frac_gene_UMIs': 'frac_UMIs',
                                          'gene_present': 'present'})
viral_genes = (viral_genes
               .pivot(
    index=['cell_barcode', 'n_viral_genes'],
    columns=['gene'],
    values=['frac_UMIs', 'present']))
viral_genes.columns = ['_'.join(col).strip() for col in viral_genes.columns.values]
viral_genes = viral_genes.reset_index()
display(viral_genes)

Unnamed: 0,cell_barcode,n_viral_genes,frac_UMIs_fluHA,frac_UMIs_fluM,frac_UMIs_fluNA,frac_UMIs_fluNP,frac_UMIs_fluNS,frac_UMIs_fluPA,frac_UMIs_fluPB1,frac_UMIs_fluPB2,present_fluHA,present_fluM,present_fluNA,present_fluNP,present_fluNS,present_fluPA,present_fluPB1,present_fluPB2
0,AAACCCAGTAACAAGT,0,4.17772e-05,6.26658e-05,0,0,2.08886e-05,0,0,0,False,False,False,False,False,False,False,False
1,AAACCCATCATTGCTT,1,0,5.54853e-05,0,3.32912e-05,1.10971e-05,0,0,1.10971e-05,False,False,False,False,False,False,False,True
2,AAACGAAAGATGTTGA,0,2.68745e-05,8.95817e-05,0,8.95817e-06,3.58327e-05,0,0,0,False,False,False,False,False,False,False,False
3,AAACGAAGTACTTCCC,7,3.51939e-05,0.229816,0.00739072,0.0728338,0.10157,0.00040473,0.00890406,0.00281551,False,True,True,True,True,True,True,True
4,AAACGAAGTAGACGTG,8,0.00256553,0.0127231,0.000249314,0.00488978,0.0165271,8.84664e-05,0.000361908,2.41272e-05,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3367,TTTGATCTCCCGTTCA,0,0,1.58353e-05,0,1.58353e-05,1.58353e-05,0,0,0,False,False,False,False,False,False,False,False
3368,TTTGATCTCGCATTGA,8,0.00360415,0.0283242,0.0078636,0.00723756,0.013147,8.77634e-05,0.000497326,0.000175527,True,True,True,True,True,True,True,True
3369,TTTGGAGAGTTGCCTA,1,0,9.09904e-05,1.51651e-05,1.51651e-05,6.06603e-05,0,0,0,False,False,True,False,False,False,False,False
3370,TTTGGAGGTATCGTTG,8,0.000619463,0.0111104,0.00029974,0.00278425,0.0081929,4.66263e-05,0.000379671,5.32872e-05,True,True,True,True,True,True,True,True


Load valid viral barcodes in each infected cell:

In [5]:
transcriptome_viral_barcodes = pd.read_csv(viral_barcodes_valid_csv)
transcriptome_viral_barcodes = transcriptome_viral_barcodes.drop(columns=['valid_viral_bc'])
transcriptome_viral_barcodes = transcriptome_viral_barcodes.rename(columns={'gene': 'barcoded_gene'})
assert set(transcriptome_viral_barcodes['barcoded_gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
display(transcriptome_viral_barcodes)

Unnamed: 0,cell_barcode,barcoded_gene,viral_barcode,viral_bc_UMIs,frac_viral_bc_UMIs
0,AAACGAAGTAGACGTG,fluHA,AAGTAAGCGACATGAG,251,0.002019
1,AAAGGATTCTGATGGT,fluHA,GTGGAGTCGCCAGTTC,114,0.001424
2,AAAGGGCCAGGCTACC,fluHA,AAAGTGATCCCCATAC,8,0.000395
3,AAAGGGCCAGGCTACC,fluHA,CATTTAACGCTGTGAG,15,0.000741
4,AAAGGGCCAGGCTACC,fluHA,CGTAGGATGTTGCGTC,31,0.001532
...,...,...,...,...,...
998,TTTACCAGTCGCTTAA,fluNA,TTGGAGGAGACCCGTG,7,0.000061
999,TTTAGTCCATCATCCC,fluNA,AGAAACCTCGACATAT,11,0.000463
1000,TTTAGTCCATCATCCC,fluNA,TTGGACGCATTGCAAA,18,0.000757
1001,TTTCACAAGCCAAGCA,fluNA,GGTATCAGTTATTGTT,186,0.002718


Load progeny viral barcode frequencies:

In [8]:
progeny_viral_barcodes = pd.read_csv(filtered_progeny_viral_bc_csv)
progeny_viral_barcodes = progeny_viral_barcodes.drop(columns=['Unnamed: 0'])
progeny_viral_barcodes = (progeny_viral_barcodes
                          .rename(columns={'tag': 'infecting_viral_tag',
                                           'average_freq': 'progeny_freq',
                                           'gene': 'barcoded_gene'}))
assert set(progeny_viral_barcodes['barcoded_gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
progeny_viral_barcodes = (
    progeny_viral_barcodes
    .pivot(index=['viral_barcode', 'infecting_viral_tag', 'barcoded_gene'],
           columns=['source'],
           values=['progeny_freq'])
    .reset_index())
progeny_viral_barcodes.columns = ['_'.join(col).strip() for col in viral_genes.columns.values]
progeny_viral_barcodes = progeny_viral_barcodes.reset_index()
display(progeny_viral_barcodes)

Unnamed: 0_level_0,viral_barcode,infecting_viral_tag,barcoded_gene,progeny_freq,progeny_freq
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,second_infection,supernatant
0,AAAAACACGTTCTATA,wt,fluNA,0.000010,0.000010
1,AAAACATGATGACGCC,wt,fluNA,0.000010,0.000010
2,AAAACTAGTTAGAGCA,wt,fluHA,0.000010,0.000010
3,AAAAGCCATTCGGAGA,wt,fluNA,0.000010,0.000010
4,AAAAGTTCTTGGATGT,wt,fluNA,0.000010,0.000010
...,...,...,...,...,...
862,TTTGCCAGAAAATCTT,wt,fluHA,0.000010,0.000010
863,TTTGTCGGCAGTCACT,wt,fluNA,0.000010,0.000010
864,TTTTAACGTTATACTA,wt,fluNA,0.000010,0.000010
865,TTTTACCTACGTAGTT,wt,fluNA,0.000010,0.000010


Load whether each infected cell contributes any progeny:

In [7]:
contributes_progeny = pd.read_csv(contributes_progeny_by_cell_csv)
contributes_progeny = contributes_progeny.rename(columns={'gene': 'barcoded_gene'})
assert set(transcriptome_viral_barcodes['barcoded_gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
display(contributes_progeny)

Unnamed: 0,cell_barcode,source,max_progeny_freq,contributes_progeny
0,AAACGAAGTAGACGTG,second_infection,0.000010,False
1,AAAGGATTCTGATGGT,second_infection,0.077933,True
2,AAAGGGCCAGGCTACC,second_infection,0.000010,False
3,AAAGGGCTCCGCACTT,second_infection,0.000124,True
4,AAAGTCCAGTAGAGTT,second_infection,0.000010,False
...,...,...,...,...
815,TTTAGTCCATCATCCC,supernatant,0.000010,False
816,TTTAGTCGTGCTCCGA,supernatant,0.000010,False
817,TTTCACAAGCCAAGCA,supernatant,0.000010,False
818,TTTGATCTCGCATTGA,supernatant,0.000010,False


## Integrate data into single dataframe

## Revise integrated dataframe

## Output integrated dataframe