# Integrate data
This notebook takes processed data and integrates it into a single dataframe.

The general structure of the dataframe is that each cell barcode is listed on a row, and features of that cell are listed in columns. Cells that have more than one valid viral barcode identified may have multiple rows--one for each valid viral barcode.

Import python modules:


In [1]:
from IPython.display import display

import pandas as pd

Variables:

In [17]:
# Input sources
cell_annotations_csv = 'results/viral_tags_bcs_in_cells/scProgenyProduction_trial3_cell_barcodes_with_viral_tags.csv.gz' #snakemake.input.cell_annotations
viral_genes_by_cell_csv = 'results/viral_fastq10x/scProgenyProduction_trial3_viral_genes_by_cell.csv.gz' #snakemake.input.viral_genes_by_cell_csv
viral_barcodes_valid_csv = 'results/viral_fastq10x/scProgenyProduction_trial3_viral_bc_by_cell_valid.csv.gz' #snakemake.input.viral_barcodes_valid_csv
filtered_progeny_viral_bc_csv = 'results/viral_progeny/scProgenyProduction_trial3_filtered_progeny_viral_bc.csv.gz' #snakemake.input.filtered_progeny_viral_bc_csv

# # Params and wildcards
barcoded_viral_genes = ['fluHA', 'fluNA'] #snakemake.params.barcoded_viral_genes
# expt = expt = #snakemake.wildcards.expt

# # Output
# integrated_data_csv = #snakemake.output.integrated_data_csv

## Load data
Load data from the following sources:  
* `cell_annotations_csv` contains a list of all cell barcodes, their infection status/tag, total_UMIs, viral_UMIs, and frac_viral_UMIs  

* `viral_genes_by_cell_csv` contains every cell barcode, each viral gene, whether the viral gene is present, the fraction of UMIs from that gene, and the total number of viral genes detected in that cell  

* `viral_barcodes_valid_csv` contains each infected cell, the valid viral barcodes detected in that cell, viral_bc_UMIs, and frac_viral_bc_UMIs  

* `filtered_progeny_viral_bc_csv` contains the progeny source, the tag of the progeny source, the valid viral barcodes detected in that progeny source, and the frequency of the viral barcodes in that progeny source

Load cell barcodes, tags, and basic metrics for every cell in transcriptome:

In [7]:
cell_annotations = pd.read_csv(cell_annotations_csv)
display(cell_annotations)

Unnamed: 0,cell_barcode,total_UMIs,viral_UMIs,cellular_UMIs,frac_viral_UMIs,infected,infecting_viral_tag,viral_tag_doublet,viral_tag_major,viral_tag_minor,viral_tag_major_infected,viral_tag_minor_infected,viral_tag_major_counts,viral_tag_minor_counts
0,AAACCCAGTAACAAGT,47873,6,47867,0.000125,uninfected,none,False,syn,wt,False,False,1,1
1,AAACCCATCATTGCTT,90114,10,90104,0.000111,uninfected,none,False,wt,syn,False,False,5,2
2,AAACGAAAGATGTTGA,111630,18,111612,0.000161,uninfected,none,False,wt,syn,False,False,9,2
3,AAACGAAGTACTTCCC,56828,24082,32746,0.423770,infected,both,True,wt,syn,True,True,18769,28
4,AAACGAAGTAGACGTG,124341,4654,119687,0.037429,infected,wt,False,wt,syn,True,False,3694,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3367,TTTGATCTCCCGTTCA,63150,3,63147,0.000048,uninfected,none,False,wt,syn,False,False,3,0
3368,TTTGATCTCGCATTGA,170914,10415,160499,0.060937,infected,wt,False,wt,syn,True,False,6883,10
3369,TTTGGAGAGTTGCCTA,65941,12,65929,0.000182,uninfected,none,False,wt,syn,False,False,4,3
3370,TTTGGAGGTATCGTTG,150130,3526,146604,0.023486,infected,wt,False,wt,syn,True,False,2784,2


Load viral genes detected in each cell:

In [9]:
viral_genes = pd.read_csv(viral_genes_by_cell_csv)
display(viral_genes)

Unnamed: 0,cell_barcode,n_viral_genes,gene,frac_gene_UMIs,gene_present
0,AAACCCAGTAACAAGT,0,fluPB2,0.000000,False
1,AAACCCATCATTGCTT,1,fluPB2,0.000011,True
2,AAACGAAAGATGTTGA,0,fluPB2,0.000000,False
3,AAACGAAGTACTTCCC,7,fluPB2,0.002816,True
4,AAACGAAGTAGACGTG,8,fluPB2,0.000024,True
...,...,...,...,...,...
26971,TTTGATCTCCCGTTCA,0,fluNS,0.000016,False
26972,TTTGATCTCGCATTGA,8,fluNS,0.013147,True
26973,TTTGGAGAGTTGCCTA,1,fluNS,0.000061,False
26974,TTTGGAGGTATCGTTG,8,fluNS,0.008193,True


Load valid viral barcodes in each infected cell:

In [15]:
transcriptome_viral_barcodes = pd.read_csv(viral_barcodes_valid_csv)
transcriptome_viral_barcodes = transcriptome_viral_barcodes.drop(columns=['valid_viral_bc'])
assert set(transcriptome_viral_barcodes['gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
display(transcriptome_viral_barcodes)

Unnamed: 0,cell_barcode,gene,viral_barcode,viral_bc_UMIs,frac_viral_bc_UMIs
0,AAACGAAGTAGACGTG,fluHA,AAGTAAGCGACATGAG,251,0.002019
1,AAAGGATTCTGATGGT,fluHA,GTGGAGTCGCCAGTTC,114,0.001424
2,AAAGGGCCAGGCTACC,fluHA,AAAGTGATCCCCATAC,8,0.000395
3,AAAGGGCCAGGCTACC,fluHA,CATTTAACGCTGTGAG,15,0.000741
4,AAAGGGCCAGGCTACC,fluHA,CGTAGGATGTTGCGTC,31,0.001532
...,...,...,...,...,...
998,TTTACCAGTCGCTTAA,fluNA,TTGGAGGAGACCCGTG,7,0.000061
999,TTTAGTCCATCATCCC,fluNA,AGAAACCTCGACATAT,11,0.000463
1000,TTTAGTCCATCATCCC,fluNA,TTGGACGCATTGCAAA,18,0.000757
1001,TTTCACAAGCCAAGCA,fluNA,GGTATCAGTTATTGTT,186,0.002718


Load progeny viral barcode frequencies:

In [18]:
progeny_viral_barcodes = pd.read_csv(filtered_progeny_viral_bc_csv)
progeny_viral_barcodes = progeny_viral_barcodes.drop(columns=['Unnamed: 0'])
progeny_viral_barcodes = (progeny_viral_barcodes
                          .rename(columns={'tag': 'infecting_viral_tag',
                                           'average_freq': 'progeny_freq'}))
assert set(progeny_viral_barcodes['gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
display(progeny_viral_barcodes)

Unnamed: 0,source,infecting_viral_tag,gene,viral_barcode,progeny_freq
0,second_infection,syn,fluHA,AAAATTGTCCAATATA,0.000010
1,second_infection,syn,fluHA,AAACCGAGGAAATCCC,0.001310
2,second_infection,syn,fluHA,AAATATCGCACCAAGA,0.000010
3,second_infection,syn,fluHA,AACACACGATACAGAC,0.000010
4,second_infection,syn,fluHA,AACAGTACGAAGCCGA,0.000010
...,...,...,...,...,...
1729,supernatant,wt,fluNA,TTTCTTTACTCAGAAT,0.002060
1730,supernatant,wt,fluNA,TTTGTCGGCAGTCACT,0.000010
1731,supernatant,wt,fluNA,TTTTAACGTTATACTA,0.000010
1732,supernatant,wt,fluNA,TTTTACCTACGTAGTT,0.000010


## Integrate data into single dataframe

## Revise integrated dataframe

## Output integrated dataframe