# Integrate data
This notebook takes processed data and integrates it into a single dataframe.

The general structure of the dataframe is that each cell barcode is listed on a row, and features of that cell are listed in columns. Cells that have more than one valid viral barcode identified may have multiple rows--one for each valid viral barcode.

Import python modules:


In [None]:
from IPython.display import display

import pandas as pd

Variables:

In [None]:
# Input sources
cell_annotations_csv = snakemake.input.cell_annotations
viral_genes_by_cell_csv = snakemake.input.viral_genes_by_cell_csv
pacbio_consensus_gene_csv = 'results/pacbio/scProgenyProduction_trial3_consensus_gene.csv.gz' # snakemake.input.pacbio_consensus_gene_csv
viral_barcodes_valid_csv = snakemake.input.viral_barcodes_valid_csv
filtered_progeny_viral_bc_csv = snakemake.input.filtered_progeny_viral_bc_csv
contributes_progeny_by_cell_csv = snakemake.input.contributes_progeny_by_cell_csv

# # Params and wildcards
barcoded_viral_genes = snakemake.params.barcoded_viral_genes
expt = snakemake.wildcards.expt

# Output
integrated_data_csv = snakemake.output.integrated_data_csv

## Load data and transform into wide format
Load data from the following sources:  
* `cell_annotations_csv` contains a list of all cell barcodes, their infection status/tag, total_UMIs, viral_UMIs, and frac_viral_UMIs  

* `viral_genes_by_cell_csv` contains every cell barcode, each viral gene, whether the viral gene is present, the fraction of UMIs from that gene, and the total number of viral genes detected in that cell  

* `viral_barcodes_valid_csv` contains each infected cell, the valid viral barcodes detected in that cell, viral_bc_UMIs, and frac_viral_bc_UMIs  

* `filtered_progeny_viral_bc_csv` contains the progeny source, the tag of the progeny source, the valid viral barcodes detected in that progeny source, and the frequency of the viral barcodes in that progeny source  

* `pacbio_consensus_gene_csv` contains the cell barcode, each viral gene, and the consensus genotype of the viral gene in that cell

Some of these dataframes are in a long format, with many rows per cell barcode. I will transform them into a wide format, so that each cell barcode has only one row. The only exception is the viral barcodes, which will have one row per cell barcode-viral barcode pair.

Load cell barcodes, tags, and basic metrics for every cell in transcriptome:

In [None]:
cell_annotations = pd.read_csv(cell_annotations_csv)
cell_annotations = (
    cell_annotations
    [['cell_barcode',
      'infected',
      'infecting_viral_tag',
      'total_UMIs',
      'viral_UMIs',
      'frac_viral_UMIs']])

display(cell_annotations)

Load viral genes detected in each cell:

In [None]:
viral_genes = pd.read_csv(viral_genes_by_cell_csv)
viral_genes = viral_genes.rename(columns={'frac_gene_UMIs': 'frac_UMIs',
                                          'gene_present': 'present'})
viral_genes = (viral_genes
               .pivot(
    index=['cell_barcode', 'n_viral_genes'],
    columns=['gene'],
    values=['frac_UMIs', 'present']))
viral_genes.columns = ['_'.join(col).strip() for col in viral_genes.columns.values]
viral_genes = viral_genes.reset_index()
display(viral_genes)

Load viral gene genotypes:

In [None]:
viral_genotypes = pd.read_csv(pacbio_consensus_gene_csv)
viral_genotypes = viral_genotypes.rename(columns={'total_UMI': 'pacbio_UMIs',
                                                  'consensus_mutations': 'mutations'})
viral_genotypes = (
    viral_genotypes
    .pivot(index=['cell_barcode'],
           columns=['gene'],
           values=['mutations', 'pacbio_UMIs']))
viral_genotypes.columns = ['_'.join(col).strip() for col in viral_genotypes.columns.values]
viral_genotypes = viral_genotypes.reset_index()
display(viral_genotypes)

Load valid viral barcodes in each infected cell:

In [None]:
transcriptome_viral_barcodes = pd.read_csv(viral_barcodes_valid_csv)
transcriptome_viral_barcodes = transcriptome_viral_barcodes.drop(columns=['valid_viral_bc'])
transcriptome_viral_barcodes = transcriptome_viral_barcodes.rename(columns={'gene': 'barcoded_gene'})
assert set(transcriptome_viral_barcodes['barcoded_gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
display(transcriptome_viral_barcodes)

Load progeny viral barcode frequencies:

In [None]:
progeny_viral_barcodes = pd.read_csv(filtered_progeny_viral_bc_csv)
progeny_viral_barcodes = progeny_viral_barcodes.drop(columns=['Unnamed: 0'])
progeny_viral_barcodes = (progeny_viral_barcodes
                          .rename(columns={'tag': 'infecting_viral_tag',
                                           'average_freq': 'freq',
                                           'gene': 'barcoded_gene'}))
assert set(progeny_viral_barcodes['barcoded_gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
progeny_viral_barcodes = (
    progeny_viral_barcodes
    .pivot(index=['viral_barcode', 'infecting_viral_tag', 'barcoded_gene'],
           columns=['source'],
           values=['freq']))
progeny_viral_barcodes.columns = ['_'.join(col).strip() for col in progeny_viral_barcodes.columns.values]
progeny_viral_barcodes = progeny_viral_barcodes.reset_index()
display(progeny_viral_barcodes)

Load whether each infected cell contributes any progeny:

In [None]:
contributes_progeny = pd.read_csv(contributes_progeny_by_cell_csv)
contributes_progeny = contributes_progeny.rename(columns={'gene': 'barcoded_gene',
                                                          'max_progeny_freq': 'max_freq'})
assert set(transcriptome_viral_barcodes['barcoded_gene']) == set(barcoded_viral_genes), \
       "Barcoded genes in barcode counts do not match expectation."
contributes_progeny = (
    contributes_progeny
    .pivot(index=['cell_barcode'],
           columns=['source'],
           values=['max_freq', 'contributes_progeny']))
contributes_progeny.columns = ['_'.join(col).strip() for col in contributes_progeny.columns.values]
contributes_progeny = contributes_progeny.reset_index()
display(contributes_progeny)

## Integrate data into single dataframe

Start with `cell_annotations` dataframe. Bring in viral gene expression information.

In [None]:
integrated_df = pd.merge(
    left=cell_annotations,
    right=viral_genes,
    on='cell_barcode',
    how='outer',
    validate='one_to_one'
)
display(integrated_df)

Bring in viral gene genotypes:

In [None]:
integrated_df = pd.merge(
    left=integrated_df,
    right=viral_genotypes,
    on='cell_barcode',
    how='left',
    validate='one_to_one'
)
display(integrated_df)

Bring in valid viral barcodes for cells that have them:

In [None]:
integrated_df = pd.merge(
    left=integrated_df,
    right=transcriptome_viral_barcodes,
    on='cell_barcode',
    how='left',
    validate='one_to_many'
)
display(integrated_df)

Bring in progeny frequencies:

In [None]:
integrated_df = pd.merge(
    left=integrated_df,
    right=progeny_viral_barcodes,
    on=['infecting_viral_tag', 'barcoded_gene', 'viral_barcode'],
    how='left',
    validate='many_to_one'
)
display(integrated_df)

Bring in annotation of whether each infected cell contributes any progeny:

In [None]:
integrated_df = pd.merge(
    left=integrated_df,
    right=contributes_progeny,
    on=['cell_barcode'],
    how='left',
    validate='many_to_one'
)
display(integrated_df)

## Check integrated dataframe

Check that the number of total cells has not changed.

In [None]:
assert integrated_df['cell_barcode'].nunique() == \
    len(cell_annotations['cell_barcode']), \
    "Total number of cells changed"

Check that every viral barcode has a `barcoded_gene`, `frac_viral_bc_UMIS`, `freq_supernatant`, and `freq_second_infection`.

In [None]:
assert len(integrated_df.query('viral_barcode.notnull()', engine='python')) == \
    len(integrated_df.query('barcoded_gene.notnull()', engine='python')) == \
    len(integrated_df.query('frac_viral_bc_UMIs.notnull()', engine='python')) == \
    len(integrated_df.query('freq_supernatant.notnull()', engine='python')) == \
    len(integrated_df.query('freq_second_infection.notnull()', engine='python')), \
    "Mismatch in viral barcode data."

## Output integrated dataframe

In [None]:
# save CSV
integrated_df.to_csv(integrated_data_csv, index=False)