# Count viral barcodes from aligned Illumina 10X reads
This Python Jupyter notebook counts the viral barcodes from aligned Illumina 10X data and outputs the counts of each viral barcode for each cell barcode and barcoded gene into a CSV.

## Parameters for notebook
First, set the parameters for the notebook.
That should be done in the next cell, which is tagged as a `parameters` cell to enable [papermill parameterization](https://papermill.readthedocs.io/en/latest/usage-parameterize.html):

In [1]:
# parameters cell; in order for notebook to run this cell must define:
#  - input_fastq10x_bam: BAM file with aligned FASTQ 10X reads
#  - input_fastq10x_bai: BAM index file for `input_fastq10x_bai`
#  - input_viraltag_locs: CSV file giving the location of the viral tags
#  - input_viraltag_identities: YAML file giving expected identity of each tag for each tag variant
#  - input_cellbarcodes: TSV file giving valid cell barcodes
#  - output_viraltag_counts: created CSV file with the counts of each tag variant for each gene

In [2]:
# Parameters
input_fastq10x_bam = "results/aligned_fastq10x/wt_virus_pilot/Aligned.sortedByCoord.out.bam"
input_fastq10x_bai = "results/aligned_fastq10x/wt_virus_pilot/Aligned.sortedByCoord.out.bam.bai"
input_viralbc_locs = "results/viral_fastq10x/viralbc_locs.csv"
input_cellbarcodes = "results/aligned_fastq10x/wt_virus_pilot/Solo.out/Gene/filtered/barcodes.tsv"
output_viralbc_counts = "results/viral_fastq10x/wt_virus_pilot_viralbc_counts.csv"

## Import Python modules

In [3]:
import pandas as pd

from plotnine import *

from pymodules.tags_and_barcodes import extract_tags

import pysam

## Read viral barcode locations

Read the viral tag locations:

In [4]:
print(f"Reading viral barcode locations from {input_viralbc_locs}")
viralbc_locs_df = pd.read_csv(input_viralbc_locs)
viralbc_locs_df

Reading viral barcode locations from results/viral_fastq10x/viralbc_locs.csv


Unnamed: 0,gene,start,end
0,fluHA,1828,1843
1,fluNA,1551,1566


Get names of the barcoded viral genes:

In [5]:
bc_viral_genes = viralbc_locs_df['gene'].unique()

assert len(bc_viral_genes) == len(viralbc_locs_df), 'currently on support on barcode per gene'

## Get set of valid cell barcodes

In [6]:
print(f"Reading valid cell barcodes from {input_cellbarcodes}")

cellbarcodes = set(pd.read_csv(input_cellbarcodes, header=None)[0])

print(f"Read {len(cellbarcodes)} valid barcodes.")

Reading valid cell barcodes from results/aligned_fastq10x/wt_virus_pilot/Solo.out/Gene/filtered/barcodes.tsv
Read 3607 valid barcodes.


## Count viral barcodes
For each cell barcode and each viral gene, we count the number of unique reads for each viral tag variant.

For each viral barcode, we parse the barcode identity for all reads that cover that barcode.
The reads are grouped by UMI and cell barcode, and the viral barcode is labeled as `ambiguous` if there is not a majority consensus (>50%) nucleotide at any site in the viral barcode.
    
The output of this process is the tidy data frame `viralbc_counts`:

In [7]:
print(f"Parsing reads from {input_fastq10x_bam} (index {input_fastq10x_bai}):\n")

with pysam.AlignmentFile(input_fastq10x_bam, index_filename=input_fastq10x_bai) as bamfile:
    
    viralbc_counts = pd.DataFrame({},
                                  columns=['gene', 'cell_barcode', 'viral_barcode', 'count'])
    
    assert len(viralbc_locs_df) == viralbc_locs_df['gene'].nunique()
    for tup in viralbc_locs_df.itertuples():
        print(f"Processing viral barcodes for {tup.gene}...", end=' ')
        
        readiterator = bamfile.fetch(contig=tup.gene,
                                     start=tup.start - 1,  # convert 1- to 0-based indexing
                                     end=tup.end,
                                     )
        gene_counts_df = (
                extract_tags(readiterator, cellbarcodes, tup.start - 1, tup.end)
                .rename(columns={'tag': 'viral_barcode'})
                [['cell_barcode', 'UMI', 'viral_barcode']]
                )
        print(f"parsed viral barcodes for {len(gene_counts_df)} UMIs.")
            
        # aggregate viral barcode counts by gene and cell barcode
        viralbc_counts = viralbc_counts.append(
            gene_counts_df
            .groupby(['cell_barcode', 'viral_barcode'])
            .aggregate(count=pd.NamedAgg('UMI', 'count'))
            .reset_index()
            .assign(gene=tup.gene)
            [['gene', 'cell_barcode', 'viral_barcode', 'count']],
            ignore_index=True, sort=False
            )     

Parsing reads from results/aligned_fastq10x/wt_virus_pilot/Aligned.sortedByCoord.out.bam (index results/aligned_fastq10x/wt_virus_pilot/Aligned.sortedByCoord.out.bam.bai):

Processing viral barcodes for fluHA... parsed viral barcodes for 289 UMIs.
Processing viral barcodes for fluNA... parsed viral barcodes for 178 UMIs.


The results are now in the data frame `viralbc_counts`:

In [8]:
viralbc_counts

Unnamed: 0,gene,cell_barcode,viral_barcode,count
0,fluHA,AAACGCTCACCGAATT,GAGACGGAAGAGTCAA,1
1,fluHA,AAAGTGAAGGAGATAG,CGATACAGGGATCCCG,3
2,fluHA,AAAGTGAAGGAGATAG,GATACAGGGATCCCAT,1
3,fluHA,AAAGTGAGTAACATCC,ATGCTGATTCAACTCA,1
4,fluHA,AAAGTGAGTAACATCC,GTGGAACGGGCCCCAT,1
...,...,...,...,...
351,fluNA,TTGTTTGCAGCGCGTT,AGAAGCTTCTCCCCAT,1
352,fluNA,TTTCACACAACGAGGT,AGGAGGCGTTGCGCAT,1
353,fluNA,TTTCACAGTCCGTACG,TAGGGGGGACTGGAAC,1
354,fluNA,TTTCACATCGCACGAC,ATATCGAGATAAAAAG,1


**Recall that some might be `ambiguous`**

In [10]:
viralbc_counts.query('viral_barcode == "ambiguous"')

Unnamed: 0,gene,cell_barcode,viral_barcode,count
198,fluHA,TTGAGTGGTTGTAGCT,ambiguous,1
