# Extract viral barcodes from aligned 10x transcriptomic data
This Python Jupyter notebook uses the aligned 10x transcriptomic data to tally viral barcodes for each 10x cell barcode and UMI.
It does this only for the **valid** cell barcodes, and uses the error-corrected cell barcodes and UMIs reported in the BAM file created by `STARsolo`.

Import Python modules:

In [None]:
import pandas as pd

from pymodules.tags_and_barcodes import extract_tags

import pysam

Get `snakemake` variables [as described here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration):

In [None]:
bam = snakemake.input.bam
bai = snakemake.input.bai
cell_barcodes = snakemake.input.cell_barcodes
viral_bc_locs = snakemake.input.viral_bc_locs
viral_barcodes_csv = snakemake.output.viral_barcodes_csv
expt = snakemake.wildcards.expt

Read the viral tag locations:

In [None]:
print(f"Reading viral tag locations from {viral_bc_locs}")
bc_locs_df = pd.read_csv(viral_bc_locs)
display(bc_locs_df)

if len(bc_locs_df) != bc_locs_df['gene'].nunique():
    raise ValueError('code assumes at most one barcode per gene')

Get set of valid cell barcodes

In [None]:
print(f"Reading valid cell barcodes from {cell_barcodes}")
cell_barcode_set = set(pd.read_csv(cell_barcodes, header=None)[0])
print(f"Read {len(cell_barcode_set)} valid barcodes.")

Now we get the viral barcodes.
Specifically, parse the BAM file, and for each read mapping to a viral gene with a valid cell barcode and UMI, we see if we can determine the viral barcode identity.
The barcodes are grouped by UMI, gene, and cell barcode, and the barcode is labeled as `ambiguous` if no tag identities compose more than 50% of the reads for a UMI / gene in a cell.

In [None]:
print(f"Parsing viral barcodes from {bam} (index {bai}):\n")

bcs_by_umi = pd.DataFrame({}, columns=['gene', 'cell_barcode', 'UMI', 'viral_barcode'])

with pysam.AlignmentFile(bam, index_filename=bai) as bamfile:
    for tup in bc_locs_df.itertuples():
        print(f"Parsing viral barcodes for {tup.gene}... ", sep='')
        readiterator = bamfile.fetch(contig=tup.gene,
                                     start=tup.start - 1,  # convert 1- to 0-based indexing
                                     end=tup.end,
                                     )
        gene_bcs_by_umi = (
                        extract_tags(readiterator, cell_barcode_set, tup.start - 1, tup.end)
                        .rename(columns={'tag': 'viral_barcode'})
                        [['cell_barcode', 'UMI', 'viral_barcode']]
                        )
        print(f"parsed viral barcodes {len(gene_bcs_by_umi)} UMIs.")
        bcs_by_umi = bcs_by_umi.append(gene_bcs_by_umi.assign(gene=tup.gene))
    
if len(bcs_by_umi) != len(bcs_by_umi
                          [['gene', 'cell_barcode', 'UMI']]
                          .drop_duplicates()
                          ):
    raise ValueError('not unique viral barcodes for each gene / cell / UMI')

Write the viral barcodes to the output CSV file:

In [None]:
print(f"Writing viral barcodes to {viral_barcodes_csv}")

bcs_by_umi.to_csv(viral_barcodes_csv,
                  index=False,
                  compression='gzip')