# Correct transcript viral barcodes within each cell
This Python Jupyter notebook corrects viral barcodes within each cell. Viral barcodes are input as UMI count data extracted from the transcriptome. Each viral barcode is associated with a cell barcode and a gene.

The notebook uses `UMI tools` to correct the viral barcodes for each cell. UMI tools outputs a set of corrected viral barcodes for each cell. The notebook then associates each original viral barcode with it corrected viral barcodes and aggregates the counts for each corrected viral barcodes. Finally, the notebook outputs the corrected viral barcode count.

**Notes about UMI_tools**
* Using directional adjacency method. This has [been demonstrated on simulated data](https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/) to produce a more accurate estimate of true UMI number than other heuristics.
* Sequence must be input as byte. See definition here: https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal
* The corrected barcode is returned as the first barcode in the group list. See umi_tools API documentation: https://umi-tools.readthedocs.io/en/latest/API.html

## Notebook setup

Import Python modules:

In [1]:
import gzip

from IPython.display import display

from dms_variants.constants import CBPALETTE

import numpy as np

import pandas as pd

import plotnine as p9

from umi_tools import UMIClusterer

Get `snakemake` variables [as described here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration):

In [2]:
viral_bc_by_cell = 'results/viral_fastq10x/scProgenyProduction_trial2_viral_bc_by_cell.csv.gz'
expt = 'scProgenyProduction_trial2'
viral_bc_in_transcripts_corrected_csv = 'scProgenyProduction_trial2_viral_bc_by_cell_corrected.csv.gz'

Set plot style

In [3]:
p9.theme_set(p9.theme_classic())

Import barcode frequency data

In [4]:
viral_bc_df = pd.read_csv(gzip.open(viral_bc_by_cell))
display(viral_bc_df)

Unnamed: 0,gene,cell_barcode,viral_barcode,count
0,fluHA,AAACCCAAGTAGGTTA,ACGTTATTGATTGAGA,1
1,fluHA,AAACCCAAGTAGGTTA,AGAATCGACACATGTC,14
2,fluHA,AAACCCAAGTAGGTTA,AGCCATAGTCTAAAGG,7
3,fluHA,AAACCCAAGTAGGTTA,AGCCATAGTCTACAGG,1
4,fluHA,AAACCCAAGTAGGTTA,AGGATGATTTTTTTAT,1
...,...,...,...,...
94893,fluNA,TTTGTTGTCTAGGAAA,GAACCCGAAGGGGAAT,1
94894,fluNA,TTTGTTGTCTAGGAAA,GAACCCGATGGGGAAT,25
94895,fluNA,TTTGTTGTCTAGGAAA,TAAGGTAAAATAATAG,1
94896,fluNA,TTTGTTGTCTAGGAAA,TAAGGTATAATTCTAG,4


## Data processing