# Correct viral barcodes to reduce replication, PCR, and sequencing errors
This Python Jupyter notebook uses `UMI_tools` directional adjacency method to correct viral barcodes that are likely derived from errors in the replication or sequencing library preparation process.

## Notes about UMI_tools
* Using directional adjacency method. This has been demonstrated on simulated data to produce a more accurate result than other heuristics.
* Sequence must be input as byte. See definition here: https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal
* The corrected barcode is returned as the first barcode in the group list. See umi_tools API documentation: https://umi-tools.readthedocs.io/en/latest/API.html

Import Python modules:

In [1]:
from IPython.display import display

from dms_variants.constants import CBPALETTE

import gzip

import pandas as pd

import plotnine as p9

from umi_tools import UMIClusterer

Get `snakemake` variables [as described here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration):

In [2]:
viral_bc_in_progeny_freq_csv = 'results/viral_progeny/scProgenyProduction_trial2_viral_bc_in_progeny_freq.csv.gz'

Import barcode frequency data

In [3]:
viral_bc_df = pd.read_csv(gzip.open(viral_bc_in_progeny_freq_csv))
display(viral_bc_df)

Unnamed: 0,source,tag,gene,barcode,mean_count
0,second_infection,syn,fluHA,AAAAAAAGTAAATCTT,28.0
1,second_infection,syn,fluHA,AAAAAAATCTTAATAA,1.0
2,second_infection,syn,fluHA,AAAAAAATCTTAATGA,50.0
3,second_infection,syn,fluHA,AAAAAACAATGACTAA,0.5
4,second_infection,syn,fluHA,AAAAAACCCAATTATT,0.5
...,...,...,...,...,...
149148,supernatant,wt,fluNA,TTTTGTTAGCGTCCTG,145.5
149149,supernatant,wt,fluNA,TTTTTTAGAAAACGTA,0.5
149150,supernatant,wt,fluNA,TTTTTTAGAAAACGTC,102.0
149151,supernatant,wt,fluNA,TTTTTTCACTGCCATT,0.5


Cluster barcodes within each sample

In [22]:
clusterer = UMIClusterer(cluster_method="directional")

for (source, tag, gene), df in (viral_bc_df
                                .groupby(['source',
                                          'tag',
                                          'gene'])):
    n_unique_bc = df['barcode'].nunique()
    print(f"There are {n_unique_bc} unique barcodes in the raw data for {source} {tag} {gene}")
    
    # Convert dataframe to dictionary. Dict is requried input type for umi_tools clustering.
    viral_bc_dict = (df[['barcode','mean_count']]
                     .set_index('barcode')
                     .to_dict(orient='dict'))
    viral_bc_dict = viral_bc_dict['mean_count']
    
    # Convert barcode strings to byte. Byte is required dtype for umi_tools clustering.
    byte_dict={}
    for key, value in viral_bc_dict.items(): 
        byte_dict[key.encode("utf-8")] = float(value)
    
    # Cluster barcodes
    bc_groups = clusterer(byte_dict, threshold=1)
    groups_df = pd.DataFrame(bc_groups)
    groups_df = groups_df.stack().str.decode('utf-8').unstack() # Convert bytes back to string
    groups_df = groups_df.rename(columns={0:'corrected_bc'})
    groups_df = groups_df.set_index('corrected_bc', drop=False)
    n_corrected_bc = groups_df['corrected_bc'].nunique()
    print(f"{n_unique_bc - n_corrected_bc} barcodes were corrected.")
    print(f"There are {n_corrected_bc} corrected barcodes for {source} {tag} {gene}\n")
    
    # Generate lookup table for this sample
    lookup_df = (groups_df.melt(ignore_index=False,
                                value_name='original_barcode')
                 ['original_barcode']
                 .dropna()
                 .reset_index())
    lookup_df['source'] = source
    lookup_df['tag'] = tag
    lookup_df['gene'] = gene
    display(lookup_df)

There are 23862 unique barcodes in the raw data for second_infection syn fluHA
7514 barcodes were corrected.
There are 16348 corrected barcodes for second_infection syn fluHA



Unnamed: 0,corrected_bc,original_barcode,source,tag,gene
0,GTCAATCAAGATAAGA,GTCAATCAAGATAAGA,second_infection,syn,fluHA
1,CATTTACGCTGAATTG,CATTTACGCTGAATTG,second_infection,syn,fluHA
2,AGCTTGGCATGAAAGA,AGCTTGGCATGAAAGA,second_infection,syn,fluHA
3,CTTAAAAAGATTCCAG,CTTAAAAAGATTCCAG,second_infection,syn,fluHA
4,GGTCACCGTGAGAAAT,GGTCACCGTGAGAAAT,second_infection,syn,fluHA
...,...,...,...,...,...
23857,CATTTACGCTGAATTG,CATTTTCGCTGAATTG,second_infection,syn,fluHA
23858,CATTTACGCTGAATTG,CATTTACGCTGAATAT,second_infection,syn,fluHA
23859,CATTTACGCTGAATTG,CATTGACGCTGAATTG,second_infection,syn,fluHA
23860,CATTTACGCTGAATTG,CATTTCCGCTGAATTG,second_infection,syn,fluHA


There are 24756 unique barcodes in the raw data for second_infection syn fluNA
5122 barcodes were corrected.
There are 19634 corrected barcodes for second_infection syn fluNA



Unnamed: 0,corrected_bc,original_barcode,source,tag,gene
0,TGTTATATTTGTATTG,TGTTATATTTGTATTG,second_infection,syn,fluNA
1,AAACACGGGTGAAATG,AAACACGGGTGAAATG,second_infection,syn,fluNA
2,ACAATTTCAGTATCAA,ACAATTTCAGTATCAA,second_infection,syn,fluNA
3,GTGTGGAGGTTTTTGA,GTGTGGAGGTTTTTGA,second_infection,syn,fluNA
4,GCATAGTGCGAACGTT,GCATAGTGCGAACGTT,second_infection,syn,fluNA
...,...,...,...,...,...
24751,TGTTATATTTGTATTG,TGTTATATTTGTATAT,second_infection,syn,fluNA
24752,TGTTATATTTGTATTG,GGTTATATTTGTAGTG,second_infection,syn,fluNA
24753,TGTTATATTTGTATTG,TGTTATATTTGGAGGG,second_infection,syn,fluNA
24754,TGTTATATTTGTATTG,TGTGATAGGTGTATTG,second_infection,syn,fluNA


There are 28801 unique barcodes in the raw data for second_infection wt fluHA


KeyboardInterrupt: 