# Correct viral barcodes to reduce replication, PCR, and sequencing errors
This Python Jupyter notebook uses `UMI_tools` directional adjacency method to correct viral barcodes that are likely derived from errors in the replication or sequencing library preparation process.

## Notes about UMI_tools
* Using directional adjacency method. This has been demonstrated on simulated data to produce a more accurate result than other heuristics.
* Sequence must be input as byte. See definition here: https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal
* The corrected barcode is returned as the first barcode in the group list. See umi_tools API documentation: https://umi-tools.readthedocs.io/en/latest/API.html

Import Python modules:

In [1]:
from IPython.display import display

from dms_variants.constants import CBPALETTE

import gzip

import pandas as pd

import plotnine as p9

from umi_tools import UMIClusterer

Get `snakemake` variables [as described here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration):

In [2]:
viral_bc_in_progeny_freq_csv = 'results/viral_progeny/scProgenyProduction_trial2_viral_bc_in_progeny_freq.csv.gz'

Import barcode frequency data

In [3]:
viral_bc_df = pd.read_csv(gzip.open(viral_bc_in_progeny_freq_csv))
display(viral_bc_df)

Unnamed: 0,source,tag,gene,barcode,mean_freq
0,second_infection,syn,fluHA,AAAAAAAGTAAATCTT,4.967883e-05
1,second_infection,syn,fluHA,AAAAAAATCTTAATAA,1.780615e-06
2,second_infection,syn,fluHA,AAAAAAATCTTAATGA,8.840638e-05
3,second_infection,syn,fluHA,AAAAAACAATGACTAA,9.349045e-07
4,second_infection,syn,fluHA,AAAAAACCCAATTATT,9.349045e-07
...,...,...,...,...,...
149148,supernatant,wt,fluNA,TTTTGTTAGCGTCCTG,3.000532e-04
149149,supernatant,wt,fluNA,TTTTTTAGAAAACGTA,7.892971e-07
149150,supernatant,wt,fluNA,TTTTTTAGAAAACGTC,1.610166e-04
149151,supernatant,wt,fluNA,TTTTTTCACTGCCATT,7.892971e-07


Cluster barcodes within each sample

In [4]:
clusterer = UMIClusterer(cluster_method="directional")

for (source, tag, gene), df in (viral_bc_df
                                .groupby(['source',
                                          'tag',
                                          'gene'])):
    n_unique_bc = df['barcode'].nunique()
    print(f"There are {n_unique_bc} unique barcodes in the raw data for {source} {tag} {gene}")
    
    # Convert dataframe to dictionary. Dict is requried input type for umi_tools clustering.
    viral_bc_dict = (df[['barcode','mean_freq']]
                     .set_index('barcode')
                     .to_dict(orient='dict'))
    viral_bc_dict = viral_bc_dict['mean_freq']
    
    # Convert barcode strings to byte. Byte is required dtype for umi_tools clustering.
    byte_dict={}
    for key, value in viral_bc_dict.items(): 
        byte_dict[key.encode("utf-8")] = float(value)
    
    # Cluster barcodes
    bc_groups = clusterer(byte_dict, threshold=1)
    groups_df = pd.DataFrame(bc_groups)
    groups_df = groups_df.stack().str.decode('utf-8').unstack() # Convert bytes back to string
    groups_df = groups_df.rename(columns={0:'corrected_bc'})
    groups_df = groups_df.set_index('corrected_bc', drop=False)
#     groups_df['raw_bc'] = groups_df.to_numpy().tolist()
#     groups_df = groups_df[['corrected_bc','raw_bc']]
    n_corrected_bc = groups_df['corrected_bc'].nunique()
    print(f"{n_unique_bc - n_corrected_bc} barcodes were corrected.")
    print(f"There are {n_corrected_bc} corrected barcodes for {source} {tag} {gene}\n")
    

There are 23862 unique barcodes in the raw data for second_infection syn fluHA
7528 barcodes were corrected.
There are 16334 corrected barcodes for second_infection syn fluHA

There are 24756 unique barcodes in the raw data for second_infection syn fluNA
5126 barcodes were corrected.
There are 19630 corrected barcodes for second_infection syn fluNA

There are 28801 unique barcodes in the raw data for second_infection wt fluHA
8573 barcodes were corrected.
There are 20228 corrected barcodes for second_infection wt fluHA

There are 22847 unique barcodes in the raw data for second_infection wt fluNA
5697 barcodes were corrected.
There are 17150 corrected barcodes for second_infection wt fluNA

There are 13478 unique barcodes in the raw data for supernatant syn fluHA
5598 barcodes were corrected.
There are 7880 corrected barcodes for supernatant syn fluHA

There are 11179 unique barcodes in the raw data for supernatant syn fluNA
5555 barcodes were corrected.
There are 5624 corrected barcod