# Correct viral barcodes to reduce replication, PCR, and sequencing errors
This Python Jupyter notebook uses `UMI_tools` directional adjacency method to correct viral barcodes that are likely derived from errors in the replication or sequencing library preparation process.

## Notes about UMI_tools
* Using directional adjacency method. This has been demonstrated on simulated data to produce a more accurate result than other heuristics.
* Sequence must be input as byte. See definition here: https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal

Import Python modules:

In [6]:
from IPython.display import display

from dms_variants.constants import CBPALETTE

import gzip

import pandas as pd

import plotnine as p9

from umi_tools import UMIClusterer

Get `snakemake` variables [as described here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration):

In [4]:
viral_bc_in_progeny_freq_csv = 'results/viral_progeny/scProgenyProduction_trial1_viral_bc_in_progeny_freq.csv.gz'

Import barcode frequency data

In [17]:
viral_bc_df = pd.read_csv(gzip.open(viral_bc_in_progeny_freq_csv))
display(viral_bc_df)

Unnamed: 0,source,tag,gene,barcode,mean_freq
0,second_infection,syn,fluHA,AAAAAACGAAGGGATT,7.869666e-07
1,second_infection,syn,fluHA,AAAAAACGAATGGATC,8.755249e-07
2,second_infection,syn,fluHA,AAAAAACGAATGGATT,3.381641e-04
3,second_infection,syn,fluHA,AAAAAATATTCATACG,7.869666e-07
4,second_infection,syn,fluHA,AAAAAATCGGTAGAGG,7.869666e-07
...,...,...,...,...,...
103158,supernatant,wt,fluNA,TTTTTACTTTACGAGC,8.823851e-07
103159,supernatant,wt,fluNA,TTTTTCGTAAAACTAT,1.182396e-04
103160,supernatant,wt,fluNA,TTTTTGACTCGAAGTA,4.362519e-04
103161,supernatant,wt,fluNA,TTTTTGGAATACGCAA,1.226515e-04


Cluster barcodes within each sample

In [99]:
clusterer = UMIClusterer(cluster_method="directional")

for (source, tag, gene), df in (viral_bc_df
                                .groupby(['source',
                                          'tag',
                                          'gene'])):
    viral_bc_dict = (viral_bc_df[['barcode','mean_freq']][0:100]
                     .set_index('barcode')
                     .to_dict(orient='dict'))
    viral_bc_dict = viral_bc_dict['mean_freq']
    
    # Convert barcode strings to byte
    byte_dict={}
    for key, value in viral_bc_dict.items(): 
        byte_dict[key.encode("utf-8")] = float(value)
    
    # Cluster barcodes
    bc_groups = clusterer(byte_dict, threshold=1)
    groups_df = pd.DataFrame(bc_groups)
    groups_df = groups_df.stack().str.decode('utf-8').unstack() # Convert bytes to string
    groups_df = groups_df.rename(columns={0:'corrected_bc'})
    groups_df = groups_df.set_index('corrected_bc')
    groups_df['original_bc']= groups_df.values.tolist()
    display(groups_df['original_bc'])

corrected_bc
AAAACGTTGTTCAAAG    [AAAACGTTGTTTAAAG, AAAACGTTGTGCAAAG, AAAACGTTG...
AAACCAAGTAGGCAGA    [AAAACAAGTAGGCAGA, nan, nan, nan, nan, nan, na...
AAAACCAAAAGCTGTT    [AAAACCTAAAGCTGTT, AAAACCAAAGGCTGTT, AAAACTAAA...
AAAATCGTTTTGAATG    [AAAATCGTTTCGAATG, AAAATCGTTTTGAATT, nan, nan,...
AAAAAACGAATGGATT    [AAAAAACGAATGGATC, AAAAGACGAATGGATT, AAAAAACGA...
AAAAACCTGAACAATC    [AAAAACCTGATCAATC, AAAAACCTGAACAATT, AAAAATCTG...
AAAAACGGTTCATGGT    [AAAAATGGTTCATGGT, AAAAACGGTTTATGGT, AAAAACGGT...
AAAAAATCGGTAGTGG    [AAAAAATCGGTAGAGG, nan, nan, nan, nan, nan, na...
AAAACTACATTTTTTG    [AAAACTACATTTTTCG, AAAACTAAATTTTTTG, nan, nan,...
AAAACGCAGATGCACA    [AAAACGCGGATGCACA, AAAACGCAGATGTACA, AAAACGCAT...
AAAACTTATTATGGTT    [AAAACTTATTATGGTC, AAAACTTATTACGGTT, AAAACTTAT...
AAAACACGGTACTATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACCAGGAGACTCTA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACAAACCTCCAATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAAATGA

corrected_bc
AAAACGTTGTTCAAAG    [AAAACGTTGTTTAAAG, AAAACGTTGTGCAAAG, AAAACGTTG...
AAACCAAGTAGGCAGA    [AAAACAAGTAGGCAGA, nan, nan, nan, nan, nan, na...
AAAACCAAAAGCTGTT    [AAAACCTAAAGCTGTT, AAAACCAAAGGCTGTT, AAAACTAAA...
AAAATCGTTTTGAATG    [AAAATCGTTTCGAATG, AAAATCGTTTTGAATT, nan, nan,...
AAAAAACGAATGGATT    [AAAAAACGAATGGATC, AAAAGACGAATGGATT, AAAAAACGA...
AAAAACCTGAACAATC    [AAAAACCTGATCAATC, AAAAACCTGAACAATT, AAAAATCTG...
AAAAACGGTTCATGGT    [AAAAATGGTTCATGGT, AAAAACGGTTTATGGT, AAAAACGGT...
AAAAAATCGGTAGTGG    [AAAAAATCGGTAGAGG, nan, nan, nan, nan, nan, na...
AAAACTACATTTTTTG    [AAAACTACATTTTTCG, AAAACTAAATTTTTTG, nan, nan,...
AAAACGCAGATGCACA    [AAAACGCGGATGCACA, AAAACGCAGATGTACA, AAAACGCAT...
AAAACTTATTATGGTT    [AAAACTTATTATGGTC, AAAACTTATTACGGTT, AAAACTTAT...
AAAACACGGTACTATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACCAGGAGACTCTA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACAAACCTCCAATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAAATGA

corrected_bc
AAAACGTTGTTCAAAG    [AAAACGTTGTTTAAAG, AAAACGTTGTGCAAAG, AAAACGTTG...
AAACCAAGTAGGCAGA    [AAAACAAGTAGGCAGA, nan, nan, nan, nan, nan, na...
AAAACCAAAAGCTGTT    [AAAACCTAAAGCTGTT, AAAACCAAAGGCTGTT, AAAACTAAA...
AAAATCGTTTTGAATG    [AAAATCGTTTCGAATG, AAAATCGTTTTGAATT, nan, nan,...
AAAAAACGAATGGATT    [AAAAAACGAATGGATC, AAAAGACGAATGGATT, AAAAAACGA...
AAAAACCTGAACAATC    [AAAAACCTGATCAATC, AAAAACCTGAACAATT, AAAAATCTG...
AAAAACGGTTCATGGT    [AAAAATGGTTCATGGT, AAAAACGGTTTATGGT, AAAAACGGT...
AAAAAATCGGTAGTGG    [AAAAAATCGGTAGAGG, nan, nan, nan, nan, nan, na...
AAAACTACATTTTTTG    [AAAACTACATTTTTCG, AAAACTAAATTTTTTG, nan, nan,...
AAAACGCAGATGCACA    [AAAACGCGGATGCACA, AAAACGCAGATGTACA, AAAACGCAT...
AAAACTTATTATGGTT    [AAAACTTATTATGGTC, AAAACTTATTACGGTT, AAAACTTAT...
AAAACACGGTACTATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACCAGGAGACTCTA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACAAACCTCCAATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAAATGA

corrected_bc
AAAACGTTGTTCAAAG    [AAAACGTTGTTTAAAG, AAAACGTTGTGCAAAG, AAAACGTTG...
AAACCAAGTAGGCAGA    [AAAACAAGTAGGCAGA, nan, nan, nan, nan, nan, na...
AAAACCAAAAGCTGTT    [AAAACCTAAAGCTGTT, AAAACCAAAGGCTGTT, AAAACTAAA...
AAAATCGTTTTGAATG    [AAAATCGTTTCGAATG, AAAATCGTTTTGAATT, nan, nan,...
AAAAAACGAATGGATT    [AAAAAACGAATGGATC, AAAAGACGAATGGATT, AAAAAACGA...
AAAAACCTGAACAATC    [AAAAACCTGATCAATC, AAAAACCTGAACAATT, AAAAATCTG...
AAAAACGGTTCATGGT    [AAAAATGGTTCATGGT, AAAAACGGTTTATGGT, AAAAACGGT...
AAAAAATCGGTAGTGG    [AAAAAATCGGTAGAGG, nan, nan, nan, nan, nan, na...
AAAACTACATTTTTTG    [AAAACTACATTTTTCG, AAAACTAAATTTTTTG, nan, nan,...
AAAACGCAGATGCACA    [AAAACGCGGATGCACA, AAAACGCAGATGTACA, AAAACGCAT...
AAAACTTATTATGGTT    [AAAACTTATTATGGTC, AAAACTTATTACGGTT, AAAACTTAT...
AAAACACGGTACTATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACCAGGAGACTCTA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACAAACCTCCAATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAAATGA

corrected_bc
AAAACGTTGTTCAAAG    [AAAACGTTGTTTAAAG, AAAACGTTGTGCAAAG, AAAACGTTG...
AAACCAAGTAGGCAGA    [AAAACAAGTAGGCAGA, nan, nan, nan, nan, nan, na...
AAAACCAAAAGCTGTT    [AAAACCTAAAGCTGTT, AAAACCAAAGGCTGTT, AAAACTAAA...
AAAATCGTTTTGAATG    [AAAATCGTTTCGAATG, AAAATCGTTTTGAATT, nan, nan,...
AAAAAACGAATGGATT    [AAAAAACGAATGGATC, AAAAGACGAATGGATT, AAAAAACGA...
AAAAACCTGAACAATC    [AAAAACCTGATCAATC, AAAAACCTGAACAATT, AAAAATCTG...
AAAAACGGTTCATGGT    [AAAAATGGTTCATGGT, AAAAACGGTTTATGGT, AAAAACGGT...
AAAAAATCGGTAGTGG    [AAAAAATCGGTAGAGG, nan, nan, nan, nan, nan, na...
AAAACTACATTTTTTG    [AAAACTACATTTTTCG, AAAACTAAATTTTTTG, nan, nan,...
AAAACGCAGATGCACA    [AAAACGCGGATGCACA, AAAACGCAGATGTACA, AAAACGCAT...
AAAACTTATTATGGTT    [AAAACTTATTATGGTC, AAAACTTATTACGGTT, AAAACTTAT...
AAAACACGGTACTATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACCAGGAGACTCTA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACAAACCTCCAATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAAATGA

corrected_bc
AAAACGTTGTTCAAAG    [AAAACGTTGTTTAAAG, AAAACGTTGTGCAAAG, AAAACGTTG...
AAACCAAGTAGGCAGA    [AAAACAAGTAGGCAGA, nan, nan, nan, nan, nan, na...
AAAACCAAAAGCTGTT    [AAAACCTAAAGCTGTT, AAAACCAAAGGCTGTT, AAAACTAAA...
AAAATCGTTTTGAATG    [AAAATCGTTTCGAATG, AAAATCGTTTTGAATT, nan, nan,...
AAAAAACGAATGGATT    [AAAAAACGAATGGATC, AAAAGACGAATGGATT, AAAAAACGA...
AAAAACCTGAACAATC    [AAAAACCTGATCAATC, AAAAACCTGAACAATT, AAAAATCTG...
AAAAACGGTTCATGGT    [AAAAATGGTTCATGGT, AAAAACGGTTTATGGT, AAAAACGGT...
AAAAAATCGGTAGTGG    [AAAAAATCGGTAGAGG, nan, nan, nan, nan, nan, na...
AAAACTACATTTTTTG    [AAAACTACATTTTTCG, AAAACTAAATTTTTTG, nan, nan,...
AAAACGCAGATGCACA    [AAAACGCGGATGCACA, AAAACGCAGATGTACA, AAAACGCAT...
AAAACTTATTATGGTT    [AAAACTTATTATGGTC, AAAACTTATTACGGTT, AAAACTTAT...
AAAACACGGTACTATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACCAGGAGACTCTA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACAAACCTCCAATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAAATGA

corrected_bc
AAAACGTTGTTCAAAG    [AAAACGTTGTTTAAAG, AAAACGTTGTGCAAAG, AAAACGTTG...
AAACCAAGTAGGCAGA    [AAAACAAGTAGGCAGA, nan, nan, nan, nan, nan, na...
AAAACCAAAAGCTGTT    [AAAACCTAAAGCTGTT, AAAACCAAAGGCTGTT, AAAACTAAA...
AAAATCGTTTTGAATG    [AAAATCGTTTCGAATG, AAAATCGTTTTGAATT, nan, nan,...
AAAAAACGAATGGATT    [AAAAAACGAATGGATC, AAAAGACGAATGGATT, AAAAAACGA...
AAAAACCTGAACAATC    [AAAAACCTGATCAATC, AAAAACCTGAACAATT, AAAAATCTG...
AAAAACGGTTCATGGT    [AAAAATGGTTCATGGT, AAAAACGGTTTATGGT, AAAAACGGT...
AAAAAATCGGTAGTGG    [AAAAAATCGGTAGAGG, nan, nan, nan, nan, nan, na...
AAAACTACATTTTTTG    [AAAACTACATTTTTCG, AAAACTAAATTTTTTG, nan, nan,...
AAAACGCAGATGCACA    [AAAACGCGGATGCACA, AAAACGCAGATGTACA, AAAACGCAT...
AAAACTTATTATGGTT    [AAAACTTATTATGGTC, AAAACTTATTACGGTT, AAAACTTAT...
AAAACACGGTACTATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACCAGGAGACTCTA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACAAACCTCCAATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAAATGA

corrected_bc
AAAACGTTGTTCAAAG    [AAAACGTTGTTTAAAG, AAAACGTTGTGCAAAG, AAAACGTTG...
AAACCAAGTAGGCAGA    [AAAACAAGTAGGCAGA, nan, nan, nan, nan, nan, na...
AAAACCAAAAGCTGTT    [AAAACCTAAAGCTGTT, AAAACCAAAGGCTGTT, AAAACTAAA...
AAAATCGTTTTGAATG    [AAAATCGTTTCGAATG, AAAATCGTTTTGAATT, nan, nan,...
AAAAAACGAATGGATT    [AAAAAACGAATGGATC, AAAAGACGAATGGATT, AAAAAACGA...
AAAAACCTGAACAATC    [AAAAACCTGATCAATC, AAAAACCTGAACAATT, AAAAATCTG...
AAAAACGGTTCATGGT    [AAAAATGGTTCATGGT, AAAAACGGTTTATGGT, AAAAACGGT...
AAAAAATCGGTAGTGG    [AAAAAATCGGTAGAGG, nan, nan, nan, nan, nan, na...
AAAACTACATTTTTTG    [AAAACTACATTTTTCG, AAAACTAAATTTTTTG, nan, nan,...
AAAACGCAGATGCACA    [AAAACGCGGATGCACA, AAAACGCAGATGTACA, AAAACGCAT...
AAAACTTATTATGGTT    [AAAACTTATTATGGTC, AAAACTTATTACGGTT, AAAACTTAT...
AAAACACGGTACTATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACCAGGAGACTCTA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAACAAACCTCCAATA             [nan, nan, nan, nan, nan, nan, nan, nan]
AAAATGA