# Correct viral barcodes to reduce replication, PCR, and sequencing errors
This Python Jupyter notebook uses `UMI_tools` directional adjacency method to correct viral barcodes that are likely derived from errors in the replication or sequencing library preparation process.

## Notes about UMI_tools
* Using directional adjacency method. This has been demonstrated on simulated data to produce a more accurate result than other heuristics.
* Sequence must be input as byte. See definition here: https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal
* The corrected barcode is returned as the first barcode in the group list. See umi_tools API documentation: https://umi-tools.readthedocs.io/en/latest/API.html

Import Python modules:

In [1]:
from IPython.display import display

from dms_variants.constants import CBPALETTE

import gzip

import pandas as pd

import plotnine as p9

from umi_tools import UMIClusterer

Get `snakemake` variables [as described here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration):

In [2]:
viral_bc_in_progeny_freq_csv = 'results/viral_progeny/scProgenyProduction_trial2_viral_bc_in_progeny_freq.csv.gz'

Import barcode frequency data

In [3]:
viral_bc_df = pd.read_csv(gzip.open(viral_bc_in_progeny_freq_csv))
display(viral_bc_df)

Unnamed: 0,source,tag,gene,barcode,mean_count
0,second_infection,syn,fluHA,AAAAAAAGTAAATCTT,28.0
1,second_infection,syn,fluHA,AAAAAAATCTTAATAA,1.0
2,second_infection,syn,fluHA,AAAAAAATCTTAATGA,50.0
3,second_infection,syn,fluHA,AAAAAACAATGACTAA,0.5
4,second_infection,syn,fluHA,AAAAAACCCAATTATT,0.5
...,...,...,...,...,...
149148,supernatant,wt,fluNA,TTTTGTTAGCGTCCTG,145.5
149149,supernatant,wt,fluNA,TTTTTTAGAAAACGTA,0.5
149150,supernatant,wt,fluNA,TTTTTTAGAAAACGTC,102.0
149151,supernatant,wt,fluNA,TTTTTTCACTGCCATT,0.5


Cluster barcodes within each sample

In [21]:
clusterer = UMIClusterer(cluster_method="directional")

lookup_list = []

for (source, tag, gene), df in (viral_bc_df
                                .groupby(['source',
                                          'tag',
                                          'gene'])):
    n_unique_bc = df['barcode'].nunique()
    print(f"There are {n_unique_bc} unique barcodes in the raw data for {source} {tag} {gene}")
    
    # Convert dataframe to dictionary. Dict is requried input type for umi_tools clustering.
    viral_bc_dict = (df[['barcode','mean_count']]
                     .set_index('barcode')
                     .to_dict(orient='dict'))
    viral_bc_dict = viral_bc_dict['mean_count']
    
    # Convert barcode strings to byte. Byte is required dtype for umi_tools clustering.
    byte_dict={}
    for key, value in viral_bc_dict.items(): 
        byte_dict[key.encode("utf-8")] = float(value)
    
    # Cluster barcodes
    bc_groups = clusterer(byte_dict, threshold=1)
    groups_df = pd.DataFrame(bc_groups)
    groups_df = groups_df.stack().str.decode('utf-8').unstack() # Convert bytes back to string
    groups_df = groups_df.rename(columns={0:'corrected_bc'})
    groups_df = groups_df.set_index('corrected_bc', drop=False)
    n_corrected_bc = groups_df['corrected_bc'].nunique()
    print(f"{n_unique_bc - n_corrected_bc} barcodes were corrected.")
    print(f"There are {n_corrected_bc} corrected barcodes for {source} {tag} {gene}\n")
    
    # Generate lookup table for this sample
    lookup_df = (groups_df.melt(ignore_index=False,
                                value_name='original_barcode')
                 ['original_barcode']
                 .dropna()
                 .reset_index())
    
    lookup_list.append(lookup_df)
    display(lookup_df)

There are 23862 unique barcodes in the raw data for second_infection syn fluHA
7514 barcodes were corrected.
There are 16348 corrected barcodes for second_infection syn fluHA



Unnamed: 0,corrected_bc,original_barcode
0,GTCAATCAAGATAAGA,GTCAATCAAGATAAGA
1,CATTTACGCTGAATTG,CATTTACGCTGAATTG
2,AGCTTGGCATGAAAGA,AGCTTGGCATGAAAGA
3,CTTAAAAAGATTCCAG,CTTAAAAAGATTCCAG
4,GGTCACCGTGAGAAAT,GGTCACCGTGAGAAAT
...,...,...
23857,CATTTACGCTGAATTG,CATTTTCGCTGAATTG
23858,CATTTACGCTGAATTG,CATTTACGCTGAATAT
23859,CATTTACGCTGAATTG,CATTGACGCTGAATTG
23860,CATTTACGCTGAATTG,CATTTCCGCTGAATTG


There are 24756 unique barcodes in the raw data for second_infection syn fluNA
5122 barcodes were corrected.
There are 19634 corrected barcodes for second_infection syn fluNA



Unnamed: 0,corrected_bc,original_barcode
0,TGTTATATTTGTATTG,TGTTATATTTGTATTG
1,AAACACGGGTGAAATG,AAACACGGGTGAAATG
2,ACAATTTCAGTATCAA,ACAATTTCAGTATCAA
3,GTGTGGAGGTTTTTGA,GTGTGGAGGTTTTTGA
4,GCATAGTGCGAACGTT,GCATAGTGCGAACGTT
...,...,...
24751,TGTTATATTTGTATTG,TGTTATATTTGTATAT
24752,TGTTATATTTGTATTG,GGTTATATTTGTAGTG
24753,TGTTATATTTGTATTG,TGTTATATTTGGAGGG
24754,TGTTATATTTGTATTG,TGTGATAGGTGTATTG


There are 28801 unique barcodes in the raw data for second_infection wt fluHA
8566 barcodes were corrected.
There are 20235 corrected barcodes for second_infection wt fluHA



Unnamed: 0,corrected_bc,original_barcode
0,TGCAACTTACGCAGAG,TGCAACTTACGCAGAG
1,CATTCTGCTATGATAG,CATTCTGCTATGATAG
2,GCCTGCTCGAACATAC,GCCTGCTCGAACATAC
3,GACAGATTCACGTTTT,GACAGATTCACGTTTT
4,ATCCCAGCAGTGAACC,ATCCCAGCAGTGAACC
...,...,...
28796,GCCTGCTCGAACATAC,GCCTGCTCGAACATGC
28797,CATTCTGCTATGATAG,CCTTCTGCTATGATAG
28798,CATTCTGCTATGATAG,CACTCTGCTATGATAG
28799,CATTCTGCTATGATAG,CATTCTGCTATGTTAG


There are 22847 unique barcodes in the raw data for second_infection wt fluNA
5690 barcodes were corrected.
There are 17157 corrected barcodes for second_infection wt fluNA



Unnamed: 0,corrected_bc,original_barcode
0,GTAACAAAATTAGCGC,GTAACAAAATTAGCGC
1,TTGTGAGCGAAGTGCG,TTGTGAGCGAAGTGCG
2,TGTAACCGCGTTAGAG,TGTAACCGCGTTAGAG
3,ACGTACCTTGTCAATC,ACGTACCTTGTCAATC
4,AAGATACAAAATGATC,AAGATACAAAATGATC
...,...,...
22842,GTAACAAAATTAGCGC,TGAACAAAATTAGCGC
22843,GTAACAAAATTAGCGC,GTAAAAAAAGTAGCGC
22844,GTAACAAAATTAGCGC,GTAACAAAATTCGCGC
22845,GTAACAAAATTAGCGC,ATAACAAAAGTAGCGC


There are 13478 unique barcodes in the raw data for supernatant syn fluHA
5578 barcodes were corrected.
There are 7900 corrected barcodes for supernatant syn fluHA



Unnamed: 0,corrected_bc,original_barcode
0,GTCAATCAAGATAAGA,GTCAATCAAGATAAGA
1,CCACAAAACGTCTGGG,CCACAAAACGTCTGGG
2,CATTTACGCTGAATTG,CATTTACGCTGAATTG
3,ATCGCTATATGAATCC,ATCGCTATATGAATCC
4,AGCCAACGAGGATTAT,AGCCAACGAGGATTAT
...,...,...
13473,CCACAAAACGTCTGGG,CCACTAAACGTCTGGG
13474,GTCAATCAAGATAAGA,GTCAGTCAAGATAAGA
13475,CCACAAAACGTCTGGG,CCACGAAACGTCTGGG
13476,CCACAAAACGTCTGGG,CCACAAAACGTCTGGA


There are 11179 unique barcodes in the raw data for supernatant syn fluNA
5535 barcodes were corrected.
There are 5644 corrected barcodes for supernatant syn fluNA



Unnamed: 0,corrected_bc,original_barcode
0,TGTTATATTTGTATTG,TGTTATATTTGTATTG
1,ACAATTTCAGTATCAA,ACAATTTCAGTATCAA
2,AAACACGGGTGAAATG,AAACACGGGTGAAATG
3,GTGTGGAGGTTTTTGA,GTGTGGAGGTTTTTGA
4,GTCGTAATTGTTTAAT,GTCGTAATTGTTTAAT
...,...,...
11174,TGTTATATTTGTATTG,TATTATATTTTTATTG
11175,TGTTATATTTGTATTG,TTGTATATTTGTATTG
11176,TGTTATATTTGTATTG,TGATATATTTGTATTG
11177,TGTTATATTTGTATTG,TGTTATATTAGTATTG


There are 15079 unique barcodes in the raw data for supernatant wt fluHA
5699 barcodes were corrected.
There are 9380 corrected barcodes for supernatant wt fluHA



Unnamed: 0,corrected_bc,original_barcode
0,TGCAACTTACGCAGAG,TGCAACTTACGCAGAG
1,GCCTGCTCGAACATAC,GCCTGCTCGAACATAC
2,CATTCTGCTATGATAG,CATTCTGCTATGATAG
3,AGTCCTGGCTGGTAGG,AGTCCTGGCTGGTAGG
4,TTCCTTGGCAACGAAA,TTCCTTGGCAACGAAA
...,...,...
15074,TGCAACTTACGCAGAG,TGCAACATACGCAGAG
15075,TGCAACTTACGCAGAG,TGCATCTTACGCAGAG
15076,TGCAACTTACGCAGAG,TGCAATTTACGCAGAG
15077,TGCAACTTACGCAGAG,TGCAACATACGCAGGG


There are 9151 unique barcodes in the raw data for supernatant wt fluNA
5103 barcodes were corrected.
There are 4048 corrected barcodes for supernatant wt fluNA



Unnamed: 0,corrected_bc,original_barcode
0,TTGTGAGCGAAGTGCG,TTGTGAGCGAAGTGCG
1,ACGTACCTTGTCAATC,ACGTACCTTGTCAATC
2,AAGCGAGGGCGAGGTG,AAGCGAGGGCGAGGTG
3,TGTAACCGCGTTAGAG,TGTAACCGCGTTAGAG
4,GTAACAAAATTAGCGC,GTAACAAAATTAGCGC
...,...,...
9146,TTGTGAGCGAAGTGCG,TTGAGAGCGAAGTGCG
9147,TTGTGAGCGAAGTGCG,TTGTGATCGAAGTGCG
9148,TTGTGAGCGAAGTGCG,TTGTGAGCGAAGTGCA
9149,TTGTGAGCGAAGTGCG,TTGTGAGAGAAGTGAG
