# parse_barcodes
This notebook attempts to parse cell barcodes, UMIs, and viral barcodes from *only VIRAL reads*. It uses Read1 data to do this.

In [11]:
# Import
import glob
import os
import gzip
import pandas as pd
from Bio import SeqIO

In [2]:
# Constants
sample_file = 'samples.csv'

out_folder = 'outs/'

In [3]:
# Load samples
samples = pd.read_csv(sample_file, comment='#')

samples

Unnamed: 0,sample,data_type,index,data_file
0,HUNDREDuM_primer,transcripts,A1,out/virus_hashing/outs/fastq_path/CL3WR/HUNDRE...
1,TENuM_primer,transcripts,A2,out/virus_hashing/outs/fastq_path/CL3WR/TENuM_...


## Split features based on position in read
Now, I must parse out the features I expect to be present in each of these reads. For this, I will simply split out the sequence based on position, since each feature should be some exact length. The lengths I anticipate are designated in a cell below.

The order of the features looks like this, depending on whether the adapter was appended in a way that retained the Cell Barcode and UMI or not:  
With Cell Barcode/UMI = `TruSeq Read 1 - Cell Barcode - UMI - PolyA`  
Without Cell Barcode/UMI = `TruSeq Read 1 - Viral Barcode - CDS`; an important note: if the molecule looks like this, the CDS will either be `GCGGCCGCCTATG (HA)` or `GCGGCCGCCTAT (NA)`.

I will parse the `TruSeq Read 1` feature as the first `22 bp`.  
I will parse the `Cell Barcode` OR `Viral Barcode` as the next `16 bp`.  
Finally, I will parse the `UMI` OR `CDS` as the final `12 bp`.


In [30]:
# Feature Lengths
truseq_len = 22
bc_len = 16
umi_cds_len = 12

#HA and NA CDS sequences
ha_constant = 'GCGGCCGCCTATG'
na_constant = 'GCGGCCGCCTAT'

In [32]:
record_dict = {'record_id': [],
               'primer': [],
               'truseq': [],
               'bc': [],
               'umi_cds': [],
               'has_cell_bc': []
              }
# Load Read 1 Files
for tup in samples.itertuples(index=False):
    print(f'Processing reads for sample "{tup.sample}"')
    r1files = glob.glob(os.path.join(tup.data_file, '*R1*.fastq.gz'))
    for file in r1files:
        print(f'Parsing file {file}')
        with gzip.open(file, "rt") as gunzip_file:
            for record in SeqIO.parse(gunzip_file, "fastq"):
                record_dict['record_id'].append(record.id)
                record_dict['primer'].append(tup.sample)
                record_dict['truseq'].append(str(record.seq[0:21]))
                record_dict['bc'].append(str(record.seq[22:37]))
                record_dict['umi_cds'].append(str(record.seq[37:49]))
                if (str(record.seq[37:49]) == ha_constant) or (str(record.seq[37:49]) == na_constant):
                    record_dict['has_cell_bc'].append(False)
                else:
                    record_dict['has_cell_bc'].append(True)
        print('Done.\n')

print('Done loading FASTQ files.\n')




Processing reads for sample "HUNDREDuM_primer"
Parsing file out/virus_hashing/outs/fastq_path/CL3WR/HUNDREDuM_primer/HUNDREDuM_primer_S1_L001_R1_001.fastq.gz
Done.

Processing reads for sample "TENuM_primer"
Parsing file out/virus_hashing/outs/fastq_path/CL3WR/TENuM_primer/TENuM_primer_S2_L001_R1_001.fastq.gz
Done.

Done loading FASTQ files.



In [33]:
print('Converting to dataframe.')
r1_reads = pd.DataFrame.from_dict(record_dict)
print('Dataframe head:')
r1_reads.head()

Converting to dataframe.
Dataframe head:


Unnamed: 0,record_id,primer,truseq,bc,umi_cds,has_cell_bc
0,M03100:474:000000000-CL3WR:1:2116:16081:1664,HUNDREDuM_primer,CTACACGACGCTCTTCNGATC,ACGACCGCGTCCTCT,TCCTCTTCCCCG,True
1,M03100:474:000000000-CL3WR:1:2116:21218:1667,HUNDREDuM_primer,CTACACGACGCTCTTCNGATC,CTCTCCTTCCTCCCT,CCCCTCTCCCCC,True
2,M03100:474:000000000-CL3WR:1:2116:17297:1672,HUNDREDuM_primer,CTACACGACGCTCTTCNGATC,CTCTGCCTCCCGTGT,ACAAACATTTCT,True
3,M03100:474:000000000-CL3WR:1:2116:21080:1736,HUNDREDuM_primer,CTACACGACGCTCTTCNGATC,TCAACGATCGCCATT,ATTCCAACCTGT,True
4,M03100:474:000000000-CL3WR:1:2116:15840:1750,HUNDREDuM_primer,CTACACGACGCTCTTCNGATC,AACCGCCTCCCGTTT,CCCCCTCCCTCT,True


In [38]:
r1_reads.groupby('primer').count()

Unnamed: 0_level_0,record_id,truseq,bc,umi_cds,has_cell_bc
primer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HUNDREDuM_primer,300313,300313,300313,300313,300313
TENuM_primer,237328,237328,237328,237328,237328
