# parse_barcodes
This notebook attempts to parse cell barcodes, UMIs, and viral barcodes from *only VIRAL reads*. It uses Read1 data to do this.

In [1]:
# Import
import glob
import os
import gzip
import pandas as pd
from Bio import SeqIO

In [2]:
# Constants
sample_file = 'samples.csv'

out_folder = 'outs/'

In [3]:
# Load samples
samples = pd.read_csv(sample_file, comment='#')

samples

Unnamed: 0,sample,data_type,index,data_file
0,HUNDREDuM_primer,transcripts,A1,out/virus_hashing/outs/fastq_path/CL3WR/HUNDRE...
1,TENuM_primer,transcripts,A2,out/virus_hashing/outs/fastq_path/CL3WR/TENuM_...


## Handle Read 1 Data

### Split features based on position in Read 1
Now, I must parse out the features I expect to be present in each of these reads. For this, I will simply split out the sequence based on position, since each feature should be some exact length. The lengths I anticipate are designated in a cell below.

The order of the features looks like this, depending on whether the adapter was appended in a way that retained the Cell Barcode and UMI or not:  
With Cell Barcode/UMI = `TruSeq Read 1 - Cell Barcode - UMI - PolyA`  
Without Cell Barcode/UMI = `TruSeq Read 1 - Viral Barcode - CDS`; an important note: if the molecule looks like this, the CDS will be `GCGGCCGCCT`.

I will parse the `TruSeq Read 1` feature as the first `22 bp`.  
I will parse the `Cell Barcode` OR `Viral Barcode` as the next `16 bp`.  
Finally, I will parse the `UMI` OR `CDS` as the final `12 bp`.


In [4]:
# Feature Lengths
truseq_len = 22
bc_len = 16
umi_cds_len = 12

#HA and NA CDS sequences
virus_constant = 'GCGGCCGCCT'

In [5]:
record_dict = {'record_id': [],
               'primer': [],
               'truseq': [],
               'bc': [],
               'umi_or_cds': [],
               'has_cell_bc': []
              }
# Load Read 1 Files
for tup in samples.itertuples(index=False):
    print(f'Processing reads for sample "{tup.sample}"')
    r1files = glob.glob(os.path.join(tup.data_file, '*R1*.fastq.gz'))
    for file in r1files:
        print(f'Parsing file {file}')
        with gzip.open(file, "rt") as gunzip_file:
            for record in SeqIO.parse(gunzip_file, "fastq"):
                # Start parsing features
                record_dict['record_id'].append(record.id)
                record_dict['primer'].append(tup.sample)
                record_dict['truseq'].append(str(record.seq[0:22]))
                record_dict['bc'].append(str(record.seq[22:38]))
                record_dict['umi_or_cds'].append(str(record.seq[38:48]))
                if (str(record.seq[38:48]) in virus_constant):
                    record_dict['has_cell_bc'].append(False)
                else:
                    record_dict['has_cell_bc'].append(True)
        print('Done.\n')

print('Done loading FASTQ files.\n')

Processing reads for sample "HUNDREDuM_primer"
Parsing file out/virus_hashing/outs/fastq_path/CL3WR/HUNDREDuM_primer/HUNDREDuM_primer_S1_L001_R1_001.fastq.gz
Done.

Processing reads for sample "TENuM_primer"
Parsing file out/virus_hashing/outs/fastq_path/CL3WR/TENuM_primer/TENuM_primer_S2_L001_R1_001.fastq.gz
Done.

Done loading FASTQ files.



In [6]:
print('Converting to dataframe.')
r1_reads = pd.DataFrame.from_dict(record_dict)
print('Dataframe head:')
r1_reads.head()

Converting to dataframe.
Dataframe head:


Unnamed: 0,record_id,primer,truseq,bc,umi_or_cds,has_cell_bc
0,M03100:474:000000000-CL3WR:1:2116:16081:1664,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,ACGACCGCGTCCTCTT,CCTCTTCCCC,True
1,M03100:474:000000000-CL3WR:1:2116:21218:1667,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,CTCTCCTTCCTCCCTC,CCCTCTCCCC,True
2,M03100:474:000000000-CL3WR:1:2116:17297:1672,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,CTCTGCCTCCCGTGTA,CAAACATTTC,True
3,M03100:474:000000000-CL3WR:1:2116:21080:1736,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,TCAACGATCGCCATTA,TTCCAACCTG,True
4,M03100:474:000000000-CL3WR:1:2116:15840:1750,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,AACCGCCTCCCGTTTC,CCCCTCCCTC,True


As a check, I just want to see that the reads being called as missing a cell barcode have the appropriate constant sequence:

In [7]:
r1_reads[r1_reads['has_cell_bc']==False].head()

Unnamed: 0,record_id,primer,truseq,bc,umi_or_cds,has_cell_bc
4676,M03100:474:000000000-CL3WR:1:2116:8072:7817,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,TAAGGCCCTGATATTC,GCGGCCGCCT,False
8246,M03100:474:000000000-CL3WR:1:2116:17775:9854,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,GATGGTGGTACCTGCC,GCGGCCGCCT,False
8947,M03100:474:000000000-CL3WR:1:2116:20841:10190,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,GATGGTGGTACCTGCC,GCGGCCGCCT,False
9054,M03100:474:000000000-CL3WR:1:2116:15390:10235,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,GAAATTTGCGGTTCCG,GCGGCCGCCT,False
9510,M03100:474:000000000-CL3WR:1:2116:24188:10470,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,GATGGCTGTACCTGCC,GCGGCCGCCT,False


### How many molecules are missing their cell barcode?

Now, I want to ask how many reads from each sample are missing the cell barcode? This will be indicated by having the expected constant region from the virus CDS.

In [8]:
r1_reads.groupby(['primer','has_cell_bc']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,record_id,truseq,bc,umi_or_cds
primer,has_cell_bc,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HUNDREDuM_primer,False,274,274,274,274
HUNDREDuM_primer,True,300039,300039,300039,300039
TENuM_primer,False,265,265,265,265
TENuM_primer,True,237063,237063,237063,237063


About ~270 reads of each sample appear to be missing a cell barcode. To know what proportion of the viral reads this is, we'll have to map each read to virus or host.

## Handle Read 2 Data

### Call a read as viral/not viral
The fastest way to figure out how many of our viral reads retain their cell barcode is just to search for a snippet of read 2 in the viral genome.

First, I will define the viral genome.  I have directly copied a script which automatically generates FASTA files for this from another project (see [here](https://github.com/jbloomlab/pdmH1N1_flu_single_cell/tree/master/data/flu_sequences)). 
Then, I will trim Read 2 back to ~50 bp.
Finally, I will check to see if each read is in the viral genome. These will be called viral molecules, and all others will be assumed to be host.

In [10]:
flu_genome_folder = 'flu_sequences'
flu_files = ['flu-CA09.fasta', 'flu-CA09-dblSyn.fasta']

In [17]:
# Load in viral genomes and make into a single string that can be searched
seqs = list()

for file in flu_files:
    print(f'Reading in flu genome from file: {file}')
    file_path = flu_genome_folder + '/' + file
    with open(file_path) as open_file:
        for record in SeqIO.parse(open_file, "fasta"):
            print(f"Reading in sequence for {record.id}.")
            seqs.append(str(record.seq))
    print('Done with file.\n')

print("Finished loading flu genomes.")
print(f"There were {len(seqs)} sequences loaded from {len(flu_files)} files.")

Reading in flu genome from file: flu-CA09.fasta
Reading in sequence for fluPB2.
Reading in sequence for fluPB1.
Reading in sequence for fluPA.
Reading in sequence for fluHA.
Reading in sequence for fluNP.
Reading in sequence for fluNA.
Reading in sequence for fluM.
Reading in sequence for fluNS.
Done with file.

Reading in flu genome from file: flu-CA09-dblSyn.fasta
Reading in sequence for fluPB2.
Reading in sequence for fluPB1.
Reading in sequence for fluPA.
Reading in sequence for fluHA.
Reading in sequence for fluNP.
Reading in sequence for fluNA.
Reading in sequence for fluM.
Reading in sequence for fluNS.
Done with file.

Finished loading flu genomes.
There were 16 sequences loaded from 2 files.
