# parse_barcodes
This notebook attempts to parse cell barcodes, UMIs, and viral barcodes from *only VIRAL reads*. It uses Read1 data to do this.

In [1]:
# Import
import glob
import os
import gzip

import pandas as pd

#from ggplot import *

from IPython.display import display, HTML

from Bio import SeqIO
import pysam

In [2]:
# Constants
sample_file = 'samples.csv'

out_folder = 'outs/'

In [3]:
# Load samples
samples = pd.read_csv(sample_file, comment='#')

display(HTML(samples.to_html(index=False)))

sample,data_type,index,data_file
HUNDREDuM_primer,transcripts,A1,out/virus_hashing/outs/fastq_path/CL3WR/HUNDREDuM_primer/
TENuM_primer,transcripts,A2,out/virus_hashing/outs/fastq_path/CL3WR/TENuM_primer/


## Handle Read 1 Data

### Split features based on position in Read 1
Now, I must parse out the features I expect to be present in each of these reads. For this, I will simply split out the sequence based on position, since each feature should be some exact length. The lengths I anticipate are designated in a cell below.

The order of the features looks like this, depending on whether the adapter was appended in a way that retained the Cell Barcode and UMI or not:  
With Cell Barcode/UMI = `TruSeq Read 1 - Cell Barcode - UMI - PolyA`  
Without Cell Barcode/UMI = `TruSeq Read 1 - Viral Barcode - CDS`; an important note: if the molecule looks like this, the CDS will start with the 10 bp sequence `GCGGCCGCCT`.

I will parse the `TruSeq Read 1` feature as the first `22 bp`.  
I will parse the `Cell Barcode` OR `Viral Barcode` as the next `16 bp`.  
Finally, I will parse the `UMI` OR `CDS` as the final `12 bp`.


In [4]:
# Feature Lengths
truseq_len = 22
bc_len = 16
umi_cds_len = 12

#HA and NA CDS sequences
virus_constant = 'GCGGCCGCCT'

In [5]:
r1_dict = {'record_id': [],
               'primer': [],
               'truseq': [],
               'bc': [],
               'umi_or_cds': [],
               'has_cell_bc': []
              }

# Load Read 1 Files
for tup in samples.itertuples(index=False):
    print(f'Processing reads for sample "{tup.sample}"')
    r1files = glob.glob(os.path.join(tup.data_file, '*R1*.fastq.gz'))
    for file in r1files:
        print(f'Parsing file {file}')
        with gzip.open(file, "rt") as gunzip_file:
            for record in SeqIO.parse(gunzip_file, "fastq"):
                # Start parsing features
                r1_dict['record_id'].append(record.id)
                r1_dict['primer'].append(tup.sample)
                r1_dict['truseq'].append(str(record.seq[0:22]))
                r1_dict['bc'].append(str(record.seq[22:38]))
                r1_dict['umi_or_cds'].append(str(record.seq[38:50]))
                if (virus_constant in str(record.seq[38:50])):
                    r1_dict['has_cell_bc'].append(False)
                else:
                    r1_dict['has_cell_bc'].append(True)
        print('Done.\n')

print('Done loading FASTQ files.\n')

Processing reads for sample "HUNDREDuM_primer"
Parsing file out/virus_hashing/outs/fastq_path/CL3WR/HUNDREDuM_primer/HUNDREDuM_primer_S1_L001_R1_001.fastq.gz
Done.

Processing reads for sample "TENuM_primer"
Parsing file out/virus_hashing/outs/fastq_path/CL3WR/TENuM_primer/TENuM_primer_S2_L001_R1_001.fastq.gz
Done.

Done loading FASTQ files.



In [6]:
print('Converting to dataframe.')
r1_reads = pd.DataFrame.from_dict(r1_dict)

print(f"There are {len(r1_reads['record_id'])} reads in this data set.\n")
      
print('Dataframe head:')
r1_reads.head()

Converting to dataframe.
There are 537641 reads in this data set.

Dataframe head:


Unnamed: 0,record_id,primer,truseq,bc,umi_or_cds,has_cell_bc
0,M03100:474:000000000-CL3WR:1:2116:16081:1664,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,ACGACCGCGTCCTCTT,CCTCTTCCCCGG,True
1,M03100:474:000000000-CL3WR:1:2116:21218:1667,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,CTCTCCTTCCTCCCTC,CCCTCTCCCCCT,True
2,M03100:474:000000000-CL3WR:1:2116:17297:1672,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,CTCTGCCTCCCGTGTA,CAAACATTTCTC,True
3,M03100:474:000000000-CL3WR:1:2116:21080:1736,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,TCAACGATCGCCATTA,TTCCAACCTGTT,True
4,M03100:474:000000000-CL3WR:1:2116:15840:1750,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,AACCGCCTCCCGTTTC,CCCCTCCCTCTT,True


As a check, I just want to see that the reads being called as missing a cell barcode have the appropriate constant sequence:

In [7]:
r1_reads[r1_reads['has_cell_bc']==False].head()

Unnamed: 0,record_id,primer,truseq,bc,umi_or_cds,has_cell_bc
1886,M03100:474:000000000-CL3WR:1:2116:21363:5878,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,GGATTGCGGGCCTTCC,CGCGGCCGCCTA,False
4676,M03100:474:000000000-CL3WR:1:2116:8072:7817,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,TAAGGCCCTGATATTC,GCGGCCGCCTGT,False
8246,M03100:474:000000000-CL3WR:1:2116:17775:9854,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,GATGGTGGTACCTGCC,GCGGCCGCCTAT,False
8947,M03100:474:000000000-CL3WR:1:2116:20841:10190,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,GATGGTGGTACCTGCC,GCGGCCGCCTAT,False
9054,M03100:474:000000000-CL3WR:1:2116:15390:10235,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,GAAATTTGCGGTTCCG,GCGGCCGCCTTG,False


### How many molecules are missing their cell barcode?

Now, I want to ask how many reads from each sample are missing the cell barcode? This will be indicated by having the expected constant region from the virus CDS.

In [8]:
r1_reads.groupby(['primer','has_cell_bc']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,record_id,truseq,bc,umi_or_cds
primer,has_cell_bc,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HUNDREDuM_primer,False,276,276,276,276
HUNDREDuM_primer,True,300037,300037,300037,300037
TENuM_primer,False,266,266,266,266
TENuM_primer,True,237062,237062,237062,237062


About ~270 reads of each sample appear to be missing a cell barcode. To know what proportion of the viral reads this is, we'll have to map each read to virus or host.

## Handle Read 2 Data

### Use Python to Call a read as viral/not viral
The fastest way to figure out how many of our viral reads retain their cell barcode is just to search for a snippet of read 2 in the viral genome.

First, I will define the viral genome.  I have directly copied a script which automatically generates FASTA files for this from another project (see [here](https://github.com/jbloomlab/pdmH1N1_flu_single_cell/tree/master/data/flu_sequences)). 
Then, I will trim Read 2 back to ~50 bp.
Finally, I will check to see if each read is in the viral genome. These will be called viral molecules, and all others will be assumed to be host.

As a technical note, I checked to see if strings which span two viral segments (e.g. str(End of PB2 + End of PB1)) are returned as being valid parts of the viral genome, and they are not.

In [9]:
# Constants
flu_genome_folder = 'flu_sequences'
flu_files = ['flu-CA09.fasta', 'flu-CA09-dblSyn.fasta']

In [10]:
# Load in viral genomes and make into a single string that can be searched
seqs = list()

for file in flu_files:
    print(f'Reading in flu genome from file: {file}')
    file_path = flu_genome_folder + '/' + file
    with open(file_path) as open_file:
        for record in SeqIO.parse(open_file, "fasta"):
            print(f"Reading in sequence for {record.id}.")
            seqs.append(str(record.seq))
    print('Done with file.\n')

print("Finished loading flu genomes.")
print(f"There were {len(seqs)} sequences loaded from {len(flu_files)} files.")

Reading in flu genome from file: flu-CA09.fasta
Reading in sequence for fluPB2.
Reading in sequence for fluPB1.
Reading in sequence for fluPA.
Reading in sequence for fluHA.
Reading in sequence for fluNP.
Reading in sequence for fluNA.
Reading in sequence for fluM.
Reading in sequence for fluNS.
Done with file.

Reading in flu genome from file: flu-CA09-dblSyn.fasta
Reading in sequence for fluPB2.
Reading in sequence for fluPB1.
Reading in sequence for fluPA.
Reading in sequence for fluHA.
Reading in sequence for fluNP.
Reading in sequence for fluNA.
Reading in sequence for fluM.
Reading in sequence for fluNS.
Done with file.

Finished loading flu genomes.
There were 16 sequences loaded from 2 files.


In [11]:
# Read 2 trim length
trim_len = 100 #91 recommended as minimum by 10X

In [12]:
r2_simple_dict = {'record_id': [],
               'primer': [],
               'cds': [],
               'host_or_virus': []
              }

# Load Read 2 Files
for tup in samples.itertuples(index=False):
    print(f'Processing reads for sample "{tup.sample}"')
    r1files = glob.glob(os.path.join(tup.data_file, '*R2*.fastq.gz'))
    for file in r1files:
        if ('qualfiltered' in file) or ('trimmed' in file):
            pass
        else:
            print(f'Parsing file {file}')
            with gzip.open(file, "rt") as gunzip_file:
                count = 0
                for record in SeqIO.parse(gunzip_file, "fastq"):
                    count += 1
                    if count % 50000 == 0:
                        print(f"{count} sequences processed.")
                    # Start parsing features
                    r2_simple_dict['record_id'].append(record.id)
                    r2_simple_dict['primer'].append(tup.sample)
                    r2_simple_dict['cds'].append(str(record.seq[0:100]))
                    if str(record.seq[0:100]) in str(seqs):
                        r2_simple_dict['host_or_virus'].append('Virus')
                    else:
                        r2_simple_dict['host_or_virus'].append('Host')
            print('Done.\n')

print('Done loading FASTQ files.\n')

Processing reads for sample "HUNDREDuM_primer"
Parsing file out/virus_hashing/outs/fastq_path/CL3WR/HUNDREDuM_primer/HUNDREDuM_primer_S1_L001_R2_001.fastq.gz
50000 sequences processed.
100000 sequences processed.
150000 sequences processed.
200000 sequences processed.
250000 sequences processed.
300000 sequences processed.
Done.

Processing reads for sample "TENuM_primer"
Parsing file out/virus_hashing/outs/fastq_path/CL3WR/TENuM_primer/TENuM_primer_S2_L001_R2_001.fastq.gz
50000 sequences processed.
100000 sequences processed.
150000 sequences processed.
200000 sequences processed.
Done.

Done loading FASTQ files.



In [13]:
print('Converting to dataframe.')
r2_simple_reads = pd.DataFrame.from_dict(r2_simple_dict)

print(f"There are {len(r2_simple_reads['record_id'])} reads in this data set.\n")

print('Dataframe head:')
r2_simple_reads.head()

Converting to dataframe.
There are 537641 reads in this data set.

Dataframe head:


Unnamed: 0,record_id,primer,cds,host_or_virus
0,M03100:474:000000000-CL3WR:1:2116:16081:1664,HUNDREDuM_primer,ACGCAGTCGTATCAACCCCCCGTACATCCCCCCCCTCCTCCCCCTC...,Host
1,M03100:474:000000000-CL3WR:1:2116:21218:1667,HUNDREDuM_primer,CTCCCGCCGTCTCTCCCCCCCCTCCCTCCCCCCCCCCCCCCCCCCC...,Host
2,M03100:474:000000000-CL3WR:1:2116:17297:1672,HUNDREDuM_primer,ACCCAGTCCTATCAACCCCCCCTACCTCCCCCCCCCCCCCCCTCCC...,Host
3,M03100:474:000000000-CL3WR:1:2116:21080:1736,HUNDREDuM_primer,ATTCCTTACACTCACCCACTCGTCTCTCCTTTCCCCCCCTCTTCCC...,Host
4,M03100:474:000000000-CL3WR:1:2116:15840:1750,HUNDREDuM_primer,CTTGACATCTCCATCCCACACCACCCCAACCTCCCCCCTCCCACTT...,Host


In [14]:
r2_simple_reads.groupby(['primer','host_or_virus']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,record_id,cds
primer,host_or_virus,Unnamed: 2_level_1,Unnamed: 3_level_1
HUNDREDuM_primer,Host,300164,300164
HUNDREDuM_primer,Virus,149,149
TENuM_primer,Host,237213,237213
TENuM_primer,Virus,115,115


In [15]:
merged_simple_reads = pd.merge(r1_reads, r2_simple_reads, on=["record_id","primer"])

print(f"There are {len(merged_simple_reads['record_id'])} reads in this data set.\n")

merged_simple_reads.head()

There are 537641 reads in this data set.



Unnamed: 0,record_id,primer,truseq,bc,umi_or_cds,has_cell_bc,cds,host_or_virus
0,M03100:474:000000000-CL3WR:1:2116:16081:1664,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,ACGACCGCGTCCTCTT,CCTCTTCCCCGG,True,ACGCAGTCGTATCAACCCCCCGTACATCCCCCCCCTCCTCCCCCTC...,Host
1,M03100:474:000000000-CL3WR:1:2116:21218:1667,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,CTCTCCTTCCTCCCTC,CCCTCTCCCCCT,True,CTCCCGCCGTCTCTCCCCCCCCTCCCTCCCCCCCCCCCCCCCCCCC...,Host
2,M03100:474:000000000-CL3WR:1:2116:17297:1672,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,CTCTGCCTCCCGTGTA,CAAACATTTCTC,True,ACCCAGTCCTATCAACCCCCCCTACCTCCCCCCCCCCCCCCCTCCC...,Host
3,M03100:474:000000000-CL3WR:1:2116:21080:1736,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,TCAACGATCGCCATTA,TTCCAACCTGTT,True,ATTCCTTACACTCACCCACTCGTCTCTCCTTTCCCCCCCTCTTCCC...,Host
4,M03100:474:000000000-CL3WR:1:2116:15840:1750,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,AACCGCCTCCCGTTTC,CCCCTCCCTCTT,True,CTTGACATCTCCATCCCACACCACCCCAACCTCCCCCCTCCCACTT...,Host


### How many viral reads are missing cell barcodes?

Next, I want to ask how many viral reads, across both samples, are missing their cell_barcodes in Read 1. These reads may have to be thrown out, or they could possibly be parsed fully from read 2 if the molecule is short enough.

In [16]:
merged_simple_reads.groupby(['host_or_virus','has_cell_bc']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,record_id,primer,truseq,bc,umi_or_cds,cds
host_or_virus,has_cell_bc,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Host,False,533,533,533,533,533,533
Host,True,536844,536844,536844,536844,536844,536844
Virus,False,9,9,9,9,9,9
Virus,True,255,255,255,255,255,255


On the plus side, most reads that are called as viral have a cell barcode.

**Suprisingly, however, I am seeing a bunch of reads which I cannot map back to the viral genome which are missing cell barcodes. I would be surprised if these are really host reads. Perhaps they have mutations and were not exact matches?** I will print a few of these reads and blast them to see what they are.

In [17]:
for item in merged_simple_reads[(merged_simple_reads['host_or_virus'] == 'Host') & (merged_simple_reads['has_cell_bc'] == False)].iloc[0:10]['cds']:
    print(item)

GTTGTATAGAAGAGAGGAGCACCAGGAAGTGGGCGAACGGCGCGTTGCAATGCCGGACTAGCATAAAATGATGCACCATAGGCGGGCGCGAGGAGGCGAA
CTGTCCTCCACTCTTCCGCACTGTACAACATACCAAAAACTCCAACACTTATAGCCACGAAAGCTAAGTAGGGTTGGTAGAAGTGTGCACTGGGCGAGCC
AAGCAGTGGTATCAACGCAGAGTACATGGGGACCCCCGAGCCGCAAAAGCAGGGGAAAACAAAAGCAACCAAATTGAAAGCAATACTAGTAGTTCTGCTA
AAGCAGTGGTATCAACGCAGAGTACACGGGGGCCGGCCGCGCAGCAAAAGCAGGGGAAAACAAAAGCAACCAAATTGAAAGCAATACTAGTAGTTCTGCT
AGCAGTGGTATCAACGCAGAGTACATGGGGGCAGTAAGATCGCGAAAGCAGGAGTTTAAACCGAATCCAAACCAGAAAATAATAACCATTGGGTCAATCT
CAGCAGTCGCATCAACGCAGAGTACACGGGAGATTACTCGCAGCAAAAGCAGGGGAAAACAAAAGCAACCAAATTGAAAGCAAGACAAGTAGTTCTGCTA
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGTAGAGGCTGCAAAAGCAGGGGAAAACAAAAGCAACCAAATTGAAAGCAATACTAGTAGTTCTGCTAT
AAATGAGTGGACCGGCTCTAGCGGGAGTTTTGTTCAGCATCCAGAACTAACAGGGCTGGATTGTATAAGACCTTGCTTCTGGGTTGAACTAATCCGAGGG
AAGCAGTGGTATCAACGCAGAGTACATGGGAGGCTTGAGCAGCAAAAGCAGGGGAAAACAAAAGCAACCAAATTGAAAGCAATACTAGTAGTTCTGCTAT
CAGTGGTATCAACGCAGAGTACATGGGAAAACAAAAGCAACCAAATTGAAAGCAATACTAGTAGTTCTGCTATATACATTTGCAACCGCAA

I then blasted the 10 sequences printed above. The results follow here:  
1 did not map to any sequence in the BLASTn database.  
2 mapped to a contemporary pdmH1N1 sequences, with mutations.  
3 mapped to several contemporary pdmH1N1 sequences, with mutations.  
4 mapped to a variety of historic H1N1 sequences (e.g. WSN) and several avian H3N8 sequences, all with mutatations.  
5 mapped to several contemporary pdmH1N1 sequences, with mutations.  
6 mapped to a contemporary pdmH1N1 sequences, with mutations.  
7 mapped to several contemporary pdmH1N1 sequences, with mutations.  
8 mapped to several contemporary pdmH1N1 sequences, with mutations.  
9 mapped to several contemporary pdmH1N1 sequences, with mutations.  
10 mapped to several contemporary pdmH1N1 sequences, with mutations.  

Based on these results, I conclude that **most of these sequences are truly virus sequences, but not exact matches.** For that reason, I will have to align these using bowtie instead to allow some flexibility.

### Use Bowtie2 to call a read as viral/not viral

Since the simple search strategy I used above was not very successful -- it broke on mutations -- I have used bowtie2 to _locally_ align trimmed R2 reads to the influenza reference genome.

The following actions took place manually in the command line. This is not ideal, and I will eventually go back and automate these.

First, I concatenated the flu-CA09 and flu-CA09-doublesyn fasta files into one flu_genome.fasta.  
Then, I used `bowtie2-build` to make a flu reference genome out of this.

I trimmed my R2 reads for quality (q score = 10+33), then trimmed them down to 100 bp using `cutadapt`.

Finally, I used `bowtie2 -local` to align the reads from each sample to the reference genome. This allows for deletions.

I converted the SAM files to BAM files and sorted them. They are stored in the `out` folder.

Next, I will load these BAM files of all reads that mapped to a flu transcript. I will extract the record ID and merge this on the R1 dataframe.

In [18]:
#Load BAM Files
r2_bowtie_dict = {'record_id': [],
               'primer': [],
               'cds': [],
               'host_or_virus': [],
               'gene_name': []
              }

for tup in samples.itertuples(index=False):
    print(f'Processing reads for sample "{tup.sample}"')
    sample_prefix = tup.sample.split('_')
    bamfile = os.path.join('out', sample_prefix[0] + '_aligned.sam')
    print(f'Reading in BAM file: {bamfile}')
    bam_object = pysam.AlignmentFile(bamfile, "r")
    for record in bam_object:
            # Start parsing features
            r2_bowtie_dict['record_id'].append(record.query_name)
            r2_bowtie_dict['primer'].append(tup.sample)
            r2_bowtie_dict['cds'].append(record.query_sequence)
            if record.is_unmapped == True:
                r2_bowtie_dict['host_or_virus'].append('Host')
                r2_bowtie_dict['gene_name'].append('unmapped')
            else:
                r2_bowtie_dict['host_or_virus'].append('Virus')
                r2_bowtie_dict['gene_name'].append(record.reference_name)
    print('Done.\n')
        
bam_object.close()

Processing reads for sample "HUNDREDuM_primer"
Reading in BAM file: out/HUNDREDuM_aligned.sam
Done.

Processing reads for sample "TENuM_primer"
Reading in BAM file: out/TENuM_aligned.sam
Done.



In [19]:
print('Converting to dataframe.')
r2_bowtie_reads = pd.DataFrame.from_dict(r2_bowtie_dict)

print(f"There are {len(r2_bowtie_reads['record_id'])} reads in this data set.\n")

print('Dataframe head:')
r2_bowtie_reads.head()

Converting to dataframe.
There are 537641 reads in this data set.

Dataframe head:


Unnamed: 0,record_id,primer,cds,host_or_virus,gene_name
0,M03100:474:000000000-CL3WR:1:2116:16081:1664,HUNDREDuM_primer,ACGCAGTCGTATCAACCCCCCGTACATCCCCCCCCTCCTCCCCCTC...,Host,unmapped
1,M03100:474:000000000-CL3WR:1:2116:21218:1667,HUNDREDuM_primer,CTCCCGCCGTCTCTCCCCCCCCTCCCTCCCCCCCCCCCCCCCCCCC...,Host,unmapped
2,M03100:474:000000000-CL3WR:1:2116:17297:1672,HUNDREDuM_primer,ACCCAGTCCTATCAACCCCCCCTACCTCCCCCCCCCCCCCCCTCCC...,Host,unmapped
3,M03100:474:000000000-CL3WR:1:2116:21080:1736,HUNDREDuM_primer,ATTCCTTACACTCACCCACTCGTCTCTCCTTTCCCCCCCTCTTCCC...,Host,unmapped
4,M03100:474:000000000-CL3WR:1:2116:15840:1750,HUNDREDuM_primer,CTTGACATCTCCATCCCACACCACCCCAACCTCCCCCCTCCCACTT...,Host,unmapped


In [20]:
r2_bowtie_reads.groupby(['primer','host_or_virus']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,record_id,cds,gene_name
primer,host_or_virus,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HUNDREDuM_primer,Host,299598,299598,299598
HUNDREDuM_primer,Virus,715,715,715
TENuM_primer,Host,236807,236807,236807
TENuM_primer,Virus,521,521,521


In [21]:
merged_bowtie_reads = pd.merge(r1_reads, r2_bowtie_reads, on=["record_id","primer"])

print(f"There are {len(merged_bowtie_reads['record_id'])} reads in this data set.\n")

merged_bowtie_reads.head()

There are 537641 reads in this data set.



Unnamed: 0,record_id,primer,truseq,bc,umi_or_cds,has_cell_bc,cds,host_or_virus,gene_name
0,M03100:474:000000000-CL3WR:1:2116:16081:1664,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,ACGACCGCGTCCTCTT,CCTCTTCCCCGG,True,ACGCAGTCGTATCAACCCCCCGTACATCCCCCCCCTCCTCCCCCTC...,Host,unmapped
1,M03100:474:000000000-CL3WR:1:2116:21218:1667,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,CTCTCCTTCCTCCCTC,CCCTCTCCCCCT,True,CTCCCGCCGTCTCTCCCCCCCCTCCCTCCCCCCCCCCCCCCCCCCC...,Host,unmapped
2,M03100:474:000000000-CL3WR:1:2116:17297:1672,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,CTCTGCCTCCCGTGTA,CAAACATTTCTC,True,ACCCAGTCCTATCAACCCCCCCTACCTCCCCCCCCCCCCCCCTCCC...,Host,unmapped
3,M03100:474:000000000-CL3WR:1:2116:21080:1736,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,TCAACGATCGCCATTA,TTCCAACCTGTT,True,ATTCCTTACACTCACCCACTCGTCTCTCCTTTCCCCCCCTCTTCCC...,Host,unmapped
4,M03100:474:000000000-CL3WR:1:2116:15840:1750,HUNDREDuM_primer,CTACACGACGCTCTTCNGATCT,AACCGCCTCCCGTTTC,CCCCTCCCTCTT,True,CTTGACATCTCCATCCCACACCACCCCAACCTCCCCCCTCCCACTT...,Host,unmapped


### How many viral reads are missing cell barcodes?

Next, I want to ask how many viral reads, across both samples, are missing their cell_barcodes in Read 1. These reads may have to be thrown out, or they could possibly be parsed fully from read 2 if the molecule is short enough.

In [22]:
merged_bowtie_reads.groupby(['host_or_virus','has_cell_bc']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,record_id,primer,truseq,bc,umi_or_cds,cds,gene_name
host_or_virus,has_cell_bc,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Host,False,222,222,222,222,222,222,222
Host,True,536183,536183,536183,536183,536183,536183,536183
Virus,False,320,320,320,320,320,320,320
Virus,True,916,916,916,916,916,916,916


**Based on these results, it looks like there are approximately `1200 virus reads` in this dataset. About `3/4 (~900)` of these have a cell barcode! This is promising and suggests that we can use this data to study multiple infection!**

There are also some reads that fail to map to the virus genome (which are called "Host" by default), but have a virus constant sequence at the expected position in read 1. These are a small minority of "Host" reads, so could totally be thrown out there. However, if they are actually "Virus" reads that failed to map, it would be nice to figure out how to map them succesfully, so we can include them in our analysis. As I did before, I will BLast them.

In [23]:
for item in merged_bowtie_reads[(merged_bowtie_reads['host_or_virus'] == 'Host') & (merged_bowtie_reads['has_cell_bc'] == False)].iloc[0:10]['cds']:
    print(item)

GTTGTATAGAAGAGAGGAGCACCAGGAAGTGGGCGAACGGCGCGTTGCAATGCCGGACTAGCATAAAATGATGCACCATAGGCGGGCGCGAGGAGGCGAA
CTGTCCTCCACTCTTCCGCACTGTACAACATACCAAAAACTCCAACACTTATAGCCACGAAAGCTAAGTAGGGTTGGTAGAAGTGTGCACTGGGCGAGCC
AGCAGTGGTATCAACGCAGAGTACATGGGGGCAGTAAGATCGCGAAAGCAGGAGTTTAAACCGAATCCAAACCAGAAAATAATAACCATTGGGTCAATCT
CAGCAGTCGCATCAACGCAGAGTACACGGGAGATTACTCGCAGCAAAAGCAGGGGAAAACAAAAGCAACCAAATTGAAAGCAAGACAAGTAGTTCTGCTA
AGTGGTATCAACGCAGAGTACATGGGTAGTGTGAAACGCGAAAGCAGGAGTTTAAACCGAATCCAAACCAGAAAATAATAACCATTGGGTCAATCTGTAT
AAGCAGTGGTATCAACGCAGAGTACATGGGGACACTCACGGCGCGAAAGCAGGAGTTTAAACCGAATCCAAACCAGAAAATAATAACCATTGGGTCAATC
AAGCAGTGGTATCAACGCAGAGTACATGGGGGCCTCCTGACGGGCGAACGCATGAGTATAGACCGAATCCAAACCAGAAAATAATAACCACTGGGTCTAT
AAGCAGTGGTATCAACGCAGAGTACATGGGGACGCGAGGACGCGAAAGCAGGAGTTTAAACCGAATCCAAACCAGAAAATAATAACCATTGGGTCAATCT
AAGCAGTGGTATCAACGCAGAGTACATGGGCTTCCATTCAGTAGCAAAAGCAGGGGAAACCAAAAGCAACCAAATTGAAAGCAATACTAGTAGTTGTGCT
AAGCAGTGGTATCAACGCACAGTACATGGGGGCCTAGGTGCCCGGCAAAAGCAGGGGAAAACAAAAGCAACCAAATTGAAAGCAATACTAA

BLAST results:

Read 1 mapped to a random smattering of bacterial sequences.  
Read 2 maps to WSN and avian influenza sequences.  
Read 3 maps to modern pdmH1N1 sequences.  
Read 4 maps to WSN and other historic H1N1 (not pdm) sequences.  
Read 5 maps to WSN and avian influenza sequences. 
Read 6 maps to modern pdmH1N1 seuqences.


Based on these results, I think these are mostly viral reads that did not map by the bowtie alignment. I may have to modify those settings a bit to make them less conservative.

### Do the barcoded viral segments -- HA and NA -- have cell barcodes?

Many of the viral reads have a cell barcodes. However, at least 1/4 do not. I want to check that these are not somehow exclusively transcrtips from the barcoded segments: HA and NA.

In [32]:
print('HA reads')
merged_bowtie_reads[merged_bowtie_reads['gene_name']=='fluHA'].groupby(['host_or_virus','has_cell_bc']).count()

HA reads


Unnamed: 0_level_0,Unnamed: 1_level_0,record_id,primer,truseq,bc,umi_or_cds,cds,gene_name
host_or_virus,has_cell_bc,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Virus,False,305,305,305,305,305,305,305
Virus,True,170,170,170,170,170,170,170


In [33]:
print('NA reads')
merged_bowtie_reads[merged_bowtie_reads['gene_name']=='fluNA'].groupby(['host_or_virus','has_cell_bc']).count()

NA reads


Unnamed: 0_level_0,Unnamed: 1_level_0,record_id,primer,truseq,bc,umi_or_cds,cds,gene_name
host_or_virus,has_cell_bc,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Virus,False,15,15,15,15,15,15,15
Virus,True,29,29,29,29,29,29,29


For the HA reads, most molecules have lost their cell barcode. For the NA reads, most molecules retain their cell barcodes. There are fewer NA reads than HA reads, but ultimately, we can use information from both segments to estimate how many virions entered a cell. While these numbers are not ideal, with greater sequencing depth, they may provide reliable estimates of virion entry for many cells.

**I still think these data give me enough information to order more sequencing! At least half the virus reads in my data (maybe more) will have cell barcodes. This proportion is slightly less for HA and NA, but still workable. It is likely that the R2 sequences will contain the virus barcodes for any molecule that has a cell barcode. Therefore, with this sequencing scheme, we should be able to study multiple infection in individual cells.**

In [24]:
#Save to HTML for easy viewing on GitHub
!jupyter nbconvert parse_barcodes.ipynb --to html --output parse_barcodes.html

[NbConvertApp] Converting notebook parse_barcodes.ipynb to html
[NbConvertApp] Writing 330398 bytes to parse_barcodes.html
