# Homework 4: Practical analysis with BioPython

For the homework, you are going to extend the code from the analysis of our FASTQ file in lectures 8 and 9.
Recall that the FASTQ file contains reads from a real sequencing run of influenza virus HA and NA genes.

The _actual sequences_ are as follows:

    5'-[end of HA]-AGGCGGCCGC-[16 X N barcode]-3'
    
or 

    5'-[end of NA]-AGGCGGCCGC-[16 X N barcode]-3'
    
The end of NA is:

    ...CACGATAGATAAATAATAGTGCACCAT
    
The end of HA is:

    ...CCGGATTTGCATATAATGATGCACCAT
    
The _sequencing reads_ (located in the FASTQ file) are from the reverse end of these actual sequences, so the first thing in the sequencing reads is the reverse complement of the barcode followed by the reverse complement of the constant sequence, etc.
The reads can originate from **either** HA or NA, and that will be distinguished by the most 3' end of the read.
But in our example exercise in class, we did not distinguish among reads matching to HA and NA, as we didn't even look far enough into the read to tell the identity.

For the homework, your goal is to write code that extends the material from lectures 8 and 9 to also distinguish between HA and NA.
This homework can be completed almost entirely by re-using code from lecture 9. You will need to set up your analysis to do the following:
 1. Get the reverse complement of each read.
 2. Determine if it matches the expected pattern for HA and NA, and if so which one.
 3. If it matches, extract the barcode and add it to a dictionary to keep track of counts.
 4. Determine the number and distribution of barcodes for HA and NA separately.

Please include code to address each of the following questions. Please include code comments to explain what your code is attempting to accomplish. Don't forget to include references to the sources you used to obtain your answer, including your classmates (if you are working in groups).  

1. How many reads map to HA, and how many reads map to NA?

In [33]:
import re
import Bio.SeqIO

seqreads = list(Bio.SeqIO.parse('barcodes_R1.fastq', 'fastq'))

seqreads_str = []
for seqrecord in seqreads:
    seqreads_str.append(str(seqrecord.seq))

def reverse_complement(seq, unk_partner = 'N'):
    base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
    rseq = ''
    for base in seq:
        if base in base_partner:
            # look up the complementary base in the dictionary and add in reverse order
            rseq = base_partner[base] + rseq
        else:
            rseq = unk_partner + rseq
    return rseq

def read_barcode(seqread, bclen, proteinseq, upstream='AGGCGGCCGC'):
    seqread = seqread.upper() #make each seq all uppercase
    reverse = reverse_complement(seqread) # get the reverse complement of the read

    # compile the barcode search pattern
    barcode_pattern = re.compile(proteinseq + upstream + f"(?P<barcode>[ATCGN]{{{bclen}}})$")

    # search for the barcode pattern
    match = barcode_pattern.search(reverse)

    if match:
        barcode = match.group('barcode')
        return barcode
    else:
        return None

barcode_counts_ha = {}
barcode_counts_na = {}
barcodeha=0
barcodena=0

for seq in seqreads_str: # iterate through all reads
    barcode_ha = read_barcode(seq, bclen = 16, proteinseq='CCGGATTTGCATATAATGATGCACCAT')
    barcode_na = read_barcode(seq, bclen = 16, proteinseq='CACGATAGATAAATAATAGTGCACCAT')
    if barcode_ha: # if there is a valid barcode, add it to the dictionary
        if barcode_ha in barcode_counts_ha:
            barcode_counts_ha[barcode_ha] += 1
        else:
            barcode_counts_ha[barcode_ha] = 1
    if barcode_na:
        if barcode_na in barcode_counts_na:
            barcode_counts_na[barcode_na] += 1
        else:
            barcode_counts_na[barcode_na] = 1

ha_counts = sum(barcode_counts_ha.values())
na_counts = sum(barcode_counts_na.values())

print(f"The number of reads that mapped to HA were {ha_counts}.")
print(f"The number of reads that mapped to NA were {na_counts}.")

The number of reads that mapped to HA were 5246.
The number of reads that mapped to NA were 3910.


2. What is the HA barcode with the most counts (and how many counts)? Also answer the same question for NA.

    _Hint: you will need to find the key associated with the maximum value in your dictionary. There are many ways to do this._

In [40]:
max_ha_barcode = max(barcode_counts_ha, key=barcode_counts_ha.get)
max_ha_count = max(barcode_counts_ha.values())
max_na_barcode = max(barcode_counts_na, key=barcode_counts_na.get)
max_na_count = max(barcode_counts_na.values())
print (f"The HA barcode with the most counts was {max_ha_barcode} with the total counts for this barcode being {max_ha_count}.")
print (f"The HA barcode with the most counts was {max_na_barcode} with the total counts for this barcode being {max_na_count}.")

The HA barcode with the most counts was CCCGACCCGACATTAA with the total counts for this barcode being 155.
The HA barcode with the most counts was ACCAGTTCTCCCCGGG with the total counts for this barcode being 152.
