# __Homework 4:__ Practical analysis with BioPython

For the homework, you are going to extend the code from the analysis of our FASTQ file in lectures 8 and 9.
Recall that the FASTQ file contains reads from a real sequencing run of influenza virus HA and NA genes.

---
The __actual sequences__ are as follows:

    5'-[end of HA]-AGGCGGCCGC-[16 X N barcode]-3'
or 

    5'-[end of NA]-AGGCGGCCGC-[16 X N barcode]-3'
---


__The end of NA is__ `...CACGATAGATAAATAATAGTGCACCAT`
    
__The end of HA is__ `...CCGGATTTGCATATAATGATGCACCAT`

---    

    
The __sequencing reads__ from the reverse end of the molecules (in 5'>3' orientation), so the sequencing reads are as follows:

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of HA]-3'
or

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of NA]-3'

---   
    
The reads can originate from **either** HA or NA, and that will be distinguished by the most 3' end of the read.
But in our example exercise in class, we did not distinguish among reads matching to HA and NA, as we didn't even look far enough into the read to tell the identity.

For the homework, your goal is to write code that extends the material from lectures 8 and 9 to also distinguish between HA and NA.
This homework can be completed almost entirely by re-using code from lecture 9. You will need to set up your analysis to do the following:
 1. Get the reverse complement of each read.
 2. Determine if it matches the expected pattern for HA and NA, and if so which one.
 3. If it matches, extract the barcode and add it to a dictionary to keep track of counts.
 4. Determine the number and distribution of barcodes for HA and NA separately.

Please include code to address each of the following questions. Please include code comments to explain what your code is attempting to accomplish. Don't forget to include references to the sources you used to obtain your answer, including your classmates (if you are working in groups).  

1. How many reads map to HA, and how many reads map to NA?

In [61]:
# load necessary packages
import re
import Bio.SeqIO
from Bio.Seq import Seq

#Read fastq and create into list
seqreads = list(Bio.SeqIO.parse('barcodes_R1.fastq', 'fastq'))

seqreads_Seq = []
for seqrecord in seqreads:
    seqreads_Seq.append(seqrecord.seq)

In [75]:
# Define the patterns for HA and NA sequences
# Reference source: Google, GPT, A friend who major CS

def extract_barcode_and_gene(seq): 
    
    ha_pattern = re.compile(r'CCGGATTTGCATATAATGATGCACCAT')  # Pattern for HA gene
    na_pattern = re.compile(r'CACGATAGATAAATAATAGTGCACCAT')  # Pattern for NA gene

    # Convert reverse complement of the seq into string
    reverse_seq = str(seq.reverse_complement())

    # Search for the gene patterns in the reverse complement
    ha_match = ha_pattern.search(reverse_seq)
    na_match = na_pattern.search(reverse_seq)

    # Check if the seq matches HA or NA gene, extract barcode
    if ha_match:
        if check_barcode(reverse_seq):
            barcode = extract_barcode(reverse_seq)
            return 'HA', barcode
        else:
            return 'HA', None
    elif na_match:
        if check_barcode(reverse_seq):
            barcode = extract_barcode(reverse_seq)
            return 'NA', barcode
        else:
            return 'NA', None
    else:
        return None, None  # Return None if neither HA nor NA gene is found

# Define a function to extract the barcode
def extract_barcode(seq, bclen=16):
    barcode = seq[-bclen:]  # Extract the barcode
    return barcode

#Define a function to check the barcode
def check_barcode(seq,  bclen=16, upstream='AGGCGGCCGC'):
    barcode_re = re.compile(upstream + "(?P<barcode>[ATCG]{" + str(bclen) + "})$")
    
    match = barcode_re.search(seq)

    if match:
        barcode = match.group("barcode")
    else:
        barcode = None
    
    return barcode

# Create dictionaries for HA and NA barcodes
ha_barcodes = {}
na_barcodes = {}

# Count the barcode for HA or NA
valid_ha, valid_na = 0, 0
invalid_ha, invalid_na = 0, 0
invalid_all = 0

for seq in seqreads_Seq: #read seq
    gene, barcode = extract_barcode_and_gene(seq)
    if gene == 'HA':    #Count HA
        if barcode is None:
            invalid_ha += 1
        elif barcode in ha_barcodes:
            ha_barcodes[barcode] += 1
            valid_ha += 1
        else:
            ha_barcodes[barcode] = 1
            valid_ha += 1
    elif gene == 'NA':  #Count NA
        if barcode is None:
            invalid_na += 1
        elif barcode in na_barcodes:
            na_barcodes[barcode] += 1
            valid_na += 1
        else:
            na_barcodes[barcode] = 1
            valid_na += 1
    else:
        invalid_all += 1    #seq without HA nor NA

print(valid_ha + valid_na+ invalid_ha + invalid_na + invalid_all) #Check if missing anything


10000


In [70]:
# your code here...

total_ha = valid_ha + invalid_ha
total_na = valid_na + invalid_na
print(f"Barcode counts for HA gene: {total_ha}")
print(f"Barcode counts for NA gene: {total_na}")



Barcode counts for HA gene: 5409
Barcode counts for NA gene: 4122


2. How many HA sequences did not have a valid barcode? Also anwer the same question for NA.

In [71]:
# your code here...
print(f"Invalid barcode counts for HA gene: {invalid_ha}")
print(f"Invalid barcode counts for NA gene: {invalid_na}")

Invalid barcode counts for HA gene: 164
Invalid barcode counts for NA gene: 215


3. What is the HA barcode with the most counts (and how many counts)? Also answer the same question for NA.

    _Hint: you will need to find the key associated with the maximum value in your dictionary. There are many ways to do this._

In [73]:
# your code here...
# Search google for max function

ha_max_barcode = max(ha_barcodes, key=ha_barcodes.get)
ha_max_count = ha_barcodes[ha_max_barcode]

print(f"The HA barcode '{ha_max_barcode}' has the highest count of {ha_max_count}.")

na_max_barcode = max(na_barcodes, key=na_barcodes.get)
na_max_count = na_barcodes[na_max_barcode]

print(f"The NA barcode '{na_max_barcode}' has the highest count of {na_max_count}.")

The HA barcode 'CCCGACCCGACATTAA' has the highest count of 155.
The NA barcode 'ACCAGTTCTCCCCGGG' has the highest count of 152.
