# __Homework 4:__ Practical analysis with BioPython

For the homework, you are going to extend the code from the analysis of our FASTQ file in lectures 8 and 9.
Recall that the FASTQ file contains reads from a real sequencing run of influenza virus HA and NA genes.

---
The __actual sequences__ are as follows:

    5'-[end of HA]-AGGCGGCCGC-[16 X N barcode]-3'
or 

    5'-[end of NA]-AGGCGGCCGC-[16 X N barcode]-3'
---


__The end of NA is__ `...CACGATAGATAAATAATAGTGCACCAT`
    
__The end of HA is__ `...CCGGATTTGCATATAATGATGCACCAT`

---    

    
The __sequencing reads__ from the reverse end of the molecules (in 5'>3' orientation), so the sequencing reads are as follows:

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of HA]-3'
or

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of NA]-3'

---   
    
The reads can originate from **either** HA or NA, and that will be distinguished by the most 3' end of the read.
But in our example exercise in class, we did not distinguish among reads matching to HA and NA, as we didn't even look far enough into the read to tell the identity.

For the homework, your goal is to write code that extends the material from lectures 8 and 9 to also distinguish between HA and NA.
This homework can be completed almost entirely by re-using code from lecture 9. You will need to set up your analysis to do the following:
 1. Get the reverse complement of each read.
 2. Determine if it matches the expected pattern for HA and NA, and if so which one.
 3. If it matches, extract the barcode and add it to a dictionary to keep track of counts.
 4. Determine the number and distribution of barcodes for HA and NA separately.

Please include code to address each of the following questions. Please include code comments to explain what your code is attempting to accomplish. Don't forget to include references to the sources you used to obtain your answer, including your classmates (if you are working in groups).  

In [43]:
# import programs 
import re
from Bio.Seq import Seq
from Bio import SeqIO

#initialize empty directories
HA_barcodes = {}
NA_barcodes = {}

#initialize invalid count for #2
invalid_HA_count = 0
invalid_NA_count = 0

# Define the reverse complement of the HA and NA end sequences and the anchor
HA_endseq = "CCGGATTTGCATATAATGATGCACCAT" 
NA_endseq = "CACGATAGATAAATAATAGTGCACCAT"
anchor = "GCGGCCGCCT"

# Get reverse complements of HA and NA end sequences to match 3' end
HA_reverse_complement = str(Seq(HA_endseq).reverse_complement())
NA_reverse_complement = str(Seq(NA_endseq).reverse_complement())

#source- Veronica and I worked on the barcode together here because we had discrepancies in our answers- asked Lucas about this too
barcode_search_ha = re.compile(r'(?P<barcode>[ATCG]{16})' + anchor + HA_reverse_complement)
barcode_search_na = re.compile(r'(?P<barcode>[ATCG]{16})' + anchor + NA_reverse_complement)

# Parse the seq file (and create a list of barcodes for each fro #3)
for record in SeqIO.parse('/workspaces/tfcb_2024/homeworks/homework04/barcodes_R1.fastq', format='fastq'):  
    # Convert Seq object to string
    seq = str(record.seq)

    # Check for HA match
    ha_match = barcode_search_ha.search(seq) #this function searches seq for a barcode that matches HA-end sequence
    if ha_match:
        # if the seq was fohnd, the barcode is extracted from ha_match and updated in the HA_barcodes directory 
        barcode = ha_match.group("barcode")
        HA_barcodes[barcode] = HA_barcodes.get(barcode, 0) + 1
        continue  # Skip to the next read if matched to HA
        
    # Check for NA match if HA was not matched
    na_match = barcode_search_na.search(seq)
    if na_match:
        # Extract the barcode and update count for NA
        barcode = na_match.group("barcode")
        NA_barcodes[barcode] = NA_barcodes.get(barcode, 0) + 1
        continue  # Skip to the next read if matched to NA

    # If neither HA nor NA matched, increment invalid counters for answer to #2
    if HA_reverse_complement in seq:
        invalid_HA_count += 1
    elif NA_reverse_complement in seq:
        invalid_NA_count += 1

1. How many reads map to HA, and how many reads map to NA?

In [47]:
# Calculate the total number of barcodes for HA and NA
total_HA_barcodes = sum(HA_barcodes.values())
total_NA_barcodes = sum(NA_barcodes.values())

# Print the results
print("Total number of valid HA barcodes:", total_HA_barcodes)
print("Total number of valid NA barcodes:", total_NA_barcodes)

Total number of valid HA barcodes: 5249
Total number of valid NA barcodes: 3909


2. How many HA sequences did not have a valid barcode? Also anwer the same question for NA.

In [45]:
# print answers, using f strings used this website and chatgpt to learn about it- https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/
print(f"Total HA sequences with invalid barcode: {invalid_HA_count}")
print(f"Total NA sequences with invalid barcode: {invalid_NA_count}")

Total HA sequences with invalid barcode: 160
Total NA sequences with invalid barcode: 213


3. What is the HA barcode with the most counts (and how many counts)? Also answer the same question for NA.

    _Hint: you will need to find the key associated with the maximum value in your dictionary. There are many ways to do this._

In [46]:
# define a function that finds the most common barcode
# used this website- https://datagy.io/python-get-dictionary-key-with-max-value/
def most_common_barcode(barcodes):
    #barcode.items converts dictionary into a pair of (barcode, count)
    #max finds the pair with the highest count
    #key=lamba item: item[1] sets the comparison key to be the count
    #learned about lambda here- https://www.w3schools.com/python/python_lambda.asp
   return max(barcodes.items(), key=lambda item: item[1]) 

# run the function on both the NA and HA sets of barcdodes 
most_common_ha = most_common_barcode(HA_barcodes)
most_common_na = most_common_barcode(NA_barcodes)

# print answers, using f strings used this website and chatgpt to learn about it- https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/
print(f"The most common HA barcode is {most_common_ha[0]}, {most_common_ha[1]} counts")
print(f"The most common NA barcode is {most_common_na[0]}, {most_common_na[1]} counts")

The most common HA barcode is TTAATGTCGGGTCGGG, 155 counts
The most common NA barcode is CCCGGGGAGAACTGGT, 152 counts
