# __Homework 4:__ Practical analysis with BioPython

For the homework, you are going to extend the code from the analysis of our FASTQ file in lectures 8 and 9.
Recall that the FASTQ file contains reads from a real sequencing run of influenza virus HA and NA genes.

---
The __actual sequences__ are as follows:

    5'-[end of HA]-AGGCGGCCGC-[16 X N barcode]-3'
or 

    5'-[end of NA]-AGGCGGCCGC-[16 X N barcode]-3'
---


__The end of NA is__ `...CACGATAGATAAATAATAGTGCACCAT`
    
__The end of HA is__ `...CCGGATTTGCATATAATGATGCACCAT`

---    

    
The __sequencing reads__ from the reverse end of the molecules (in 5'>3' orientation), so the sequencing reads are as follows:

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of HA]-3'
or

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of NA]-3'

---   
    
The reads can originate from **either** HA or NA, and that will be distinguished by the most 3' end of the read.
But in our example exercise in class, we did not distinguish among reads matching to HA and NA, as we didn't even look far enough into the read to tell the identity.

For the homework, your goal is to write code that extends the material from lectures 8 and 9 to also distinguish between HA and NA.
This homework can be completed almost entirely by re-using code from lecture 9. You will need to set up your analysis to do the following:
 1. Get the reverse complement of each read.
 2. Determine if it matches the expected pattern for HA and NA, and if so which one.
 3. If it matches, extract the barcode and add it to a dictionary to keep track of counts.
 4. Determine the number and distribution of barcodes for HA and NA separately.

Please include code to address each of the following questions. Please include code comments to explain what your code is attempting to accomplish. Don't forget to include references to the sources you used to obtain your answer, including your classmates (if you are working in groups).  

In [179]:
# import regular expressions module
import re

1. How many reads map to HA, and how many reads map to NA?

In [180]:
#Import Bio.Seq package 
import Bio.SeqIO
import Bio.Seq 

In [181]:
# Open R1.fastq file 
reads = Bio.SeqIO.parse('barcodes_R1.fastq', format='fastq')
# make list of seq reads
seqreads = list(reads)

In [182]:
# make a list of just the sequences
seqreads_Seq = []
for seqrecord in seqreads:
    sequence = seqrecord.seq # isolate the sequence from the seqrecord
    seqreads_Seq.append(sequence) # add string sequence to list
    
#Reverse Complement
seqreads_Seq_rev = sequence
seqreads_Seq_rev.reverse_complement()
print(seqreads_Seq_rev)

CGTAGGATTGAATTAGGCGGCCGCCTATGGTGCACTATTATTTATCTATCGTGAAAGGGAGTTCTGCTCCATCAGGCCAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAACACAAAAAACATAATTCTGCCACATGTAGATGCATGGCAAAGATCAATACTATGCAAAATTTACACATATATCTGCAGACAAATAATATAA


In [183]:
def reverse_complement(seq, unk_partner='N'):
    """Returns the reverse complement of a nucleic acid sequence
    
    Uses unk_partner as the partner of unrecognized letters
    """
    base_partner = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
    rseq = ''
    # iterate through all bases in the sequence
    for a in seq:
        # check if the base is in the dictionary
        if a in base_partner:
            # look up the complementary base in the dictionary
            pair = base_partner[a]
            # add the complentary base to the beginning of the string (reverse comp)
            rseq = pair + rseq
        else:
            rseq = unk_partner + rseq
    return rseq


In [184]:
#Define known sequences 
HA_END = "CCGGATTTGCATATAATGATGCACCAT".upper()
NA_END = "CACGATAGATAAATAATAGTGCACCAT".upper()

reverse_complement(HA_END)

'ATGGTGCATCATTATATGCAAATCCGG'

In [185]:
#Define known sequences 
HA_END = "CCGGATTTGCATATAATGATGCACCAT".upper()
NA_END = "CACGATAGATAAATAATAGTGCACCAT".upper()
len(HA_END)
print(len(HA_END))

#Constant sequence 
constant_region = "AGGCGGCCGC"

#Lists 
seqs = ['CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CCAGCatagcaGATCA',  # does not have expected 5' sequence
        'CTAGCtacagGATCA',   # barcode too short
        'CTAGCgaccatGATCA',  # has barcode GACCAT
        'CTAGCatcgatGATCA',  # has barcode ATCGAT
        'CTAGCatcgatGGTCA',  # does not have expected 3' sequence
        ]

27


In [186]:
# Number of reads that map to HA & NA
HA_COUNT = 0
NA_COUNT = 0

In [187]:
# isolate just the first sequence for learning purposes
seq = seqreads_Seq[0]

# get reverse complement of read
rev_comp = seq.reverse_complement()

# define constant sequence
constant_region = "AGGCGGCCGC"
# define constant sequence in reverse direction
constant_region_reverse = "GCGGCCGCCT"

if (constant_region in rev_comp): 
    print("constant region")

# define the sequence order to search for
HA_pattern = HA_END + constant_region
print(HA_pattern)
NA_pattern = NA_END + constant_region
print(NA_pattern)

# define correct read length (barcode length + constant length + HA/NA end)
read_length = 16 + len(constant_region) + len(HA_END)

# Number of reads that map to HA & NA
HA_COUNT = 0
NA_COUNT = 0

# check if pattern is in read
if (HA_pattern in rev_comp):
    print("Maps to HA")
    HA_COUNT += 1
    print(HA_COUNT)
    
if (NA_pattern in rev_comp):
    print("Maps to NA")
    NA_COUNT += 1
    print(NA_COUNT)
    

# if one of the NA or HA + constant sequence is in the read, the read maps to that gene
# get barcode:
# find the position the HA_pattern is at
# get the 16 bases after the pattern
# use the position of the HA_pattern to know where to get the 16 from
# that's the barcode

constant region
CCGGATTTGCATATAATGATGCACCATAGGCGGCCGC
CACGATAGATAAATAATAGTGCACCATAGGCGGCCGC
Maps to NA
1


In [188]:
## NA sequences that map to fastq
#122

##HA sequence that map to fastq
#5,409

In [198]:
##applying to the whole fastq file
# Define known sequences
HA_END = "CCGGATTTGCATATAATGATGCACCAT".upper()
NA_END = "CACGATAGATAAATAATAGTGCACCAT".upper()
constant_region = "AGGCGGCCGC"
constant_region_reverse = "GCGGCCGCCT"  # constant region in reverse complement form

# Initialize counters and barcode dictionaries first find HA and NA counts
HA_COUNT = 0
NA_COUNT = 0
##HA_invalid_count = 0
##NA_invalid_count = 0
##HA_barcodes = {}
##NA_barcodes = {}


# get reverse complement of read
#Reverse Complement
seqreads_Seq_rev = sequence
seqreads_Seq_rev.reverse_complement()
print(seqreads_Seq_rev)

# define constant sequence
constant_region = "AGGCGGCCGC"
# define constant sequence in reverse direction
constant_region_reverse = "GCGGCCGCCT"

if (constant_region in seqreads_Seq_rev): 
    print("constant region")

##def identify_barcode(seqreads_Seq_rev, upstream_sequence, barcode_length=16): I can't get this to work 
    """
    Identifies a barcode of specified length directly upstream of a known sequence.
    Args:
        sequence (str): The sequence in which to search.
        upstream_sequence (str): The known upstream sequence to identify the barcode.
        barcode_length (int): The length of the barcode to identify.
    Returns:
        str or None: Returns the barcode if found, otherwise None.
    """
    ##sequence_seq = str(seqreads_Seq_rev)
    ##pattern = rf"([ACGT]{{{barcode_length}}})" + upstream_sequence
    ##match = re.search(pattern, sequence_seq)
    ##return match.group(1) if match else None

# Check for HA or NA in the reverse complement
if HA_END and constant_region in str(seqreads_Seq_rev):
    HA_COUNT += 1
    barcode = identify_barcode(seqreads_Seq_rev, constant_region)
    if barcode:
        HA_barcodes[barcode] = HA_barcodes.get(barcode, 0) + 1
    else:
        HA_invalid_count += 1  # Increment invalid count if no barcode

if NA_END and constant_region in str(sequence):
    NA_COUNT += 1
    barcode = identify_barcode(sequence, constant_region)
    if barcode:
        NA_barcodes[barcode] = NA_barcodes.get(barcode, 0) + 1
    else:
        NA_invalid_count += 1  # Increment invalid count if no barcode

# Print final results
print(f"Total HA reads: {HA_COUNT}")
print(f"Total NA reads: {NA_COUNT}")

CGTAGGATTGAATTAGGCGGCCGCCTATGGTGCACTATTATTTATCTATCGTGAAAGGGAGTTCTGCTCCATCAGGCCAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAACACAAAAAACATAATTCTGCCACATGTAGATGCATGGCAAAGATCAATACTATGCAAAATTTACACATATATCTGCAGACAAATAATATAA
constant region
Total HA reads: 1
Total NA reads: 1


2. How many HA sequences did not have a valid barcode? Also answer the same question for NA.

In [190]:
# your code here...

3. What is the HA barcode with the most counts (and how many counts)? Also answer the same question for NA.

    _Hint: you will need to find the key associated with the maximum value in your dictionary. There are many ways to do this._

In [191]:
# your code here...