# HW1 Answer

## 1. 
Write a Python function that takes as input a FASTA file and returns a sequence string [2pt].
The FASTA format consists of a single-line description, followed by line(s) of sequence data. The
first character of the description line is a greater-than (">") symbol. Below is a sample sequence
in FASTA format:
<blockquote>
>seq0
<br>
CTGAAAGGTGGTTTCATGGACATCTCTCTGGGAAAGAAGCAGAGAAATTATTAACTGAAAAAGGAAAAGTGGTAGTTTTCTTGTACGAGAGA GAGCCAGAGCCACCCTGGAGATTTTGTTCTTTCTGTGCGCACTGGAG
</blockquote>

In [1]:
#note that the following function can only deal with one entry in the fasta file
def readFASTA(filename):
    seq = ""
    with open(filename, "r") as fin: #open the file object as fin
        for line in fin: 
            #iterate each line in the file
            if line[0] == ">": 
                #if the line starts with a ">" then it means it is the description
                continue 
                # we move to the next line
                
            #you can also skip the first line and read lines afterwards
            seq += line.strip() # we add this line to the sequence string with "\n" trimmed
            
    return seq

In [2]:
seq = readFASTA("test.fasta")
print(seq)

CTGAAAGGTGGTTTCATGGACATCTCTCTGGGAAAGAAGCAGAGAAATTATTAACTGAAAAAGGAAAAGTGGTAGTTTTCTTGTACGAGAGAGAGCCAGAGCCACCCTGGAGATTTTGTTCTTTCTGTGCGCACTGGAG


## 2. 
Write a Python function that takes as input a sequence string and returns a list with 4 entries
that are the number of A, C, G, and T in the sequence [2pt].

In [3]:
#the following function uses dict structure to store count
def countBases1(seq):
    count = {"A": 0, "C": 0, "G": 0, "T": 0} #define the data structure
    for base in seq: #iterate through the string
        count[base] += 1 #count
    return [count["A"], count["C"], count["G"], count["T"]]

#the following function uses "Counter" data structure to directly generate the count
from collections import Counter
def countBases2(seq):
    count = Counter(seq)
    return [count["A"], count["C"], count["G"], count["T"]]

#the following function uses list data structure
def countBases3(seq):
    count = [0] * 4 # count = [0, 0, 0, 0]
    for base in seq:
        if base == "A":
            count[0] += 1
        elif base == "C":
            count[1] += 1
        elif base == "G":
            count[2] += 1
        else: # elif base == "T":
            count[3] += 1
    return count

#countBases1 and countBases2 will be faster since they rely on less comparison per iteration.

In [4]:
print(countBases1(seq))
print(countBases2(seq))
print(countBases3(seq))

[41, 22, 39, 37]
[41, 22, 39, 37]
[41, 22, 39, 37]


## 3. 
Write a Python function that takes two inputs: a sequence string and a string of two letters (e.g.,
“CG” or “CT”). This function returns the number of times the two letters occur consecutively in
the sequence. The count should be performed by moving the sliding window of 2 nt down the
sequence one base at a time. For example, given a pair of input (“AAAAACAAACC”,”AA”), the
function should return six [2pt].

In [6]:
#the following function uses for loop to count the occurrence
def countDi1(seq, di): #di must be length of 2
    count = 0
    for i in range(len(seq) - 1): 
        if seq[i:i+2] == di:
            count += 1
    return count

#the following function uses built-in function "find" to count the occurrence
#each time we "find" one occurrence, we move to the next start position to perform "find" again
#"find" function will return -1 if it can't find the substring
def countDi2(seq, di):
    count = start = 0
    while True:
        start = seq.find(di, start) + 1 
        if start > 0:
            count += 1
        else:
            return count
        
#countDi2 will be faster than countDi1 because it uses the most optimized built-in function.
#countDi2 can also be used to count any substrings, such as k-mer rather than dinucleotides, 
## occured in the sequence.

In [7]:
print(countDi1("AAAAACAAACC", "AA"))
print(countDi2("AAAAACAAACC", "AA"))
print(countDi2(seq, "AG"))

6
6
16


## 4.
Explore the NCBI website, go to the following two pages, and download the FASTA files for the
human gene PTPN11 and it’s Drosophila orthologue csw.
<blockquote>
<a href="https://www.ncbi.nlm.nih.gov/nuccore/NM_002834">https://www.ncbi.nlm.nih.gov/nuccore/NM_002834</a>
<br>
<a href="https://www.ncbi.nlm.nih.gov/nuccore/NM_057783.3">https://www.ncbi.nlm.nih.gov/nuccore/NM_057783.3</a>
</blockquote>
For each of the two FASTA files, print the output of function #2 and function #3 with input “CG”.
Compare the results and describe your finding [2pt].

In [8]:
seq = readFASTA("sequence1.fasta")
print(countBases1(seq))
print(countBases2(seq))
print(countBases3(seq))
print(countDi1(seq, "CG"))
print(countDi2(seq, "CG"))

[1773, 1139, 1410, 1751]
[1773, 1139, 1410, 1751]
[1773, 1139, 1410, 1751]
99
99


In [9]:
seq = readFASTA("sequence2.fasta")
print(countBases1(seq))
print(countBases2(seq))
print(countBases3(seq))
print(countDi1(seq, "CG"))
print(countDi2(seq, "CG"))

[2395, 1876, 1675, 1718]
[2395, 1876, 1675, 1718]
[2395, 1876, 1675, 1718]
455
455


## 5.
[Bonus] Write another Python function that takes as input a sequence string and returns a list
with 16 entries that are the outputs of function #3 for all 16 possible two letter strings [Bonus 1
pt].


In [18]:
#the following function firstly generate all possible dinucleotides, then run the function #3 to count.
from itertools import product
def countAllDi1(seq):
    result = []
    for di in product("ACGT", repeat = 2):
        result.append(countDi2(seq, "".join(di)))
    #using for loop to generate all possible dinucleotides instead:
    '''
    for base1 in "ACGT":
        for base2 in "ACGT":
            di = base1 + base2
    '''
    return result

#the following function iterate through the sequence and count all dinucleotides. 
from collections import Counter
def countAllDi2(seq):
    count = Counter() #count = {}
    for i in range(len(seq) - 1):
        count[seq[i:i+2]] += 1
        '''
        if seq[i:i+2] in count:
            count[seq[i:i+2]] += 1
        else:
            count[seq[i:i+2]] = 0
        '''
    return count #you can also return a list of these values with a given order


#If you want to count k-mer instead of dinucleotides.
from itertools import product
def countAllKmer_slow(seq, k):
    result = []
    for di in product("ACGT", repeat = k):
        result.append(countDi2(seq, "".join(di)))
    #using for loop to generate all possible dinucleotides instead:
    '''
    for base1 in "ACGT":
        for base2 in "ACGT":
            di = base1 + base2
    '''
    return result

from collections import Counter
def countAllKmer(seq, k):
    count = Counter()
    for i in range(len(seq) - k + 1):
        count[seq[i:i+k]] += 1
    return count #you can also return a list of these values with a given order

In [11]:
print(countAllDi1(seq))
print(countAllDi2(seq))

[870, 521, 481, 523, 621, 452, 455, 348, 499, 501, 323, 352, 404, 402, 416, 495]
Counter({'AA': 870, 'CA': 621, 'AT': 523, 'AC': 521, 'GC': 501, 'GA': 499, 'TT': 495, 'AG': 481, 'CG': 455, 'CC': 452, 'TG': 416, 'TA': 404, 'TC': 402, 'GT': 352, 'CT': 348, 'GG': 323})


In [17]:
time(countAllKmer(seq, 5))

CPU times: user 3.63 ms, sys: 86 µs, total: 3.72 ms
Wall time: 3.78 ms


Counter({'ATTCA': 10,
         'TTCAT': 10,
         'TCATT': 7,
         'CATTC': 7,
         'TCATA': 4,
         'CATAC': 5,
         'ATACC': 3,
         'TACCC': 5,
         'ACCCC': 8,
         'CCCCA': 11,
         'CCCAG': 16,
         'CCAGC': 19,
         'CAGCG': 14,
         'AGCGC': 12,
         'GCGCT': 9,
         'CGCTT': 9,
         'GCTTA': 3,
         'CTTAG': 2,
         'TTAGA': 12,
         'TAGAA': 10,
         'AGAAC': 15,
         'GAACA': 21,
         'AACAC': 14,
         'ACACC': 16,
         'CACCA': 28,
         'ACCAC': 18,
         'CCACC': 24,
         'ACCAG': 8,
         'CCAGG': 5,
         'CAGGC': 6,
         'AGGCC': 6,
         'GGCCA': 8,
         'GCCAC': 13,
         'CCACA': 18,
         'CACAG': 7,
         'ACAGC': 19,
         'CAGCT': 14,
         'AGCTC': 11,
         'GCTCC': 6,
         'CTCCC': 7,
         'TCCCG': 3,
         'CCCGT': 4,
         'CCGTC': 3,
         'CGTCC': 9,
         'GTCCG': 6,
         'TCCGA': 9,
         'CCG

In [19]:
time(countAllKmer_slow(seq, 5))

CPU times: user 27.6 ms, sys: 1.5 ms, total: 29.1 ms
Wall time: 29.3 ms


[58,
 35,
 14,
 28,
 36,
 18,
 19,
 13,
 13,
 10,
 5,
 13,
 21,
 12,
 12,
 22,
 43,
 14,
 24,
 12,
 9,
 11,
 8,
 5,
 16,
 11,
 10,
 6,
 9,
 1,
 14,
 8,
 27,
 7,
 12,
 9,
 19,
 5,
 11,
 7,
 5,
 4,
 5,
 3,
 4,
 8,
 9,
 14,
 27,
 10,
 5,
 21,
 12,
 14,
 16,
 11,
 7,
 6,
 6,
 12,
 11,
 6,
 14,
 12,
 41,
 13,
 12,
 14,
 9,
 16,
 12,
 4,
 15,
 19,
 4,
 6,
 9,
 8,
 6,
 12,
 16,
 18,
 8,
 8,
 19,
 8,
 7,
 3,
 7,
 7,
 2,
 3,
 2,
 8,
 3,
 6,
 12,
 10,
 11,
 8,
 7,
 10,
 8,
 6,
 10,
 8,
 4,
 2,
 3,
 5,
 5,
 5,
 6,
 8,
 2,
 8,
 2,
 1,
 6,
 1,
 8,
 8,
 7,
 6,
 2,
 4,
 2,
 10,
 20,
 15,
 14,
 13,
 11,
 4,
 10,
 2,
 12,
 8,
 8,
 1,
 6,
 9,
 9,
 6,
 19,
 14,
 30,
 7,
 9,
 3,
 13,
 6,
 8,
 12,
 14,
 7,
 2,
 11,
 12,
 2,
 5,
 3,
 10,
 2,
 6,
 6,
 3,
 2,
 7,
 4,
 0,
 2,
 4,
 2,
 3,
 2,
 11,
 12,
 2,
 4,
 3,
 3,
 7,
 2,
 10,
 7,
 2,
 8,
 3,
 10,
 10,
 9,
 20,
 10,
 6,
 13,
 14,
 3,
 6,
 5,
 4,
 10,
 1,
 5,
 20,
 6,
 11,
 24,
 12,
 6,
 4,
 4,
 10,
 6,
 7,
 3,
 17,
 11,
 10,
 3,
 8,
 7,
 6,
 7,
 9,
 3,
 4,
