<img src="../graphics/icr_logo.png" alt="drawing" width="300"/>

# Basic programming with Python
## Part 05: DNA Analysis

With a few tweaks, we can extend the codon script to analyse a DNA sequence; Thalassiosira pseudonana CCMP1335 breast cancer 2 early onset (BRCA2).

***

⚙️ ***Exercise:*** 
- Count the number of occurances of each distinct codon from the BRCA2 sequence.

***

In [1]:
bases = ('t', 'c', 'a', 'g')

codon_counters = {}

# Start base pairs
for base1 in bases:
    for base2 in bases:
        for base3 in bases:
            # Add a key to the codon_counters dictionary using the bases you have iterated over.
            # What is their initial "count"?
            codon_counters[base1 + base2 + base3] = 0
            

# Check whether our calculation makes sense
print(codon_counters)

{'ttt': 0, 'ttc': 0, 'tta': 0, 'ttg': 0, 'tct': 0, 'tcc': 0, 'tca': 0, 'tcg': 0, 'tat': 0, 'tac': 0, 'taa': 0, 'tag': 0, 'tgt': 0, 'tgc': 0, 'tga': 0, 'tgg': 0, 'ctt': 0, 'ctc': 0, 'cta': 0, 'ctg': 0, 'cct': 0, 'ccc': 0, 'cca': 0, 'ccg': 0, 'cat': 0, 'cac': 0, 'caa': 0, 'cag': 0, 'cgt': 0, 'cgc': 0, 'cga': 0, 'cgg': 0, 'att': 0, 'atc': 0, 'ata': 0, 'atg': 0, 'act': 0, 'acc': 0, 'aca': 0, 'acg': 0, 'aat': 0, 'aac': 0, 'aaa': 0, 'aag': 0, 'agt': 0, 'agc': 0, 'aga': 0, 'agg': 0, 'gtt': 0, 'gtc': 0, 'gta': 0, 'gtg': 0, 'gct': 0, 'gcc': 0, 'gca': 0, 'gcg': 0, 'gat': 0, 'gac': 0, 'gaa': 0, 'gag': 0, 'ggt': 0, 'ggc': 0, 'gga': 0, 'ggg': 0}


In [2]:
# You don't need to modify this cell: we have just defined the sequence here.

# https://www.genome.jp/dbget-bin/www_bget?tps:THAPS_263089
# Thalassiosira pseudonana CCMP1335 breast cancer 2 early onset (BRAC2), partial mRNA
DNA_seq = 'gggtgcgacgattcattgttttcggacaagtggataggcaaccactaccggtggattgtc' + \
          'tggaagctagcagcaatggagagacggtttccacaccatcttggaggacattacttgacg' + \
          'tacgagcgtgtgctgaaacaaatgaagggccgctacgataaggaacttcgtaatttcaga' + \
          'cggcctgcagtacgcataatgctcaaccgagatgttgcagcgagtttgccagtcatctta' + \
          'tgcgtaagccaaatccttcgattcaaatcaagaccgccaaaaggaagttcttccgacgag' + \
          'atcaaagaagaagtccgactggagttgacggatggatggtactcactacctgctgtagtg' + \
          'gacgaaatactgttgaagtttgttgaagaaaggagaatcgcagtgggatcaaaactaatg' + \
          'atttgcaatgggcagttagttggatctgatgacggagtggagcctctcgatgacagctac' + \
          'tcatcttccaaacgagattgtcctctattgctgggcatctctgccaacaactcccgttta' + \
          'gcaagatgggatgcaactctaggttttgtacctcgcaacaactctaatctatacggcggc' + \
          'aatcttttggtcaaatccctgcaagacattttcatcggcggaggtactgttccggctatt' + \
          'gatttggttgtttgtaagaagtacccaaggatgtttctagagcaattaaacggtggagct' + \
          'tccattcatcttacagaagccgaagaagcagcacgccaaagtgagtacgattcaaggcat' + \
          'cagcgagcaagcgagagatatgccgacgatgctacgaaggaatgttcagaggtaagttca' + \
          'ttgctgttcacattcttcactatgaagccacttccgttgctttggtacaatcttgtcact' + \
          'gactcatcttttggcgttcatgattcgcacaggaaatcgatgaggatgctcctactcagt' + \
          'ggaaagagatga'

# Calculate length of sequence
sequence_length = len(DNA_seq)

In [3]:
# Iterate over the number of trinucleotides: starting with index locations 0, 3, 6, ....
for index_start in range(0, sequence_length, 3):    
    index_end = index_start + 3

    # Get slices of the sequence using the variables index_start and index_end
    codon = DNA_seq[index_start:index_end]

    # Update the codon_counters dictionary object
    codon_counters[codon] += 1

In [4]:
# Iterate over (key, value) pairs in the codon_counters dictionary
for key, value in codon_counters.items():

    # If the value is greater than zero, we'll print it out
    if value > 0:
        print(key, ":", value)


ttt : 6
ttc : 6
tta : 4
ttg : 10
tct : 6
tcc : 5
tca : 9
tcg : 3
tat : 1
tac : 10
tgt : 3
tgc : 3
tga : 1
tgg : 6
ctt : 8
ctc : 4
cta : 8
ctg : 6
cct : 5
cca : 5
ccg : 3
cat : 5
cac : 3
caa : 5
cag : 2
cgt : 3
cgc : 4
cga : 5
cgg : 3
att : 5
atc : 6
ata : 3
atg : 8
act : 4
aca : 2
acg : 3
aat : 5
aac : 7
aaa : 8
aag : 10
agt : 5
agc : 3
aga : 7
agg : 5
gtt : 7
gtc : 5
gta : 5
gtg : 4
gct : 4
gcc : 3
gca : 10
gcg : 1
gat : 12
gac : 9
gaa : 10
gag : 9
ggt : 3
ggc : 7
gga : 10
ggg : 2
