Problem
For positive integers a and n, a modulo n (written amodn in shorthand) is the remainder when a is divided by n. For example, 29mod11=7 because 29=11×2+7.

Modular arithmetic is the study of addition, subtraction, multiplication, and division with respect to the modulo operation. We say that a and b are congruent modulo n if amodn=bmodn; in this case, we use the notation a≡bmodn.

Two useful facts in modular arithmetic are that if a≡bmodn and c≡dmodn, then a+c≡b+dmodn and a×c≡b×dmodn. To check your understanding of these rules, you may wish to verify these relationships for a=29, b=73, c=10, d=32, and n=11.

As you will see in this exercise, some Rosalind problems will ask for a (very large) integer solution modulo a smaller number to avoid the computational pitfalls that arise with storing such large numbers.

Given: A protein string of length at most 1000 aa.

Return: The total number of different RNA strings from which the protein could have been translated, modulo 1,000,000. (Don't neglect the importance of the stop codon in protein translation.)

In [17]:
# two congruent numbers 29≡73mod11
29 % 11 == 73 % 11

True

In [19]:
# what is modulo?
# a % n == a - (n * (a//n))
29 % 11 == 7
29 - (11 * (29//11))

7


protein seq
number of possible codons for each letter
S = stop codon in this case 

MAS
143* = 12


MA
44 = 16

stop codons ae implicit in fasta sequences, ex aa seq MA is actually has stop codon after A.

In [29]:

# generator list of capital letters from alphabet less 'B', 'J', 'O', 'U', 'X', and 'Z' 
import string
aa_list = [ltr for ltr in list(string.ascii_uppercase) if ltr not in ['B', 'J', 'O', 'U', 'X', 'Z']] 
print(aa_list)

['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'] 20


In [32]:
# create dictionary of amino acid letters as keys and the number of codons to generate each amino acid as values.
aa_list = ['V','A','D','E','G','F','L','S','Y','C','W','L','P','H','Q','R','I','M','T','N','K','S','R']
aa_combos_list = [4,4,2,2,4,2,2,4,2,2,1,4,4,2,2,4,3,1,4,2,2,2,2]
aa_combos_list = [4,4,2,2,4,2,2,4,2,2,1,4,4,2,2,4,3,1,4,2,2,2,2]
aa_combo_dict = dict(zip(aa_list, aa_combos_list))
print(aa_combo_dict)

{'V': 4, 'A': 4, 'D': 2, 'E': 2, 'G': 4, 'F': 2, 'L': 4, 'S': 2, 'Y': 2, 'C': 2, 'W': 1, 'P': 4, 'H': 2, 'Q': 2, 'R': 2, 'I': 3, 'M': 1, 'T': 4, 'N': 2, 'K': 2}


In [36]:
# calculate the number of possible RNA sequences that could produce an aa sequence. 
# don't forget that there should always be a *3 added to the end of the product because of the 3 different stop codons and only 1 possibility for each stop codon.
test_seq = 'MA'

product = 1
for aa in test_seq:
    product *= aa_combo_dict[aa]

product *= 3
print(product)

12


In [58]:
# reading in data correctly
# make entire sequence into one string. 
# split lines

with open('input.txt', 'r') as f:
    test_seq = ''.join(f.read().splitlines())
    
    product = 1
    for aa in test_seq:
        product = product * aa_combo_dict[aa]
        # print(product)
    product = (product * 3) % 1000000
    print(product)

503296


In [135]:
import re
with open('codon_table.txt', 'r') as f_codon:
    codon_table = f_codon.read()

print(codon_table)
pattern = ' [A-Z] | [A-Z]\\n'
aminos_repeated = re.findall(pattern, codon_table)
print(aminos_repeated)

UUU F      CUU L      AUU I      GUU V
UUC F      CUC L      AUC I      GUC V
UUA L      CUA L      AUA I      GUA V
UUG L      CUG L      AUG M      GUG V
UCU S      CCU P      ACU T      GCU A
UCC S      CCC P      ACC T      GCC A
UCA S      CCA P      ACA T      GCA A
UCG S      CCG P      ACG T      GCG A
UAU Y      CAU H      AAU N      GAU D
UAC Y      CAC H      AAC N      GAC D
UAA Stop   CAA Q      AAA K      GAA E
UAG Stop   CAG Q      AAG K      GAG E
UGU C      CGU R      AGU S      GGU G
UGC C      CGC R      AGC S      GGC G
UGA Stop   CGA R      AGA R      GGA G
UGG W      CGG R      AGG R      GGG G

[' F ', ' L ', ' I ', ' V\n', ' F ', ' L ', ' I ', ' V\n', ' L ', ' L ', ' I ', ' V\n', ' L ', ' L ', ' M ', ' V\n', ' S ', ' P ', ' T ', ' A\n', ' S ', ' P ', ' T ', ' A\n', ' S ', ' P ', ' T ', ' A\n', ' S ', ' P ', ' T ', ' A\n', ' Y ', ' H ', ' N ', ' D\n', ' Y ', ' H ', ' N ', ' D\n', ' Q ', ' K ', ' E\n', ' Q ', ' K ', ' E\n', ' C ', ' R ', ' S ', ' G\n', ' C ', ' R 

In [141]:
# getting rid of spaces and newline characters
aminos_repeated = ''.join(aminos_repeated).split()
print(aminos_repeated)


['F', 'L', 'I', 'V', 'F', 'L', 'I', 'V', 'L', 'L', 'I', 'V', 'L', 'L', 'M', 'V', 'S', 'P', 'T', 'A', 'S', 'P', 'T', 'A', 'S', 'P', 'T', 'A', 'S', 'P', 'T', 'A', 'Y', 'H', 'N', 'D', 'Y', 'H', 'N', 'D', 'Q', 'K', 'E', 'Q', 'K', 'E', 'C', 'R', 'S', 'G', 'C', 'R', 'S', 'G', 'R', 'R', 'G', 'W', 'R', 'R', 'G']


In [148]:
# counting the occurences of each amino acid
from collections import Counter
print(Counter(aminos_repeated).values())
print(Counter(aminos_repeated).keys())


dict_values([2, 6, 3, 4, 1, 6, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 6, 4, 1])
dict_keys(['F', 'L', 'I', 'V', 'M', 'S', 'P', 'T', 'A', 'Y', 'H', 'N', 'D', 'Q', 'K', 'E', 'C', 'R', 'G', 'W'])


In [152]:
# synthesis solution
import re
from collections import Counter

# reading codons
with open('codon_table.txt', 'r') as f_codon:
    aminos_counter = Counter(''.join(re.findall(' [A-Z] | [A-Z]\\n', f_codon.read())).split())
    aminos_combos_dict = dict(zip(aminos_counter.keys(), aminos_counter.values()))

with open('input.txt', 'r') as f:
    test_seq = ''.join(f.read().splitlines())
    
    product = 1
    for aa in test_seq:
        product = product * aminos_combos_dict[aa]
        # print(product)
    product = (product * 3) % 1000000
    print(product)





414976
