# Inferring mRNA from Protein

## Pitfalls of Reverse Translation

When researchers discover a new protein, they would like to infer the strand of mRNA from which this protein could have been translated, thus allowing them to locate genes associated with this protein on the genome.

Unfortunately, although any RNA string can be translated into a unique protein string, reversing the process yields a huge number of possible RNA strings from a single protein string because most amino acids correspond to multiple RNA codons (see the RNA Codon Table).

Because of memory considerations, most data formats that are built into languages have upper bounds on how large an integer can be: in some versions of Python, an "int" variable may be required to be no larger than 2$^{31}$ − 1, 
, or 2,147,483,647. As a result, to deal with very large numbers in Rosalind, we need to devise a system that allows us to manipulate large numbers without actually having to store large numbers.

## Problem 

For positive integers $a$ and $n$, $a$ modulo $n$ (written 
 $a$ mod $n$ in shorthand) is the remainder when $a$ is divided by $n$. For example, $29$ mod $11 = 7$ because $
29=11×2+7$.

Modular arithmetic is the study of addition, subtraction, multiplication, and division with respect to the modulo operation. We say that $a$ and $b$ are congruent modulo 
$n$ if $a$ mod $n = b$ mod $n$; in this case, we use the notation $a ≡ b$ mod $n$.

Two useful facts in modular arithmetic are that if 
$a ≡ b$ mod $n$ and $c ≡ d$ mod $n$, then $a + c ≡ b + d $mod $n$ and $a × c ≡ b × d$ mod $n$. To check your understanding of these rules, you may wish to verify these relationships for 
$a = 29$, $b = 73$, $c = 10$, $d = 32$, and $n = 11$.

As you will see in this exercise, some Rosalind problems will ask for a (very large) integer solution modulo a smaller number to avoid the computational pitfalls that arise 

## Thinking about the problem

1. Get a list of the codon table as a dictionary.
2. What are the stop codons and how many? 3
3. How many codons per amino acid?

* Add codon table to this workbook
* Make a dictionary containing how many codons per amino acid.
* Use number dictionary to get permutations. Adding probablities number of codons for amino acid 1 x number of codons for amino acid 2 x etc
* Multiple the permutations by the number of stop codons.


In [1]:
codons = {'UUU': 'F', 'CUU': 'L', 'AUU': 'I', 'GUU': 'V', 'UUC': 'F', 'CUC': 'L', 'AUC': 'I', 'GUC': 'V', 'UUA': 'L', 'CUA': 'L', 
           'AUA': 'I', 'GUA': 'V', 'UUG': 'L', 'CUG': 'L', 'AUG': 'M', 'GUG': 'V', 'UCU': 'S', 'CCU': 'P', 'ACU': 'T', 'GCU': 'A', 
           'UCC': 'S', 'CCC': 'P', 'ACC': 'T', 'GCC': 'A', 'UCA': 'S', 'CCA': 'P', 'ACA': 'T', 'GCA': 'A', 'UCG': 'S', 'CCG': 'P', 
           'ACG': 'T', 'GCG': 'A', 'UAU': 'Y', 'CAU': 'H', 'AAU': 'N', 'GAU': 'D', 'UAC': 'Y', 'CAC': 'H', 'AAC': 'N', 'GAC': 'D', 
           'UAA': 'Stop', 'CAA': 'Q', 'AAA': 'K', 'GAA': 'E', 'UAG': 'Stop', 'CAG': 'Q', 'AAG': 'K', 'GAG': 'E', 'UGU': 'C', 
           'CGU': 'R', 'AGU': 'S', 'GGU': 'G', 'UGC': 'C', 'CGC': 'R', 'AGC': 'S', 'GGC': 'G', 'UGA': 'Stop', 'CGA': 'R', 'AGA': 'R', 
           'GGA': 'G', 'UGG': 'W', 'CGG': 'R', 'AGG': 'R', 'GGG': 'G' 
}                                                              


In [2]:
# Function to load rosalind string
def loadRosalind(filepath):
    # get file path
    print(filepath)
    ids = []
    try:
        with open(filepath) as file:
            txt = file.read()
        ids = txt.strip()
    except:
        print("File not found")

    return ids

In [3]:
# Make a dictionary of the number of possible codons per ammino acid
def codonPossible():                                         
    possible = {}                                             
    for k, v in codons.items():                                
        if v not in possible:                                 
            possible[v] = 0                                   
        possible[v] += 1                                      
    return (possible) 

In [8]:
def permutations(sequence):                              
    per = codonPossible() 
    #print(per)
    num = 3 # There are thress stop codons                                              
    for a in sequence:                                        
        num *= per[a]                                             
    #return(num)
    return (num % 1000000)  

In [9]:
test = "MA"

In [10]:
print(permutations(test))

{'F': 2, 'L': 6, 'I': 3, 'V': 4, 'M': 1, 'S': 6, 'P': 4, 'T': 4, 'A': 4, 'Y': 2, 'H': 2, 'N': 2, 'D': 2, 'Stop': 3, 'Q': 2, 'K': 2, 'E': 2, 'C': 2, 'R': 6, 'G': 4, 'W': 1}
12


In [7]:
pro = loadRosalind("/mnt/c/Users/rwswo/Documents/Bioinformatics/git/rosalindTry/rosalind_mrna.txt")
print(permutations(pro))

/mnt/c/Users/rwswo/Documents/Bioinformatics/git/rosalindTry/rosalind_mrna.txt
161984
