# Rosalind Problems - Bioinformatics Stronghold

## Counting DNA Nucleotides (nt)

The nucleus of any living cell consists of macromolecules called chromatin. One class of marcomolecules in the chormatin are nucleic acids. Nucleic acids are polygmers - repeating chains of small, similarly strucutred molecules called monomers (stands.)

The **nucleotide (nt)** is the unit of stand length and is composed of three parts: a sugar molecule, a negatively charged phosphate ion, and a compound called a nucleobase (base.) THe sugar of one nucleotide bonds to the the phosphate of the next nucleotide in the chain, forming a sugar-phosphate backbone.

Nucleotides of a specific type of nucleic acid always contains the same sugar and phosphate molecule; thus they only differ by their choice of base. Therefore a strand of nucleic acid can be differentiated from another based only on the order of its bases - this ordering defines a nucleic acid's **primary structure**. 

For a strand of **deoxyribose nucleic acid (DNA)**, the four necleobases are molecules called **adenine (A)**, **cytosine (C)**, **guanine (G)**, and **thymine (T)**.

### Problem: 

Given a string over the alphabet {A,C,G,T}, count the number of times each symbol occurs in the string.

In [5]:
input_string = "ATTCGTAGGCATCGCTCCTGCCCTGTGGTTACGTTCGCACCGGGTGTCAGGTCTGAAATTATGAGAATTTTACACTCCTAATGTTGGGCGACTGAACTGATTTTGAAGTCAGGCGGTAGTATACTGAGTCAGGGTCGGAGAGTGAGAGATGTATGAGCGTGGTTTGCTCCGCTAGCCCAGGCCGGCGAGTGCTGATGCGTAGCTGTATCTACAATTCGTCAGCTTTTGGCGCAGTAATGAATATCAGAGTGCCTAAGCAGCTCTTGCCCTACGGCTAATCGAATCTTCAGATTCAAAGGGGTACACTAGCTGTTGCCAAACCGTCGGAGACGCCACCTCACGAACTCACATCATATCAGTCGAAGATGGTACGTTAGAGCCCATGTGCCTATGGAGGCATGGTATATAGCCTGGCTGCGGTGTAATGAAGTCACCTGGCCAACCTGGGGCACTTGTCTCGTGACATAGTCTTATAGGGGTGCCGCAAGAAGGTCCCACCAAAAATCACGGATACGCGTTCGCCCGAAGGCCCTTGGAAACCTTCCAGCTCAAACTTGAGAGATACGTAGTTTTCCCACCCGTCACAAGAGGTGAGATTACAACAAAGTATGTCTGCTCGTCCCTGTTAGCTATAGCAGCTCGTATAACAACATTTCCCGACTGCGCGTCGCTGATCGAACAATAGGCTAGAGTTACAGAGCCTATCACAGGCCGAAGAGGACAAGCTTCGCTGGTACCTATAAAGGGAAAACCCCAGATTAATCCCGCCTTATCGTCCTAAGTAGTCGCGGTCAATATGGTACTATCCGACACCCTTTGATTGTGAGTCTTCACTTACCTATTCCGGATTTCCTCTCCCAGTAGGTTATAGTTCATGACGTCTTTTAGCACGGGATCACACGTGGCATGGATTAAATAGGATACACCTGGTGCCATGCAAAGTCCACGCTTAAGCCGAAGAACCCCGTAAAGCAC"

# Create dictonary for each character, scan through string, counting occurences.
dict = {'A':0, 'C':0, 'G':0, 'T':0}

for char in input_string:
    count = dict[char]
    dict[char] = count + 1
    
print(dict.values())

dict_values([247, 244, 242, 242])


## The Second Nucleic Acid (RNA)

Along side DNA in the chromatin a molecule with a different sugar can ribose known as **ribose nucleic acid (RNA)**. RNA differs from DNA as in it contains a base called **uracil** instread of thymine. 

The primary structure of DNA and RNA is similar because DNA serves as the blueprint for the creation of special kinds of RNA molecule called **messenger RNA (mRNA)**. mRNA is created during RNA transcription, during which a strand of DNA is used as a template for constrcuting a strand of RNA by compy nucleotides one at a time, where uracil is used in place of thymine.

### Problem: 

Given a DNA string t, convert t to RNA by replacing all thymine (T) with uracil (U).

In [6]:
t = "AAAACGTTGCCCTCCTCCCAGGTTCTGGTAAACGTGCACAGCGTCCCCTGCCTTTCATATCGTACTTCTAGGCCCTCGTAAGGAGTCCGGTATGTAATATCGATCACGATCGCCTATGAGGTGCGGGAATTGCCAAATCACTGGCCTAGTTGACTCAGTCACCCTCTTTTGCAAGGCCTTTAGCGAGGAAACCTGCACCGCCGAGGTACGCAATTGCTACATGCTACCCAAGAAGGGCATTCAGCACGCGAGAGCGATACTGTACTCGAGCACCACCATCACGATAGAAAGGATGGTTGTATTCACACGCGGTTAGGTGGAACCTTGATAAATGTGCGCGGATGCGAATGTGCCAAGTTGCCTGCTCTTTCTTAAGGTTCCGGTCTGACGGGAACACAAGGCGAGTAAGAGGGTCTTAGCAAGGGGTGCGCCTTAGCGACTACTATTAGCTTGCACACAAACAAGCGAGTAGTATGGCCTCGCTCGTGTTCAGCTCGGTTTGCGCGTGACCATTGCATTTTTTAAATTGAAAAACCAGTGGTGCTGCTCACTAGCATTATCCGTTTGCTAGCCCTTCCAATGTAACAATGCAGTTGGGCTGAAGAGAGACCAGTACGCCCAGCTATTCTAGAATCCACCTCTGGAAGAACAACTGAGTATTCTCGAAGTTTACCAGCATCCCTCAGTAAAATTTCAACTATGATGTTGTCATCCAATTGCTCCCGAGAAAATTCGACGTGTAGCCCGATGAAAGGTAAGTCGAGTGGGATACCTCTAATGATAGTGCAGAGAATGGCCCGTAGCCTCGGCACTAGAGCCCGTTGCCACTAAGAGCCAAGAAAAGCCCTCGTGAGCAGAGCCTCCGTGCAGTCTCTCTGGCCGGCAATGCAAGGACAGGCTGGTTGC"
print(t.replace('T', 'U'))

AAAACGUUGCCCUCCUCCCAGGUUCUGGUAAACGUGCACAGCGUCCCCUGCCUUUCAUAUCGUACUUCUAGGCCCUCGUAAGGAGUCCGGUAUGUAAUAUCGAUCACGAUCGCCUAUGAGGUGCGGGAAUUGCCAAAUCACUGGCCUAGUUGACUCAGUCACCCUCUUUUGCAAGGCCUUUAGCGAGGAAACCUGCACCGCCGAGGUACGCAAUUGCUACAUGCUACCCAAGAAGGGCAUUCAGCACGCGAGAGCGAUACUGUACUCGAGCACCACCAUCACGAUAGAAAGGAUGGUUGUAUUCACACGCGGUUAGGUGGAACCUUGAUAAAUGUGCGCGGAUGCGAAUGUGCCAAGUUGCCUGCUCUUUCUUAAGGUUCCGGUCUGACGGGAACACAAGGCGAGUAAGAGGGUCUUAGCAAGGGGUGCGCCUUAGCGACUACUAUUAGCUUGCACACAAACAAGCGAGUAGUAUGGCCUCGCUCGUGUUCAGCUCGGUUUGCGCGUGACCAUUGCAUUUUUUAAAUUGAAAAACCAGUGGUGCUGCUCACUAGCAUUAUCCGUUUGCUAGCCCUUCCAAUGUAACAAUGCAGUUGGGCUGAAGAGAGACCAGUACGCCCAGCUAUUCUAGAAUCCACCUCUGGAAGAACAACUGAGUAUUCUCGAAGUUUACCAGCAUCCCUCAGUAAAAUUUCAACUAUGAUGUUGUCAUCCAAUUGCUCCCGAGAAAAUUCGACGUGUAGCCCGAUGAAAGGUAAGUCGAGUGGGAUACCUCUAAUGAUAGUGCAGAGAAUGGCCCGUAGCCUCGGCACUAGAGCCCGUUGCCACUAAGAGCCAAGAAAAGCCCUCGUGAGCAGAGCCUCCGUGCAGUCUCUCUGGCCGGCAAUGCAAGGACAGGCUGGUUGC


## The Secondary and Tertiary Structures of DNA

The primary structure of a nucleic acid is determined by the ordering of its bases, yet it does not describe the large, 3D shape of the molecule. In 1953, the following structure for DNA was proposed:

1) DNA is composed of two strings, running in opposite directions. 

2) Each base bonds to a base in the opposite strand: A-T, C-G (always)

3) The two strands are twisted together into a long spiral double helix. 


1) and 2) compose the **secondary structure** of DNA. 3) Describes the **tertiary structure**.

Note that the **complement** of a base is the base in which it bonds to. The bonding of two complementary bases is called a **base pair (bp)**. Thus the length of a DNA molecule is commonly given in bp rather than nt; we can determine the other strand by taking the complement of the first (running in opposite direction.)

Example:

'AAAACCCGGT' <-> 'ACCGGGTTTT'


In [25]:
s = "AACTCCAGACAGGGGCTCTATCGTTACGAGGCGCCCACGTAGGAGGCGCCGCCGTACCCCAACTTTCCTCGATCTTACACGATAGTTGGGGATAGAGAACGGGCCACATGACTCTAGGCAGCCATTTTGTGTCTCGTAAAGGCTGGGGAACCCTTAGCCGCTTCACGACAACGCGATCTGGTGCGCCCCTTGAGGGGACTGCCTAGCGTTAGTATCGTCCAGGTCCATACCTACTCACTACAGTGTATGGGATCCTTTCGCATCGGCGCTTGACATTTTTAGGCATTGCTCTGAAAGTAACGGTCGACTAGAGTGCTCGAGATCCAATGTCAGAAGCCGCTCCACCGATTTAGGGATGGCTACTGAGGTCTCGTAGCGCAGACTCTGTATTATATGAAGGGCCCATATCGCCGCAAATCAGCGGTAGGGGGCGAAATTGGGCAATTCTTCGAGCTGAGTCTCCGGTTATTGTAAGGTTTGCATGAACCTTCGAGCGGGTGTTGTCTTACAAGCCATCCGAGCAGTTCCCCGGCAAGCCCTGCACCCCGCCTGAATGCTGCATTTTTGGTACAACCTAATGTCTTATAGAGATACCTTAGCTAACGGAGTATAATTTCCATTCTTGCCCTCTACTCAAGATAGGTATAGGACAGTGCCTTTCATCCGTGTACTGACGTAAGCTAAGCACTCGGTGTAACCAGCTGTGAAAATGTAGTACCAGGTTTAGAGGATCACGTCAGGGTTCTTTTTATGTTAATGCACAGGGGGAACGTGGACCACTATTAGATAAGGATCCTTCTAAAGTTTTCGTCGTTGCGGATCGACGTTTCCCACGGTTTATA"
s_c = "" # complement of s. 

for char in s:
    if char == 'A':
        s_c += 'T'
    elif char == 'T':
        s_c += 'A'
    elif char == 'C':
        s_c += 'G'
    elif char == 'G':
        s_c += 'C'
        
print(s_c[::-1])

TATAAACCGTGGGAAACGTCGATCCGCAACGACGAAAACTTTAGAAGGATCCTTATCTAATAGTGGTCCACGTTCCCCCTGTGCATTAACATAAAAAGAACCCTGACGTGATCCTCTAAACCTGGTACTACATTTTCACAGCTGGTTACACCGAGTGCTTAGCTTACGTCAGTACACGGATGAAAGGCACTGTCCTATACCTATCTTGAGTAGAGGGCAAGAATGGAAATTATACTCCGTTAGCTAAGGTATCTCTATAAGACATTAGGTTGTACCAAAAATGCAGCATTCAGGCGGGGTGCAGGGCTTGCCGGGGAACTGCTCGGATGGCTTGTAAGACAACACCCGCTCGAAGGTTCATGCAAACCTTACAATAACCGGAGACTCAGCTCGAAGAATTGCCCAATTTCGCCCCCTACCGCTGATTTGCGGCGATATGGGCCCTTCATATAATACAGAGTCTGCGCTACGAGACCTCAGTAGCCATCCCTAAATCGGTGGAGCGGCTTCTGACATTGGATCTCGAGCACTCTAGTCGACCGTTACTTTCAGAGCAATGCCTAAAAATGTCAAGCGCCGATGCGAAAGGATCCCATACACTGTAGTGAGTAGGTATGGACCTGGACGATACTAACGCTAGGCAGTCCCCTCAAGGGGCGCACCAGATCGCGTTGTCGTGAAGCGGCTAAGGGTTCCCCAGCCTTTACGAGACACAAAATGGCTGCCTAGAGTCATGTGGCCCGTTCTCTATCCCCAACTATCGTGTAAGATCGAGGAAAGTTGGGGTACGGCGGCGCCTCCTACGTGGGCGCCTCGTAACGATAGAGCCCCTGTCTGGAGTT


## Rabbits and Recurrences

We begin with a pair of rabbits - one male and one female. Each month such a pair mates, and one month later k pairs are produced (one male, one female.)
After a month of being born, rabbits reach reproductive age. Rabbits never stop reporducing or dying. 

Given n months and value k, what is the the total number of rabbit pairs after n months?

e.g. n=5, k=3 -> 19 pairs. 

#### Recurrence relationship:

Month m=0, there is one pair of mating age, 0 new born. Total: 1
Month m=1, there is one pair of mating age, k pairs of new borns. Total: k+1
Month m=2, there are k+1 pairs of mating age, k pairs of new borns: Total: 2k+1

In general, for the ith month suppose there are x,y pairs of mating age, new born respectively. Then for month i+1, we have
x+y,k(x) respectvely. 

Let $F_i$ denote the number of pairs of rabbits on the $i^{th}$ month. Then for $i=0, i=1$, we have $F_0=1$, $F_1=k+1$, and $F_i = F_{i-1} + kF_{i-2}$ for $i \geq 2$.

In [2]:
def recurrence(m,k):
    if m == 0:
        return 0
    elif m == 1:
        return 1
    else:
        return recurrence(m-1, k) + k * recurrence(m-2, k) 
# See a bottom up solution.

In [4]:
m = 36
k = 3
print(recurrence(m, k))

3048504677680


## GC-Content

99.9% of the 3.2 billion base pairs in a human genome are common to almost all other humans (barring those with mahor genetic defects.) For any species, the average case genome is very similar.

For DNA, this means that the frequencey of bases will be similar for memebers of the same species. 

For a molecule of DNA that is from an unknown species, we can analysis the frequencies of bases in the molecule and compare them to a database of DNA to attempt to determine the species. 

Because of the base pairing relations of the two DNA strands, cytosine and guanine will always appear in equal amounts in a double-stranded DNA molecule - this we analysis its **GC-Content**. The GC-content of a DNA string is the percentage of symbols in the string that are either 'C' or 'G'.



### FASTA format

The FASTA format is a means to label strings (DNA) In this format, we introduce the label of the string using a '>' followed by the label. Then the next line is the string itself. For example,

\>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG

\>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC

\>Rosalind_0808CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT



### Problem:

Give a list of FASTA formatted DNA, compute the string with largest GC-content.

In [31]:
# Get input from file, store content in list of pairs: label, string
file = open('question_inputs/computeGC.txt', 'r')

# preprocessing
dna_list = file.read().replace('\n', '').split('>')
dna_list.pop(0)

# For each pair, compute GC-content of string, store label and freq.
largest_pair = [None, -1.0]

i = 0
while (i < len(dna_list)):
    label = dna_list[i]
    seq = dna_list[i+1]
    # Compute GC-content
    count = 0
    for char in seq:
        if char == 'G' or char == 'C':
            count += 1
    if largest_pair[1] < count/len(seq):
        largest_pair = [label, count/len(seq)]
    
    i += 2
    
print(largest_pair)

['Rosalind_6489', 0.5282583621683967]


## Point Mutations

A **mutation** is an error that occurs during the creation or copying of a nucleci acid (in particular DNA.) Mutations come in forms, the simplest and most common is a **point mutation** which replaces one base with another at a single nucleotide. For DNA, this also changes the complementary base.

For example, the pair G-C mutations to A-T.

The **Hamming distance** between two strings of equal length is the number of points in which they differ. 

DNA from two species with a common ancestor are expected to have the similar strings (i.e. small Hamming difference.)

### Problem: 

Given two strings of the same length, compute the Hamming distance between the two.

In [4]:
s = "GAGCCTACTAACGGGAT"
t = "CATCGTAATGACGGCCT"

s = "CACATAACAGCCCTGCAACCACCTAGGCCTGGAGCTACTGCTGCCAGCACTCCTGCAGTGAACTTGAAATACTTCGATGCATAATGAACTTTGTCTATTTCGGGCGCACGCAACACCGACAGCTACGATCTTCATCGAAGTTTTAGCGCCAAGATAGTTACTGCTTGTAGCCAGGTGGCGACGCTTAATAGCACGAGTGACCCTGAGGACAAATGTTGCCTCAGGTCATCAAGACGCGGCGGCTATTACTCAGTCAAACCGGCATGGGTCCCACGACCTGATACATCTCCGACTGGATGATATTATCAGATTAGCCGCAGCGTATCAAGGAACTATTCAGTAGTACTGCCTCGGGGGGTATGAACGCACGAGGCATTTAATAGTTAATGCTAACCACACTTCCAAATCATATGAAGTGGGACAGTTAGCAATGATGGCACTGGAAGAGTTCACAGTTCCTCTAAAAAGAACTAGGGTCCGGACATGGACGCTTTCGGTCGGATTTACTGCGAGTTCAAATTGGATGAAGTGGCCCACCGAATAGGTTCGAGCAACTTACTCGCAGCTATAAGGGAGTGTTATACATGTATATTTCCCCTTTTAAGCCCGTACGTGGCTCCAGATATAGACTCCTTGCAGGAAGGTGTGCACCCCTCGCTGTGCACGTGGCCGTAGATCTCGGCCCGTCCATGAGGCGATCAGTGCCAGGTGCTTTAGCCTTGTGCTGTCGGCACGTCTAGCTCCGATATGGGTCCCCTTGCCTTTTACTCACCGAAACGCGGTCGGGGGGGGTTGGTTCGGTGCCAAGGGTAAAGGGTAGCTCGAGATGTAGCGCCGTAAAACAAACGAATGGGTAATGGTATCTCCTGCAGTCTAGGTCCTATGAGCCTCTAGTGAGTATCTCGTCTCATCCACGTGCGTCAGCATAACTT"
t = "CTATTAACTTATTGGCAACCAAGTACGGCTAGTGCTACAGAAACCACTAATCCTAAACTGCACCTTAAATATTTGCATGTACAACAAGTGTAGTACACATGATGCACAAGCGAGCGCGGGAGTTCGGAGCACCGATTATTATTCAGCGCGAGGAAAACCTCTTCGTTCGGTGAAGTGGCGGGGTTTTATGCCAGGTAACGGAATGACGGCAAATATACCGCCCGGTGTCCTACTTTATGCTACAACTACAATTTCGTACCGGCATGGGTCTCTCCGGCTAAGCCATTGCCCATTGGAACGTATAGCAGCGGAAGCCATATCGTATCAAAATTCTAGCCTGGGAGGCCGCCTCAGGCACAATGGACACCCGAGGTATCTAGTAGTTGGTCCTATGCACACTTCAGTACGCTTTGACACACGACAGATAGTCGAGATGGCCGAGATACGCCGCAGCTTTGCTTTAATGAGAGCTAGGTCCCGTGTTCGGATACGTTAGGCCATATTAGGAACACGTCGCACATGGATCCCCTGGAGAACCATATCGGTGTTAGCCTCTTGGCGGCCGCTGCAGAGCACTGTTGTACGTGAACATGTTCTATACCGGGCTCGTACATTACTGCGAGTCTATCTTTGGTGTAGTAGATTAGGCTTACCTAGCATTACACTCTCACCAAGATAGCTGCTACCCCAATAGTGGAAACGGACAACGGGTTTCAGTCGGGTGCCGACCGCACGGATGACATACATACGTCACCGCTCGAGGTTGACGAGTCGTAGCAAGCTCGGTAGGCCTTTTTGACGACGGAAGTGATGAGTACAGCTGTTCAATTAGAAGCTGCCCACTAAGGTGGATGTAAATCTAGCCGAGTCAGTGCAAGCCCGGCGAGGCACCGGCGAGTCTCTCGTCTAAAACAGGTGCGGCGGAGCAGCAT"


ham_dst = 0
for i in range(len(s)):
    if s[i] != t[i]:
        ham_dst += 1
        
print(ham_dst)

446


## Mendelian Inheritance

In 1865, Gregor Mendel described the contemporary hereditary **blending inheritance** model, which states that the inheritance must be a blend of its partent's traits. This model however is not observed in people.

Mendel viewed the traits of pea plants as discrete blocks called **factors**, each of which has a distinct form called **alleles**.

Mendel's **first law (law of segregation)** states that for a given factor, every organism possesses a pair of alleles for a given factor. If an individual's two allels for a given factor are the same, then it is **homozygous** for the factor; if they differ, then the individual is **heterozygous**. Thus for an offspring with two parents will receive one allele from each parent; which one is randomly decided. 

Further, any factor corresponds to only two possible alleles - the **dominant** and **recessive** alleles. By possessing only one copy of the dominate allele, the organism will **display the trait** of the dominant allele. That is, the only way that an organism can display a trait encoded by a recessive allele is if the individual is homozygous recessive for that factor.

We encode the dominant allele of a factor using captial letters, and the recessive allele by lower case letters. As an organism may possess the recessive allele but not display it, we define an organism's **genotype** as its precise genetic makeup, and its **phenotype** as its physical manifestation of its underlying traits.





### Problem

Given populations K,M,N of homozygous dominant, hetrozygois, and homozygous recessive factors respectively, compute the probability of any two mating organisms will produce an individual possessing a dominant allele. (An organism cannot mate with itself.) Assume at least two memebers of each population.

### Solution

From k,m, and n numbers in each population, we consider all events (i.e. ways to mate) along with the probability that the resulting individual will possess the dominant allele:

Parents:ways that offspring can have the dominante allele.

KK:4/4  
KM:4/4  
KN:4/4  

MK:4/4  
MM:3/4  
MN:2/4  

NK:4/4  
NM:2/4  
NN:0/4  

In [20]:
import numpy as np

k=29
m=19
n=17
sum_terms = k+m+n
total_outcomes = sum_terms*(sum_terms-1)

dominante_allele_outcomes = [
    k*(k-1)*1,
    k*m*1,
    k*n*1,
    
    m*k*1,
    m*(m-1)*0.75,
    m*n*0.5,
    
    n*k*1,
    n*m*0.5,
    n*(n-1)*0
    
]

print(np.sum(dominante_allele_outcomes) / total_outcomes)

0.8364182692307692


## Translating RNA into Protein

**Proteins** are chains of small molecules called **amino acids** - 20 amnio acids commonly appear in all species. Similar to nucleic acids, the **primary structure of a protein** is the order of the amnio acids.

Proteins power all practical functions carried out by a cell; thus the key to understanding life lies in understanding the relationship between a chain of amino acids and the function of the protein that this chain will eventually construct. 

The translation of an RNA molecule (mRNA) into amino acids are how amino acids are created. Somehow, the 4 RNA bases must be translated into a language of **20 amino acids**; for every possible amino acid, we translate 3 nucleobase strings (called **codons**) into amino acids - this implies there are $4^3=64$ possible codons.  
Two special types of codons are the **start codon (AUG)** and three **stop codons (UAA, UAG, UGA)**, these do not code for amnio acid, only indicate the translation to end.

#TODO update

### Problem 
Given an input RNA string s corresponding to a strand of mRNA, output the protein string encoded by s. 

In [34]:
# Create dict where k-v pair corresponds to RNA condon-amino acids for the purpose of protein creation.
dict = {}
with open('question_inputs/rna_codon_table.txt', 'r') as f:
    for line in f:
        k,v = line.split()
        dict[k] = v
        
# input string
s = 'AUGUGCCAUAUUGUUACCAGAUAUUCUGCACACCGCCCGGUACAUGCGGAACUUCAAUUUGCAAGCUCACCUACGCGCUCUUCCGACGGGGGGUACAAAAGCUUGAAGGCCAUGGGUACUACCGUUCUUCUCCGGUCGUUGCGGUCCAUAUCUUGCCGUACUUAUUAUAUUAUACCCUGCCGCUCGCCGAGGCUCGAGCACUUUCGCCUCAGCUGCAAGAUCUCGCUGUUCCACGCGCGUCAGACAAUGCUUAAGGUUACAGUAAUGACGUACCCCGCAAAAAUUGCGGCGUGCGCAAGGAGGGCUAAACUACCCUUCAUACUGUGCGGUGGUAUGGUGCCUAGCUGUUCCCGUGUCAUGAAAUACGGGUGUAUCCGCUACCGUGAGACGGGAAAACAACUCAAGUGCGUUAAGACGGCCCUACGUCCUACUAAUGAGAAUAGUUCUAUUGCCCGUGCGAAGUGCAAUGUCAAUGUGAAAGGUUGGCUACGGUAUCAGUUAUGCGCCGAUCAAGGUAUGGCGCAACUACGCGCCUGCCCUAUGGGUCGAAGGGGUCGAUCGGACAUGAGGUUGAUAGGACCUCGCGUGCUACUACCAUCGGGCGGCUGCAGCAAAAGUGGCUGGUACAAUCUUAGUCUAACUGACAGACCGCUGCCGGUCUCAUGUUUACCCAUGGGCUCGUUCAAGUCUGCAUCGUCGAAUAUUACAGGGGCGAGCCCCUGUAUGACUAUAAUGCCAAUUAUCAAAACACAGAAUUUUGCUCAUUGUAUGCCAAAAUUGACCCCGCGGAUAACUCAGCAGUUGAAACGACUUCUGCUGCCUCCGGUGUUUGCUCCCGCUAGGCCACGAGGCCAUUCACACGUGUAUCAGACCGGGAUUGCUUUCUGCACACCGUGGACACCUGAGCCCUACCAAUCGCGAUGUUGCCAUGGGUUGCUUGCAGUUCUAUUGAUAAGGCGCACAUUUACUUCCUUGCAACCGAGCCGCUUCCGUAAUAACAUAGCCGAGGUAUCACUCUCGGAGAGCAUUUGGGCAUAUUGCUACCACAUCGAGCCGUACUCACUUGAGCUAUACCCUGAAAUGGGUCAUGCAAGUACCUCCGUAAGUCGGCUCCGUCCCCGACAAGUAGCGGUGCGAGUAAGAGCAAACAGACGCAUAAGUCUCAGAGUCCAAACGCAAUGUGGGAAGUUGCCCACUGGUUUGUGGGUACCCAUCGGUAAAUCUAACGGAGACACUCUGCAACCGUGGGUUGAAGUAGAUAAAAAAUUCCUGGGUGAACGUACGGACACUGUAGAGACUUAUUUAGAAAUUUGCAUCGUCAGGCCGGAGGGCUUUUCUAGAUCUGUAGAGUUACAGAGAUGCGAUUUCAGGUUCUGGACGCAAGUUUCCACUCAAGUCACAAAGCCACCAGUGAAAGCUGCCGUAUGCAUCUACAAUGUCUCCCGGCGUUCAUACUUAUCGGCGCCUAGCAACCCGUCCCUCCCCCUUUUUAGCCAACGAUGCUGUAACUGGGGUUUCUCCACACACUCAGAAAUGCCGGGGGACGAUCUUUCUCCUUUGCUGUCGGUAAAACAUUUACACAUGGGGCUACGGGUACUGGAAGCAGGGUACCAAGACUCCAAACGUCACUAUUAUACUACAUGCCGGGAGUCUUCACUUUAUUUGCGAAUUGCGCUUGCGGUGGUAGCGCGGCGCAAGGUCCGGCAUCACAGCCAAGACGAUACUAAGAGUCUACCUUGGAACAACAGUCGAUAUAAACAUCUCUAUGUCGUAAGUACUACAAUUGUUGAGUACAAGCCGGGCGCCUUUGUAUCGAGACUCCGGCACAACCGCGACGAAACUCCAUACUCCAAGAACUUGCUCUGUCUAAGCAAUAGUUCAUCAUCGGAUGGAUUGACCCUACAGGAGAGCCACGACAUUAUAAAACUGAUCCACUUGACGCAGGCUCCUCCCUCUUGCACUAGUCACCUACUAAGAGACUUUCCAGGGACACGGGCGAUAGAAGCACCAUGGAUGCAUUUCAAUAAGAUCAAGACCAACAGGCUGAGGCUCGCGGCAGGCAUCAUUUCCCUUGGCGGCUACGCACCUGCAAGCGCGGCGCUUACACGUCCGAUUCAUAUUAUUUCGGGCGUGACGCGUUGGCUGCCCCCCCUAACAUAUCCAACCACAAGUGGAACAUCCAAUGUGGCACUCCCACACGCUGCGUGGGUGGGAUGUGAUACUUCAGACCUACAUUUUGUUAAACAUCCAACUCUACGUCGAAUUACGGUUAAAGGGCUCACAAGAGUGACUUCACUAUUAAGGUUUGUAAAUAAUUCUCCCGUGGAACUGGAAGACGUUACCUUUGUGUUUACGCGUCCAUUCGAAGUCCAGCGACUCUUGCUAAGCGCACGUCAAUGCGGCCCGUCACGUAAAUCUGACAAGCCGUACUCACAGUUCAAUGAGCUACCAGUGGAAUACAACAUAUCCCGGAGGAGCACACGCCCUGAUCACUCCAAUCCUUUACGGGCACAGCGGGAACGUCAAGUACUUGAACCGGACAUUGAUCUUUGUAGUCAUGGCAUGUUGUUAUUUCGUCAUCCGCCAGGUCGUUUAUGCCUGAGCAUUCGAUCGACCAAUCCGCGCGCUCCCAGGAAGUCAGGAAAAGCCCACGAUGUCGAUCAUCGUCCUCACGGACUCAGAUGCCGCGUGCUGAGAUUCGAAGCGAGUUCCGCUCCAGGUAGUGUGUGGUCCGCCUUCUGCAAUAAUCUGCCCUUAGGGUCUUUGCUCGCAUGCCCGGUGGGUGCUAUGUUAAGAAGGGCGGUUCGAGAAGCGAAGCUAACGGUGACAGGUGUUACUUUGACAAACGUGAUAGCAAACGAGACGUGGGCCUCUGGACCCCCGCUUGGUAGGCGACGUCUCUCCGGCAUUGCGCUUACGGGUUCCCUUCACUGCCCUCGUGCAGUCUCACCCUAUUGCCUUCUGAGGUCUCUCAUAUGUUGUGAAGUACUAGGUUAUGGGGCCGGCUGCUUACUUCGUUUGAUCAGCAGCCAGUCGGUGUUGAGAGUUAUGCUAGAGGCGCGACGCACCACACUCGGGAUCGGCAGUAUUAUAGCAACAAAAUUCACGUGCUUGUUAACCCAAGUGUGGCAAGAAGCAGCCAACCACAAGUGCCACGGCGCUAGGAUUAGGAUGUCUUUACAGUCUUUUGGAACCGGUCGUACGAUAGCGAGUAAUAACGCCCCCCGAAGAUCGUCGGUUGAGUUGGGUCACCAGCCAUGUCAUCUCCGGCUCGCUAAAUUUGCUGUAAAAGAUCGUCCGAUAGCCAUAUAUAGGCCAAUCAACGCCAAUGAAUGGGGCUACGGUAUUGCAACAAAGAAUACUAGUUUCUACCAUCGUAGGAUACUGGCGGCCACCGGGUCCUUCAAUACUAAAAGUUCGGGGAAAGGGGGGGUUAAGUCUACUGACCUAGAACUAUAUCGGAGCUCGAAUAAGAUGUCGUCGGUCACGCGGCGCAGCAUGCUAUCUACGUUAGCCCUAUCGAAAUCGCUGCGAAGGUCGCGUCGAAUCGUUCAGCUCGCGCCUCGUCGACCCCGGCAUUGGAACUUUUCCGAUGUAUAUUCGCCGCUGACAAUUGCGGGCGCGUAUCAUAGCGACAAGCAAACUCUAGCACUGACUUCUCGGUCGGCAUACUUGGUAGGUAGAUCAGUAACGCUCCUUACUAAACGUAAAGAUUAUGAUUCACUAUUAGAGUUUGCGUUUCUAUUUGAUACUGCGCUUUCCGCUCUAUCAUCUAUCCCCCUCUCCGUUUGGGUUCGUCCAAAAAACCUCACAAGAUUCUCUAUAUUGAAUAUCUCAUAUCUGCUUCCGUGUGUGAAUGACGUCACAGCAGCACUGUCCCGCACAACAAAUUUCGGUUGUAUCCUCAAAUCGGCGAUCCGCUAUUACCGGCCGGAUAAACACAGCCCUGUUAAUGUAGCCGUUUAUCCAGACGCAGAGCGGAUCAGAUAUCGAGUAAACGCGGAACGUACCCUAGUAACCCUUCGUGUAGGUGCUUUAUGCCUUUACCCUACUGGCUUUGUCGAUAAUGCAGCUGAUGUUGCUCUCUAUGCUUACGGGGUCUACGUCGGAGGCUUAGAACCUCGUCCUACACUUUAUAAAAUAGUAUCCGGGCAACAAGCUGCUAUCCAGAACAGGGCCGAGAUACAGACAACCGCUAGGUACGUAACAGUACCAAGCGACGUUAACGAUCUUGGAGAAGCUACACCGGCCUUAACUAAGGUAAAUAUCUCAGUGAUAUCUGUUCCACUGUCACGUUUGUCGGGCUCCACCGGAAUGAAAGCCCCUACAUUUGAAGCACCUACAACUGGAUCCGAAAGCGAGUGCUCUAUACGUUGCCCUUUGGUUAUGAAACAUGGGCCAUGGCUUGAGAAUGGUUUCCGUAGUGUCGCCUCUAGGGCCGUCGCAAAAUCCAUUACGUUUGAAGAGAUCGGGUACAAGGCGAGGACCAAAGCCAAUACGUUCUGUCGGCCAAGCUCCUGUAGACUCAAUUUUCGUUCGCUGCUACCUUGGGGGUUUAGAGAACCGAAUUCGGUCAAUGGGUCGGGUGUAGCAGCGCGAGUGCUAAUAUGUAGUCUGAAGCCGCCGCUCAAUCGGGAAGAAGAAAAGAUUCGGGCAGUAAGGGGACUGUUAUGGUCGGUAUAUGCCAAAAAGGCCCGAGCAUGCCGUGUCGGGCAUUCAGGCAGCAAUCUGCCAAUGUCGGACCCUGACCCUAGAAGGUUGCUAGUCCCGGAUAGUCGCCGGGAUUAUUGUAAUUACGAAGAUUCAAAAGGCAUAAUCCCGACCACGAAUUAUUGCGCCCGCAGAGUACCACCCUACCGCAAAACAGGGCCAAUAGCGCGUGGAAGCCAACCGGCAGUGAAAGAUUCCCCUUCAAUGCGACCAGCCGCAACUCGCUCAUUCAAAAGACACGCCCGUGCCAAACAGCACGCUAUGGUAGUACUCUCUAAACUAUUACUUGAUGGUGGGCACUGCUGCACGAUUCCCUUCAGGUCGUAUACAAAAUUGUCGAUGGCUGUUAGGCCGGGACAAGUCCUUUCCAGGGCAAAGAUGUUGUUUAAGAUCCUAUCUUUUCAUGAGGUACUGACUACCCGGCUGCUGCACACUUUUGGUUUGCAUUCGUUUACAAUGGAAGCCUACACGGUGUCCCUUUAUUCUUAUGCGGCGCGUGUGUGCUCCCAUGACCCUCUAGUGGGCUUGUACUACGGCACUAAAGAUACCCGUAUCGGGCCUGUCCGUAACUUGCGGUUAAGUCUGAGGCCAAUACGAGAUAUCCUUUCGCGUGGACACCUGGUCCCCACAGUUCCCGUCACAAGCCUGGGGAGGAAUAAAACCGGGGUAAAUAUUCGCACGUUUUCAUCGGCGCCCUUGUAUGACGGAAUUGAGUUUGGCCCACAUCUCAGCGCCCUCGCUUGUUCGUCAAGCCACUACGUUAGACCCGUGGGGGAUAUCCAUCGUAUAAAUGCUCAAUAUUUGAUUCAGCGUCCUCAGCGCCAGUUGCCGUUCAAUUCAGAAACCCCGGUAUCGCAGAACGAACUCCUCACCGUCAAUCUGUAUCAAUAUUUCACUUGCGAGCCCAUAGACCAUUGUCCGCGCACUUCCGAUACCUCGACGCUUGAGGACCUAAUGCCACUAGUGCCUCGACCUCUUCCGCUAUCGGCUACCAUGUACACGGACGCGCUCUAUACCCAUGUGUCCAAAACUCUACUAAGCUAUAGUUACGACCAACACAAACCUCCGACCAUAAAUGCGCUAAAGGGCGAGGGCGCGCCAAUAUGGGGGCCCCUGUCAAUCGAUAGGGUUACUCGACUGGCGUCAACUGUGGAGAUUCUUAGAUUAAGUGACUAUGAAUCAACCUUUUACCCGCCUACGCCGGACGGCGAUUCAUACUUGACGAAGUUAUACAAGUCGAUGACACCCGCCCAGUGCGUGAGACUCAUGCUUGAGCCAAAGCGUACCGAAAAAACUCGGGGGGGGAAUUCCGGUAAGUUGCAGAGCGAUGCACAAUCAUGUGCGGACCUUCAAAGGACUCUUUCGACCAAAUUAAGAUGGCGCAGCGAGGCGGCAGAGGAUGCACUUACUAGUCCUGCGACUGACGAUAUCGUUUUCUGCCCAGCCCAUCACCUUCGAAUACAUCCCUUCUCCCGCUAUAUGCAGUCCCUUACCUCUCCGAAGGCUACCAGAGAACUCAAGAUGGGGAUACUACCAAUACCUCUCCCGCCUUACUCCCGCCACUGUGCGUCUUACAUAUCGAUUUCUAUAAUAGGGCAUGGGAAAUCGCUAUGCCCAUGCGGACUUGCCACCGUGGCCUACCCUCACCGUGACGAUAUCACUAAUCAAGUCUCAAGCUCAGUUAAAAACUUGGUUUUACAAUCAAUGUGCCCUCGUAUUCUAAUUCAGCACCUUCUGUCAAUUGUUCGUGUCCUUGUGGUUAAUAGCACUUUAUGUCAUUGGAUAGUACGAUAUGUCUACCCCAAAUGCGACGAAUAUGUGUCAGGUGACAACAUCGUGGCUUGCCAAUCCCGACUCCUCCUGCGGGUCUACAGCGACGAGCUCACAGCUUCCCUAGCAAAAAUGAGUUCCCGCUUUCAAACUGAUUUGCUGGGCCGGGUGAGAGGGAGACUGCCGGCACCAGAAUUAACUGUACGGGCUGCGGGGAGUCCUAACGAAGAAUCAACCAAACUUACCAGCGAGCACCUUGUUUCCCCCUUGCUAUCAUGCCCGCAAGAAUGUUUUAACAGGCAGAACAUACAUGUACUCAUAUCCUUUAGCCAGGCGUAUGCUGACUCAGGAGUCGUCCAAUGCCAGAAUGGUGAUAGUUUACAUAUCUUCCACUCAAACAGCCGCAGAAUCGUCCUGGGCGUUAUGUCAUUCAUGCUAUCGGCAAAGGUUGGGCACCGGAAUUAUGGUGGAAGUGAUGACAAUGCAGACAGCCACACACAUCGCGGUAGCCGUGGGGAUUUAUCGGCUGACUUCUUAGUUAAGUCGCGCGGUCGUGUAUGGAGAAGUGUAUGGUAUCUCCUGAAGAAUGCAAUCUCCGGUUUAAGAUGUCCCUCCUGGAUAUGGAGUAAUACCAUCGCUUAUCAGGCGAGUAGGACGAUAAUGUCAAUUGGGCGUCUCCGACAGAUUAACCAGGGUCAGAUCUCUCCACGUGAGAGUGGGGGUCAAACGACCAGUGUGCAGGGCGUUAAAAAUGAAACAGGCUCCAUAUGCUCCAAAGGAUUCUUUAUCGUAUGGUCGAGUGUUGAGGCUCCAACGUUCCACGAUGAUAUGCAAGCUAGGUCUUCUAUUGGGCAAGCAAUUAUACGCUGUGCCAGUGAGCUGGAUGCGCGCGUACGAGAUGUUGAAACGGUGGGAUCCGGGGCAGACGGGCUCCCAUCGGAGAGCUCUUCAACAUGUUCAAGCUUUCGGAAGCUGAUGGUUGGUACUCCGCCACCCGCGGUGCACGAGACCAAAGUAAGGAAAACCUCUGCGAUGCGUGCCAGGACAGAUUGCUUAGCACAGCGCGGGGAGCCAGUGAAGGUCCUGGUAUCGUCUCCAUUAGGUAGAAACUGGACUGCUCUAAUAUCUAGAGGGCGCUAUGAGAGGCAUCCAACUGAGCCUACGAAAUCUAGCCAGCUUGAUGAGACGACGAGAGCUAUGCGACAAGAGAUACGCAAGCCCAGUUCCGUUAGUUGCUAUCAAUGUUCGAAAGAGGGACACAAGCAUAGAGCGUGCGUUUCACGCCUGGCUUGGUUUAUAAAUUGGGUUCACAACAAGUACACGAUGGCACCUAGCCUCUUGCAUUGCGGAGGACGUGGCGUUAGCCGCGUCGCACUAAACACUUUUAAAAUCGAAUGUUUCUUAAGCCCAGUACUGCGAGCACGUUAUGCCAGUGCAGAACGUUCCCUACAGCAUGAUUUGAGAGUCCUCAGUACUCCUGUCCUGUCAAAUUUGCACGGGGACACCGAAGAAAGUUGCGGGAUUUACAGCGUGGCGUUACAGGGUUGUCUUCUCUACCACUCACCCAAGAGACAGUAUUCCUAUCCGCAAUCACCUACGCGUGUAGGCAGAAACCUCAGCAAACGCGUGUGCAAGGCGCAAGUUCGUUCCGCACGAUGGAUUGCCUUGCAAACGAAUAAAUACUUCUGCGAUGACGGGAGCCCCUCAGGCCUUGUCUUCACGGCACGAGCAGCCUUCUUUACCGAGGCGUUUCCCCAUCUAGCCCUCAUCCAGCUUAAGAAUACAUUGCUUAAUGUUGCACAUAACAGCGCGGUACCCCGAGAUGAUUCCGACUGUCAUCUCUCGGAAUUGUCGUACCCAACGGCCCGGAUGUCGUUGCCCUCAAGAAGGGAGCCGCAAGACAAGGUCAGUUGGUUAGCAGAUUGGAAGACACCUCGCAUUCAAACACCUUAUCGUCAACUCGUUACGUCUAGUUACCAUAAAUUACCCUCCGUUGAGUCGGCAAUAAGUCUCGACUGUGCCCAUUCUCUGCUUUAUGUGGGUCUUAAGGUUGCGGUCGAUUCGUCGGGAUGGUGCGUAUCUUAUUGGCACAGACACCCUGACACGCCAGGAACACGAACUACAGAUGCUGCUAGGCUAGGGAACACUAUGAGUGACACAACCAACGCGAAAGUUUGGCAUCCGCGAAAAGGGCUUUGCUCUACGAACCCACUUAUUGUAGGCCCGACAACGGUGCUCCCUUCGUCCGUCUCUAGAUCCUCGACUCGUUUGCAUUGGGAGAAGAUGUUACUGAACUGGCAAUCCCGUGCAGGGGAUUGUCGGACGACAAACGUUGGUCCAUGUAGAUGCGCCGUUAACAAGGGACGUAUCGGGAUAGUUCUGCGGACAUGCCACGAAUCCGCCGAAGCACAUUCUGCUCAACGAAUUACUACAAGAUCCUCCUCUCGGCGUCCGCUAGGUCUCACCGUUAACUACCACCUUAGCAAAAGCAGCCUGGCCUGGCAGUAUGUGGCAGGGGAAUAUAUCUACUUUACCGAGGGUUUACCCUUGCAUAGAAAUCGGAAUCAUUGCUAUGCUAAAAUGAGCGUGGAGUUUAUCACGAUGUUGACUGAGUGCACCAGUGGCGAGAACGGCAAGAUCUCUACUCGUGCAGUCUUCCGGACAGAGUAUUCUUACUCAUCUAUUGGCUCGAACGGUAAGAGUGAAGCUAGCGGUACCAGGUGUGGGGCUUGUCGAGCCAAGAGGGCCGGUCUCAAUGCCUCAAACGUGACUGAUUGCCUAGGUCCUUGGGCAGCGGCUUUAUCGAAUGUGGUGCGAUUUGGCGUGGGACUGAAAGUUUUAUUGCUGUCUGCAUGGGGCGUCCACGCGGCUAAGUCACAGGCGUGUCCUUAUCUGAGACAGGCUGGAGAGACUAACUCGGCCAUAUGGGUUGAGGAGGUCGUUGUGAGGCCGAAUGAAAGUCAAAUCGUCGGUGGUUGCUCGAACAUUUGGAGAGAGGUAAGUUUGAAAAGUCUUCGCAUCUACAGGUACAAGUCUUUGCCAAACACCGAUCGGAGACUAUCCGAUCGCAUGAACGCCGAUACUUACACUUCCACGCCAACCCUUAAAUAUUCGGGGAGUACCCUUGCACGAUCAAUCCAAGGUUGUGUGAAGAAAAUGCAGUCGACUACCCAACUCCGUCUCCGCGAUCCGACUUUGCACAGUUACCUGCUGCAGCUUGAACUAGUUCUUGCGACUCCUCUCCCGCGCAUGUAA'
output = ''
i = 0
while i < len(s):
    codon = s[i:i+3]
    amino_acid = dict[codon]
    if amino_acid == 'Stop':
        break
    else:
        output += amino_acid
    i += 3
print(output)

MCHIVTRYSAHRPVHAELQFASSPTRSSDGGYKSLKAMGTTVLLRSLRSISCRTYYIIPCRSPRLEHFRLSCKISLFHARQTMLKVTVMTYPAKIAACARRAKLPFILCGGMVPSCSRVMKYGCIRYRETGKQLKCVKTALRPTNENSSIARAKCNVNVKGWLRYQLCADQGMAQLRACPMGRRGRSDMRLIGPRVLLPSGGCSKSGWYNLSLTDRPLPVSCLPMGSFKSASSNITGASPCMTIMPIIKTQNFAHCMPKLTPRITQQLKRLLLPPVFAPARPRGHSHVYQTGIAFCTPWTPEPYQSRCCHGLLAVLLIRRTFTSLQPSRFRNNIAEVSLSESIWAYCYHIEPYSLELYPEMGHASTSVSRLRPRQVAVRVRANRRISLRVQTQCGKLPTGLWVPIGKSNGDTLQPWVEVDKKFLGERTDTVETYLEICIVRPEGFSRSVELQRCDFRFWTQVSTQVTKPPVKAAVCIYNVSRRSYLSAPSNPSLPLFSQRCCNWGFSTHSEMPGDDLSPLLSVKHLHMGLRVLEAGYQDSKRHYYTTCRESSLYLRIALAVVARRKVRHHSQDDTKSLPWNNSRYKHLYVVSTTIVEYKPGAFVSRLRHNRDETPYSKNLLCLSNSSSSDGLTLQESHDIIKLIHLTQAPPSCTSHLLRDFPGTRAIEAPWMHFNKIKTNRLRLAAGIISLGGYAPASAALTRPIHIISGVTRWLPPLTYPTTSGTSNVALPHAAWVGCDTSDLHFVKHPTLRRITVKGLTRVTSLLRFVNNSPVELEDVTFVFTRPFEVQRLLLSARQCGPSRKSDKPYSQFNELPVEYNISRRSTRPDHSNPLRAQRERQVLEPDIDLCSHGMLLFRHPPGRLCLSIRSTNPRAPRKSGKAHDVDHRPHGLRCRVLRFEASSAPGSVWSAFCNNLPLGSLLACPVGAMLRRAVREAKLTVTGVTLTNVIANETWASGPPLGRRRLSGIALTGSLHCPRAVSPYCLLRSLICCEVLGYG

## Finding a Motif in DNA



If the same interval of DNA in the genomes of two different organisms is found together, it is suggestive that the interval has the same function in both organisms. 

A **motif** is such a commonly shared interval of DNA. A common task is to search for an organism's genome for a known motif. 

Genomes are riddled with intervals of DNA that occur multiple times (perhaps with only slight modifications) called **repeats**. The occur too often to be simply random chance. Thus the language of DNA is quite powerful.


### Problem

Given two string s and t, report the starting index of all occurances of t in s, using 1-based indexing. 

In [37]:
s = "ATGGCACGAGGCACGAGGCACGAGGGCGGCACGATTCAGGCACGAAGGCACGACTGGCACGAGGCCGGGCACGAGGCACGAAGGCACGATGGCACGAAGGCACGACGGCACGAAAGAGGCACGAAGGCACGAGGCACGAGGCACGACCGAGGCACGAATCGGCACGATCCCTTTTAGGCACGATGAGCTCGGCACGATCCGCCAAGGCACGACTGGCACGAGGCACGAGGACGGCACGAGGCACGATTCCGGCACGATACCTGGGCACGAGGGCACGAGGTACACGACGGCACGAAACGGCACGACTACCGGCACGAAGGCACGACTACGGGCACGAGGCACGATGGTGGCACGACTGCGGCACGACGGCGGCACGACGGCACGAGGCACGAAGGCACGAGGCACGAGGCACGAGGCACGAGCCTTTCTAAGGCACGAGGCACGAAGCGTCAATTAGGCACGACAGCGCGGCACGAGGGCACGATGGCACGAGGCACGAGGGCACGAGGCACGAGGCCCACGGCACGAAATTGGGCACGATCGGCACGAGAGCAGGGCACGAGTCGAGGCACGATCGGGCACGAGCAGGCACGAGACGGCACGATTCTGGCACGAGTGGCACGAAGGCACGATCTGGGCACGAGGCACGATAAGGCACGATGGCACGAAGGGGCACGACGGGCACGAGGCACGAGGCACGAGGCACGAGATGGCACGAGGGCACGAAAGGCACGATCCCGTCACGGCACGACAGGCACGATGGCACGAGGCACGATCTGAGTGGCACGATCGGCGGCACGAGGGCACGAGGCACGAGGCACGACGGCACGAGGCACGACTGATGTGGGCACGAGGCACGATAGTCGGCACGACGGCACGAGGCACGAGGGGGCACGAGGCACGAGGCACGAGGCACGACGGCACGAGGCACGAGGCACGAGCGGCACGAACGGCACGAGGCACGA"
t = "GGCACGAGG"

index_loc = []

for i in range(len(s) - len(t)):
    if s[i:i+len(t)] == t:
        index_loc.append(i+1)
        
print(index_loc)

[3, 10, 17, 56, 68, 126, 133, 215, 222, 233, 264, 272, 331, 379, 394, 401, 408, 432, 470, 486, 493, 501, 508, 637, 681, 688, 695, 712, 762, 795, 803, 810, 825, 847, 874, 881, 891, 898, 905, 920, 927, 952]


## Consensus and Profile

We can count the point mutations between two strings, but for multiple strings (of equal length) we can do more. For several homologous strands, we can find an average-case strand to represent the most likely common ancestor of the given strands. 

For each index of the the strings, by taking the largest number A,C,G, and T bases as a representative of all the strings forms the **concensus string**. The **profile** of the strings (all of length n) is the 4xn matrix where entry $1,j$ denotes the count of A in all strings at position j, $2,j$ the count of C in all string at position j, and so on.




### Problem 

Given a collection of DNA strings of equal length, all in FASTA format, output the concnsus string along with the resulting profile.

### Helper function: read FASTAf file.

In [41]:
def read_FASTA_file(file):
    # preprocessing
    dna_list = file.readlines()

    dict = {}

    curr_key = ''
    curr_string = ''

    for item in dna_list:
        # Line contains label
        if item[0] == '>':
            # store current_key with current_string
            if len(curr_string) != 0:
                dict[curr_key.replace('\n','').replace('>','')] = curr_string.replace('\n','')
                curr_key = ''
                curr_string = ''

            curr_key = item
        # line contains piece of string. 
        else:
            curr_string = curr_string + item

    # Add last pair to dict.
    dict[curr_key.replace('\n', '').replace('>','')] = curr_string.replace('\n', '')
    file.close()
    return dict

In [77]:
print(read_FASTA_file(open('question_inputs/consensus.txt', 'r')))

{'>Rosalind_1': 'ATCCAGCT', '>Rosalind_2': 'GGGCAACT', '>Rosalind_3': 'ATGGATCT', '>Rosalind_4': 'AAGCAACC', '>Rosalind_5': 'TTGGAACT', '>Rosalind_6': 'ATGCCATT', '>Rosalind_7': 'ATGGCACT'}


In [121]:
import numpy as np

# Get contents of FASTA file
dict = read_FASTA_file(open('question_inputs/consensus_input.txt', 'r'))

output_file = open('question_outputs/consensus_output.txt', 'w')

# create profile matrix
"""
row 0 -> A
row 1 -> C
row 2 -> G
row 3 -> T
"""
list_strings = list(dict.values())
P = np.zeros(shape=(4, len(list_strings[0])))
P = P.astype(int)

# Add values to profile matrix.
for string in list_strings:
    for i,char in enumerate(string):
        if char == 'A':
            P[0][i] += 1
        elif char == 'C':
            P[1][i] += 1
        elif char == 'G':
            P[2][i] += 1
        else:
            P[3][i] += 1

# Output results, print to file. 
# output must be formatted carefully!
ave_string = ""
for i in  range(len(P[0])):
    idx = np.argmax(P[:,i])
    if idx == 0:
        ave_string += "A"
    elif idx == 1:
        ave_string += "C"
    elif idx == 2:
        ave_string += "G"
    else:
        ave_string += "T"

def get_line(P,i):
    send = ""
    for j in range(len(P[i])):
        send += str(P[i][j])
        if j != len(P[i])-1:
            send += " "
    return send
        
output_file.write(ave_string +'\n')
output_file.write('A: ' + get_line(P, 0) +'\n')
output_file.write('C: ' + get_line(P, 1) +'\n')
output_file.write('G: ' + get_line(P, 2) +'\n')
output_file.write('T: ' + get_line(P, 3))

output_file.close()

## Mortal Fibonacci Rabbits

Recall the Fiboncacci sequence for the number of rabbits after n months from earlier. We expand on this sequence by specifying that rabbits will 'die out' after living for m months.

We assume that we start with one pair of rabbits; it takes a rabbit one month before being able to reproduce in which a single pair will produce a single pair of offsprings.

In [202]:
import math
import sys
np.set_printoptions(threshold=sys.maxsize)

n = 98
m = 19

#Note: Values of matrix will become large; make sure data type of matrix can handle it!
dp_matrix = np.full((n), -math.inf, dtype=object)
dp_matrix[0] = 1
dp_matrix[1] = 1

def helper(i):
    # Set value
    if dp_matrix[i] == -math.inf:
        # First m values are given by $F_n = F_{n-1} + F_{n-2}$
        if i < m:
            dp_matrix[i] = helper(i-1) + helper(i-2) 
        # The remainder is given by $F_n = F_{n-2} + F_{n-3} +  ... + F_{n-m-1}$
        else:
            dp_matrix[i] = 0
            for j in range(m-1):
                dp_matrix[i] += helper(i-2-j)
    return dp_matrix[i]

print((helper(n-1)))

134779554541299204933


## Overlap Graphs

**Overlap Graph**: Given a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph $O_k$ where each string is a vertex and a directed edge $(s,t)$ exists if there is a $k$ length suffix of $s$ that matches a $k$ length prefix of $t$. 

In [43]:

import networkx as nx

k = 3

# Get contents of FASTA file
dict = read_FASTA_file(open('question_inputs/graph_overlap_input.txt', 'r'))

digraph = nx.DiGraph()
digraph.add_nodes_from(dict.keys())

pairs = [(s,t) for s in dict.items() for t in dict.items() if s!=t]

#for each pair, if prefix matches suffix, create edge. 
# direct edge from suffix to prefix string. 
for s,t in pairs:
    if s[1][:k] == t[1][-k:]:
        digraph.add_edge(t[0], s[0])

for e in digraph.edges():
    print(e[0] +" "+e[1])

Rosalind_8614 Rosalind_1595
Rosalind_4448 Rosalind_0539
Rosalind_8520 Rosalind_0716
Rosalind_4044 Rosalind_7161
Rosalind_4044 Rosalind_8999
Rosalind_7430 Rosalind_6432
Rosalind_1009 Rosalind_2535
Rosalind_1009 Rosalind_1808
Rosalind_1009 Rosalind_4575
Rosalind_7285 Rosalind_6992
Rosalind_7285 Rosalind_4261
Rosalind_7285 Rosalind_5499
Rosalind_8041 Rosalind_6798
Rosalind_1885 Rosalind_2463
Rosalind_6992 Rosalind_1213
Rosalind_4261 Rosalind_6438
Rosalind_4261 Rosalind_0707
Rosalind_3229 Rosalind_6992
Rosalind_3229 Rosalind_4261
Rosalind_3229 Rosalind_5499
Rosalind_3349 Rosalind_7682
Rosalind_5934 Rosalind_0716
Rosalind_2898 Rosalind_4448
Rosalind_3479 Rosalind_7955
Rosalind_3479 Rosalind_9492
Rosalind_3479 Rosalind_4859
Rosalind_6715 Rosalind_1009
Rosalind_0602 Rosalind_6798
Rosalind_0117 Rosalind_2890
Rosalind_0247 Rosalind_3479
Rosalind_0247 Rosalind_1565
Rosalind_0247 Rosalind_7216
Rosalind_6766 Rosalind_5934
Rosalind_6766 Rosalind_2447
Rosalind_9413 Rosalind_2465
Rosalind_4914 Rosali

## Calculating Expected Offspring

Molecular biology research uses averages to predict the expected number of antibiotic-resistant pathogenic bacteria in a future outbreak, estimate the predicted number of locations in the genome taht will match a given motif, and to study the distribution of alleles throughout an evolving population.

### Problem

Given 6 integers that represent the number of couples in a population each possessing each genotype pairing for a given factor, return the expected number of offspring displaying the dominant phenotype in the next generation, assuming that every couple has two offspring.

In order, the six values correspond to the genotype:  
AA-AA  
AA-Aa  
AA-aa  
Aa-Aa  
Aa-aa  
aa-aa  


### Solution

For each population, the probabilty that one offsping has the dominate trait is given below. To compute the expected value we apply the formula, and multiple by two as there are two offsprings (assuming independence.)

AA-AA -> 1  
AA-Aa -> 1  
AA-aa -> 1  
Aa-Aa -> 0.75  
Aa-aa -> 0.5  
aa-aa -> 0  



In [51]:
import numpy as np

pop = np.array([19004, 16403, 19730, 19939, 18620, 16352])

prob = np.array([1, 1, 1, 0.75, 0.5, 0])

print(np.sum(pop*prob*2))

158802.5


## Finding a Shared Motfi

Recall a **motif** is a substring of nucleotides or amino acids, yet this motif may not be known in advance. 

A common motif to find is the longest common substring of a collection of genetic strings. A longer motif will likely indicate a greater shared function between the genetic strings.

### Problem
Given a collection of DNA strings in FASTA format, determine the longest common substring (LCS) of the collection. 

TODO see generalized suffix trees. 

In [65]:
# Get contents of FASTA file
dict = read_FASTA_file(open('question_inputs/lcs.txt', 'r'))

A = list(dict.values())[0]
B = list(dict.values())[1:]

# TODO