# Rosalind Problems - Bioinformatics Stronghold

## Counting DNA Nucleotides (nt)

The nucleus of any living cell consists of macromolecules called chromatin. One class of marcomolecules in the chormatin are nucleic acids. Nucleic acids are polygmers - repeating chains of small, similarly strucutred molecules called monomers (stands.)

The **nucleotide (nt)** is the unit of stand length and is composed of three parts: a sugar molecule, a negatively charged phosphate ion, and a compound called a nucleobase (base.) THe sugar of one nucleotide bonds to the the phosphate of the next nucleotide in the chain, forming a sugar-phosphate backbone.

Nucleotides of a specific type of nucleic acid always contains the same sugar and phosphate molecule; thus they only differ by their choice of base. Therefore a strand of nucleic acid can be differentiated from another based only on the order of its bases - this ordering defines a nucleic acid's **primary structure**. 

For a strand of **deoxyribose nucleic acid (DNA)**, the four necleobases are molecules called **adenine (A)**, **cytosine (C)**, **guanine (G)**, and **thymine (T)**.

### Problem: 

Given a string over the alphabet {A,C,G,T}, count the number of times each symbol occurs in the string.

In [5]:
input_string = "ATTCGTAGGCATCGCTCCTGCCCTGTGGTTACGTTCGCACCGGGTGTCAGGTCTGAAATTATGAGAATTTTACACTCCTAATGTTGGGCGACTGAACTGATTTTGAAGTCAGGCGGTAGTATACTGAGTCAGGGTCGGAGAGTGAGAGATGTATGAGCGTGGTTTGCTCCGCTAGCCCAGGCCGGCGAGTGCTGATGCGTAGCTGTATCTACAATTCGTCAGCTTTTGGCGCAGTAATGAATATCAGAGTGCCTAAGCAGCTCTTGCCCTACGGCTAATCGAATCTTCAGATTCAAAGGGGTACACTAGCTGTTGCCAAACCGTCGGAGACGCCACCTCACGAACTCACATCATATCAGTCGAAGATGGTACGTTAGAGCCCATGTGCCTATGGAGGCATGGTATATAGCCTGGCTGCGGTGTAATGAAGTCACCTGGCCAACCTGGGGCACTTGTCTCGTGACATAGTCTTATAGGGGTGCCGCAAGAAGGTCCCACCAAAAATCACGGATACGCGTTCGCCCGAAGGCCCTTGGAAACCTTCCAGCTCAAACTTGAGAGATACGTAGTTTTCCCACCCGTCACAAGAGGTGAGATTACAACAAAGTATGTCTGCTCGTCCCTGTTAGCTATAGCAGCTCGTATAACAACATTTCCCGACTGCGCGTCGCTGATCGAACAATAGGCTAGAGTTACAGAGCCTATCACAGGCCGAAGAGGACAAGCTTCGCTGGTACCTATAAAGGGAAAACCCCAGATTAATCCCGCCTTATCGTCCTAAGTAGTCGCGGTCAATATGGTACTATCCGACACCCTTTGATTGTGAGTCTTCACTTACCTATTCCGGATTTCCTCTCCCAGTAGGTTATAGTTCATGACGTCTTTTAGCACGGGATCACACGTGGCATGGATTAAATAGGATACACCTGGTGCCATGCAAAGTCCACGCTTAAGCCGAAGAACCCCGTAAAGCAC"

# Create dictonary for each character, scan through string, counting occurences.
dict = {'A':0, 'C':0, 'G':0, 'T':0}

for char in input_string:
    count = dict[char]
    dict[char] = count + 1
    
print(dict.values())

dict_values([247, 244, 242, 242])


## The Second Nucleic Acid (RNA)

Along side DNA in the chromatin a molecule with a different sugar can ribose known as **ribose nucleic acid (RNA)**. RNA differs from DNA as in it contains a base called **uracil** instread of thymine. 

The primary structure of DNA and RNA is similar because DNA serves as the blueprint for the creation of special kinds of RNA molecule called **messenger RNA (mRNA)**. mRNA is created during RNA transcription, during which a strand of DNA is used as a template for constrcuting a strand of RNA by compy nucleotides one at a time, where uracil is used in place of thymine.

### Problem: 

Given a DNA string t, convert t to RNA by replacing all thymine (T) with uracil (U).

In [6]:
t = "AAAACGTTGCCCTCCTCCCAGGTTCTGGTAAACGTGCACAGCGTCCCCTGCCTTTCATATCGTACTTCTAGGCCCTCGTAAGGAGTCCGGTATGTAATATCGATCACGATCGCCTATGAGGTGCGGGAATTGCCAAATCACTGGCCTAGTTGACTCAGTCACCCTCTTTTGCAAGGCCTTTAGCGAGGAAACCTGCACCGCCGAGGTACGCAATTGCTACATGCTACCCAAGAAGGGCATTCAGCACGCGAGAGCGATACTGTACTCGAGCACCACCATCACGATAGAAAGGATGGTTGTATTCACACGCGGTTAGGTGGAACCTTGATAAATGTGCGCGGATGCGAATGTGCCAAGTTGCCTGCTCTTTCTTAAGGTTCCGGTCTGACGGGAACACAAGGCGAGTAAGAGGGTCTTAGCAAGGGGTGCGCCTTAGCGACTACTATTAGCTTGCACACAAACAAGCGAGTAGTATGGCCTCGCTCGTGTTCAGCTCGGTTTGCGCGTGACCATTGCATTTTTTAAATTGAAAAACCAGTGGTGCTGCTCACTAGCATTATCCGTTTGCTAGCCCTTCCAATGTAACAATGCAGTTGGGCTGAAGAGAGACCAGTACGCCCAGCTATTCTAGAATCCACCTCTGGAAGAACAACTGAGTATTCTCGAAGTTTACCAGCATCCCTCAGTAAAATTTCAACTATGATGTTGTCATCCAATTGCTCCCGAGAAAATTCGACGTGTAGCCCGATGAAAGGTAAGTCGAGTGGGATACCTCTAATGATAGTGCAGAGAATGGCCCGTAGCCTCGGCACTAGAGCCCGTTGCCACTAAGAGCCAAGAAAAGCCCTCGTGAGCAGAGCCTCCGTGCAGTCTCTCTGGCCGGCAATGCAAGGACAGGCTGGTTGC"
print(t.replace('T', 'U'))

AAAACGUUGCCCUCCUCCCAGGUUCUGGUAAACGUGCACAGCGUCCCCUGCCUUUCAUAUCGUACUUCUAGGCCCUCGUAAGGAGUCCGGUAUGUAAUAUCGAUCACGAUCGCCUAUGAGGUGCGGGAAUUGCCAAAUCACUGGCCUAGUUGACUCAGUCACCCUCUUUUGCAAGGCCUUUAGCGAGGAAACCUGCACCGCCGAGGUACGCAAUUGCUACAUGCUACCCAAGAAGGGCAUUCAGCACGCGAGAGCGAUACUGUACUCGAGCACCACCAUCACGAUAGAAAGGAUGGUUGUAUUCACACGCGGUUAGGUGGAACCUUGAUAAAUGUGCGCGGAUGCGAAUGUGCCAAGUUGCCUGCUCUUUCUUAAGGUUCCGGUCUGACGGGAACACAAGGCGAGUAAGAGGGUCUUAGCAAGGGGUGCGCCUUAGCGACUACUAUUAGCUUGCACACAAACAAGCGAGUAGUAUGGCCUCGCUCGUGUUCAGCUCGGUUUGCGCGUGACCAUUGCAUUUUUUAAAUUGAAAAACCAGUGGUGCUGCUCACUAGCAUUAUCCGUUUGCUAGCCCUUCCAAUGUAACAAUGCAGUUGGGCUGAAGAGAGACCAGUACGCCCAGCUAUUCUAGAAUCCACCUCUGGAAGAACAACUGAGUAUUCUCGAAGUUUACCAGCAUCCCUCAGUAAAAUUUCAACUAUGAUGUUGUCAUCCAAUUGCUCCCGAGAAAAUUCGACGUGUAGCCCGAUGAAAGGUAAGUCGAGUGGGAUACCUCUAAUGAUAGUGCAGAGAAUGGCCCGUAGCCUCGGCACUAGAGCCCGUUGCCACUAAGAGCCAAGAAAAGCCCUCGUGAGCAGAGCCUCCGUGCAGUCUCUCUGGCCGGCAAUGCAAGGACAGGCUGGUUGC


## The Secondary and Tertiary Structures of DNA

The primary structure of a nucleic acid is determined by the ordering of its bases, yet it does not describe the large, 3D shape of the molecule. In 1953, the following structure for DNA was proposed:

1) DNA is composed of two strings, running in opposite directions. 

2) Each base bonds to a base in the opposite strand: A-T, C-G (always)

3) The two strands are twisted together into a long spiral double helix. 


1) and 2) compose the **secondary structure** of DNA. 3) Describes the **tertiary structure**.

Note that the **complement** of a base is the base in which it bonds to. The bonding of two complementary bases is called a **base pair (bp)**. Thus the length of a DNA molecule is commonly given in bp rather than nt; we can determine the other strand by taking the complement of the first (running in opposite direction.)

Example:

'AAAACCCGGT' <-> 'ACCGGGTTTT'


In [25]:
s = "AACTCCAGACAGGGGCTCTATCGTTACGAGGCGCCCACGTAGGAGGCGCCGCCGTACCCCAACTTTCCTCGATCTTACACGATAGTTGGGGATAGAGAACGGGCCACATGACTCTAGGCAGCCATTTTGTGTCTCGTAAAGGCTGGGGAACCCTTAGCCGCTTCACGACAACGCGATCTGGTGCGCCCCTTGAGGGGACTGCCTAGCGTTAGTATCGTCCAGGTCCATACCTACTCACTACAGTGTATGGGATCCTTTCGCATCGGCGCTTGACATTTTTAGGCATTGCTCTGAAAGTAACGGTCGACTAGAGTGCTCGAGATCCAATGTCAGAAGCCGCTCCACCGATTTAGGGATGGCTACTGAGGTCTCGTAGCGCAGACTCTGTATTATATGAAGGGCCCATATCGCCGCAAATCAGCGGTAGGGGGCGAAATTGGGCAATTCTTCGAGCTGAGTCTCCGGTTATTGTAAGGTTTGCATGAACCTTCGAGCGGGTGTTGTCTTACAAGCCATCCGAGCAGTTCCCCGGCAAGCCCTGCACCCCGCCTGAATGCTGCATTTTTGGTACAACCTAATGTCTTATAGAGATACCTTAGCTAACGGAGTATAATTTCCATTCTTGCCCTCTACTCAAGATAGGTATAGGACAGTGCCTTTCATCCGTGTACTGACGTAAGCTAAGCACTCGGTGTAACCAGCTGTGAAAATGTAGTACCAGGTTTAGAGGATCACGTCAGGGTTCTTTTTATGTTAATGCACAGGGGGAACGTGGACCACTATTAGATAAGGATCCTTCTAAAGTTTTCGTCGTTGCGGATCGACGTTTCCCACGGTTTATA"
s_c = "" # complement of s. 

for char in s:
    if char == 'A':
        s_c += 'T'
    elif char == 'T':
        s_c += 'A'
    elif char == 'C':
        s_c += 'G'
    elif char == 'G':
        s_c += 'C'
        
print(s_c[::-1])

TATAAACCGTGGGAAACGTCGATCCGCAACGACGAAAACTTTAGAAGGATCCTTATCTAATAGTGGTCCACGTTCCCCCTGTGCATTAACATAAAAAGAACCCTGACGTGATCCTCTAAACCTGGTACTACATTTTCACAGCTGGTTACACCGAGTGCTTAGCTTACGTCAGTACACGGATGAAAGGCACTGTCCTATACCTATCTTGAGTAGAGGGCAAGAATGGAAATTATACTCCGTTAGCTAAGGTATCTCTATAAGACATTAGGTTGTACCAAAAATGCAGCATTCAGGCGGGGTGCAGGGCTTGCCGGGGAACTGCTCGGATGGCTTGTAAGACAACACCCGCTCGAAGGTTCATGCAAACCTTACAATAACCGGAGACTCAGCTCGAAGAATTGCCCAATTTCGCCCCCTACCGCTGATTTGCGGCGATATGGGCCCTTCATATAATACAGAGTCTGCGCTACGAGACCTCAGTAGCCATCCCTAAATCGGTGGAGCGGCTTCTGACATTGGATCTCGAGCACTCTAGTCGACCGTTACTTTCAGAGCAATGCCTAAAAATGTCAAGCGCCGATGCGAAAGGATCCCATACACTGTAGTGAGTAGGTATGGACCTGGACGATACTAACGCTAGGCAGTCCCCTCAAGGGGCGCACCAGATCGCGTTGTCGTGAAGCGGCTAAGGGTTCCCCAGCCTTTACGAGACACAAAATGGCTGCCTAGAGTCATGTGGCCCGTTCTCTATCCCCAACTATCGTGTAAGATCGAGGAAAGTTGGGGTACGGCGGCGCCTCCTACGTGGGCGCCTCGTAACGATAGAGCCCCTGTCTGGAGTT


## Rabbits and Recurrences

We begin with a pair of rabbits - one male and one female. Each month such a pair mates, and one month later k pairs are produced (one male, one female.)
After a month of being born, rabbits reach reproductive age. Rabbits never stop reporducing or dying. 

Given n months and value k, what is the the total number of rabbit pairs after n months?

e.g. n=5, k=3 -> 19 pairs. 

#### Recurrence relationship:

Month m=0, there is one pair of mating age, 0 new born. Total: 1
Month m=1, there is one pair of mating age, k pairs of new borns. Total: k+1
Month m=2, there are k+1 pairs of mating age, k pairs of new borns: Total: 2k+1

In general, for the ith month suppose there are x,y pairs of mating age, new born respectively. Then for month i+1, we have
x+y,k(x) respectvely. 

Let $F_i$ denote the number of pairs of rabbits on the $i^{th}$ month. Then for $i=0, i=1$, we have $F_0=1$, $F_1=k+1$, and $F_i = F_{i-1} + kF_{i-2}$ for $i \geq 2$.

In [2]:
def recurrence(m,k):
    if m == 0:
        return 0
    elif m == 1:
        return 1
    else:
        return recurrence(m-1, k) + k * recurrence(m-2, k) 
# See a bottom up solution.

In [4]:
m = 36
k = 3
print(recurrence(m, k))

3048504677680


## GC-Content

99.9% of the 3.2 billion base pairs in a human genome are common to almost all other humans (barring those with mahor genetic defects.) For any species, the average case genome is very similar.

For DNA, this means that the frequencey of bases will be similar for memebers of the same species. 

For a molecule of DNA that is from an unknown species, we can analysis the frequencies of bases in the molecule and compare them to a database of DNA to attempt to determine the species. 

Because of the base pairing relations of the two DNA strands, cytosine and guanine will always appear in equal amounts in a double-stranded DNA molecule - this we analysis its **GC-Content**. The GC-content of a DNA string is the percentage of symbols in the string that are either 'C' or 'G'.



### FASTA format

The FASTA format is a means to label strings (DNA) In this format, we introduce the label of the string using a '>' followed by the label. Then the next line is the string itself. For example,

\>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG

\>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC

\>Rosalind_0808CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT



### Problem:

Give a list of FASTA formatted DNA, compute the string with largest GC-content.

In [31]:
# Get input from file, store content in list of pairs: label, string
file = open('question_inputs/computeGC.txt', 'r')

# preprocessing
dna_list = file.read().replace('\n', '').split('>')
dna_list.pop(0)

# For each pair, compute GC-content of string, store label and freq.
largest_pair = [None, -1.0]

i = 0
while (i < len(dna_list)):
    label = dna_list[i]
    seq = dna_list[i+1]
    # Compute GC-content
    count = 0
    for char in seq:
        if char == 'G' or char == 'C':
            count += 1
    if largest_pair[1] < count/len(seq):
        largest_pair = [label, count/len(seq)]
    
    i += 2
    
print(largest_pair)

['Rosalind_6489', 0.5282583621683967]
