# Rosalind Problems - Bioinformatics Stronghold

## Counting DNA Nucleotides (nt)

The nucleus of any living cell consists of macromolecules called chromatin. One class of marcomolecules in the chormatin are nucleic acids. Nucleic acids are polygmers - repeating chains of small, similarly strucutred molecules called monomers (stands.)

The **nucleotide (nt)** is the unit of stand length and is composed of three parts: a sugar molecule, a negatively charged phosphate ion, and a compound called a nucleobase (base.) THe sugar of one nucleotide bonds to the the phosphate of the next nucleotide in the chain, forming a sugar-phosphate backbone.

Nucleotides of a specific type of nucleic acid always contains the same sugar and phosphate molecule; thus they only differ by their choice of base. Therefore a strand of nucleic acid can be differentiated from another based only on the order of its bases - this ordering defines a nucleic acid's **primary structure**. 

For a strand of **deoxyribose nucleic acid (DNA)**, the four necleobases are molecules called **adenine (A)**, **cytosine (C)**, **guanine (G)**, and **thymine (T)**.

### Problem: 

Given a string over the alphabet {A,C,G,T}, count the number of times each symbol occurs in the string.

In [5]:
input_string = "ATTCGTAGGCATCGCTCCTGCCCTGTGGTTACGTTCGCACCGGGTGTCAGGTCTGAAATTATGAGAATTTTACACTCCTAATGTTGGGCGACTGAACTGATTTTGAAGTCAGGCGGTAGTATACTGAGTCAGGGTCGGAGAGTGAGAGATGTATGAGCGTGGTTTGCTCCGCTAGCCCAGGCCGGCGAGTGCTGATGCGTAGCTGTATCTACAATTCGTCAGCTTTTGGCGCAGTAATGAATATCAGAGTGCCTAAGCAGCTCTTGCCCTACGGCTAATCGAATCTTCAGATTCAAAGGGGTACACTAGCTGTTGCCAAACCGTCGGAGACGCCACCTCACGAACTCACATCATATCAGTCGAAGATGGTACGTTAGAGCCCATGTGCCTATGGAGGCATGGTATATAGCCTGGCTGCGGTGTAATGAAGTCACCTGGCCAACCTGGGGCACTTGTCTCGTGACATAGTCTTATAGGGGTGCCGCAAGAAGGTCCCACCAAAAATCACGGATACGCGTTCGCCCGAAGGCCCTTGGAAACCTTCCAGCTCAAACTTGAGAGATACGTAGTTTTCCCACCCGTCACAAGAGGTGAGATTACAACAAAGTATGTCTGCTCGTCCCTGTTAGCTATAGCAGCTCGTATAACAACATTTCCCGACTGCGCGTCGCTGATCGAACAATAGGCTAGAGTTACAGAGCCTATCACAGGCCGAAGAGGACAAGCTTCGCTGGTACCTATAAAGGGAAAACCCCAGATTAATCCCGCCTTATCGTCCTAAGTAGTCGCGGTCAATATGGTACTATCCGACACCCTTTGATTGTGAGTCTTCACTTACCTATTCCGGATTTCCTCTCCCAGTAGGTTATAGTTCATGACGTCTTTTAGCACGGGATCACACGTGGCATGGATTAAATAGGATACACCTGGTGCCATGCAAAGTCCACGCTTAAGCCGAAGAACCCCGTAAAGCAC"

# Create dictonary for each character, scan through string, counting occurences.
dict = {'A':0, 'C':0, 'G':0, 'T':0}

for char in input_string:
    count = dict[char]
    dict[char] = count + 1
    
print(dict.values())

dict_values([247, 244, 242, 242])


## The Second Nucleic Acid (RNA)

Along side DNA in the chromatin a molecule with a different sugar can ribose known as **ribose nucleic acid (RNA)**. RNA differs from DNA as in it contains a base called **uracil** instread of thymine. 

The primary structure of DNA and RNA is similar because DNA serves as the blueprint for the creation of special kinds of RNA molecule called **messenger RNA (mRNA)**. mRNA is created during RNA transcription, during which a strand of DNA is used as a template for constrcuting a strand of RNA by compy nucleotides one at a time, where uracil is used in place of thymine.

### Problem: 

Given a DNA string t, convert t to RNA by replacing all thymine (T) with uracil (U).

In [6]:
t = "AAAACGTTGCCCTCCTCCCAGGTTCTGGTAAACGTGCACAGCGTCCCCTGCCTTTCATATCGTACTTCTAGGCCCTCGTAAGGAGTCCGGTATGTAATATCGATCACGATCGCCTATGAGGTGCGGGAATTGCCAAATCACTGGCCTAGTTGACTCAGTCACCCTCTTTTGCAAGGCCTTTAGCGAGGAAACCTGCACCGCCGAGGTACGCAATTGCTACATGCTACCCAAGAAGGGCATTCAGCACGCGAGAGCGATACTGTACTCGAGCACCACCATCACGATAGAAAGGATGGTTGTATTCACACGCGGTTAGGTGGAACCTTGATAAATGTGCGCGGATGCGAATGTGCCAAGTTGCCTGCTCTTTCTTAAGGTTCCGGTCTGACGGGAACACAAGGCGAGTAAGAGGGTCTTAGCAAGGGGTGCGCCTTAGCGACTACTATTAGCTTGCACACAAACAAGCGAGTAGTATGGCCTCGCTCGTGTTCAGCTCGGTTTGCGCGTGACCATTGCATTTTTTAAATTGAAAAACCAGTGGTGCTGCTCACTAGCATTATCCGTTTGCTAGCCCTTCCAATGTAACAATGCAGTTGGGCTGAAGAGAGACCAGTACGCCCAGCTATTCTAGAATCCACCTCTGGAAGAACAACTGAGTATTCTCGAAGTTTACCAGCATCCCTCAGTAAAATTTCAACTATGATGTTGTCATCCAATTGCTCCCGAGAAAATTCGACGTGTAGCCCGATGAAAGGTAAGTCGAGTGGGATACCTCTAATGATAGTGCAGAGAATGGCCCGTAGCCTCGGCACTAGAGCCCGTTGCCACTAAGAGCCAAGAAAAGCCCTCGTGAGCAGAGCCTCCGTGCAGTCTCTCTGGCCGGCAATGCAAGGACAGGCTGGTTGC"
print(t.replace('T', 'U'))

AAAACGUUGCCCUCCUCCCAGGUUCUGGUAAACGUGCACAGCGUCCCCUGCCUUUCAUAUCGUACUUCUAGGCCCUCGUAAGGAGUCCGGUAUGUAAUAUCGAUCACGAUCGCCUAUGAGGUGCGGGAAUUGCCAAAUCACUGGCCUAGUUGACUCAGUCACCCUCUUUUGCAAGGCCUUUAGCGAGGAAACCUGCACCGCCGAGGUACGCAAUUGCUACAUGCUACCCAAGAAGGGCAUUCAGCACGCGAGAGCGAUACUGUACUCGAGCACCACCAUCACGAUAGAAAGGAUGGUUGUAUUCACACGCGGUUAGGUGGAACCUUGAUAAAUGUGCGCGGAUGCGAAUGUGCCAAGUUGCCUGCUCUUUCUUAAGGUUCCGGUCUGACGGGAACACAAGGCGAGUAAGAGGGUCUUAGCAAGGGGUGCGCCUUAGCGACUACUAUUAGCUUGCACACAAACAAGCGAGUAGUAUGGCCUCGCUCGUGUUCAGCUCGGUUUGCGCGUGACCAUUGCAUUUUUUAAAUUGAAAAACCAGUGGUGCUGCUCACUAGCAUUAUCCGUUUGCUAGCCCUUCCAAUGUAACAAUGCAGUUGGGCUGAAGAGAGACCAGUACGCCCAGCUAUUCUAGAAUCCACCUCUGGAAGAACAACUGAGUAUUCUCGAAGUUUACCAGCAUCCCUCAGUAAAAUUUCAACUAUGAUGUUGUCAUCCAAUUGCUCCCGAGAAAAUUCGACGUGUAGCCCGAUGAAAGGUAAGUCGAGUGGGAUACCUCUAAUGAUAGUGCAGAGAAUGGCCCGUAGCCUCGGCACUAGAGCCCGUUGCCACUAAGAGCCAAGAAAAGCCCUCGUGAGCAGAGCCUCCGUGCAGUCUCUCUGGCCGGCAAUGCAAGGACAGGCUGGUUGC


## The Secondary and Tertiary Structures of DNA

The primary structure of a nucleic acid is determined by the ordering of its bases, yet it does not describe the large, 3D shape of the molecule. In 1953, the following structure for DNA was proposed:

1) DNA is composed of two strings, running in opposite directions. 

2) Each base bonds to a base in the opposite strand: A-T, C-G (always)

3) The two strands are twisted together into a long spiral double helix. 


1) and 2) compose the **secondary structure** of DNA. 3) Describes the **tertiary structure**.

Note that the **complement** of a base is the base in which it bonds to. The bonding of two complementary bases is called a **base pair (bp)**. Thus the length of a DNA molecule is commonly given in bp rather than nt; we can determine the other strand by taking the complement of the first (running in opposite direction.)

Example:

'AAAACCCGGT' <-> 'ACCGGGTTTT'


In [25]:
s = "AACTCCAGACAGGGGCTCTATCGTTACGAGGCGCCCACGTAGGAGGCGCCGCCGTACCCCAACTTTCCTCGATCTTACACGATAGTTGGGGATAGAGAACGGGCCACATGACTCTAGGCAGCCATTTTGTGTCTCGTAAAGGCTGGGGAACCCTTAGCCGCTTCACGACAACGCGATCTGGTGCGCCCCTTGAGGGGACTGCCTAGCGTTAGTATCGTCCAGGTCCATACCTACTCACTACAGTGTATGGGATCCTTTCGCATCGGCGCTTGACATTTTTAGGCATTGCTCTGAAAGTAACGGTCGACTAGAGTGCTCGAGATCCAATGTCAGAAGCCGCTCCACCGATTTAGGGATGGCTACTGAGGTCTCGTAGCGCAGACTCTGTATTATATGAAGGGCCCATATCGCCGCAAATCAGCGGTAGGGGGCGAAATTGGGCAATTCTTCGAGCTGAGTCTCCGGTTATTGTAAGGTTTGCATGAACCTTCGAGCGGGTGTTGTCTTACAAGCCATCCGAGCAGTTCCCCGGCAAGCCCTGCACCCCGCCTGAATGCTGCATTTTTGGTACAACCTAATGTCTTATAGAGATACCTTAGCTAACGGAGTATAATTTCCATTCTTGCCCTCTACTCAAGATAGGTATAGGACAGTGCCTTTCATCCGTGTACTGACGTAAGCTAAGCACTCGGTGTAACCAGCTGTGAAAATGTAGTACCAGGTTTAGAGGATCACGTCAGGGTTCTTTTTATGTTAATGCACAGGGGGAACGTGGACCACTATTAGATAAGGATCCTTCTAAAGTTTTCGTCGTTGCGGATCGACGTTTCCCACGGTTTATA"
s_c = "" # complement of s. 

for char in s:
    if char == 'A':
        s_c += 'T'
    elif char == 'T':
        s_c += 'A'
    elif char == 'C':
        s_c += 'G'
    elif char == 'G':
        s_c += 'C'
        
print(s_c[::-1])

TATAAACCGTGGGAAACGTCGATCCGCAACGACGAAAACTTTAGAAGGATCCTTATCTAATAGTGGTCCACGTTCCCCCTGTGCATTAACATAAAAAGAACCCTGACGTGATCCTCTAAACCTGGTACTACATTTTCACAGCTGGTTACACCGAGTGCTTAGCTTACGTCAGTACACGGATGAAAGGCACTGTCCTATACCTATCTTGAGTAGAGGGCAAGAATGGAAATTATACTCCGTTAGCTAAGGTATCTCTATAAGACATTAGGTTGTACCAAAAATGCAGCATTCAGGCGGGGTGCAGGGCTTGCCGGGGAACTGCTCGGATGGCTTGTAAGACAACACCCGCTCGAAGGTTCATGCAAACCTTACAATAACCGGAGACTCAGCTCGAAGAATTGCCCAATTTCGCCCCCTACCGCTGATTTGCGGCGATATGGGCCCTTCATATAATACAGAGTCTGCGCTACGAGACCTCAGTAGCCATCCCTAAATCGGTGGAGCGGCTTCTGACATTGGATCTCGAGCACTCTAGTCGACCGTTACTTTCAGAGCAATGCCTAAAAATGTCAAGCGCCGATGCGAAAGGATCCCATACACTGTAGTGAGTAGGTATGGACCTGGACGATACTAACGCTAGGCAGTCCCCTCAAGGGGCGCACCAGATCGCGTTGTCGTGAAGCGGCTAAGGGTTCCCCAGCCTTTACGAGACACAAAATGGCTGCCTAGAGTCATGTGGCCCGTTCTCTATCCCCAACTATCGTGTAAGATCGAGGAAAGTTGGGGTACGGCGGCGCCTCCTACGTGGGCGCCTCGTAACGATAGAGCCCCTGTCTGGAGTT
