### A Rapid Introduction to Molecular Biology 

Cell nuclei are filled with chromatin which contains nucleic acids. NA are a polymer, and the monomer unit is a **nucleotide**. 

**Nucleotides** are made up of a sugar, negatively charged phosphate ion, and *nucleobase* or *base* for short. Nucleotides of a specific nucleic acid will always have the same sugar and phosphate ion, differing only in the base. 

Polymerization is achieved by the bonding of the sugar and phosphate ion forming the sugar-phosphate backbone of nucleic acid strands. Nucleic acid strands are differentiated based on the order of the bae. 

For DNA (Deoxyribose Nuclei Acid), Deoxyribose is the sugar, and the four choices of bases are A,C,G,T (Adenine, Cytosine, Guanine, Thymine)


### Problem 
Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

In [3]:
# can't use set since it's unordered
# using a dictionary to count occurences 

bases = {};

with open("datasets/rosalind_dna.txt","r") as f:
    string = f.readline().strip();

for base in set(string):
    bases[base] = string.count(base)
        
print(bases["A"], bases["C"], bases["G"], bases["T"])

# storing variables in a dictionary allows for extension and maintenance

206 228 199 234


In [4]:

## alternatively, using a generator expression & unpacking

with open("datasets/rosalind_dna.txt","r") as fh: 
    fh = fh.read()
    print(*(fh.count(nuc) for nuc in "ACGT"))


# although more compact, this version can be less readable 
# less efficient for larger strings because we are looping through the 


206 228 199 234


### The Second Nucleic Acid 

RNA (Ribose Nucleic Acid) with a different sugar ribose is also present in chromatin. It has the base *uracil* instead of *thymine*. 

DNA serves as a template for mRNA, created during RNA transcription of DNA, which can enter the far reaches of the cell. In contrast, DNA stays in the nucleus. 

### Problem 
An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

Given: A DNA string t having length at most 1000 nt.
Return: The transcribed RNA string of t.

In [5]:
with open("datasets/rosalind_rna.txt","r") as t: 
    t = t.read().strip()
    u = ""

    for bp in t: 
        if bp == "T":
            u = u + "U"
        else: 
            u = u + bp 

print(u)

# this approach works but is inefficient 
# strings are immutable - a new string is created in each iteration 
# results in O(n^2) complexity as string length grows  

AAGGAAGUGUCAUUCAGGGGUAAUAGCAGGCUGACUUGCCCCCCCGUACCGCACUGGAGUACAGCGAUUAAUGAUGUGGCUUACGUCCUGGAUUUGUGCACAGGUGGGCUUCCCGCAAGUGUGUGGCCGUGCCUCCGGUAGCGUCAUAAGGUAAGUGCAGCCCGCACCUAUUGUAGAUCGUAGUAGGGGAGAGGGGGGCUCUCUGGAUUAGCAGGGGGAAACAGAUGGGGCAUAGAUACUUGAGAUGCACCAGUCCGGCAAAUGGCUAAUGGUUAUCGCCUCGAGUAGACUACUCAGUUAGGCGUCUCCUUGAUUUGAGAUUGUAUGCCCGCCGACUACCCCCAUUCCUGAUAACGUGAUCGGGUAAUACGAAAGAUUAAGAUAUGAUUGUAAACUUUUGUGUCGCAUGAGUGAGAGUACCGUAGCAGGUCGUAAGAAGUGAAUUACAGCUACCUGCCUGUUCUUUGUUGAGCAUUAAAUGGAGAGACAGACGUUCUCCGGUUAUUGUGUAACGCCGCUAAUACCUAUAUUGCCGACGGCUGAAAAUGAAUCUACGCCAUUAACACCCCGUCCGAUUUCAUCGCGUCCAUAAAGCCCCCCCGCGAUCUCAGAGUGACAUUUAGUAAGCUACCCCCCGUAUCCUUACUGGAAACAAUGAGCCAGUUGAUGCCGUCGUUCCGCGUGUGGUGCCGUCGGAGUCUAUAGGUAGAACUAUCGUUGUUCCCGCGCGAAAUGAACCCGGUUGUAAAAUUAGCGCAGCAGUGGACGCAACUCCUGCUCUCGUCCAAGGUAGGGGACCCAGUACUGCAAGCUUGCUAGGAGUCAAUCAUAACUUCAAUCUGUGAUCCCCUCGAUAGGAACUUGAGGGGGGAGUGCCCAAAAUCCCAAAACGAACCAGCAGA


In [6]:
# Note the *reverse* complement is due to the two strands running in 
# *opposite* directions 

# must reverse first and then replace! 
with open('datasets/rosalind_revc.txt','r') as s:
    s = s.read().strip()

sc_list = list(s) 
sc_list.reverse()


# list comprehension
sc_list = [
    'A' if nt == 'T' else 
    'T' if nt == 'A' else
    'G' if nt == 'C' else
    'C' if nt == 'G' else nt

    for nt in sc_list
]

sc = ''.join(sc_list)
sc



'CGGGAGCTGTGCGTAGCGGTTCCATGATGACCCATTCCACCGTTCAAACCCTGACACCTTAGCATCGGTACTTGAGCGTTAAGTGCTGAAGAACAGAACTCTCCTCACGATGTACCTCTACAAATCATTAAACTGTTAATTCGGCCGCAAAACCTAGTCCGGGACTCTAATTTCATACTCTAACTCAGGCTCGCTGTCTGATGCGATCAACCGTTATCTTCTCCCGCCCTGGTAGGGTGCTAAAAGATGTTATGGGCTCGAGAGGACGCTTTGTCTGTTGAATCCTCCTAGCCTCAGAGCTTCCATCTATAAACGTTCGTTCTGACCACGTACCGGTATCGGCGCGCAGTCCTCCTAATTCAGAACTGAGTAAACGAGGAAAACAATTTTATAGGTGTTACACAGCGGCAACCATGACACTGAACGTAAAATCCCGGGCTGTAATCGAGAAGTGCGAGAGTGTAAAATGGACGTTTTCAGTGCATCCTGCATCGGTCCGGCGGGTTCTGATGTTCGGCGGACCGCTTAGGTGTGACACGCGAGAAAGTAACACGACTCACTGGTGCACCAATCTCCTATGTTGCAAACGGATTCTTGTCACTGCCTCTTTTGAACCCCCCACTAAAAGAGGTGCTCCAAGTTTGCTACACGAACATGACGCTCGAGAACAATCGTCGAACGAAGCAGATTGGTGGTTTATAAAACTATTTTTACAAGTCAGTTCTCGATATATCCGTTGGGGGTCGCCTAAATGACACTCGGAAGAAAACGACATGTTACTGAGTTGTCTCCTACTAGACGCGGTCACTGAAACGCAATCTTGTGGTGTGACCAGAGATCATAGGGGGGACC'

### The Secondary and Tertiary Structures of DNA 

The DNA molecule is a double helix made up of two strands running in opposite directions. Each base bonds to a base in the opposite strand complementarily (A-T) and (G-C) - called a **base pair**

The tertiary structure refers to the 3D shape of the molecule, while the secondary is compriesd of the two opposite strands and base pairs. 

### Problem 
In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string s<sup>c</sup> formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string s of length at most 1000 bp.
Return: The reverse complement s<sup>c</sup> of s.



In [7]:
## more efficient method using translation table

with open('datasets/rosalind_revc.txt', 'r') as s:
    s = s.read().strip()

sc = s[::-1].translate(str.maketrans("ATGC", "TACG"))
sc


'CGGGAGCTGTGCGTAGCGGTTCCATGATGACCCATTCCACCGTTCAAACCCTGACACCTTAGCATCGGTACTTGAGCGTTAAGTGCTGAAGAACAGAACTCTCCTCACGATGTACCTCTACAAATCATTAAACTGTTAATTCGGCCGCAAAACCTAGTCCGGGACTCTAATTTCATACTCTAACTCAGGCTCGCTGTCTGATGCGATCAACCGTTATCTTCTCCCGCCCTGGTAGGGTGCTAAAAGATGTTATGGGCTCGAGAGGACGCTTTGTCTGTTGAATCCTCCTAGCCTCAGAGCTTCCATCTATAAACGTTCGTTCTGACCACGTACCGGTATCGGCGCGCAGTCCTCCTAATTCAGAACTGAGTAAACGAGGAAAACAATTTTATAGGTGTTACACAGCGGCAACCATGACACTGAACGTAAAATCCCGGGCTGTAATCGAGAAGTGCGAGAGTGTAAAATGGACGTTTTCAGTGCATCCTGCATCGGTCCGGCGGGTTCTGATGTTCGGCGGACCGCTTAGGTGTGACACGCGAGAAAGTAACACGACTCACTGGTGCACCAATCTCCTATGTTGCAAACGGATTCTTGTCACTGCCTCTTTTGAACCCCCCACTAAAAGAGGTGCTCCAAGTTTGCTACACGAACATGACGCTCGAGAACAATCGTCGAACGAAGCAGATTGGTGGTTTATAAAACTATTTTTACAAGTCAGTTCTCGATATATCCGTTGGGGGTCGCCTAAATGACACTCGGAAGAAAACGACATGTTACTGAGTTGTCTCCTACTAGACGCGGTCACTGAAACGCAATCTTGTGGTGTGACCAGAGATCATAGGGGGGACC'

In [8]:
with open('datasets/rosalind_rna.txt','r') as t: 
    t = t.read().strip()

u_l = []

for bp in t: 
    if bp == "T":
        u_l.append('U')
    else: 
        u_l.append(bp)


u = ''.join(u_l)
u

# this is a more efficient way, with O(n) complexity (linear)

'AAGGAAGUGUCAUUCAGGGGUAAUAGCAGGCUGACUUGCCCCCCCGUACCGCACUGGAGUACAGCGAUUAAUGAUGUGGCUUACGUCCUGGAUUUGUGCACAGGUGGGCUUCCCGCAAGUGUGUGGCCGUGCCUCCGGUAGCGUCAUAAGGUAAGUGCAGCCCGCACCUAUUGUAGAUCGUAGUAGGGGAGAGGGGGGCUCUCUGGAUUAGCAGGGGGAAACAGAUGGGGCAUAGAUACUUGAGAUGCACCAGUCCGGCAAAUGGCUAAUGGUUAUCGCCUCGAGUAGACUACUCAGUUAGGCGUCUCCUUGAUUUGAGAUUGUAUGCCCGCCGACUACCCCCAUUCCUGAUAACGUGAUCGGGUAAUACGAAAGAUUAAGAUAUGAUUGUAAACUUUUGUGUCGCAUGAGUGAGAGUACCGUAGCAGGUCGUAAGAAGUGAAUUACAGCUACCUGCCUGUUCUUUGUUGAGCAUUAAAUGGAGAGACAGACGUUCUCCGGUUAUUGUGUAACGCCGCUAAUACCUAUAUUGCCGACGGCUGAAAAUGAAUCUACGCCAUUAACACCCCGUCCGAUUUCAUCGCGUCCAUAAAGCCCCCCCGCGAUCUCAGAGUGACAUUUAGUAAGCUACCCCCCGUAUCCUUACUGGAAACAAUGAGCCAGUUGAUGCCGUCGUUCCGCGUGUGGUGCCGUCGGAGUCUAUAGGUAGAACUAUCGUUGUUCCCGCGCGAAAUGAACCCGGUUGUAAAAUUAGCGCAGCAGUGGACGCAACUCCUGCUCUCGUCCAAGGUAGGGGACCCAGUACUGCAAGCUUGCUAGGAGUCAAUCAUAACUUCAAUCUGUGAUCCCCUCGAUAGGAACUUGAGGGGGGAGUGCCCAAAAUCCCAAAACGAACCAGCAGA'

In [9]:
## another alternative 
# efficient and most Pythonic, using replace() as designed

with open('datasets/rosalind_rna.txt','r') as t: 
    t = t.read().strip()
    u = t.replace("T","U")

print(u)

AAGGAAGUGUCAUUCAGGGGUAAUAGCAGGCUGACUUGCCCCCCCGUACCGCACUGGAGUACAGCGAUUAAUGAUGUGGCUUACGUCCUGGAUUUGUGCACAGGUGGGCUUCCCGCAAGUGUGUGGCCGUGCCUCCGGUAGCGUCAUAAGGUAAGUGCAGCCCGCACCUAUUGUAGAUCGUAGUAGGGGAGAGGGGGGCUCUCUGGAUUAGCAGGGGGAAACAGAUGGGGCAUAGAUACUUGAGAUGCACCAGUCCGGCAAAUGGCUAAUGGUUAUCGCCUCGAGUAGACUACUCAGUUAGGCGUCUCCUUGAUUUGAGAUUGUAUGCCCGCCGACUACCCCCAUUCCUGAUAACGUGAUCGGGUAAUACGAAAGAUUAAGAUAUGAUUGUAAACUUUUGUGUCGCAUGAGUGAGAGUACCGUAGCAGGUCGUAAGAAGUGAAUUACAGCUACCUGCCUGUUCUUUGUUGAGCAUUAAAUGGAGAGACAGACGUUCUCCGGUUAUUGUGUAACGCCGCUAAUACCUAUAUUGCCGACGGCUGAAAAUGAAUCUACGCCAUUAACACCCCGUCCGAUUUCAUCGCGUCCAUAAAGCCCCCCCGCGAUCUCAGAGUGACAUUUAGUAAGCUACCCCCCGUAUCCUUACUGGAAACAAUGAGCCAGUUGAUGCCGUCGUUCCGCGUGUGGUGCCGUCGGAGUCUAUAGGUAGAACUAUCGUUGUUCCCGCGCGAAAUGAACCCGGUUGUAAAAUUAGCGCAGCAGUGGACGCAACUCCUGCUCUCGUCCAAGGUAGGGGACCCAGUACUGCAAGCUUGCUAGGAGUCAAUCAUAACUUCAAUCUGUGAUCCCCUCGAUAGGAACUUGAGGGGGGAGUGCCCAAAAUCCCAAAACGAACCAGCAGA
