<h1 id="toctitle">Text manipulation solutions to exercises</h1>
<ul id="toc"/>

##Calculating AT content

First, we need to store the information we need - the A and T count, and the length of the DNA sequence:

In [1]:
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')

We can print these out to check that they look OK:

In [2]:
print("length: " + str(length))
print("A count: " + str(a_count))
print("T count: " + str(t_count))

length: 54
A count: 16
T count: 21


Now let's try the calculation:

In [3]:
at_content = a_count + t_count / length
print("AT content is " + str(at_content))

AT content is 16


This is definitely not right, it should be between zero and one! We need parentheses:

In [5]:
at_content = (a_count + t_count) / length
print("AT content is " + str(at_content))

AT content is 0


Remember to include the line for division:

In [6]:
from __future__ import division

at_content = (a_count + t_count) / length
print("AT content is " + str(at_content))

AT content is 0.685185185185


Final version:

In [7]:
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')
at_content = (a_count + t_count) / length
print("AT content is " + str(at_content))

AT content is 0.685185185185


##Complementing DNA

The obvious thing doesn't work:

In [8]:
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
# replace A with T
replacement1 = my_dna.replace('A', 'T')
# replace T with A
replacement2 = replacement1.replace('T', 'A')
# replace C with G
replacement3 = replacement2.replace('C', 'G')
# replace G with C
replacement4 = replacement3.replace('G', 'C')
# print the result of the final replacement
print(replacement4)

ACACAACCAAAACCAAAACAAAAACCAAACAAACAAAAAAAACCAACCCAACAA


Because we change the A to T then back to A. To avoid it, we need to either use an intermediate step:

In [9]:
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
replacement1 = my_dna.replace('A', 'H')
replacement2 = replacement1.replace('T', 'J')
replacement3 = replacement2.replace('C', 'K')
replacement4 = replacement3.replace('G', 'L')
replacement5 = replacement4.replace('H', 'T')
replacement6 = replacement5.replace('J', 'A')
replacement7 = replacement6.replace('K', 'G')
replacement8 = replacement7.replace('L', 'C')
print(replacement8)

TGACTAGCTAATGCATATCATAAACGATAGTATGTATATATAGCTACGCAAGTA


or do the replacement in a different case:

In [10]:
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
replacement1 = my_dna.replace('A', 't')
print(replacement1)
replacement2 = replacement1.replace('T', 'a')
print(replacement2)
replacement3 = replacement2.replace('C', 'g')
print(replacement3)
replacement4 = replacement3.replace('G', 'c')
print(replacement4)
print(replacement4.upper())

tCTGtTCGtTTtCGTtTtGTtTTTGCTtTCtTtCtTtTtTtTCGtTGCGTTCtT
tCaGtaCGtaatCGatatGataaaGCataCtatCtatatataCGtaGCGaaCta
tgaGtagGtaatgGatatGataaaGgatagtatgtatatatagGtaGgGaagta
tgactagctaatgcatatcataaacgatagtatgtatatatagctacgcaagta
TGACTAGCTAATGCATATCATAAACGATAGTATGTATATATAGCTACGCAAGTA


##Restriction fragment lengths

We can kind of see the answer just by looking at the sequence:

`0         1         2         3         4         5`
`0123456789012345678901234567890123456789012345678901234`
`ACTGATCGATTACGTATAGTA`__`GAATTC`__`TATCATACATATATATCGATGCGTTCAT`


First fragment will be position 0 to 21, second from 22 to the end, so the lengths are 22 and 33. 

The same idea in code:

In [12]:
my_dna = "ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT"
frag1_length = my_dna.find("GAATTC") + 1
frag2_length = len(my_dna) - frag1_length
print("length of fragment one is " + str(frag1_length))
print("length of fragment two is " + str(frag2_length))

length of fragment one is 22
length of fragment two is 33


##Splicing out introns

First, store the DNA and extract the exons as separate variables:

In [15]:
my_dna ="ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"

Remember that we start counting from zero, and that positions are inclusive at the start and exclusive at the end:

In [16]:
exon1 = my_dna[0:63]
exon2 = my_dna[90:]

Now we can just join the two exons and print them out:

In [17]:
print(exon1 + exon2)

ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGAATCATCGATCGATATCGATGCATCGACTACTAT


To calculate the coding percentage, we just have to take the length of the exons, divide by the length of the sequence, and multiply by 100:

In [19]:
coding_length = len(exon1 + exon2)
total_length = len(my_dna)
print((coding_length / total_length) * 100)

78.0487804878


To print out the upper/lower case version, we need to take the middle bit i.e. the intron and convert it to lower case, then join the bits back together:

In [20]:
exon1 = my_dna[0:63]
intron = my_dna[63:90]
exon2 = my_dna[90:]
print(exon1 + intron.lower() + exon2)

ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGAtcgatcgatcgatcgatcgatcatgctATCATCGATCGATATCGATGCATCGACTACTAT


In [1]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [3]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")