<h1 id="toctitle">Lists and loops exercise solutions</h1>
<ul id="toc"/>

##Processing DNA in a file

First, just read each line and print to the screen:

In [1]:
file = open("input.txt") 
for dna in file: 
    print(dna) 

ATTCGATTATAAGCTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC

ATTCGATTATAAGCACTGATCGATCGATCGATCGATCGATGCTATCGTCGT

ATTCGATTATAAGCATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC

ATTCGATTATAAGCACTATCGATGATCTAGCTACGATCGTAGCTGTA

ATTCGATTATAAGCACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA



Now we can grab the bit of the sequence from 15th base to the end:

In [2]:
file = open("input.txt") 
for dna in file:
    trimmed_dna = dna[14:] 
    print(trimmed_dna) 

TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC

ACTGATCGATCGATCGATCGATCGATGCTATCGTCGT

ATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC

ACTATCGATGATCTAGCTACGATCGTAGCTGTA

ACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA



Looks good, the printed sequences are definitely shorter!

Now switch to writing to a file:

In [4]:
file = open("input.txt") 
output = open("trimmed.txt", "w") 
for dna in file: 
    trimmed_dna = dna[14:] 
    output.write(trimmed_dna) 
output.close()

Notice where we open, write, and close - before, during, and after the loop.

Add one more statement to print the length of the trimmed sequence to the screen:

In [5]:
# open the input file 
file = open("input.txt") 
 
# open the output file 
output = open("trimmed.txt", "w") 
 
# go through the input file one line at a time 
for dna in file: 

    # get the substring from the 15th character to the end 
    trimmed_dna = dna[14:]

    # print out the trimmed sequence
    output.write(trimmed_dna)

    # print out the length to the screen
    print("processed sequence with length " + str(len(trimmed_dna))) 
output.close()

processed sequence with length 43
processed sequence with length 38
processed sequence with length 49
processed sequence with length 34
processed sequence with length 48


##Multiple exons from genomic DNA

There are two files involved here - the DNA and the exon locations. Start with the locations:

In [7]:
exon_locations = open("exons.txt") 
for line in exon_locations: 
    print(line) 

5,58

72,133

190,276

340,398



Use `split()` to turn each line into a list of two elements:

In [9]:
exon_locations = open("exons.txt") 
for line in exon_locations: 
    positions = line.split(',') 
    print(positions) 

['5', '58\n']
['72', '133\n']
['190', '276\n']
['340', '398\n']


To make it easier to work with, let's assign the start and stop to variables:

In [10]:
exon_locations = open("exons.txt") 
for line in exon_locations: 
    positions = line.split(',') 
    start = positions[0] 
    stop = positions[1] 
    print("start is " + start + ", stop is " + stop)

start is 5, stop is 58

start is 72, stop is 133

start is 190, stop is 276

start is 340, stop is 398



Looks good. Next we tackle the DNA part: open and read the sequence, then use the start/stop positions to extract the exon:

In [15]:
genomic_dna = open("genomic_dna2.txt").read() 
exon_locations = open("exons.txt") 
for line in exon_locations: 
    positions = line.split(',') 
    start = positions[0] 
    stop = positions[1] 
    exon = genomic_dna[start:stop] 
    print("exon is: " + exon) 

TypeError: slice indices must be integers or None or have an __index__ method

Problem: when we split a string, the resulting elements of the list are strings. Look at the output from this:

In [12]:
"123,456,789".split(',')

['123', '456', '789']

and notice how the numbers are surrounded by quotes. We need to turn them into numbers with

```python
    start = int(positions[0]) 
    stop = int(positions[1]) 
```

In [16]:
genomic_dna = open("genomic_dna2.txt").read() 
exon_locations = open("exons.txt") 
for line in exon_locations: 
    positions = line.split(',') 
    start = int(positions[0]) 
    stop = int(positions[1])
    exon = genomic_dna[start:stop] 
    print("exon is: " + exon) 

exon is: CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCG
exon is: CGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGA
exon is: CGATCGATCGATCGTAGCTAGCTAGCTAGATCGATCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTA
exon is: CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGTAGCTAGCTACGATCG


OK. Next step - do something useful with the exons. We have to concatenate them all to make one long coding sequence. Because we are only dealing with a single exon at a time, we have to do it inside the loop. Here's the easiest way:

In [17]:
genomic_dna = open("genomic_dna2.txt").read() 
exon_locations = open("exons.txt") 

# create a new variable to hold the coding sequence
# at first it is just an empty string
coding_sequence = "" 


for line in exon_locations: 
    positions = line.split(',') 
    start = int(positions[0]) 
    stop = int(positions[1]) 
    exon = genomic_dna[start:stop] 
    
    # take the original coding sequence,
    # add the new exon on to the end, 
    # then store the result back in the coding sequence variable
    coding_sequence = coding_sequence + exon 
    
    
    print("coding sequence is : " + coding_sequence) 

coding sequence is : CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCG
coding sequence is : CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCGCGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGA
coding sequence is : CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCGCGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGACGATCGATCGATCGTAGCTAGCTAGCTAGATCGATCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTA
coding sequence is : CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCGCGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGACGATCGATCGATCGTAGCTAGCTAGCTAGATCGATCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTACGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGTAGCTAGCTACGATCG


Notice how the coding sequence gets longer each time round the loop as more exons are added to it. 

After the loop has finished we can write the coding sequence to a file. Final version:

In [19]:
# open the genomic dna file and read the contents 
genomic_dna = open("genomic_dna2.txt").read() 
 
# open the exons locations file 
exon_locations = open("exons.txt") 
 
# create a variable to hold the coding sequence 
coding_sequence = "" 
 
# go through each line in the exon locations file 
for line in exon_locations: 

    # split the line using a comma 
    positions = line.split(',') 

    # get the start and stop positions 
    start = int(positions[0]) 
    stop = int(positions[1]) 

    # extract the exon from the genomic dna 
    exon = genomic_dna[start:stop] 

    # append the exon to the end of the current coding sequence 
    coding_sequence = coding_sequence + exon 

# write the coding sequence to an output file 
output = open("coding_sequence.txt", "w") 
output.write(coding_sequence) 
output.close() 

###Bonus exercise: sliding windows

We can start by defining some variables: a DNA sequence and a window size.

In [29]:
dna = "aacgtcgat"
window_size = 4

Let's get the first few windows manually:

In [30]:
window1 = dna[0:4]
window2 = dna[1:5]
window3 = dna[2:6]
print(window1)
print(window2)
print(window3)

aacg
acgt
cgtc


You can see the pattern here:
- the stop position of the window is always 4 more than the start (or whatever the window size is)
- the start position increases by one each time

So we can use `range()` to generate the list of start positions:

In [31]:
start_positions = range(len(dna))
print(start_positions)

[0, 1, 2, 3, 4, 5, 6, 7, 8]


And now write a loop to get the window for each start position:

In [32]:
for start in range(len(dna)):
    stop = start + window_size
    window = dna[start:stop]
    print window

aacg
acgt
cgtc
gtcg
tcga
cgat
gat
at
t


We can see what's going on even more clearly if we print some spaces at the start of the window to make it line up with the original sequence. Use `*` to repeat a string:

In [33]:
print(dna)
for start in range(len(dna)):
    stop = start + window_size
    window = dna[start:stop]
    print (' ' * start) + window

aacgtcgat
aacg
 acgt
  cgtc
   gtcg
    tcga
     cgat
      gat
       at
        t


Notice that we have some incomplete windows at the end. If we want to avoid this, we need to stop the `range()` while there are still enough bases at the end:

In [34]:
print(dna)
for start in range(len(dna) - window_size + 1):
    stop = start + window_size
    window = dna[start:stop]
    print (' ' * start) + window

aacgtcgat
aacg
 acgt
  cgtc
   gtcg
    tcga
     cgat


Now that we have a loop to generate all the windows, it's easy to calculate their AT content (remember the first line!)

In [41]:
from __future__ import division
for start in range(len(dna) - window_size + 1):
    stop = start + window_size
    window = dna[start:stop]
    at = (window.count('a') + window.count('t')) / len(window)
    print(start, window, at)

(0, 'aacg', 0.5)
(1, 'acgt', 0.5)
(2, 'cgtc', 0.25)
(3, 'gtcg', 0.25)
(4, 'tcga', 0.5)
(5, 'cgat', 0.5)


If we want to include partial windows and the start and end, we just have to adjust the call to `range()` so that it starts with a negative number and ends at the length of the sequence. For the negative start positions we have to bump them up to zero:

In [43]:
from __future__ import division
for start in range(1 - window_size, len(dna)):
    stop = start + window_size
    if start < 0:
        start = 0
    
    
    window = dna[start:stop]
    at = (window.count('a') + window.count('t')) / len(window)
    print(start,window,at)

(0, 'a', 1.0)
(0, 'aa', 1.0)
(0, 'aac', 0.6666666666666666)
(0, 'aacg', 0.5)
(1, 'acgt', 0.5)
(2, 'cgtc', 0.25)
(3, 'gtcg', 0.25)
(4, 'tcga', 0.5)
(5, 'cgat', 0.5)
(6, 'gat', 0.6666666666666666)
(7, 'at', 1.0)
(8, 't', 1.0)


In [2]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [1]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")