<h1 id="toctitle">Exercise solutions</h1>
<ul id="toc"/>

##Accession names, again

The bulk of the work here will be coming up with patterns to describe the various criteria. Here's a skeleton program that will create a list to hold the accession numbers and loop over them:

In [3]:
import re
accs = ["xkn59438", "yhdck2", "eihd39d9", "chdsye847", "hedle3455", "xjhd53e", "45da", "de37dp"]
for acc in accs:
    print(acc)

xkn59438
yhdck2
eihd39d9
chdsye847
hedle3455
xjhd53e
45da
de37dp


The first criterion is easy - the pattern we are looking for is just the number `5`:

In [4]:
for acc in accs: 
    if re.search(r"5", acc): 
        print("\t" + acc)

	xkn59438
	hedle3455
	xjhd53e
	45da


Next, accessions that contain `d` or `e`. The easiest way to solve this is probably with alternation:

In [5]:
for acc in accs: 
    if re.search(r"(d|e)", acc): 
        print("\t" + acc) 

	yhdck2
	eihd39d9
	chdsye847
	hedle3455
	xjhd53e
	45da
	de37dp


For accessions that contain both `d` and `e` in that order we can't use an alternation because we need __both__ letters. We can express it like this: `d`, followed by any character repeated any number of times, followed by `e`:

In [6]:
for acc in accs: 
    if re.search(r"d.*e", acc): 
        print("\t" + acc) 

	chdsye847
	hedle3455
	xjhd53e
	de37dp


We can use a very similar pattern for the next problem: `d` and `e` separated by any single letter:

In [7]:
for acc in accs: 
    if re.search(r"(d.e)", acc): 
        print("\t" + acc) 

	hedle3455


The next one is surprisingly tricky. If we re-frame it as `d` followed by anything followed by `e` __or__ `e` followed by anything followed by `d`, it becomes a bit clearer:

In [8]:
for acc in accs: 
    if re.search(r"d.*e", acc) or re.search(r"e.*d", acc): 
        print("\t" + acc) 

	eihd39d9
	chdsye847
	hedle3455
	xjhd53e
	de37dp


To find accessions that start with either `x` or `y`, we need to combine an alternation with a start-of-string anchor:

In [9]:
for acc in accs: 
    if re.search(r"^(x|y)", acc): 
        print("\t" + acc) 

	xkn59438
	yhdck2
	xjhd53e


We can modify this quite easily to add the requirement that the accession ends with `e`. Watch out for the bit in the middle - it has to match anything, any number of times:

In [10]:
for acc in accs: 
    if re.search(r"^(x|y).*e$", acc): 
        print("\t" + acc) 

	xjhd53e


To match three or more numbers in a row, we need a more specific quantifier – the curly brackets – and a character group which contains all the numbers:

In [12]:
for acc in accs: 
    if re.search(r"[0123456789]{3,}", acc): 
        print("\t" + acc) 

	xkn59438
	chdsye847
	hedle3455


or we can use a shortcut, `\d` means any digit:

In [13]:
for acc in accs: 
    if re.search(r"\d{3,}", acc): 
        print("\t" + acc) 

	xkn59438
	chdsye847
	hedle3455


The last one uses a character group and an end-of-string anchor:

In [14]:
for acc in accs: 
    if re.search(r"d[arp]$", acc): 
        print("\t" + acc) 

	45da
	de37dp


##Double digest

Let's write a pattern for the enyme Abc1. `N` means any base, so the pattern is

`A[ATGC]TAAT`

We can use `re.finditer()` to find the start of all the cut sites:

In [15]:
dna = open("long_dna.txt").read().rstrip("\n") 

print("AbcI cuts at:") 
for match in re.finditer(r"A[ATGC]TAAT", dna): 
    print(match.start()) 

AbcI cuts at:
1140
1625


Be careful though, the cut position is actually three base pairs upstream of the match:

In [16]:
dna = open("long_dna.txt").read().rstrip("\n") 

print("AbcI cuts at:") 
for match in re.finditer(r"A[ATGC]TAAT", dna): 
    print(match.start() + 3) 

AbcI cuts at:
1143
1628


Once we've got the cut positions, how to calculate the sizes? Measure the distance from the current cut site to the previous one (or to the start of the sequence):

In [17]:
dna = open("long_dna.txt").read().rstrip("\n") 

last_cut = 0
for match in re.finditer(r"A[ATGC]TAAT", dna): 
    cut_position = match.start() + 3
    fragment_size = cut_position - last_cut
    print("fragment size is " + str(fragment_size))
    last_cut = cut_position

fragment size is 1143
fragment size is 485


Notice how the current cut position becomes the last cut position for the next iteration. We also have to remember the last fragment, from the last cut to the end:

In [18]:
dna = open("long_dna.txt").read().rstrip("\n") 

last_cut = 0
for match in re.finditer(r"A[ATGC]TAAT", dna): 
    cut_position = match.start() + 3
    fragment_size = cut_position - last_cut
    print("fragment size is " + str(fragment_size))
    last_cut = cut_position
    
# now the last fragment
fragment_size = len(dna) - last_cut
print("fragment size is " + str(fragment_size))

fragment size is 1143
fragment size is 485
fragment size is 384


Doing the same for two enzymes is trickier. We need to change our strategy. First, make a big list of all the cut positions for both enzymes:

In [19]:
all_cuts = []
# add cut positions for AbcI 
for match in re.finditer(r"A[ATGC]TAAT", dna): 
    all_cuts.append(match.start() + 3) 
 
# add cut positions for AbcII 
for match in re.finditer(r"GC[AG][AT]TG", dna): 
    all_cuts.append(match.start() + 4) 

print(all_cuts)

[1143, 1628, 488, 1577]


These aren't in the right order, so we have to sort them:

In [20]:
all_cuts.sort()
all_cuts

[488, 1143, 1577, 1628]

Now we can go through the list of all cuts with the same logic:

In [22]:
last_cut = 0
for cut_position in all_cuts:
    fragment_size = cut_position - last_cut
    print("fragment size is " + str(fragment_size))
    last_cut = cut_position
    
# now the last fragment
fragment_size = len(dna) - last_cut
print("fragment size is " + str(fragment_size))

fragment size is 488
fragment size is 655
fragment size is 434
fragment size is 51
fragment size is 384


In [3]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [4]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")