### NOTE FOR LUCA

**Remember to set/remove metadata as:**
{
  "nbsphinx": "hidden"
}

to enable/disable solutions view


# Practical 14

In this practical we will start working with sorting algorithms. In particular we will work with **selection sort** and **insertion sort**.

## Slides

The slides of the introduction can be found here: [Intro](docs/Practical14.pdf)

## Sorting algorithms

The basic principle of sorting algorithms is quite simple: we have an input sequence ($S$) of un-sorted elements $U=u_{1},u_{2}, ..., u_{n}$ and output a new sequence $S = s_{1}, s_{2}, ... , s_{n}$ which is a permutation of $U$ such that $s_{0}\leq s_{1}, ..., \leq s_{n}$.

There are several sorting algorithms the first one we will work with is **selection sort**.

### Selection sort

Selection sort is the simplest of the sorting algorithms. The idea of **selection sort** is that given $U=u_{1},u_{2},...,u_{n}$ is to loop through all the elements of $U$, find the minimum $u_m$ and place it at the beginning of the sequence $U$, swapping $u_{1}$ with $u_m$. The next iteration is to continue looking for the minimum starting from $u_2$ and so on.

If $U$ has $n$ elements, for each position $i=0,..,n-1$ in the list we need to perform the following two steps:

1. (argmin) Find index of the minimum element in the sublist $U[i+1:]$, let's call it $m$ (i.e. $u_{m} = min(U[i:])$);

2. (swap) Swap $u_m$ with $u_i$;

A reminder on how selection sort works is reported in the following picture (taken from the lecture). Yellow cells are the minimum found at each iteration, while orange cells are those already sorted.

![](img/pract14/selection_sort.png)

### 

### Insertion sort


## A testing class

In [None]:
%reset -f 

class MySorting:
    

The line ```if __name__ == "__main__":``` is used to specify if the code is executed as a script (i.e. it is not invoked as an imported module somewhere else in another piece of code).

Executing the code as it is will not give any particular error as the tests we set-up passed correctly. But if we uncomment the ```res.append(1)``` the tests will fail and we produce the following testing output, which reports the three failed tests, the expected values and the obtained values: 

![](img/pract13/doctest.png)

Another way of testing the code is through unit tests, we will see this later on. 

### Raising exceptions and using assertions

One thing we can do is to raise exceptions whenever some pre-conditions are not met in order to insure that these do not lead to erroneous behaviours. This can be done with ```raise Exception("exception text")```. More info on raising exceptions are [here](https://docs.python.org/3/tutorial/errors.html#raising-exceptions). 

**Example**: 
Consider the following ```MyIntPair``` class that works with integers. If we want to make sure it only contains integers we can add a ```raise Exception``` in case it is not an integer. 

## Exercises


1. Implement a class SelectionSort that has one parameter called ```data``` (the actual data to sort), ```swaps``` (initialized to 0) that counts how many swaps have been done to perform the sorting, ```comparisons``` (initialized to 0) that counts how many comparisons have been done and ```verbose``` a boolean (default= True) that is used to decide if the method should report what is happening at each step and some stats or not. The class has one method called ```sort``` that implements the selection sort algorithm (two more methods might be needed to compute ```swap``` and ```argmin``` -- see description above). 

Once you implemented the class you can test it with some data like:
```
[7, 5, 10, -11 ,3, -4, 99, 1]
```
or you can create a random list of N integers with:
```
import random
for i in range(0,N):
        d.append(random.randint(0,1000))
```
Test the class wit N = 10000
Add a private ```__time``` variable that computes the time spent doing the sorting. This can be done by:
```
import time
...
start_t = time.time()
...
end_t = time.time()
self._time = end_t - start_t
```
How long does it take with a list of 10000 elements? With 20000?

<div class="tggle" onclick="toggleVisibility('ex1');">Show/Hide Solution</div>
<div id="ex1" style="display:none;">

In [None]:
%reset -f

import random
import time

class SelectionSort:
    def __init__(self,data, verbose = True):
        self.__data = data
        self.__comparisons = 0
        self.__swaps = 0
        self.__verbose = verbose
        self.__time = 0
        
    def getData(self):
        return self.__data
    
    def getTime(self):
        return self.__time
    
    def getComparisons(self):
        return self.__comparisons
    
    def getSwaps(self):
        return self.__swaps
    
    def swap(self, i, j):
        """
        swaps elements i and j in data.
        """
        if(i != j): #no point in swapping if i==j
            tmp = self.__data[i]
            self.__data[i] = self.__data[j]
            self.__data[j] = tmp
        
    def argmin(self, i):
        """
        returns the index of the smallest element of
        self.__data[i:]
        """
        mpos = i
        N = len(self.__data)
        minV = self.__data[mpos]
        for j in range(i + 1,N): # from i+1 to N. U[i+1:]
            if(self.__data[j] < minV):
                mpos = j
                minV = self.__data[j]
            #just for checking
            self.__comparisons += 1
        
        return mpos
    
    def sort(self):
        self.__comparisons = 0
        self.__swaps = 1
        if self.__verbose:
            print("Initial list:")
            print(self.__data)
            print("\n")
            
        #to check performance    
        start_t = time.time()
        for i in range(len(self.__data) - 1):
                j = self.argmin(i)
                self.swap(i,j) 
                self.__swaps += 1
                if self.__verbose:
                    print("It. {}. data[{}]<->data[{}] {}<->{}".format(i,
                                                                       i,
                                                                       j,
                                                                       self.__data[i],
                                                                       self.__data[j]))
                    print(self.__data)    
        end_t = time.time()
        
        self.__time = end_t - start_t
        
        if self.__verbose:
            print(self.__data)
            print("\nNumber of comparisons: {}".format(self.__comparisons))
            print("Number of swaps: {}".format(self.__swaps))
            print("In {:.4f}s".format(self.__time))



if __name__ == "__main__":
    d = [7, 5, 10, -11 ,3, -4, 99, 1]
    selSorter = SelectionSort(d, verbose = True)
    selSorter.sort()
    d = []
    for i in range(0,10000):
        d.append(random.randint(0,1000))
    selSorter = SelectionSort(d, verbose = False)
    selSorter.sort()
    print("\nNumber of elements: {}".format(len(d)))
    print("Number of comparisons: {}".format(selSorter.getComparisons()))
    print("Number of swaps: {}".format(selSorter.getSwaps()))
    print("In {:.4f}s".format(selSorter.getTime()))
    test = True
    for el in range(1,len(d)-1):
        test = test and (d[el]<= d[el+1])
    print("Sorting test passed? {}".format(test))
    
    d = []
    for i in range(0,20000):
        d.append(random.randint(0,1000))
    selSorter = SelectionSort(d, verbose = False)
    selSorter.sort()
    print("\nNumber of elements: {}".format(len(d)))
    print("Number of comparisons: {}".format(selSorter.getComparisons()))
    print("Number of swaps: {}".format(selSorter.getSwaps()))
    print("In {:.4f}s".format(selSorter.getTime()))
    test = True
    for el in range(1,len(d)-1):
        test = test and (d[el]<= d[el+1])
    print("Sorting test passed? {}".format(test))
    

</div>

2. Define a unittest class (see previous practical) to test the SelectionSort class. Some examples of things to check are:

a. swap(i,j);swap(i,j) == original data;
b. length of data does not change after swap;
c. swap does not change elements other than i and j;
d. if j=argmin(i) , data[j] is lower or equal than all elements in data[i:];
e. after sort $\forall$ i=0,..,n-2 : data[i]<data[i+1];
f. Any other check?

Perform unit testing on the SelectionSort class (copy the code from your previous exercise).

<div class="tggle" onclick="toggleVisibility('ex2');">Show/Hide Solution</div>
<div id="ex2" style="display:none;">

In [None]:
import random
import time
import unittest

class SelectionSort:
    def __init__(self,data, verbose = True):
        self.__data = data
        self.__comparisons = 0
        self.__swaps = 0
        self.__verbose = verbose
        self.__time = 0
        
    def getData(self):
        return self.__data
    
    def getTime(self):
        return self.__time
    
    def getComparisons(self):
        return self.__comparisons
    
    def getSwaps(self):
        return self.__swaps
    
    def swap(self, i, j):
        """
        swaps elements i and j in data.
        """
        if(i != j): #no point in swapping if i==j
            tmp = self.__data[i]
            self.__data[i] = self.__data[j]
            self.__data[j] = tmp
        
    def argmin(self, i):
        """
        returns the index of the smallest element of
        self.__data[i:]
        """
        mpos = i
        N = len(self.__data)
        minV = self.__data[mpos]
        for j in range(i + 1,N): # from i+1 to N. U[i+1:]
            if(self.__data[j] < minV):
                mpos = j
                minV = self.__data[j]
            #just for checking
            self.__comparisons += 1
        
        return mpos
    
    def sort(self):
        self.__comparisons = 0
        self.__swaps = 1
        if self.__verbose:
            print("Initial list:")
            print(self.__data)
            print("\n")
            
        #to check performance    
        start_t = time.time()
        for i in range(len(self.__data) - 1):
                j = self.argmin(i)
                self.swap(i,j) 
                self.__swaps += 1
                if self.__verbose:
                    print("It. {}. data[{}]<->data[{}] {}<->{}".format(i,
                                                                       i,
                                                                       j,
                                                                       self.__data[i],
                                                                       self.__data[j]))
                    print(self.__data)    
        end_t = time.time()
        
        self.__time = end_t - start_t
        
        if self.__verbose:
            print(self.__data)
            print("\nNumber of comparisons: {}".format(self.__comparisons))
            print("Number of swaps: {}".format(self.__swaps))
            print("In {:.4f}s".format(self.__time))



if __name__ == "__main__":
    unittest.main()


class Test(unittest.TestCase):
    def __init__(self, *args, **kwargs):
        super(Test, self).__init__(*args, **kwargs)
        #create a test list:
        x = []
        for i in range(300):    
            x.append(random.randint(-100,100))
        self.sorter = SelectionSort(x, verbose = False)
                          
    def test_swap1(self):
        """swap of swap is identical to beginning"""
        #let's copy data
        dcopy = self.sorter.getData()[:]
        for i in range(40):
            i1 = random.randint(0,len(dcopy) - 1)
            i2 = random.randint(0,len(dcopy) - 1)
            self.sorter.swap(i1,i2)
            self.sorter.swap(i1,i2)
            self.assertTrue(self.sorter.getData() == dcopy)
    def test_swap2(self):
        """length of swapped data does not change"""
        
        l = len(self.sorter.getData())
        for i in range(40):
            i1 = random.randint(0,l - 1)
            i2 = random.randint(0,l - 1)
            self.sorter.swap(i1,i2)
            self.assertTrue(len(self.sorter.getData()) == l)
        
    def test_swap3(self):
        """swapping only changes i and j indexes (if i!=j)"""
        #let's copy data
        dcopy = self.sorter.getData()[:]
        for i in range(40):
            #let's copy data
            dcopy = self.sorter.getData()[:]
            i1 = random.randint(0,len(dcopy) - 1)
            i2 = random.randint(0,len(dcopy) - 1)
            if i1 != i2:
                self.sorter.swap(i1,i2)
                for ind in range(0,len(dcopy)):
                    if ind != i1 and ind != i2:
                        self.assertTrue(dcopy[ind] == self.sorter.getData()[ind])
    
    def test_argmin(self):
        """
        tests if j=argmin(i) then j <= data[i:]
        """
        l = len(self.sorter.getData())
        for i in range(40):
            ind = random.randint(0,l - 1)
            minP = self.sorter.argmin(ind)
            for j in self.sorter.getData()[ind:]:
                self.assertTrue(self.sorter.getData()[minP] <= j)
                
    def test_sort(self):
        """tests if the sort works"""
        self.sorter.sort()
        d = self.sorter.getData()
        for el in range(0,len(d) - 2):
            self.assertTrue(d[el] <= d[el+1])
    
    def test_empty(self):
            """sorting of empty list is empty"""
            self.assertEqual(SelectionSort([]).getData(),[])


<div class="alert alert-info">

**Note:** 
Note the line that I used to initialize the Test class. 
```
def __init__(self, *args, **kwargs):
        super(Test, self).__init__(*args, **kwargs)
```
this allows us to define the random test data within the Test class (these lines are basically because we need to pass the super-class constructor with all the parameters it needs). 

</div>


You can find a solution to run unittests here: [selection_sort_test.py](file_samples/selection_sort_test.py)
Reminder: you can run the unittest with:

```
python3 -m unittest selection_sort_test.py
```

</div>

2. CRISPR-Cas9 is quite a neat system to perform genome editing. Guide RNAs (gRNAs) can transport Cas9 to anywhere in the genome for gene editing, but no editing can occur at any site other than one at which Cas9 recognizes the protospacer adjacent motif (PAM). The PAM site is a 2-6 base pair DNA sequence immediately following the DNA sequence targeted by the Cas9 nuclease in the CRISPR bacterial adaptive immune system. Some used PAMs are the following:
```
NGG (where N is any base)
NGA
YG (where Y is a Pyrimidine, i.e. C or T)
TTTN
YTN
```
write a function that loads the fasta sequences [contig82.fasta](file_samples/contig82.fasta) and for each sequence reports the number of sites of each of the above PAMs and its frequency (i.e. number over the length of the sequence). Hint: load the sequence with biopython and SeqIO.



<div class="tggle" onclick="toggleVisibility('ex2');">Show/Hide Solution</div>
<div id="ex2" style="display:none;">

In [None]:
%reset -f

from Bio import SeqIO
import re

def countPAMs(filename):
    for seq_record in SeqIO.parse(filename, "fasta"):
        s = seq_record.seq
        l = len(s)
        ident = seq_record.id
        m = re.findall("[ATCG]GG", str(s))
        NGG = len(m)
        m = re.findall("[ATCG]GA", str(s))
        NGA = len(m)
        m = re.findall("[C|T]G", str(s))
        YG = len(m)
        m = re.findall("TTT[ATCG]", str(s))
        TTTN = len(m)
        m = re.findall("[C|T]T[ATCG]", str(s))
        YTN = len(m)
        print("{} (len:{}):\n\tNGG:{} (1 on {} bases)".format(ident, l,NGG, l/NGG))
        print("\tNGA:{} (1 on {} bases)".format(NGA, l/NGA))
        print("\tYG:{} (1 on {} bases)".format(YG, l/YG))
        print("\tTTTN:{} (1 on {} bases)".format( TTTN, l/TTTN))
        print("\tYTN:{} (1 on {} bases)".format( YTN, l/YTN))
        
        
fn = "file_samples/contigs82.fasta"
countPAMs(fn)


</div>

3. Write a python function ```sortCSV(mystr)``` that gets a comma separated string and returns a comma separated string with all elements sorted in alphabetically decreasing order. Define some unittests to check if the function has been implemented correctly (some things to check: length of initial string is the same as that of the final string, number of elements is the same, each element in the output string must come after the next one in lexicographical order,...). 


<div class="tggle" onclick="toggleVisibility('ex3');">Show/Hide Solution</div>
<div id="ex3" style="display:none;">

In [None]:
%reset -f

import random
import unittest

def sortCSV(mystr):
    tmp = mystr.split(",")
    tmp.sort(reverse=True)
    return ",".join(tmp)



class Testing(unittest.TestCase):
    def __init__(self, *args, **kwargs):
        super(Testing, self).__init__(*args, **kwargs)
        #create a random string
        self.alphabet = "abcdefghkjilmnopqrstuvwyz"
        self.data = ""
        #create 15 random strings
        for i in range(15):
            word = ""
            #each of them has a random length up to 20
            j = random.randint(1,20)
            for ind in range(j):
                #pick up to 20 random letters
                t = random.randint(1,len(self.alphabet)-1)
                word += self.alphabet[t]
            if(len(self.data) == 0):
                self.data = word
            else:
                self.data += "," + word


    def test_reslen(self):
        self.assertTrue(len(self.data) == len(sortCSV(self.data)))

    def test_elcount(self):
        res = sortCSV(self.data).split(",")
        self.assertTrue(len(self.data.split(",")) == len(res))

    def test_elsorting(self):
        res = sortCSV(self.data).split(",")
        for ind in range(len(res)-1):
            self.assertTrue(res[ind]> res[ind+1])

    def test_empty(self):
        self.assertEqual(sortCSV(""),"")
    
    def test_onlyOne(self):
        j = random.randint(1,20)
        word = ""
        for ind in range(j):
            #pick up to 20 random letters
            t = random.randint(0,len(self.alphabet)-1)
            word += self.alphabet[t]
        self.assertEqual(sortCSV(word), word)

if __name__ == "__main__":
    mystr = "book,tree,final,example,testing,zed,all,hair,lady,figure,tap,spring,test,fin,tail"
    print("Original:\n{}".format(mystr))
    print("Sorted:\n{}".format(sortCSV(mystr)))
    unittest.main()


You can find a solution to run unittests here: [pract13_ex3.py](file_samples/pract13_ex3.py)


</div>

4. Solve a modified version of the following exercise of Practical 11 implementing a function that parses the "ExpXml" text through several regular expressions.

The exercise is reported below:

Write a python script that retrieves all the information present in SRA regarding PacBio sequencing performed on E.coli strain K12 (query term is “E.coli K12 wgs PacBio”). 

Print the number of results and for each id report the title, the experiment accession, the instrument, the library strategy, the library source, the total number of spots and total number of bases sequenced.
Sample output:

```
Entries found: 9

[1] Results for id 357838:
Title: E. coli K12 PacBio RS C2 CCS sequencing
Experiment accession: SRX255779
Instrument: PACBIO_SMRT
Library strategy: WGS
Library source: GENOMIC
Total spots:1798302
Total bases:4228754616

 ...
 ...
```

A sample "ExpXml" string:

```
<Summary><Title>E. coli K12 PacBio RS C2 CCS sequencing</Title><Platform instrument_model="PacBio RS">PACBIO_SMRT</Platform><Statistics total_runs="22" total_spots="1798302" total_bases="4228754616" total_size="16799546700" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA071585" center_name="NBACC" contact_name="Sergey Koren" lab_name=""/><Experiment acc="SRX255779" ver="1" status="public" name="E. coli K12 PacBio RS C2 CCS sequencing"/><Study acc="SRP020003" name="Escherichia coli K12 Re-sequencing"/><Organism taxid="511145" ScientificName="Escherichia coli str. K-12 substr. MG1655"/><Sample acc="SRS000462" name=""/><Instrument PACBIO_SMRT="PacBio RS"/><Library_descriptor><LIBRARY_NAME>PacBio RS CCS</LIBRARY_NAME><LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>unspecified</LIBRARY_SELECTION><LIBRARY_LAYOUT> <SINGLE/> </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA194437</Bioproject><Biosample>SAMN00000224</Biosample>
```



<div class="tggle" onclick="toggleVisibility('ex4');">Show/Hide Solution</div>
<div id="ex4" style="display:none;">

In [None]:
%reset -f 

import re
from Bio import Entrez



def parseExp(expStr):
    m = re.search("<Title>([A-Za-z0-9_\. \(\)]*)</Title>", expStr)
    if(m):
        title = "Title: " + m.group()[7:-8]
        print(title)
        
    m = re.search("<Experiment acc=\"([A-Z0-9]*)\"", expStr)
    if(m):
        acc = "Experiment accession: " + m.groups()[0]
        print(acc)
    m = re.search("<Platform ([A-Za-z0-9_=\" \(\)]*)>([A-Za-z0-9_\(\)]*)</Platform>", expStr)
    if(m):
        platform = "Instrument: " + m.groups()[1]
        print(platform)
    m = re.search("<LIBRARY_STRATEGY>([A-Za-z0-9_=\" \(\)]*)</LIBRARY_STRATEGY>", expStr)
    if(m):
        src = "Library strategy: "  + m.groups()[0]
        print(src)
    m = re.search("<LIBRARY_SOURCE>([A-Za-z0-9_=\" \(\)]*)</LIBRARY_SOURCE>", expStr)
    if(m):
        src = "Library source: "  + m.groups()[0]    
        print(src)
    m = re.search("total_spots=\"([0-9]*)\" total_bases=\"([0-9]*)\"",expStr)
    if(m):
        spots = "Total spots:" + m.groups()[0] +"\nTotal bases:" + m.groups()[1]
        print(spots)

    
Entrez.email = "my_email"
handle = Entrez.esearch(db="sra", term="E.coli K12 wgs PacBio", retmax = 10)
res = Entrez.read(handle)
#uncomment to see all fields:
#for el in res.keys():
#    print(el , " : ", res[el])

print("Entries found: {}".format(res["Count"]))

cnt = 1
for ids in res["IdList"]:
    print("\n[{}] Results for id {}:".format(cnt, ids))
    handle = Entrez.esummary(db="sra",  id = ids)
    res = Entrez.read(handle)
    cnt += 1
          
    for r in res:
        info = r['ExpXml']
        #print(info)
        parseExp(info)
        



</div>

5. The file (DNA_seq.fasta)[file_samples/DNA_seq.fasta] contains a synthetic DNA sequence. Let's assume to have two restriction enzymes LagI and JagII that respectively cut at the site CNC/ATT, and GAGRK/TNG (where N is any site, R is A or G and K is A or C or T. Note that "/" is just a representation of where the enzyme cuts, therefore we do not need to specify this in the regular expression, but we need to take it into account when we cut the DNA.

Ex. if the sequence is:
ATACATTCCCCCGGAATCGCCCCCCCTCCATTCC
digesting the sequence with LagI would give:
["ATAC","ATTCCCCCGGAATCGCCCCCCCTCC", "ATTCC"]
digesting this further with JagII would give:
["ATAC","ATTCCCCCGGAA", "TCGCCCCCCCTCC", "ATTCC"]

Write a python script that simulates a digestion with LagI, size selection to keep only the fragments higher than 50 base pairs, and a digestion with JagII, printing the lengths of the obtained fragments. What happens to the fragments if we digest first with JagII and then with LagI?

<div class="tggle" onclick="toggleVisibility('ex5');">Show/Hide Solution</div>
<div id="ex5" style="display:none;">

In [None]:
%reset -f 

from Bio import SeqIO
import re

#in our case overhang can be 3 or 4 (3 for LagI and 4 for JagII)
def digestSequence(seq, regex, overhang):
    digests = []
    sP = 0
    matches = re.finditer(regex,seq)
    if matches:
        for site in matches:
            print("\tRestriction sites:")
            print("\t{} {} {}".format(site.start(),
                                   site.end(),
                                   site.group()))
            digests.append(seq[sP:site.start()+overhang])
            sP = site.start() + overhang
        #last element:
        digests.append(seq[sP:])
    return digests

fn = "file_samples/DNA_seq.fasta"

myseq = SeqIO.read(fn, "fasta")

regexLagI = "C[ATCG]CATT"
regexJagII = "GAG[AG][ACT]T[ATCG]G"


s = str(myseq.seq)

print("Initial sequence:")
print(s)
print("\n")

        
print("LagI restriction:")
digests = digestSequence(s, regexLagI, 3)
#filter digests
dig = [x for x in digests if len(x) > 50]
print("Lengths:")
print([len(x) for x in dig])
finalDigests = []

print("\nJagII restriction:")
for d in dig:
    tmp = digestSequence(d, regexJagII, 5)
    for t in tmp:
        finalDigests.append(t)

#print("Final digests:" + str(finalDigests))
print("Final lengths: " + str([len(x) for x in finalDigests])) 


#Other JagII first and LagI second:
print("\n###########################")
print("##### JagII first #########")
print("###########################")
print("LagI restriction:")
digests = digestSequence(s, regexJagII, 5)
#filter digests
dig = [x for x in digests if len(x) > 50]
print("Lengths:")
print([len(x) for x in dig])
finalDigests = []

print("\nJagII restriction:")
for d in dig:
    tmp = digestSequence(d, regexLagI, 3)
    for t in tmp:
        finalDigests.append(t)

#print("Final digests:" + str(finalDigests))
print("Final lengths: " + str([len(x) for x in finalDigests])) 



</div>