### NOTE FOR LUCA

**Remember to set/remove metadata as:**
{
  "nbsphinx": "hidden"
}

to enable/disable solutions view


# Practical 13

In this practical we will continue with object oriented programming and will focus on code testing in Python. Finally, we will introduce regular expressions.

## Slides

The slides of the introduction can be found here: [Intro](docs/Practical13.pdf)

## Testing your code 

Testing the code is quite an important step to make sure that the code is predictable and although some bugs can always slip through, **testing is the process of making the code as predictable as possible**.

Two types of testing exist: the **white-box** testing methodology looks into every detail of the implemented code to inspect its correctness, **black-box** testing does not look at the details of how the code is implemented, but it just focus on the correctness of the output produced by the code.

Testing is quite a complex and articulated topic, here we will just scratch the tip of the iceberg (there are many books on the topic if you are interested).

A **test** is a piece of code written with the sole purpose of checking the correctness of another piece of code. 

Testing requires three successive moments: first of all we need to **set up** (or prepare) the test setting up connections/interfaces to test data, the second step is to **execute** the piece of code we are aiming to test using the interfaces devised at the previous step, and the third is the **verification** of the results to make sure they look as they were expected to.

### Doctest 

A very simple way to specifying tests for the code is by using an embedded module called **doctest**. It will basically search for pieces of code in your python file that look like **interactive python sessions** (that are lines starting with ```>>> ```) and will execute them to check if they run giving the result specified in the next line.

<div class="alert alert-warning">

**Important:** 

Note the space after ">>> ". That is where the test starts. An example:
```
"""
This is a function that returns three values in a list...
>>> fun(x)
[x, y, z]
"""
```
</div>

**Example:**
Let's define some doctest tests for the simple function computing the first N prime numbers.

In [None]:
%reset -f 

def getFirstNprimes(N):
    '''
    This function should output the first N prime numbers.
    >>> getFirstNprimes(1)
    [2]
    >>> getFirstNprimes(2)
    [2, 3]
    >>> getFirstNprimes(10)
    [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
    '''
    if N == 0:
        return []
    res = [2]
    current = 3
    while len(res) < N:
        if len([x for x in res if current % x == 0]) == 0:
            res.append(current)
        current += 1
    #uncomment next line to introduce a bug
    #res.append(1)
    return res        

if __name__ == "__main__":
    import doctest
    doctest.testmod()
    print(getFirstNprimes(20))

The line ```if __name__ == "__main__":``` is used to specify if the code is executed as a script (i.e. it is not invoked as an imported module somewhere else in another piece of code).

Executing the code as it is will not give any particular error as the tests we set-up passed correctly. But if we uncomment the ```res.append(1)``` the tests will fail and we produce the following testing output, which reports the three failed tests, the expected values and the obtained values: 

![](img/pract13/doctest.png)

Another way of testing the code is through unit tests, we will see this later on. 

### Raising exceptions and using assertions

Exceptions are a very good way to inform the program that some unexpected thing has happened (e.g. division by zero, tentative to access a position in a list that does not exist, summing a string and an integer...).

One thing we can do is to raise exceptions whenever some pre-conditions are not met in order to insure that these do not lead to erroneous behaviours. This can be done with ```raise Exception("exception text")```. More info on raising exceptions are [here](https://docs.python.org/3/tutorial/errors.html#raising-exceptions). 

**Example**: 
Consider the following ```MyIntPair``` class that works with integers. If we want to make sure it only contains integers we can add a ```raise Exception``` in case it is not an integer. 

In [None]:
class MyIntPair:
    def __init__(self, x,y):
        if not type(x) == int:
            raise Exception("x {} is not integer".format(x))
        if not type(y) == int:
            raise Exception("y {} is not integer".format(y))
        self.x = x
        self.y = y
        
    def __add__(self, other):
        return (self.x + other.x, self.y + other.y)
    
A = MyIntPair(5,10)
B = MyIntPair(3,6)
print(A + B)
C = MyIntPair(1, "two")

Note that if we try to pass it something other than an integer, the execution will stop. To amend this behaviour we need to intercept the exception and deal with it appropriately. This topic is quite articulated, you can find more info on it [here](https://docs.python.org/3/tutorial/errors.html#handling-exceptions).

A simple exception handling can be done by using the ```try - except``` construct that tries to run some statements, being ready to intercept and handle any exception that might occur:

In [None]:
class MyIntPair:
    def __init__(self, x,y):
        if not type(x) == int:
            raise Exception("x: {} is not integer".format(x))
        if not type(y) == int:
            raise Exception("y: {} is not integer".format(y))
        self.x = x
        self.y = y
        
    def __add__(self, other):
        return (self.x + other.x, self.y + other.y)
    
try: 
    A = MyIntPair(5,10)
    B = MyIntPair(3,6)
    #Uncomment to see a different error
    #print(A/0)
    print(A + B)
    C = MyIntPair(1, "two")
    print(A + C)

except Exception as e:
    print("Whoops something went wrong. Ignore the rest.")
    print(str(e))
    

Similarly, we can produce assertions that test if a condition is True, otherwise the execution is stopped with an ```AssertionError```. This error can be caught and handled appropriately. Note that in the code below, only the AssertionError is caught and a division by 0 (e.g. print(10/0))

In [45]:
class MyIntPair:
    def __init__(self, x,y):
        assert type(x) == int, "x: {} is not integer".format(x)
        assert type(y) == int, "y: {} is not integer".format(y)
        self.x = x
        self.y = y
        
        
    def __add__(self, other):
        return (self.x + other.x, self.y + other.y)
    
try: 
    A = MyIntPair(5,10)
    B = MyIntPair(3,6)
    #Uncomment to see a different error (not captured!)
    #print(10/0)
    print(A + B)
    C = MyIntPair(1, "two")
    print(A + C)

except AssertionError as e:
    print("Whoops something went wrong. Ignore the rest.")
    print(str(e))

Whoops something went wrong. Ignore the rest.
division by zero


## Unit tests

Python comes with a fully-fledgeted testing module which is called ```unittest```. Have a look [here](https://docs.python.org/3.8/library/unittest.html) for detailed information on this module.

The module ```unittest``` must be imported first with ```import unittest``` and then the Test class must be implemented to perform the tests.

In a nutshell, to create some unit tests we need to define a ```Testing``` class (we are free to call it as we like) which is a subclass of the class ```unittest.TestCase```. Within this class, we have then to specify the tests we want to run. Every test is a method and its name **must start** with ```test_``` (e.g. ```test_length```).

Tests can use assertions ```assertEqual(value1, value2)```, ```assertTrue(condition)``` or ```assertFalse(condition)``` that respectively allow to check the equality of two values (i.e. the known result and the output of the method to be tested) and the truth value of a condition (typically computed on the output of the method under test).

We can add the invocation to the ```unittest``` module within the script by adding (```unittest.main()```) for example in the ```main``` part through ```if __name__ == "__main__":```. In this case, we can perform the tests just by calling something like:
```
python3 my_testing_function.py
```

without specifying ```unittest.main()``` in the script, we need to call the unittest with:
```
python3 -m unittest my_testing_function.py
```

The unittest will provide us feedback on how the tests performed. In particular, if all the tests are passed we should get something like:

```
python3 -m unittest file_samples/my_testing_function.py 

.....
----------------------------------------------------------------------
Ran 5 tests in 0.387s

OK
```
otherwise, in case of some errors like:

```
...FF
======================================================================
FAIL: test_one (file_samples.my_testing_function.Testing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/biancol/Google Drive/work/courses/QCBsciprolab/file_samples/my_testing_function.py", line 33, in test_one
    self.assertEqual(getFirstNprimes(1),[2])
AssertionError: Lists differ: [10] != [2]

First differing element 0:
10
2

- [10]
+ [2]

======================================================================
FAIL: test_ten (file_samples.my_testing_function.Testing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/biancol/Google Drive/work/courses/QCBsciprolab/file_samples/my_testing_function.py", line 37, in test_ten
    [2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
AssertionError: Lists differ: [2, 3, 5, 7, 11, 13, 17, 10, 23, 29] != [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

First differing element 7:
10
19

- [2, 3, 5, 7, 11, 13, 17, 10, 23, 29]
?                           ^

+ [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
?                           ^


----------------------------------------------------------------------
Ran 5 tests in 0.770s

FAILED (failures=2)

```

**Example:**
Let's define some doctest tests for the simple function computing the first N prime numbers. The file is available here: [my_testing_function.py](file_samples/my_testing_function.py).

In [None]:
%reset -f 
"""
In file my_testing_function.py
"""

import unittest
import random

def getFirstNprimes(N):
    '''
    This function should output the first N prime numbers.
     '''
    if N <= 0:
        return []
    res = [2]
    current = 3
    while len(res) < N:
        if len([x for x in res if current % x == 0]) == 0:
            res.append(current)
        current += 1
    #uncomment next line to introduce a bug
    #res.append(1)
    #or a more subtle error:
    #ind = random.randint(len(res))
    #res[ind] = 10
    return res        

class Testing(unittest.TestCase):
    def test_empty(self):
        self.assertEqual(getFirstNprimes(0),[])
    
    def test_one(self):
        self.assertEqual(getFirstNprimes(1),[2])
    
    def test_ten(self):
        self.assertEqual(getFirstNprimes(10),
                         [2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
    
    def test_len(self):
        for i in range(0,10):
            n = random.randint(1,1000)
            self.assertFalse(len(getFirstNprimes(n)) != n)
    
    
    def test_negative(self):
        self.assertTrue(len(getFirstNprimes(-1)) == 0)

if __name__ == "__main__":
    #uncomment to run the tests in the main (without -m)
    #unittest.main()
    print(getFirstNprimes(20))

Running the tests on the previous code without any "intentional bug" will produce an OK message as above. Note that the random error is quite tricky as it affects different positions all the time (try running the test several times). This type of errors is quite difficult to detect and requires several rounds of testing as if some random steps are taking place, the error might not occur all the times.

## Regular Expressions 

A **regular expression** (**regex**) is a string of characters defining a search pattern with which we can carry out operations such as pattern and string matching, find/replace etc.

There are two types of characters: **normal characters** (which have to match amongst themselves) and **special characters** (which are used to specify repetitions (```*, ?, +, {x,y}```), a set of elements (```[]```), negation (```[^]```), beginning (```^```) of a string, end of a string (```$```), etc.

As seen in the lecture, the syntax of the most common regular expressions is the following:

![](img/pract13/regex.png)

With regex it is possible to filter strings depending on some patterns that could include specific words (present 0 or several times using the ```?```, ```+``` or ```*``` or from ```n``` to ```m``` occurrences ```{n,m}```) at the beginning of the string ```^``` or end of the string with  ```$```. Subsets of characters can be specified with ```[]```, negation of a subset of characters with ```[^]```.

Let's see some examples:

What does the following regex match?

```
regex = "[A-Z]__[0-9]{1,4}__[a-z:-]*__[A-Z]"
```
One capital letter, two underscores, from 1 to 4 numbers, two underscores, from 0 to infinite small letters, : or -, two underscores and one capital letter. Examples:

```
B__2786__prxywflh-bityeqwdmuzbygwpjtadzbvjyzmq:prucyz-rkrgjmytczdjejsgvpn-__X
Q__5879__rarwqytmpqa-l__M
A__45__lwonmfel-qdbcd__X
T__5776__lmeqq__T
A__126____T

```

What does the following regex match (assuming only IUPAC ambiguous nucleotide alphabet)?

```
regex = "[ATCG]*([^ATCG]+[ATCG]*)*"
```
macthes all DNA strings with zero or more A,T,C,G bases, followed by a non A,T,C,G base (e.g. N), followed by zero or more A,T,C,G bases:
```
ATACAYYATATACA
CGCTTANTAT
AAATTCGW
TTACWWWWWWWWA
```

What does the following regex encode for?

```
regex="^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$```
```
it matches:

```
61.92.71.6
497.7.0.88
168.9.620.755
21.5.5.1
6.994.06.45
```

In python there is a specific regular expression module to deal with these. It is called ```re``` and must be imported with ```import re``` before using it.

### Match, search, finditer and findall
The first three methods we are going to see are *match*, *search* and *finditer*, which return a ```MatchObject```:

* ```re.match(regex, str)``` where ```regex``` is a string representing the regular expression, ```str``` is the input string. This method tries to match the regex on the string starting **from the beginning** (i.e. left-to-right) of the string. It will return an ```MatchObject``` or None if no match could be found;

* ```re.search(regex,str)``` searches ```regex``` in the whole string and returns a ```MatchObject``` with the first occurrence of the pattern or None if the regex could not match anything;

* ```re.finditer(regex,str)```: returns an iterator to ```MatchObject``` instances over all **non-overlapping** matches for the regular expression pattern in string. The string is scanned left-to-right, and matches are returned in the order found;


Given a ```MatchObject```, if not None, provides the following information:

* ```MatchObject.group()``` : the matched string;

* ```MatchObject.start()``` : the starting point of the matched string in the tested string (str);

* ```MatchObject.end()``` : the ending point of the matched string in the tested string (str); 

* ```MatchObject.groups()``` : when defining a regular expression, we can define subgroups by using "()". This method returns a tuple containing all the subgroups.

Another method available is:

```re.findall(regex,str)``` returns a **list** with all the occurrences of the regex in the string. Note that **if a group is specified** (with "()"), this method reports only the group.

Let's see some examples of match:

In [39]:
%reset -f

import re 


myStr = "hi there, i am using Python and learning hOw to UsE Regular Expressions AKA REGEX"

m = re.search("123", myStr)
print(myStr)
if(m):
    print("123 is in myStr")
else:
    print("123 is NOT in myStr")
    print("m is {}".format(m))
    
a = myStr.split()
print("\nCapitalized words:")
for word in a:
    match = re.match("[A-Z]+[a-zA-Z]*",word)
    if(match):
        print(word)
        
print("\nOn the whole string with SEARCH:")
result = re.search(" [A-Z]+[a-zA-Z]*|^[A-Z]+[a-zA-Z]*| [A-Z]+[a-zA-Z]*$", myStr)

print("{}: starts:{} ends: {} that is: \"{}\"".format(result.group(), 
                                        result.start(), 
                                        result.end(), 
                                        myStr[result.start():result.end()]))

print("\nIteratively with finditer")
#get all the words starting with a capital letter
for m in re.finditer(" [A-Z]+[a-zA-Z]*|^[A-Z]+[a-zA-Z]*| [A-Z]+[a-zA-Z]*$", myStr):
    print(type(m))
    print("{}: starts:{} ends: {} that is: \"{}\"".format(m.group(), 
                                        m.start(), 
                                        m.end(), 
                                        myStr[m.start():m.end()]))

hi there, i am using Python and learning hOw to UsE Regular Expressions AKA REGEX
123 is NOT in myStr
m is None

Capitalized words:
Python
UsE
Regular
Expressions
AKA
REGEX

On the whole string with SEARCH:
 Python: starts:20 ends: 27 that is: " Python"

Iteratively with finditer
<class '_sre.SRE_Match'>
 Python: starts:20 ends: 27 that is: " Python"
<class '_sre.SRE_Match'>
 UsE: starts:47 ends: 51 that is: " UsE"
<class '_sre.SRE_Match'>
 Regular: starts:51 ends: 59 that is: " Regular"
<class '_sre.SRE_Match'>
 Expressions: starts:59 ends: 71 that is: " Expressions"
<class '_sre.SRE_Match'>
 AKA: starts:71 ends: 75 that is: " AKA"
<class '_sre.SRE_Match'>
 REGEX: starts:75 ends: 81 that is: " REGEX"


**Example:**
There are three types of valid Uniprot accessions:

1. [O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9]
2. [A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
3. [A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]

Devise a single regex capable of capturing all of them and use these to filter out the strings that do not encode for valid Uniprot accessions from the following list:

 ["P68250", "SRX4799393", "XT44312B","LFKT12DDF", "A0A022YWF9", 
 "PRJNA230403", "P15163", "Q86US8","W0075458831", 
 "A0A2K5MMC5", "SAMN10177882","Q8LKW0", "Q8IEH2_PLAF7"]

Test the strings that passed the filter by accessing: 
https://www.uniprot.org/uniprot/xxxxxxx where "xxxxxxx" is a valid Uniprot accession.

In [38]:
%reset -f 
import re

accessionList = ["P68250", "SRX4799393", "XT44312B","LFKT12DDF",
                 "A0A022YWF9", "PRJNA230403", "P15163", 
                 "Q86US8","W0075458831", "A0A2K5MMC5", "SAMN10177882",
                "Q8LKW0", "Q8IEH2_PLAF7"]
for el in accessionList:
    match = re.match("[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}",el)
    if(match):
        print("https://www.uniprot.org/uniprot/" + match.group())
    else:
        print("skipping:" + el)

https://www.uniprot.org/uniprot/P68250
skipping:SRX4799393
skipping:XT44312B
skipping:LFKT12DDF
https://www.uniprot.org/uniprot/A0A022YWF9
skipping:PRJNA230403
https://www.uniprot.org/uniprot/P15163
https://www.uniprot.org/uniprot/Q86US8
skipping:W0075458831
https://www.uniprot.org/uniprot/A0A2K5MMC5
skipping:SAMN10177882
https://www.uniprot.org/uniprot/Q8LKW0
https://www.uniprot.org/uniprot/Q8IEH2


**Example:**
The BisI restriction enzyme cuts the DNA at a specific site (restriction site): 
```
GCNGC where N is any base 
```
given the following DNA string:
```
DNA = "ACAAAGACCAAGAAAATGGAATACTCGGAGGGGTAGACCACTAGAAGTGAGCTGCATAAGCGCCGCAAGCGGATTATTTGCCACTCAGTCATTGGCAGGAGTTGCAGGCTGCCAGAGTTCGACAGTTGCAGTATGCCGCGGATCCACGTTTGGAGCAGCGCCAGTTTGGATAGACACCAGGACCCAGAAGCATTACAAAATGCGGCGTAACCTCGCTGTAGACCATGAAACTACGACCGGGTGAATCAGTCCACTCTGCTGCTGGAGCTACCATTGGCAGCACCGGTTGTGGGTAAGGAAAGCTCACCGTTAATCTACTGCAGATGGACAGAGTCCGGGTCCTACACACGCGGCGAGACACCCATTTGTGGCCAGTAGCCGCTTCAGGTCGAGTAATCTGCCAGCAAATCGCGGCGAGCGAGTGACTTGTCGACGAAGCAAAGCAATGCAAACCGTTGGCGGCAGCAATCTCCTGCATATGATGGTGGGACGCTACAATAATCATGTACCAGGCAGCCTATGCATAATATGGCTGCCCACCTACGGGCTATTATAGTGCGACAGCTACCAAAAAAGGTCGCTAGCATGCGCGATACAGCTAACTTTTACGCAGCATACCTTTCCAGCTAGCCGCGATTGT" 
```
write a regex and some python code that answers the following questions:

1. print the start and end position of the first restriction site and the corresponding sequence;
2. print all the positions of the restriction sites and the corresponding sequence;
3. count how many restrictions sites are there and print a list with all the distances between consecutive restriction sites
4. define two groups to get the whole restriction site and also the base in between GCs;


In [40]:
import re

DNA = """ACAAAGACCAAGAAAATGGAATACTCGGAGGGGTAGACCACTAGAAGTGAGCTGCATAAGCGCCGCAAGCGGAT
TATTTGCCACTCAGTCATTGGCAGGAGTTGCAGGCTGCCAGAGTTCGACAGTTGCAGTATGCCGCGGATCCACGTTTGGAGCA
GCGCCAGTTTGGATAGACACCAGGACCCAGAAGCATTACAAAATGCGGCGTAACCTCGCTGTAGACCATGAAACTACGACCGG
GTGAATCAGTCCACTCTGCTGCTGGAGCTACCATTGGCAGCACCGGTTGTGGGTAAGGAAAGCTCACCGTTAATCTACTGCAG
ATGGACAGAGTCCGGGTCCTACACACGCGGCGAGACACCCATTTGTGGCCAGTAGCCGCTTCAGGTCGAGTAATCTGCCAGCA
AATCGCGGCGAGCGAGTGACTTGTCGACGAAGCAAAGCAATGCAAACCGTTGGCGGCAGCAATCTCCTGCATATGATGGTGGG
ACGCTACAATAATCATGTACCAGGCAGCCTATGCATAATATGGCTGCCCACCTACGGGCTATTATAGTGCGACAGCTACCAAA
AAAGGTCGCTAGCATGCGCGATACAGCTAACTTTTACGCAGCATACCTTTCCAGCTAGCCGCGATTGT""" 

DNA = DNA.replace("\n","")
regex = "GC[ATCG]GC"
#let's try with match
match = re.match(regex,DNA)
if match:
    print("DNA starts with a restriction site\n")
else:
    print("DNA does NOT start with a restriction site\n")
#with search
first = re.search(regex,DNA)
if first:
    print("1st Restriction site starts at {} ends at {}: {}\n".format(first.start(),
                                                                first.end(),
                                                                first.group()
                                                               ))
allrest = re.finditer(regex,DNA)
if allrest: 
    cnt = 0
    curPos = 0
    distances = []
    for site in allrest:
        print("{} res. site starts at {} ends at {}: {}".format(cnt + 1,site.start(),
                                                                site.end(),
                                                                site.group()
                                                               ))
        cnt +=1
        distances.append(site.start() - curPos -1 )
        curPos = site.end()

print("\nWe have {} restriction sites".format(cnt))
print("Distances:\n{}".format(distances))

regex = "(GC([ATCG])GC)"
allrest = re.finditer(regex,DNA)
if allrest: 
    for site in allrest:
        print("res. site S:{} E:{}: {} whole group: {} base: {}".format(site.start(),
                                                                site.end(),
                                                                site.group(),
                                                                site.groups()[0],
                                                                site.groups()[1]
                                                               ))

print("\nFindall test (tuple with all groups!):")
allres = re.findall(regex,DNA)
print(allres)

DNA does NOT start with a restriction site

1st Restriction site starts at 50 ends at 55: GCTGC

1 res. site starts at 50 ends at 55: GCTGC
2 res. site starts at 61 ends at 66: GCCGC
3 res. site starts at 107 ends at 112: GCTGC
4 res. site starts at 134 ends at 139: GCCGC
5 res. site starts at 154 ends at 159: GCAGC
6 res. site starts at 201 ends at 206: GCGGC
7 res. site starts at 257 ends at 262: GCTGC
8 res. site starts at 276 ends at 281: GCAGC
9 res. site starts at 349 ends at 354: GCGGC
10 res. site starts at 377 ends at 382: GCCGC
11 res. site starts at 410 ends at 415: GCGGC
12 res. site starts at 458 ends at 463: GCGGC
13 res. site starts at 512 ends at 517: GCAGC
14 res. site starts at 531 ends at 536: GCTGC
15 res. site starts at 609 ends at 614: GCAGC
16 res. site starts at 629 ends at 634: GCCGC

We have 16 restriction sites
Distances:
[49, 5, 40, 21, 14, 41, 50, 13, 67, 22, 27, 42, 48, 13, 72, 14]
res. site S:50 E:55: GCTGC whole group: GCTGC base: T
res. site S:61 E:66: 

A lot of information on regex can be found [here](https://docs.python.org/3.7/library/re.html) and [here](https://docs.python.org/3/howto/regex.html).

If you want to practice with regular expressions, you can use this website: [https://regex101.com/r/gU7oT6/2](https://regex101.com/r/gU7oT6/2).

## Exercises


1. The following function is supposed to get two lists of integers (let's call then X and Y) and return the list of elements that are contained in both (let's call it B). Is it correct? Devise a unit test to check if it is correct or not (things to check are the length of the result, transitivity, repeated elements in one of the lists, one list empty...). In the latter case propose a correct version of the function.


In [42]:
def myListIntersection(X,Y):
    tmp = X + Y
    vals = [x for x in tmp if tmp.count(x) == 2]
    return list(set(vals))

A = [1, 2, 3, 4, 7, 12]
B = [4, 1, 7, 120]
C = [120, 6]
D = []

print("A, B: {}".format(myListIntersection(A,B)))
print("A, C: {}".format(myListIntersection(A,C)))
print("B, C: {}".format(myListIntersection(B,C)))
print("A, D: {}".format(myListIntersection(A,D)))

A, B: [1, 4, 7]
A, C: []
B, C: [120]
A, D: []


<div class="tggle" onclick="toggleVisibility('ex1');">Show/Hide Solution</div>
<div id="ex1" style="display:none;">

In [None]:
%reset -f

import unittest
import random

#this has problems (test X=[1,1,2,2,3], Y=[4,3])
def myListIntersection2(X,Y):
    tmp = X + Y
    vals = [x for x in tmp if tmp.count(x) == 2]
    return list(set(vals))

##Correct!
def myListIntersection(X,Y):
    inx = [ x for x in X if x in Y]
    iny = [y for y in Y if y in X]
    return list(set(inx + iny))
    
A = [1, 2, 3, 4, 7, 12]
B = [4, 1, 7, 120]
C = [120, 6]
D = []


class Test(unittest.TestCase):
    def __init__(self, *args, **kwargs):
        super(Test, self).__init__(*args, **kwargs)
        #create two test lists:
        self.x = []
        self.y = []
        for i in range(15):    
            self.x.append(random.randint(0,10))
            self.y.append(random.randint(0,10))
        #print("{}\n{}".format(self.x,self.y))
                          
    def test_reslen(self):
        r = myListIntersection(self.x, self.y)
        self.assertFalse(len(r) > len(self.x))
        self.assertFalse(len(r) > len(self.y))
        s1 = [a for a in self.x if a in self.y]
        s2 = [a for a in self.y if a in self.x]
        S = set(s1+s2)
        self.assertTrue(len(r) == len(S))
    
    def test_empty(self):
            self.assertEqual(myListIntersection(self.x, []),[])
            self.assertEqual(myListIntersection([], self.y),[]) 
    
    def test_transitivity(self):
            v = myListIntersection(self.x, self.y).sort()
            v1 = myListIntersection(self.y, self.x).sort()              
            self.assertEqual(v,v1)
    
    def test_doubleEls(self):
        dX = self.x + self.x
        dY = self.y + self.y
        v1 = myListIntersection(dX,dY)
        v1.sort()
        v2 = myListIntersection(self.x,self.y)
        v2.sort()
        self.assertEqual(v1, v2)

if __name__ == "__main__":
    print("A, B: {}".format(myListIntersection(A,B)))
    print("A, C: {}".format(myListIntersection(A,C)))
    print("B, C: {}".format(myListIntersection(B,C)))
    print("A, D: {}".format(myListIntersection(A,D)))   
    unittest.main()

    

<div class="alert alert-info">

**Note:** 
Note the line that I used to initialize the Test class. 
```
def __init__(self, *args, **kwargs):
        super(Test, self).__init__(*args, **kwargs)
```
this allows us to define the random test data within the Test class (these lines are basically because we need to call the super-class constructor with all the parameters it needs). 

</div>


You can find a solution to run unittests here: [pract13_ex1.py](file_samples/pract13_ex1.py)

</div>

2. CRISPR-Cas9 is quite a neat system to perform genome editing. Guide RNAs (gRNAs) can transport Cas9 to anywhere in the genome for gene editing, but no editing can occur at any site other than one at which Cas9 recognizes the protospacer adjacent motif (PAM). The PAM site is a 2-6 base pair DNA sequence immediately following the DNA sequence targeted by the Cas9 nuclease in the CRISPR bacterial adaptive immune system. Some used PAMs are the following:
```
NGG (where N is any base)
NGA
YG (where Y is a Pyrimidine, i.e. C or T)
TTTN
YTN
```
write a function that loads the fasta sequences [contig82.fasta](file_samples/contig82.fasta) and for each sequence reports the number of sites of each of the above PAMs and its frequency (i.e. number over the length of the sequence). Hint: load the sequence with biopython and SeqIO.



<div class="tggle" onclick="toggleVisibility('ex2');">Show/Hide Solution</div>
<div id="ex2" style="display:none;">

In [41]:
%reset -f

from Bio import SeqIO
import re

def countPAMs(filename):
    for seq_record in SeqIO.parse(filename, "fasta"):
        s = seq_record.seq
        l = len(s)
        ident = seq_record.id
        m = re.findall("[ATCG]GG", str(s))
        NGG = len(m)
        m = re.findall("[ATCG]GA", str(s))
        NGA = len(m)
        m = re.findall("[C|T]G", str(s))
        YG = len(m)
        m = re.findall("TTT[ATCG]", str(s))
        TTTN = len(m)
        m = re.findall("[C|T]T[ATCG]", str(s))
        YTN = len(m)
        print("{} (len:{}):\n\tNGG:{} (1 on {} bases)".format(ident, l,NGG, l/NGG))
        print("\tNGA:{} (1 on {} bases)".format(NGA, l/NGA))
        print("\tYG:{} (1 on {} bases)".format(YG, l/YG))
        print("\tTTTN:{} (1 on {} bases)".format( TTTN, l/TTTN))
        print("\tYTN:{} (1 on {} bases)".format( YTN, l/YTN))
        
        
fn = "file_samples/contigs82.fasta"
countPAMs(fn)


MDC020656.85 (len:2802):
	NGG:128 (1 on 21.890625 bases)
	NGA:163 (1 on 17.190184049079754 bases)
	YG:226 (1 on 12.398230088495575 bases)
	TTTN:62 (1 on 45.193548387096776 bases)
	YTN:256 (1 on 10.9453125 bases)
MDC001115.177 (len:3118):
	NGG:74 (1 on 42.13513513513514 bases)
	NGA:159 (1 on 19.61006289308176 bases)
	YG:215 (1 on 14.502325581395349 bases)
	TTTN:74 (1 on 42.13513513513514 bases)
	YTN:337 (1 on 9.252225519287833 bases)
MDC013284.379 (len:5173):
	NGG:182 (1 on 28.423076923076923 bases)
	NGA:307 (1 on 16.850162866449512 bases)
	YG:456 (1 on 11.344298245614034 bases)
	TTTN:162 (1 on 31.932098765432098 bases)
	YTN:577 (1 on 8.965337954939342 bases)
MDC018185.243 (len:22724):
	NGG:723 (1 on 31.43015214384509 bases)
	NGA:1230 (1 on 18.47479674796748 bases)
	YG:1932 (1 on 11.761904761904763 bases)
	TTTN:538 (1 on 42.237918215613384 bases)
	YTN:2424 (1 on 9.374587458745875 bases)
MDC018185.241 (len:23761):
	NGG:713 (1 on 33.32538569424965 bases)
	NGA:1420 (1 on 16.733098591549297

	TTTN:249 (1 on 32.9277108433735 bases)
	YTN:947 (1 on 8.657866948257656 bases)
MDC019067.226 (len:2238):
	NGG:58 (1 on 38.58620689655172 bases)
	NGA:103 (1 on 21.728155339805824 bases)
	YG:206 (1 on 10.864077669902912 bases)
	TTTN:97 (1 on 23.072164948453608 bases)
	YTN:278 (1 on 8.050359712230216 bases)
MDC036568.1 (len:2133):
	NGG:49 (1 on 43.53061224489796 bases)
	NGA:100 (1 on 21.33 bases)
	YG:182 (1 on 11.719780219780219 bases)
	TTTN:53 (1 on 40.24528301886792 bases)
	YTN:241 (1 on 8.850622406639005 bases)
MDC014019.318 (len:2896):
	NGG:131 (1 on 22.106870229007633 bases)
	NGA:143 (1 on 20.251748251748253 bases)
	YG:270 (1 on 10.725925925925926 bases)
	TTTN:85 (1 on 34.07058823529412 bases)
	YTN:354 (1 on 8.180790960451978 bases)
MDC007995.528 (len:3172):
	NGG:102 (1 on 31.098039215686274 bases)
	NGA:169 (1 on 18.76923076923077 bases)
	YG:283 (1 on 11.208480565371024 bases)
	TTTN:109 (1 on 29.10091743119266 bases)
	YTN:364 (1 on 8.714285714285714 bases)
MDC026961.60 (len:4594):
	

</div>

3. Write a python function ```sortCSV(mystr)``` that gets a comma separated string and returns a comma separated string with all elements sorted in alphabetically decreasing order. Define some unittests to check if the function has been implemented correctly (some things to check: length of initial string is the same as that of the final string, number of elements is the same, each element in the output string must come after the next one in lexicographical order,...). 


<div class="tggle" onclick="toggleVisibility('ex3');">Show/Hide Solution</div>
<div id="ex3" style="display:none;">

In [None]:
%reset -f

import random
import unittest

def sortCSV(mystr):
    tmp = mystr.split(",")
    tmp.sort(reverse=True)
    return ",".join(tmp)



class Testing(unittest.TestCase):
    def __init__(self, *args, **kwargs):
        super(Testing, self).__init__(*args, **kwargs)
        #create a random string
        self.alphabet = "abcdefghkjilmnopqrstuvwyz"
        self.data = ""
        #create 15 random strings
        for i in range(15):
            word = ""
            #each of them has a random length up to 20
            j = random.randint(1,20)
            for ind in range(j):
                #pick up to 20 random letters
                t = random.randint(1,len(self.alphabet)-1)
                word += self.alphabet[t]
            if(len(self.data) == 0):
                self.data = word
            else:
                self.data += "," + word


    def test_reslen(self):
        self.assertTrue(len(self.data) == len(sortCSV(self.data)))

    def test_elcount(self):
        res = sortCSV(self.data).split(",")
        self.assertTrue(len(self.data.split(",")) == len(res))

    def test_elsorting(self):
        res = sortCSV(self.data).split(",")
        for ind in range(len(res)-1):
            self.assertTrue(res[ind]> res[ind+1])

    def test_empty(self):
        self.assertEqual(sortCSV(""),"")
    
    def test_onlyOne(self):
        j = random.randint(1,20)
        word = ""
        for ind in range(j):
            #pick up to 20 random letters
            t = random.randint(0,len(self.alphabet)-1)
            word += self.alphabet[t]
        self.assertEqual(sortCSV(word), word)

if __name__ == "__main__":
    mystr = "book,tree,final,example,testing,zed,all,hair,lady,figure,tap,spring,test,fin,tail"
    print("Original:\n{}".format(mystr))
    print("Sorted:\n{}".format(sortCSV(mystr)))
    unittest.main()


You can find a solution to run unittests here: [pract13_ex3.py](file_samples/pract13_ex3.py)


</div>

4. Solve a modified version of the following exercise of Practical 11 implementing a function that parses the "ExpXml" text through several regular expressions.

The exercise is reported below:

Write a python script that retrieves all the information present in SRA regarding PacBio sequencing performed on E.coli strain K12 (query term is “E.coli K12 wgs PacBio”). 

Print the number of results and for each id report the title, the experiment accession, the instrument, the library strategy, the library source, the total number of spots and total number of bases sequenced.
Sample output:

```
Entries found: 9

[1] Results for id 357838:
Title: E. coli K12 PacBio RS C2 CCS sequencing
Experiment accession: SRX255779
Instrument: PACBIO_SMRT
Library strategy: WGS
Library source: GENOMIC
Total spots:1798302
Total bases:4228754616

 ...
 ...
```

A sample "ExpXml" string:

```
<Summary><Title>E. coli K12 PacBio RS C2 CCS sequencing</Title><Platform instrument_model="PacBio RS">PACBIO_SMRT</Platform><Statistics total_runs="22" total_spots="1798302" total_bases="4228754616" total_size="16799546700" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA071585" center_name="NBACC" contact_name="Sergey Koren" lab_name=""/><Experiment acc="SRX255779" ver="1" status="public" name="E. coli K12 PacBio RS C2 CCS sequencing"/><Study acc="SRP020003" name="Escherichia coli K12 Re-sequencing"/><Organism taxid="511145" ScientificName="Escherichia coli str. K-12 substr. MG1655"/><Sample acc="SRS000462" name=""/><Instrument PACBIO_SMRT="PacBio RS"/><Library_descriptor><LIBRARY_NAME>PacBio RS CCS</LIBRARY_NAME><LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>unspecified</LIBRARY_SELECTION><LIBRARY_LAYOUT> <SINGLE/> </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA194437</Bioproject><Biosample>SAMN00000224</Biosample>
```



<div class="tggle" onclick="toggleVisibility('ex4');">Show/Hide Solution</div>
<div id="ex4" style="display:none;">

In [36]:
%reset -f 

import re
from Bio import Entrez



def parseExp(expStr):
    m = re.search("<Title>([A-Za-z0-9_\. \(\)]*)</Title>", expStr)
    if(m):
        title = "Title: " + m.group()[7:-8]
        print(title)
        
    m = re.search("<Experiment acc=\"([A-Z0-9]*)\"", expStr)
    if(m):
        acc = "Experiment accession: " + m.groups()[0]
        print(acc)
    m = re.search("<Platform ([A-Za-z0-9_=\" \(\)]*)>([A-Za-z0-9_\(\)]*)</Platform>", expStr)
    if(m):
        platform = "Instrument: " + m.groups()[1]
        print(platform)
    m = re.search("<LIBRARY_STRATEGY>([A-Za-z0-9_=\" \(\)]*)</LIBRARY_STRATEGY>", expStr)
    if(m):
        src = "Library strategy: "  + m.groups()[0]
        print(src)
    m = re.search("<LIBRARY_SOURCE>([A-Za-z0-9_=\" \(\)]*)</LIBRARY_SOURCE>", expStr)
    if(m):
        src = "Library source: "  + m.groups()[0]    
        print(src)
    m = re.search("total_spots=\"([0-9]*)\" total_bases=\"([0-9]*)\"",expStr)
    if(m):
        spots = "Total spots:" + m.groups()[0] +"\nTotal bases:" + m.groups()[1]
        print(spots)

    
Entrez.email = "my_email"
handle = Entrez.esearch(db="sra", term="E.coli K12 wgs PacBio", retmax = 10)
res = Entrez.read(handle)
#uncomment to see all fields:
#for el in res.keys():
#    print(el , " : ", res[el])

print("Entries found: {}".format(res["Count"]))

cnt = 1
for ids in res["IdList"]:
    print("\n[{}] Results for id {}:".format(cnt, ids))
    handle = Entrez.esummary(db="sra",  id = ids)
    res = Entrez.read(handle)
    cnt += 1
          
    for r in res:
        info = r['ExpXml']
        #print(info)
        parseExp(info)
        



Entries found: 9

[1] Results for id 357838:
Title: E. coli K12 PacBio RS C2 CCS sequencing
Experiment accession: SRX255779
Instrument: PACBIO_SMRT
Library strategy: WGS
Library source: GENOMIC
Total spots:1798302
Total bases:4228754616

[2] Results for id 357018:
Title: E. coli K12 MG1655 PacBio RS C2 Sequencing
Experiment accession: SRX255228
Instrument: PACBIO_SMRT
Library strategy: WGS
Library source: GENOMIC
Total spots:1389597
Total bases:3032527263

[3] Results for id 357016:
Title: E. Coli K12 454 Sequencing To Use For PacBio RS Correction
Experiment accession: SRX255226
Instrument: LS454
Library strategy: WGS
Library source: GENOMIC
Total spots:1385764
Total bases:720878709

[4] Results for id 285700:
Title: WGA E. coli genome DNA sequenced by C2 chemistry (Control)
Experiment accession: SRX209661
Instrument: PACBIO_SMRT
Library strategy: WGS
Library source: GENOMIC
Total spots:245223
Total bases:373483301

[5] Results for id 285699:
Title: WGA E. coli genome DNA sequenced by 

</div>

5. The file (DNA_seq.fasta)[file_samples/DNA_seq.fasta] contains a synthetic DNA sequence. Let's assume to have two restriction enzymes LagI and JagII that respectively cut at the site CNC/ATT, and GAGRK/TNG (where N is any site, R is A or G and K is A or C or T. Note that "/" is just a representation of where the enzyme cuts, therefore we do not need to specify this in the regular expression, but we need to take it into account when we cut the DNA.

Ex. if the sequence is:
ATACATTCCCCCGGAATCGCCCCCCCTCCATTCC
digesting the sequence with LagI would give:
["ATAC","ATTCCCCCGGAATCGCCCCCCCTCC", "ATTCC"]
digesting this further with JagII would give:
["ATAC","ATTCCCCCGGAA", "TCGCCCCCCCTCC", "ATTCC"]

Write a python script that simulates a digestion with LagI, size selection to keep only the fragments higher than 50 base pairs, and a digestion with JagII, printing the lengths of the obtained fragments. What happens to the fragments if we digest first with JagII and then with LagI?

<div class="tggle" onclick="toggleVisibility('ex5');">Show/Hide Solution</div>
<div id="ex5" style="display:none;">

In [37]:
%reset -f 

from Bio import SeqIO
import re

#in our case overhang can be 3 or 4 (3 for LagI and 4 for JagII)
def digestSequence(seq, regex, overhang):
    digests = []
    sP = 0
    matches = re.finditer(regex,seq)
    if matches:
        for site in matches:
            print("\tRestriction sites:")
            print("\t{} {} {}".format(site.start(),
                                   site.end(),
                                   site.group()))
            digests.append(seq[sP:site.start()+overhang])
            sP = site.start() + overhang
        #last element:
        digests.append(seq[sP:])
    return digests

fn = "file_samples/DNA_seq.fasta"

myseq = SeqIO.read(fn, "fasta")

regexLagI = "C[ATCG]CATT"
regexJagII = "GAG[AG][ACT]T[ATCG]G"


s = str(myseq.seq)

print("Initial sequence:")
print(s)
print("\n")

        
print("LagI restriction:")
digests = digestSequence(s, regexLagI, 3)
#filter digests
dig = [x for x in digests if len(x) > 50]
print("Lengths:")
print([len(x) for x in dig])
finalDigests = []

print("\nJagII restriction:")
for d in dig:
    tmp = digestSequence(d, regexJagII, 5)
    for t in tmp:
        finalDigests.append(t)

#print("Final digests:" + str(finalDigests))
print("Final lengths: " + str([len(x) for x in finalDigests])) 


#Other JagII first and LagI second:
print("\n###########################")
print("##### JagII first #########")
print("###########################")
print("LagI restriction:")
digests = digestSequence(s, regexJagII, 5)
#filter digests
dig = [x for x in digests if len(x) > 50]
print("Lengths:")
print([len(x) for x in dig])
finalDigests = []

print("\nJagII restriction:")
for d in dig:
    tmp = digestSequence(d, regexLagI, 3)
    for t in tmp:
        finalDigests.append(t)

#print("Final digests:" + str(finalDigests))
print("Final lengths: " + str([len(x) for x in finalDigests])) 



Initial sequence:
CTTACATGGCAATAACCCCCCGTTTCTACTTCTAGAGGAGAAAAGTATTGACATGAGCGCTCCCGGCACAAGGGCCAAAGAAGTCTCCAATTTCTTATTTCCGAATGACATGCGTCTCCTTGCGGGTAAATCACCGACCGCAATTCATAGAAGCCTGGGGGAACAGATAGGTCTAATTAGCTTAAGAGAGTAAATCCTGGGATCATTCAGTAGTAACCATAAACTTACGCTGGGGCTTCTTCGGCGGATTTTTACAGTTACCAACCAGGAGATTTGAAGTAAATCAGTTGAGGATTTAGCCGCGCTATCCGGTAATCTCCAAATTAAAACATACCGTTCCATGAAGGCTAGAATTACTTACCGGCCTTTTCCATGCCTGCGCTATACCCCCCCACTCTCCCGCTTATCCGTCCGAGCGGAGGCAGTGCGATCCTCCGTTAAGATATTCTTACGTGTGACGTAGCTATGTATTTTGCAGAGCTGGCGAACGCGTTGAACACTTCACAGATGGTAGGGATTCGGGTAAAGGGCGTATAATTGGGGACTAACATAGGCGTAGACTACGATGGCGCCAACTCAATCGCAGCTCGAGCGCCCTGAATAACGTACTCATCTCAACTCATTCTCGGCAATCTACCGAGCGACTCGATTATCAACGGCTGTCTAGCAGTTCTAATCTTTTGCCAGCATCGTAATAGCCTCCAAGAGATTGATGATAGCTATCGGCACAGAACTGAGACGGCGCCGATGGATAGCGGACTTTCGGTCAACCACAATTCCCCACGGGACAGGTCCTGCGGTGCGCATCACTCTGAATGTACAAGCAACCCAAGTGGGCCGAGCCTGGACTCAGCTGGTTCCTGCGTGAGCTCGAGACTCGGGATGACAGCTCTTTAAACATAGAGCGGGGGCGTCGAACGGTCGAGAAAGTCATAGTACCTCGGGTACCAACTTACTCAGGTTATTGCTTGAAGCTGTACTA

</div>