In [None]:
import jupman
jupman.init()

# Practical 6

In this practical we will see how to define functions to reuse code, we will talk about the scope of variables and finally will see how to deal with files in Python.

## Slides

The slides of the introduction can be found here: [Intro](docs/Practical6.pdf)

## Functions

A function is a block of code that has a name and that performs a task. A function can be thought of as a box that gets an input and returns an output. Why should we use functions? For a lot of reasons including: 
    1. *Reduce code duplication*: put in functions parts of code that are needed several times in the whole program so that you don't need to repeat the same code over and over again;
    2. *Decompose a complex task*: make the code easier to write and understand by splitting the whole progam in several easier functions;

both things improve code readability and make your code easier to understand.

The basic definition of a function is:
```
def function_name(input) :
    #code implementing the function
    ...
    ...
    return return_value
```

Functions are defined with the **def** keyword that proceeds the *function_name* and then a list of parameters is passed in the brackets. A colon **:** is used to end the line holding the definition of the function. The code implementing the function is specified by using indentation. A function **might** or **might not** return a value. In the first case a **return** statement is used.


**Example:** 
Define a function that implements the sum of two integer lists (note that there is no check that the two lists actually contain integers and that they have the same size).

In [5]:
def int_list_sum(l1,l2):
    """implements the sum of two lists of integers having the same size  
    """
    ret =[]
    for i in range(len(l1)):
        ret.append(l1[i] + l2[i])
    return ret

L1 = list(range(1,10))
L2 = list(range(20,30))
print("L1:", L1)
print("L2:", L2)

res = int_list_sum(L1,L2)

print("L1+L2:", res)

res = int_list_sum(L1,L1)

print("L1+L1", res)

L1: [1, 2, 3, 4, 5, 6, 7, 8, 9]
L2: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
L1+L2: [21, 23, 25, 27, 29, 31, 33, 35, 37]
L1+L1 [2, 4, 6, 8, 10, 12, 14, 16, 18]


Note that once the function has been defined, it can be called as many times as wanted with different input parameters. Moreover, **a function does not do anything until it is actually called.**
A function can return **0** (in this case the return value would be "None"), **1** or **more** results. Notice also that collecting the results of a function is **not mandatory**.

**Example:**
Let's write a function that, given a list of elements, prints only the even-placed ones without returning anything.

In [6]:
def get_even_placed(myList):
    """returns the even placed elements of myList"""
    ret = [myList[i] for i in range(len(myList)) if i % 2 == 0]
    print(ret)

L1 = ["hi", "there", "from","python","!"]
L2 = list(range(13))

print("L1:", L1)
print("L2:", L2)

print("even L1:")
get_even_placed(L1)
print("even L2:")       
get_even_placed(L2)

L1: ['hi', 'there', 'from', 'python', '!']
L2: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
even L1: ['hi', 'from', '!']
even L2: [0, 2, 4, 6, 8, 10, 12]


**Note that the function above is polymorphic** (i.e. it works on several data types, provided that we can iterate through them).

**Example:**
Let's write a function that, given a list of integers, returns the number of elements, the maximum and minimum.

In [9]:
def get_info(myList):
    """returns len of myList, min and max value (assumes elements are integers)"""
    tmp = myList[:] #copy the input list
    tmp.sort()
    return len(tmp), tmp[0], tmp[-1] #return type is a tuple

A = [7, 1, 125, 4, -1, 0]

print(A)
result = get_info(A)
print("Len:", result[0], "Min:", result[1], "Max:",result[2] )

    

[7, 1, 125, 4, -1, 0]
Len: 6 Min: -1 Max: 125


Please note that the return value above is actually a tuple.
Importantly enough, a function needs to be defined (i.e. its code has to be written) before it can actually be used. 

In [7]:
A = [1,2,3]
my_sum(A)

def my_sum(myList):
    ret = 0
    for el in myList:
        ret += el
    return ret

        

NameError: name 'my_sum' is not defined

## Namespace and variable scope

**Namespaces** are mappings from *names* to objects, or in other words places where names are associated to objects. Namespaces can be considered as the context. According to Python's reference a **scope** is a *textual region of a Python program, where a namespace is directly accessible*, which means that Python will look into that *namespace* to find the object associated to a name. Four **namespaces** are made available by Python:

    1. **Local**: the innermost that contains local names;
    
    2. **Enclosing**: the scope of the enclosing function, 
    it does not contain local nor global names;
    
    3. **Global**: contains the global names;
    
    4. **Built-in**: contains all built in names 
    (e.g. print, if, while, for,...)
    
When one refers to a name, Python tries to find it in the current namespace, if it is not found it continues looking in the namespace that contains it until the built-in namespace is reached. If the name is not found there either, the Python interpreter will throw a **NameError** exception, meaning it cannot find the name. The order in which namespaces are considered is: Local, Enclosing, Global and Built-in (LEGB).  

Consider the following example:

In [19]:
def my_function():
    var = 1  #local variable
    print("Local:", var)
    b = "my string"
    print("Local:", b)
    
var = 7 #global variable
my_function()
print("Global:", var)
print(b)
    

Local: 1
Local: my string
Global: 7


NameError: name 'b' is not defined

Variables defined within a function can only be seen within the function. That is why variable b is defined only within the function. Variables defined outside all functions are **global** to the whole program.
The namespace of the local variable is within the function my_function, while outside it the variable will have its  global value.

And the following:

In [21]:
def outer_function():
    var = 1 #outer
    
    def inner_function():
        var = 2 #inner
        print("Inner:", var)
        print("Inner:", B)    
    
    inner_function()
    print("Outer:", var)
    
    
var = 3 #global
B = "This is B"
outer_function()
print("Global:", var)
print("Global:", B)

Inner: 2
Inner: This is B
Outer: 1
Global: 3
Global: This is B


Note in particular that the variable B is global, therefore it is accessible everywhere and also inside the inner_function. On the contrary, the value of var defined within the inner_function is accessible only in the namespace defined by it, outside it will assume different values as shown in the example.

In a nutshell, remember the three simple rules seen in the lecture. Within a **def**:

    1. Name assignments create local names by default;
    2. Name references search the following four scopes in the order: 
    local, enclosing functions (if any), then global and finally built-in (LEGB)
    3. Names declared in global and nonlocal statements map assigned names to 
    enclosing module and function scopes.

## Argument passing

Three important things to bear in mind are:

1. Passing an argument is actually assigning an object to a local variable name;
2. Assigning an object to a variable name within a function **does not affect the caller**;
3. Changing a **mutable** object variable name within a function **affects the caller**

Consider the following examples:

In [24]:
"""Assigning the argument does not affect the caller"""

def my_f(x):
    x = "local value" #local
    print("Local: ", x)

x = "global value" #global
my_f(x)
print("Global:", x)
my_f(x)



Local:  local value
Global: global value
Local:  local value


In [27]:
"""Changing a mutable affects the caller"""

def my_f(myList):
    myList[1] = "new value1"
    myList[3] = "new value2"
    print("Local: ", myList)

myList = ["old value"]*4
print("Global:", myList)
my_f(myList)
print("Global now: ", myList)

Global: ['old value', 'old value', 'old value', 'old value']
Local:  ['old value', 'new value1', 'old value', 'new value2']
Global now:  ['old value', 'new value1', 'old value', 'new value2']


Recall what seen in the lecture:

![](img/pract6/argument_passing.png)

The behaviour above is because **immutable objects** are passed **by value** (therefore it is like making a copy), while **mutable objects** are passed **by reference** (therefore changing them effectively changes the original object). 

To avoid making changes to a **mutable object** passed as parameter one needs to **explicitely make a copy** of it.

Consider the example seen before.
**Example:**
Let's write a function that, given a list of integers, returns the number of elements, the maximum and minimum.

In [32]:
def get_info(myList):
    """returns len of myList, min and max value (assumes elements are integers)"""
    myList.sort()
    return len(myList), myList[0], myList[-1] #return type is a tuple


def get_info_copy(myList):
    """returns len of myList, min and max value (assumes elements are integers)"""
    tmp = myList[:] #copy the input list!!!!
    tmp.sort()
    return len(tmp), tmp[0], tmp[-1] #return type is a tuple

A = [7, 1, 125, 4, -1, 0]
B = [70, 10, 1250, 40, -10, 0, 10]

print("A:", A) 
result = get_info(A)
print("Len:", result[0], "Min:", result[1], "Max:",result[2] )

print("A now:", A) #whoops A is changed!!!

print("\n###### With copy now ########")

print("\nB:", B) 
result = get_info_copy(B)
print("Len:", result[0], "Min:", result[1], "Max:",result[2] )

print("B now:", B) #B is not changed!!!


A: [7, 1, 125, 4, -1, 0]
Len: 6 Min: -1 Max: 125
A now: [-1, 0, 1, 4, 7, 125]

###### With copy now ########

B: [70, 10, 1250, 40, -10, 0, 10]
Len: 7 Min: -10 Max: 1250
B now: [70, 10, 1250, 40, -10, 0, 10]


### Positional arguments

Arguments can be passed to functions following the order in which they appear in the function definition.

Consider the following example:

In [33]:
def print_parameters(a,b,c,d):
    print("1st param:", a)
    print("2nd param:", b)
    print("3rd param:", c)
    print("4th param:", d)
    

print_parameters("A", "B", "C", "D")

1st param: A
2nd param: B
3rd param: C
4th param: D


### Passing arguments by keyword

Given the name of an argument as specified in the definition of the function, parameters can be passed using the **name = value** syntax.

For example:

In [42]:
def print_parameters(a,b,c,d):
    print("1st param:", a)
    print("2nd param:", b)
    print("3rd param:", c)
    print("4th param:", d)
    

print_parameters(a = 1, c=3, d=4, b=2)
print("\n###############\n")
print_parameters("first","second",d="fourth",c="third")
print("\n###############\n")


1st param: 1
2nd param: 2
3rd param: 3
4th param: 4

###############

1st param: first
2nd param: second
3rd param: third
4th param: fourth

###############



Arguments passed positionally and by name can be used at the same time, but parameters passed by name must always be to the left of those passed by name. The following code in fact is not accepted by the Python interpreter:

In [43]:
def print_parameters(a,b,c,d):
    print("1st param:", a)
    print("2nd param:", b)
    print("3rd param:", c)
    print("4th param:", d)
    
print_parameters(d="fourth",c="third", "first","second")

SyntaxError: positional argument follows keyword argument (<ipython-input-43-4991b2c31842>, line 7)

### Specifying default values

During the definition of a function it is possible to specify default values. The syntax is the following:
```
def my_function(par1 = val1, par2 = val2, par3 = val3):
```

Consider the following example:

In [52]:
def print_parameters(a="defaultA", b="defaultB",c="defaultC"):
    print("a:",a)
    print("b:",b)
    print("c:",c)
    
print_parameters("param_A")
print("\n#################\n")
print_parameters(b="PARAMETER_B")

a: param_A
b: defaultB
c: defaultC

#################

a: defaultA
b: PARAMETER_B
c: defaultC


## Exercises

1. Implement a function that takes in input a string representing a DNA string and computes its reverse-complement. Take care to reverse complement any character other than (A,T,C,G,a,t,c,g) to N. The function should preserve the case of each letter (i.e. A becomes T, but a becomes t). For simplicity all bases that do not represent nucleotides are converted to a capital N.

    1. Apply the function to the DNA string "ATTACATATCATACTATCGCNTTCTAAATA"
    2. Apply the function to the DNA string "acaTTACAtagataATACTaccataGCNTTCTAAATA"
    3. Apply the function to the DNA string "TTTTACCKKKAKTUUUITTTARRRRRAIUTYYA"
    3. Check that the reverse complement of the reverse complement of the string in 1. is exactly as the original string.

<div class="tggle" onclick="toggleVisibility('ex1');">Show/Hide Solution</div>
<div id="ex1" style="display:none;">

In [16]:
def reverse_complement(DNA):
    """the function reverse complements a string of DNA"""
    revDict = {"A" : "T", "a" : "t",
             "C" : "G", "c" : "g",
             "G" : "C", "g" : "c",
             "T" : "A", "t" : "a"
            }
    result = ""
    for base in DNA:
        ### dict.get(val, default) : returns default if val not in
        result = revDict.get(base, "N") + result
        
    return result



S1 = "ATTACATATCATACTATCGCNTTCTAAATA"
S2 = "acaTTACAtagataATACTaccataGCNTTCTAAATA"
S3 = "TTTTACCKKKAKTUUUITTTARRRRRAIUTYYA"

print("S1   ", S1)
print("rc S1", reverse_complement(S1))

print("\nS2   ", S2)
print("rc S2", reverse_complement(S2))

print("\nS3   ", S3)
print("rc S3", reverse_complement(S3))

print("\nrev_comp(rev_comp(S1) == S1?)", reverse_complement(reverse_complement(S1)) == S1)

S1    ATTACATATCATACTATCGCNTTCTAAATA
rc S1 TATTTAGAANGCGATAGTATGATATGTAAT

S2    acaTTACAtagataATACTaccataGCNTTCTAAATA
rc S2 TATTTAGAANGCtatggtAGTATtatctaTGTAAtgt

S3    TTTTACCKKKAKTUUUITTTARRRRRAIUTYYA
rc S3 TNNANNTNNNNNTAAANNNNANTNNNGGTAAAA

rev_comp(rev_comp(S1) == S1?) True


</div>

2. Write the following python functions and test them with some parameters of your choice: 

    1. *getDivisors*: the function has a positive integer as parameter and returns a list of all the positive divisors of the integer in input (excluding the number itself). Example: getDivisors(6) --> [1,2,3]
    2. *checkSum*: the function has a list and an integer as parameters and returns True if the sum of all elements in the list  equals the integer, False otherwise. Example: checkSum([1,2,3], 6) --> True, checkSum([1,2,3],1) --> False.
    3. *checkPerfect*: the function gets an integer as parameter and returns True if the integer is a [perfect number](https://en.wikipedia.org/wiki/Perfect_number), False otherwise. A number is perfect if all its divisors (excluding itself) sum to its value. Example: checkPerfect(6) --> True because 1+2+3 = 6. Hint: use the functions implemented before.
    4. *getFirstNperfects*: the function gets an integer N as parameter and returns a dictionary with the first N perfect numbers. The key of the dictionary is the perfect number, while the value of the dictionary is the list of its divisors. Example: getFirstNperfects(1) --> {6 : [1,2,3]}
    
Get and print the first 4 perfect numbers and finally test if 33550336 is a perfect number.

**WARNING:** do not try to find more than 4 perfect numbers as it might take a while!!!

<div class="tggle" onclick="toggleVisibility('ex2');">Show/Hide Solution</div>
<div id="ex2" style="display:none;">

In [84]:
def getDivisors(intVal):
    """returns the integer divisors of intVal"""
    ret = [x for x in range(1,intVal//2 + 1) if intVal % x == 0]
    #OR:
    #for i in range(1,intVal//2+1):
    #    if(intVal % i == 0):
    #        ret.append(i)
    return ret

def checkSum(intList, intVal):
    """checks if the sum of elements in intList equals intVal"""
    s = 0
    for x in intList:
        s += x
    return (s == intVal)

def checkPerfect(intVal):
    """checks if intVal is a perfect number"""
    divisors = getDivisors(intVal)
    return checkSum(divisors,intVal)

def getFirstNPerfects(N):
    """Finds the first N perfect numbers"""
    i = 0
    val = 2
    ret = {}
    while(i<N):
        if(checkPerfect(val)):
            i+=1
            ret[val] = getDivisors(val)
            val += 1
        else:
            val += 1
    
    return ret
            
        
perfects = getFirstNPerfects(4)
perKeys = list(perfects.keys())
perKeys.sort()
for p in perKeys:
    print(p, " = ", "+".join([str(x) for x in perfects[p]]))
    
print("Is 33550336 a perfect number?", checkPerfect(33550336))

6  =  1+2+3
28  =  1+2+4+7+14
496  =  1+2+4+8+16+31+62+124+248
8128  =  1+2+4+8+16+32+64+127+254+508+1016+2032+4064
Is 33550336 a perfect number? True


</div>

<div class="tggle" onclick="toggleVisibility('ex2');">Show/Hide Solution</div>
<div id="ex2" style="display:none;">

In [104]:
text = "nOBody Said iT was eAsy, No oNe Ever saId it WoulD be tHis hArd…"
initials = [x for x in text if x.isupper()]

print(initials)
print("*".join(initials))

['O', 'B', 'S', 'T', 'A', 'N', 'N', 'E', 'I', 'W', 'D', 'H', 'A']
O*B*S*T*A*N*N*E*I*W*D*H*A


</div>

3. Given the following list of gene correlations:
```
geneCorr = [["G1C2W9", "G1C2Q7", 0.2], ["G1C2W9", "G1C2Q4", 0.9], 
["Q6NMS1", "G1C2W9", 0.8],["G1C2W9", "Q6NMS1",0.4], ["G1C2Q7", "G1C2Q4",0.76]]
```
where each sublist ["gene1", "gene2", corr] represents a correlation between *gene1* and *gene2* with correlation *corr*, create another list containing only the elements having an high correlation (i.e. > 0.75). Print this list.

Expected result:
```
[['G1C2W9', 'G1C2Q4', 0.9], ['Q6NMS1', 'G1C2W9', 0.8], ['G1C2Q7', 'G1C2Q4', 0.76]]
```

<div class="tggle" onclick="toggleVisibility('ex3');">Show/Hide Solution</div>
<div id="ex3" style="display:none;">

In [79]:
geneCorr = [["G1C2W9", "G1C2Q7", 0.2], ["G1C2W9", "G1C2Q4", 0.9], ["Q6NMS1", "G1C2W9", 0.8],
            ["G1C2W9", "Q6NMS1",0.4], ["G1C2Q7", "G1C2Q4",0.76]]

highlyCorr = [x for x in geneCorr if x[2] > 0.75]

print(geneCorr, "\n")
print(highlyCorr)

[['G1C2W9', 'G1C2Q7', 0.2], ['G1C2W9', 'G1C2Q4', 0.9], ['Q6NMS1', 'G1C2W9', 0.8], ['G1C2W9', 'Q6NMS1', 0.4], ['G1C2Q7', 'G1C2Q4', 0.76]] 

[['G1C2W9', 'G1C2Q4', 0.9], ['Q6NMS1', 'G1C2W9', 0.8], ['G1C2Q7', 'G1C2Q4', 0.76]]


</div>
4. Given the following sequence of DNA:

DNA = "GATTACATATATCAGTACAGATATATACGCGCGGGCTTACTATTAAAAACCCC"
    
    1. Create a dictionary reporting the frequency of each base (i.e. key is the 
    base and value is the frequency).
    2. Create a dictionary representing an index of all possible dimers (i.e. 2 
    bases, 16 dimers in total): AA, AT, AC, AG, TA, TT, TC, TG, ... . In this case, 
    keys of the dictionary are dimers and values are lists with all possible starting 
    positions of the dimer.  
    3. Print the DNA string.
    4. Print for each base its frequency
    4. Print all positions of the dimer "AT"

The expected result is:
```
sequence: GATTACATATATCAGTACAGATATATACGCGCGGGCTTACTATTAAAAACCCC
G has frequency: 0.1509433962264151
C has frequency: 0.22641509433962265
A has frequency: 0.3584905660377358
T has frequency: 0.2641509433962264
{'GG': [32, 33], 'TC': [11], 'GT': [14], 'CA': [5, 12, 17], 'TT': [2, 36, 42], 
'CG': [27, 29, 31], 'TA': [3, 7, 9, 15, 21, 23, 25, 37, 40, 43], 'AG': [13, 18], 
'GA': [0, 19], 'CT': [35, 39], 'GC': [28, 30, 34], 'AT': [1, 6, 8, 10, 20, 22, 24, 41], 
'CC': [49, 50, 51], 'AA': [44, 45, 46, 47], 'AC': [4, 16, 26, 38, 48]} 

Dimer AT is found at: [1, 6, 8, 10, 20, 22, 24, 41]
```

<div class="tggle" onclick="toggleVisibility('ex4');">Show/Hide Solution</div>
<div id="ex4" style="display:none;">

In [90]:
DNA = "GATTACATATATCAGTACAGATATATACGCGCGGGCTTACTATTAAAAACCCC"

n = len(DNA)

baseFreq = {"A" : DNA.count("A")/n, "T" : DNA.count("T")/n, 
            "C": DNA.count("C")/n, "G" : DNA.count("G")/n }

dimersDict ={}

print("sequence:", DNA)

for base in baseFreq:
    print(base, "has frequency:", baseFreq[base])
    
for ind in range(len(DNA) -1 ): #need -1 because at each iteration I get the dimer [ind:ind+1]
    dimer = DNA[ind:ind+2]
    if(dimer in dimersDict):
        dimersDict[dimer].append(ind)
    else:
        dimersDict[dimer] = [ind]
print(dimersDict, "\n")
print("Dimer AT is found at:", dimersDict["AT"])

sequence: GATTACATATATCAGTACAGATATATACGCGCGGGCTTACTATTAAAAACCCC
G has frequency: 0.1509433962264151
C has frequency: 0.22641509433962265
A has frequency: 0.3584905660377358
T has frequency: 0.2641509433962264
{'GG': [32, 33], 'TC': [11], 'GT': [14], 'CA': [5, 12, 17], 'TT': [2, 36, 42], 'CG': [27, 29, 31], 'TA': [3, 7, 9, 15, 21, 23, 25, 37, 40, 43], 'AG': [13, 18], 'GA': [0, 19], 'CT': [35, 39], 'GC': [28, 30, 34], 'AT': [1, 6, 8, 10, 20, 22, 24, 41], 'CC': [49, 50, 51], 'AA': [44, 45, 46, 47], 'AC': [4, 16, 26, 38, 48]} 

Dimer AT is found at: [1, 6, 8, 10, 20, 22, 24, 41]


</div>

5. Given the following table, reporting molecular weights for each amino acid, store them in a dictionary where the key is the one letter code and the value is the molecular weight (e.g. {"A" : 89, "R":179"}).

![](img/pract5/molecular_weights.png)


Write a python script to answer the following questions:

    1. What is the average molecular weight of an amino acid?
    2. What is the total molecular weight and number of aminoacids 
    of the P53 peptide GSRAHSSHLKSKKGQSTSRHK?
    3. What is the total molecular weight and number of aminoacids 
    of the peptide YTSLIHSLIEESQNQQEKNEQELLELDKWASLWNWF?

<div class="tggle" onclick="toggleVisibility('ex5');">Show/Hide Solution</div>
<div id="ex5" style="display:none;">

In [62]:
mws = {"A" : 89, "R" : 174, "N" : 132, "D" : 133, "B": 133, "C": 121, "Q": 146, 
       "E": 147, "Z": 147, "G": 75, "H": 155, "I": 131, "L" : 131,
      "K" : 146, "M" : 149, "F" : 165, "P" : 115, "S" : 105, "T" : 119, 
       "W" : 204, "Y": 181, "V" : 117
      }
avgW = 0
for amino in mws:
    avgW = avgW + mws[amino]

avgW = avgW/len(mws)
print("The average molecular weight of amino acids is:", avgW)

P53amino = "GSRAHSSHLKSKKGQSTSRHK"

totW = 0
for amino in P53amino:
    totW += mws[amino]
print("Peptide ", P53amino, "has",len(P53amino), "amino acids and a total mw of",totW, "Da")

totW = 0 #reusing same variable name
peptide = "YTSLIHSLIEESQNQQEKNEQELLELDKWASLWNWF"
for amino in peptide:
    totW += mws[amino]
    
print("Peptide ", peptide, "has",len(peptide), "amino acids and a total mw of",totW, "Da")

The average molecular weight of amino acids is: 137.04545454545453
Peptide  GSRAHSSHLKSKKGQSTSRHK has 21 amino acids and a total mw of 2662 Da
Peptide  YTSLIHSLIEESQNQQEKNEQELLELDKWASLWNWF has 36 amino acids and a total mw of 5076 Da


</div>

6. The following string is an extract of a [blast](https://www.ncbi.nlm.nih.gov/pubmed/2231712) alignment with compacted textual output. This is a tab (\\t) separated text file where the columns report the following info: 
the first column is the query_id, the second the subject_id (i.e. the reference on which we aligned the query), the third is the percentage of identity and then we have the alignment length, number of mismatches, gap opens, start point of the alignment on the query, end point of the alignment on the query, start point of the alignment on the subject, end point of the alignment on the subject and evalue of the alignment.

```
# Fields: query id,subject id,% identity,alignment length,mismatches,gap opens,q.start,q.end,s.start,s.end,evalue
ab1_400	scaffold16155	98.698	384	4	1	12	394	6700	7083	0.0
ab1_400	scaffold14620	98.698	384	4	1	12	394	1240	857	0.0
92A2_SP6_344	scaffold14394	95.575	113	5	0	97	209	250760	250648	2.92e-44
92A2_SP6_344	scaffold10682	97.849	93	2	0	18	110	898	990	3.81e-38
92A2_T7_558	scaffold277	88.746	311	31	3	21	330	26630	26937	5.81e-103
92A2_T7_558	scaffold277	89.545	220	21	2	27	246	27167	26950	6.06e-73
92A2_T7_558	scaffold1125	88.125	320	31	5	30	346	231532	231847	7.51e-102
ab1_675	scaffold4896	100.000	661	0	0	15	675	79051	78391	0.0
ab1_676	scaffold4896	99.552	670	0	3	7	673	78421	79090	0.0
```

1. For each alignment, store the subject id, the percentage of identity and evalue, 
subject start and end in a dictionary using the query id as key. All this information can be
stored in a dictionary having subject id as key and a dictionary with all the information as value:
```
alignments["ab1_400"] = {"subjectid" : scaffold16155, "perc_id" : 98.698, "evalue" : 0.0}
```
2. Print the whole dictionary
3. Print only the alignments having percentage of identity > 90%

Note: skip the first comment line (i.e. skip line if starts with "#").
Note1: when storing the percentage of identity remember to convert the string into a float.

The expected output is:

```
{'92A2_SP6_344': {'perc_id': 97.849, 'subjectid': 'scaffold10682', 'evalue': '3.81e-38'}, 
'ab1_400': {'perc_id': 98.698, 'subjectid': 'scaffold16155', 'evalue': '0.0'}, 
'92A2_T7_558': {'perc_id': 88.125, 'subjectid': 'scaffold1125', 'evalue': '7.51e-102'}, 
'ab1_400': {'perc_id': 98.698, 'subjectid': 'scaffold14620', 'evalue': '0.0'}, 
'ab1_676': {'perc_id': 99.552, 'subjectid': 'scaffold4896', 'evalue': '0.0'}, 
'ab1_675': {'perc_id': 100.0, 'subjectid': 'scaffold4896', 'evalue': '0.0'}} 


Alignments with identity > 90%:

Query id	Subject id	% ident	evalue
92A2_SP6_344 	 scaffold10682 	 97.849 	 3.81e-38
ab1_400 	 scaffold16155 	 98.698 	 0.0
ab1_400 	 scaffold14620 	 98.698 	 0.0
ab1_676 	 scaffold4896 	 99.552 	 0.0
ab1_675 	 scaffold4896 	 100.0 	 0.0
```

<div class="tggle" onclick="toggleVisibility('ex6');">Show/Hide Solution</div>
<div id="ex6" style="display:none;">

In [99]:
blast_out = """# Fields: query id,subject id,% identity,alignment length,mismatches,gap opens,q.start,q.end,s.start,s.end,evalue
ab1_400	scaffold16155	98.698	384	4	1	12	394	6700	7083	0.0
ab1_400	scaffold14620	98.698	384	4	1	12	394	1240	857	0.0
92A2_SP6_344	scaffold14394	95.575	113	5	0	97	209	250760	250648	2.92e-44
92A2_SP6_344	scaffold10682	97.849	93	2	0	18	110	898	990	3.81e-38
92A2_T7_558	scaffold277	88.746	311	31	3	21	330	26630	26937	5.81e-103
92A2_T7_558	scaffold277	89.545	220	21	2	27	246	27167	26950	6.06e-73
92A2_T7_558	scaffold1125	88.125	320	31	5	30	346	231532	231847	7.51e-102
ab1_675	scaffold4896	100.000	661	0	0	15	675	79051	78391	0.0
ab1_676	scaffold4896	99.552	670	0	3	7	673	78421	79090	0.0"""

alignments = dict()

for align in blast_out.split("\n"):
    if(align.startswith("#") == False):
        align_info = align.split("\t")
        alignments[align_info[0]] = {"subjectid" : align_info[1], 
                                     "perc_id" : float(align_info[2]),
                                    "evalue" : align_info[-1]
                                    } 

print(alignments, "\n\n")
print("Alignments with identity > 90%:\n")
print("Query id\tSubject id\t% ident\tevalue")
for a in alignments:
    if(alignments[a]["perc_id"]>90):
        print(a,"\t", alignments[a]["subjectid"],"\t", alignments[a]["perc_id"],"\t", alignments[a]["evalue"])


{'ab1_675': {'perc_id': 100.0, 'subjectid': 'scaffold4896', 'evalue': '0.0'}, 'ab1_676': {'perc_id': 99.552, 'subjectid': 'scaffold4896', 'evalue': '0.0'}, '92A2_SP6_344': {'perc_id': 97.849, 'subjectid': 'scaffold10682', 'evalue': '3.81e-38'}, 'ab1_400': {'perc_id': 98.698, 'subjectid': 'scaffold14620', 'evalue': '0.0'}, '92A2_T7_558': {'perc_id': 88.125, 'subjectid': 'scaffold1125', 'evalue': '7.51e-102'}} 


Alignments with identity > 90%:

Query id	Subject id	% ident	evalue
ab1_675 	 scaffold4896 	 100.0 	 0.0
ab1_676 	 scaffold4896 	 99.552 	 0.0
92A2_SP6_344 	 scaffold10682 	 97.849 	 3.81e-38
ab1_400 	 scaffold14620 	 98.698 	 0.0


</div>

7. The following text (separated with a tab "\\t") is an extract of protein-protein interactions network stored in the database [STRING](https://string-db.org/) involving [PKLR](https://www.ncbi.nlm.nih.gov/gene/5313) (Pyruvate kinase, liver and RBC) that plays a key role in glycolysis:

```
#node1	node2	node1_ext_id	node2_ext_id	
ENO1	TPI1	ENSP00000234590	ENSP00000229270
PKLR	ENO1	ENSP00000339933	ENSP00000234590
PKLR	ENO3	ENSP00000339933	ENSP00000324105
PGK1	ENO1	ENSP00000362413	ENSP00000234590
PGK1	TPI1	ENSP00000362413	ENSP00000229270
GPI	TPI1	ENSP00000405573	ENSP00000229270
PKLR	ENO2	ENSP00000339933	ENSP00000229277
PGK1	ENO3	ENSP00000362413	ENSP00000324105
PGK1	ENO2	ENSP00000362413	ENSP00000229277
GPI	PKLR	ENSP00000405573	ENSP00000339933
ENO2	TPI1	ENSP00000229277	ENSP00000229270
PGK2	ENO1	ENSP00000305995	ENSP00000234590
ENO3	PGK2	ENSP00000324105	ENSP00000305995
PGK2	TPI1	ENSP00000305995	ENSP00000229270
ENO3	TPI1	ENSP00000324105	ENSP00000229270
PGK2	ENO2	ENSP00000305995	ENSP00000229277
GPI	ENO3	ENSP00000405573	ENSP00000324105
PKLR	LDHB	ENSP00000339933	ENSP00000229319
PKLR	LDHC	ENSP00000339933	ENSP00000280704
PKLR	TPI1	ENSP00000339933	ENSP00000229270
PGK1	PKLR	ENSP00000362413	ENSP00000339933
GPI	ENO2	ENSP00000405573	ENSP00000229277
PKLR	PGK2	ENSP00000339933	ENSP00000305995
GPI	PGK1	ENSP00000405573	ENSP00000362413
ME3	PKLR	ENSP00000352657	ENSP00000339933
ME3	LDHB	ENSP00000352657	ENSP00000229319
ME3	LDHC	ENSP00000352657	ENSP00000280704
GPI	PGK2	ENSP00000405573	ENSP00000305995
GPI	ENO1	ENSP00000405573	ENSP00000234590
GPI	LDHB	ENSP00000405573	ENSP00000229319
ENO3	ENO2	ENSP00000324105	ENSP00000229277
GPI	LDHC	ENSP00000405573	ENSP00000280704
ENO3	LDHB	ENSP00000324105	ENSP00000229319
ENO3	LDHC	ENSP00000324105	ENSP00000280704
ENO1	LDHB	ENSP00000234590	ENSP00000229319
LDHB	TPI1	ENSP00000229319	ENSP00000229270
LDHC	TPI1	ENSP00000280704	ENSP00000229270
PGK2	LDHC	ENSP00000305995	ENSP00000280704
PGK1	LDHB	ENSP00000362413	ENSP00000229319
PGK1	PGK2	ENSP00000362413	ENSP00000305995
ENO1	ENO2	ENSP00000234590	ENSP00000229277
LDHC	ENO1	ENSP00000280704	ENSP00000234590
LDHB	ENO2	ENSP00000229319	ENSP00000229277
LDHC	ENO2	ENSP00000280704	ENSP00000229277
ENO3	ENO1	ENSP00000324105	ENSP00000234590
PGK1	LDHC	ENSP00000362413	ENSP00000280704
GPI	ME3	ENSP00000405573	ENSP00000352657
PGK2	LDHB	ENSP00000305995	ENSP00000229319
ME3	TPI1	ENSP00000352657	ENSP00000229270
```

Here is a graphic representation of the protein-protein interactions:

![](img/pract5/PKLR.png)

    Note: we can assume that relations between nodes are transitive. node1 --> node2
    implies node2 --> node1.
    
    1. Store the network information in a dictionary having node1 as key and
    the list of all nodes2 associated to it as value (remember to skip the first 
    line that is the header). Remember transitivity, therefore add also node2 -->node1
    2. Find all first neighbours of "PKLR" (i.e. the nodes that are directly connected 
    to "PKLR") and print them
    3. Find all first neighbours of "ME3" (i.e. the nodes that are directly connected 
    to "PKLR") and print them
    4. Find all the second neighbours of "ME3" (i.e. the nodes that are connected to nodes 
    directly connected to "ME3"). 
    
    

In [115]:
networkStr = """#node1	node2	node1_ext_id	node2_ext_id	
ENO1	TPI1	ENSP00000234590	ENSP00000229270
PKLR	ENO1	ENSP00000339933	ENSP00000234590
PKLR	ENO3	ENSP00000339933	ENSP00000324105
PGK1	ENO1	ENSP00000362413	ENSP00000234590
PGK1	TPI1	ENSP00000362413	ENSP00000229270
GPI	TPI1	ENSP00000405573	ENSP00000229270
PKLR	ENO2	ENSP00000339933	ENSP00000229277
PGK1	ENO3	ENSP00000362413	ENSP00000324105
PGK1	ENO2	ENSP00000362413	ENSP00000229277
GPI	PKLR	ENSP00000405573	ENSP00000339933
ENO2	TPI1	ENSP00000229277	ENSP00000229270
PGK2	ENO1	ENSP00000305995	ENSP00000234590
ENO3	PGK2	ENSP00000324105	ENSP00000305995
PGK2	TPI1	ENSP00000305995	ENSP00000229270
ENO3	TPI1	ENSP00000324105	ENSP00000229270
PGK2	ENO2	ENSP00000305995	ENSP00000229277
GPI	ENO3	ENSP00000405573	ENSP00000324105
PKLR	LDHB	ENSP00000339933	ENSP00000229319
PKLR	LDHC	ENSP00000339933	ENSP00000280704
PKLR	TPI1	ENSP00000339933	ENSP00000229270
PGK1	PKLR	ENSP00000362413	ENSP00000339933
GPI	ENO2	ENSP00000405573	ENSP00000229277
PKLR	PGK2	ENSP00000339933	ENSP00000305995
GPI	PGK1	ENSP00000405573	ENSP00000362413
ME3	PKLR	ENSP00000352657	ENSP00000339933
ME3	LDHB	ENSP00000352657	ENSP00000229319
ME3	LDHC	ENSP00000352657	ENSP00000280704
GPI	PGK2	ENSP00000405573	ENSP00000305995
GPI	ENO1	ENSP00000405573	ENSP00000234590
GPI	LDHB	ENSP00000405573	ENSP00000229319
ENO3	ENO2	ENSP00000324105	ENSP00000229277
GPI	LDHC	ENSP00000405573	ENSP00000280704
ENO3	LDHB	ENSP00000324105	ENSP00000229319
ENO3	LDHC	ENSP00000324105	ENSP00000280704
ENO1	LDHB	ENSP00000234590	ENSP00000229319
LDHB	TPI1	ENSP00000229319	ENSP00000229270
LDHC	TPI1	ENSP00000280704	ENSP00000229270
PGK2	LDHC	ENSP00000305995	ENSP00000280704
PGK1	LDHB	ENSP00000362413	ENSP00000229319
PGK1	PGK2	ENSP00000362413	ENSP00000305995
ENO1	ENO2	ENSP00000234590	ENSP00000229277
LDHC	ENO1	ENSP00000280704	ENSP00000234590
LDHB	ENO2	ENSP00000229319	ENSP00000229277
LDHC	ENO2	ENSP00000280704	ENSP00000229277
ENO3	ENO1	ENSP00000324105	ENSP00000234590
PGK1	LDHC	ENSP00000362413	ENSP00000280704
GPI	ME3	ENSP00000405573	ENSP00000352657
PGK2	LDHB	ENSP00000305995	ENSP00000229319
ME3	TPI1	ENSP00000352657	ENSP00000229270"""

netData = dict()

for line in networkStr.split("\n"):
    
    if(not line.startswith("#") ):
        info = line.split("\t")
        #insert node1
        if(info[0] in netData):
            netData[info[0]].append(info[1])
        else:
            netData[info[0]] = [info[1]]
        #insert node2
        if(info[1] in netData):
            netData[info[1]].append(info[0])
        else:
            netData[info[1]] = [info[0]]
        

print(netData)

print("\nPKLR is directly connected to:", netData["PKLR"] )

firstNeigh = netData["ME3"]
print("\nME3 is directly connected to:", firstNeigh  )

secondNeigh = []
for node in firstNeigh:
    if(node in netData): #important, the node might not have connections
        neighbours = netData[node]
        for n in neighbours:
            if(n not in secondNeigh and  n not in firstNeigh and n != "ME3"):
                secondNeigh.append(n)
            
print("\nME3's second neighbours:", secondNeigh  )

{'ME3': ['PKLR', 'LDHB', 'LDHC', 'GPI', 'TPI1'], 'PGK2': ['ENO1', 'ENO3', 'TPI1', 'ENO2', 'PKLR', 'GPI', 'LDHC', 'PGK1', 'LDHB'], 'PGK1': ['ENO1', 'TPI1', 'ENO3', 'ENO2', 'PKLR', 'GPI', 'LDHB', 'PGK2', 'LDHC'], 'ENO2': ['PKLR', 'PGK1', 'TPI1', 'PGK2', 'GPI', 'ENO3', 'ENO1', 'LDHB', 'LDHC'], 'LDHB': ['PKLR', 'ME3', 'GPI', 'ENO3', 'ENO1', 'TPI1', 'PGK1', 'ENO2', 'PGK2'], 'LDHC': ['PKLR', 'ME3', 'GPI', 'ENO3', 'TPI1', 'PGK2', 'ENO1', 'ENO2', 'PGK1'], 'GPI': ['TPI1', 'PKLR', 'ENO3', 'ENO2', 'PGK1', 'PGK2', 'ENO1', 'LDHB', 'LDHC', 'ME3'], 'PKLR': ['ENO1', 'ENO3', 'ENO2', 'GPI', 'LDHB', 'LDHC', 'TPI1', 'PGK1', 'PGK2', 'ME3'], 'TPI1': ['ENO1', 'PGK1', 'GPI', 'ENO2', 'PGK2', 'ENO3', 'PKLR', 'LDHB', 'LDHC', 'ME3'], 'ENO3': ['PKLR', 'PGK1', 'PGK2', 'TPI1', 'GPI', 'ENO2', 'LDHB', 'LDHC', 'ENO1'], 'ENO1': ['TPI1', 'PKLR', 'PGK1', 'PGK2', 'GPI', 'LDHB', 'ENO2', 'LDHC', 'ENO3']}

PKLR is directly connected to: ['ENO1', 'ENO3', 'ENO2', 'GPI', 'LDHB', 'LDHC', 'TPI1', 'PGK1', 'PGK2', 'ME3']

ME3 is dir

</div>