In [None]:
import jupman
jupman.init()

# Practical 7

In this practical we will keep practicing with functions and will see how to get input from the command line. 

## Slides

The slides of the introduction can be found here: [Intro](docs/Practical7.pdf)

## Functions


Reminder. The basic definition of a function is:
```
def function_name(input) :
    #code implementing the function
    ...
    ...
    return return_value
```

Functions are defined with the **def** keyword that proceeds the *function_name* and then a list of parameters is passed in the brackets. A colon **:** is used to end the line holding the definition of the function. The code implementing the function is specified by using indentation. A function **might** or **might not** return a value. In the first case a **return** statement is used.


## Getting input from the command line

To call a program ```my_python_program.py``` from command line, you just have to open a terminal (in Linux) or the command prompt (in Windows) and, assuming that python is present in the path, you can ```cd``` into the folder containing your python program, (eg. ```cd C:\python\my_exercises\```) and just type in 
```python3 my_python_program.py```
or
```python my_python_program.py```
In case of arguments to be passed by command line, one has to put them after the specification of the program name (eg. ```python my_python_program.py parm1 param2 param3```

Python provides the module **sys** to interact with the interpreter. In particular, **sys.argv** is a list representing all the arguments passed to the python script from the command line.

Consider the following code:

In [None]:
import sys
"""Test input from command line in systest.py"""

if(len(sys.argv) != 4):
    print("Dear user, I was expecting three parameters. You gave me ",len(sys.argv)-1)
    sys.exit(1)
else:
    for i in range(0,len(sys.argv)):
        print("Param {}: {} ({})".format(i,sys.argv[i], type(sys.argv[i])))

Invoking the ```systest.py``` script from command line with the command  ```python3 exercises/systest.py 1st_param 2nd 3``` will return:
```
Param 0: exercises/systest.py (<class 'str'>)
Param 1: 1st_param (<class 'str'>)
Param 2: 2nd (<class 'str'>)
Param 3: 3 (<class 'str'>)
```
Invoking the ```systest.py``` script from command line with the command  ```python3 exercises/systest.py 1st_param``` will return:
```
Dear user, I was expecting three parameters. You gave me  1
```

Note that the parameter at index 0, ```sys.argv[0]``` holds the name of the script, and that all parameters are actually **strings** (and therefore need to be cast to numbers if we want to do mathematical operations on them).

A more flexible and powerful way of getting input from command line makes use of the ```Argparse``` [module](https://docs.python.org/3/howto/argparse.html). 

## Argparse

Argparse is a command line parsing module which deals with user specified parameters (positional arguments) and optional arguments.


Very briefly, the basic syntax of the ```Argparse module``` (for more information check the [official documentation](https://docs.python.org/3/howto/argparse.html)) is the following.

1. Import the module:

```
import argparse
```

2. Define a argparse object:

```
parser = argparse.ArgumentParser(description="This is the description of the program")
```

note the parameter *description* that is a string to describe the program;

3. Add positional arguments:
```
parser.add_argument("arg_name", type = obj, help = "Description of the parameter)
```
where ```arg_name``` is the name of the argument (which will be used to retrieve its value). The argument has type ```obj``` (the type will be automatically checked for us) and a description specified in the ```help```string.

4. Add optional arguments:
```
parser.add_argument("-p", "--positional_arg", type = obj, default = def_val, help = "Description of the parameter)
```
where ```-p``` is a short form of the parameter (and it is optional), ```--positional_arg``` is the extended name and it requires a value after it is specified, ```default``` is optional and gives a default value to the parameter. If not specified and no argument is passed, the argument will get the value "None". ```Help``` is again the description string.

5. Parse the arguments:
```
args = parser.parse_args()
```
the parser checks the arguments and stores their values in the ```argparse``` object that we called ```args```.

6. Retrieve and process arguments:
```
myArgName = args.arg_name
myPosArg = args.positional_arg
```
now variables contain the values specified by the user and we can use them.

**Example:**
Let's write a program that gets a string (S) and an integer (N) in input and prints the string repeated N times. Three optional parameters are specified: verbosity (-v) to make the software print a more descriptive output, separator (-s) to separate each copy of the string (defaults to " ") and trailpoints (-p) to add several "." at the end of the string (defaults to 1). 

In [None]:
import argparse
parser = argparse.ArgumentParser(description="""This script gets a string 
                                 and an integer and repeats the string N times""")
parser.add_argument("string", type=str,
                    help="The string to be repeated")
parser.add_argument("N", type=int,
                    help="The number of time to repeated the string")

parser.add_argument("-v", "--verbose", action="store_true",
                    help="increase output verbosity")

parser.add_argument("-p", "--trailpoints", type = int, default = 1, help="Adds these many trailing points")
parser.add_argument("-s", "--separator", type = str, default = " ", help="The separator between repeated strings")

args = parser.parse_args()

mySTR = args.string+args.separator
trailP = "." * args.trailpoints
answer = mySTR * args.N 

answer = answer[:-len(args.separator)] + trailP #to remove the last separator

if args.verbose:
    print("the string {} repeated {} is:".format(args.str, args.N, answer))
else:
    print(answer)


Executing the program from command line without parameters gives the message:

![](img/pract7/noargs.png)

Calling it with the ```-h``` flag:

![](img/pract7/help.png)

With the positional arguments ```"ciao a tutti"``` and ```3```:

![](img/pract7/pos_args.png)

With the positional arguments ```"ciao a tutti"``` and ```3```, and with the optional parameters ```-s "___" -p 3 -v```

![](img/pract7/sample.png)


**Example:**
Let's write a program that reads and prints to screen a text file specified by the user. Optionally, the file might be compressed with gzip to save space. The user should be able to read also gzipped files. Hint: use the module gzip which is very similar to the standard file management method ([more info here](https://docs.python.org/3/library/gzip.html?highlight=gzip#module-gzip)). You can file a text file here [textFile.txt](file_samples/textFile.txt) and its gzipped version here [text.gz](file_samples/textFile.gz):


In [None]:
import argparse
import gzip

parser = argparse.ArgumentParser(description="""Reads and prints a text file""")
parser.add_argument("filename", type=str, help="The file name")
parser.add_argument("-z", "--gzipped", action="store_true", help="If set, input file is assumed gzipped")

args = parser.parse_args()
inputFile = args.filename
fh = ""
if(args.gzipped):
    fh = gzip.open(inputFile, "rt")
else:
    fh = open(inputFile, "r")

for line in fh:
    line = line.strip("\n")
    print(line)

fh.close()


The output:

![](img/pract7/read_gz.png)

## Exercises

1. Modify the program of Exercise 4 of Practical 6 in order to allow users to specify the input and output files from command line. Then test it with the provided files. The text of the exercise follows:

Write a python program that reads two files. The first is a one column text file ([contig_ids.txt](file_samples/contig_ids.txt)) with the identifiers of some contigs that are present in the second file, which is a fasta formatted file ([contigs82.fasta](file_samples/contigs82.fasta)). The program will write on a third, fasta formatted file (e.g. filtered_contigs.fasta) only those entries in *contigs82.fasta* having identifier in *contig_ids.txt*.



<div class="tggle" onclick="toggleVisibility('ex1');">Show/Hide Solution</div>
<div id="ex1" style="display:none;">

In [8]:
import argparse

def readIDS(f):
    """reads a one column file in and stores
    the ids in a dictionary that is returned at the end"""
    ret = dict()
    with open(f, "r") as file:
        for line in file:
            line = line.strip()
            if(line not in ret):
                ret[line] = 1
    return ret

def filterFasta(inF, outF, ids2keep):
    oF = open(outF, "w")
    
    outputME = False
    with open(inF, "r") as file:
        for line in file:
            line = line.strip()
            if(line.startswith(">")):
                #this is the header
                if(line[1:] in ids2keep):
                    oF.write(line +"\n")
                    outputME = True
                    print("Writing contig ", line[1:])
                else:
                    outputME = False
            else:
                if(outputME):
                    oF.write(line +"\n")
        
    oF.close()
    

parser = argparse.ArgumentParser(description="Filters a fasta file")
parser.add_argument("inputFasta", type = str, help = "The input fasta file")
parser.add_argument("inputIDS", type = str, help = "The IDS to keep")
parser.add_argument("outputFasta", type = str, help = "The output fasta file with filtered entries")
args = parser.parse_args()
idsFile = args.inputIDS
inFasta = args.inputFasta
outFasta = args.outputFasta

ids = readIDS(idsFile)
filterFasta(inFasta,outFasta, ids)

</div>

2. [Blast](https://www.ncbi.nlm.nih.gov/pubmed/2231712) is a well known tool to perform sequence alignment between a pool of query sequences and a pool of subject sequences. Among the other formats, it can produce an text output that is tab separated (```\t```) capturing user specified output. Comments in the file are written in lines starting with an hash key ("#"). A sample blast output file is [blast_sample.tsv](file_samples/blast_sample.tsv), please download it and spend some time to look at it. The meaning of all columns is specified in the file header: 
```
# Fields: query id, subject id, query length, % identity, alignment length, 
identical, gap opens, q. start, q. end, s. start, s. end, evalue
```
Write a python program with:
    1. A function (*readBlast*) that reads in the blast .tsv file ignoring comment lines;
    2. A function (*filterBlast*) that gets a string in input representing a blast alignment and filters it according to the following user specified parameters. It should return True if all filters are passed, false otherwise: 
    
        1. % identity > identity_threshold (default 0%)
        2. evalue < evalue_thrshold (default 0.5)
        3. alignment length / query length > align_threshold (default 0)

The program should report how many entries out of the total passed the filter.

Test several combinations of filters like:

    1. identity_theshold = 97, all others default
    2. evalue_threshold = 0.5, all others default
    3. align_threshold = 0.9, all others default
    4. align_threshold = 0.9, identity_threshold = 97

<div class="tggle" onclick="toggleVisibility('ex3');">Show/Hide Solution</div>
<div id="ex3" style="display:none;">

In [39]:


def filterBlast(seq, id_thr = 0, eval_thr = 1, al_thr = 0):
    """filters an alignment represented as a string. Returns True if filters OK, False otherwise"""
    infos = seq.split("\t")
    return float(infos[3]) > id_thr and float(infos[-1]) < eval_thr and float(infos[4])/int(infos[2]) > al_thr

def readBlast(inFile):
    """Reads the blast .tsv file and returns a list"""
    fh = open(inFile, "r")
    alignments = []
    for line in fh:
        if not line.startswith("#"):
            alignments.append(line)
    return alignments

def filterAndCount(alignList, id_thr = 0, eval_thr = 1, al_thr = 0):
    """filters alignments and counts the ones passing the filter"""
    passed = [x for x in alignList if filterBlast(x, id_thr, eval_thr, al_thr)]
    print("{} out of {} aligns passed filter (id_thr:{},eval_thr:{},al_thr:{})".format(
        len(passed), len(alignList), id_thr,eval_thr,al_thr))

blastF = "file_samples/blast_sample.tsv"

myAligns = readBlast(blastF)

filterAndCount(myAligns, id_thr=97)
filterAndCount(myAligns, eval_thr=1)
filterAndCount(myAligns, al_thr=0.9)
filterAndCount(myAligns, al_thr=0.9, id_thr=97)



8390 out of 90790 aligns passed filter (id_thr:97,eval_thr:1,al_thr:0)
90780 out of 90790 aligns passed filter (id_thr:0,eval_thr:1,al_thr:0)
27711 out of 90790 aligns passed filter (id_thr:0,eval_thr:1,al_thr:0.9)
804 out of 90790 aligns passed filter (id_thr:97,eval_thr:1,al_thr:0.9)


</div>

3. Fasta is a [format](https://en.wikipedia.org/wiki/FASTA_format) representing nucleotide or peptide sequences. 
Each entry of a fasta file starts with a ">" followed by the identifier of the entry (the header of the sequence). All the lines following a header represent the sequence belonging to that identifier. Here you can find an example of fasta file: [contigs82.fasta](file_samples/contigs82.fasta). Please download it and have a look at its content.

Write a python program that reads a fasta file and prints to screen the identifier of the sequence and the frequency of all the characters in the sequence (note that sequences might contain all IUPAC codes in case of SNPs). Hint: use a dictionary.

<div class="tggle" onclick="toggleVisibility('ex3');">Show/Hide Solution</div>
<div id="ex3" style="display:none;">

In [28]:
def countFrequency(seq):
    """gets a sequence in input and returns
    a dictionary with bases as keys and frequence as value"""
    bases = {}
    for b in seq:
        if b not in bases:
            bases[b] = 1
        else:
            bases[b] += 1
    
    for b in bases:
        bases[b] = bases[b]/len(seq)
    return bases

def printData(ident, freqDict):
    """get the identifier and a dictionary with all frequencies
    and prints both information on the screen"""
    print(ident,":")
    for f in freqDict:
        print("\t {} has freq {:.3f}".format(f,freqDict[f]))

def processFasta(file):
    header = ""
    seq = ""
    with open(file, "r") as f:
        for line in f:
            line = line.strip()
            if(line.startswith(">")):
                if(len(header) == 0 ):
                    #first entry:
                    header = line[1:]
                else:
                    #this is a new entry
                    freqData = countFrequency(seq)
                    printData(header,freqData)
                    seq = ""
                    header = line[1:]
            else:
                seq +=line
    #processing the final entry
    freqData = countFrequency(seq)
    printData(header,freqData)
                    

inFasta = "file_samples/contigs82.fasta"
processFasta(inFasta)

MDC020656.85 :
	 N has freq 0.052
	 A has freq 0.308
	 C has freq 0.163
	 G has freq 0.203
	 T has freq 0.273
MDC001115.177 :
	 N has freq 0.018
	 T has freq 0.316
	 C has freq 0.207
	 G has freq 0.157
	 A has freq 0.302
MDC013284.379 :
	 N has freq 0.005
	 T has freq 0.321
	 G has freq 0.194
	 C has freq 0.171
	 A has freq 0.309
MDC018185.243 :
	 N has freq 0.001
	 W has freq 0.000
	 R has freq 0.001
	 Y has freq 0.001
	 G has freq 0.181
	 A has freq 0.307
	 S has freq 0.000
	 T has freq 0.304
	 M has freq 0.000
	 C has freq 0.204
	 K has freq 0.000
MDC018185.241 :
	 N has freq 0.006
	 W has freq 0.000
	 R has freq 0.001
	 Y has freq 0.002
	 G has freq 0.183
	 A has freq 0.335
	 S has freq 0.000
	 T has freq 0.304
	 M has freq 0.001
	 C has freq 0.168
	 K has freq 0.001
MDC004527.213 :
	 N has freq 0.013
	 T has freq 0.332
	 G has freq 0.155
	 C has freq 0.273
	 A has freq 0.227
MDC003661.174 :
	 N has freq 0.007
	 A has freq 0.246
	 C has freq 0.213
	 G has freq 0.232
	 T has freq 0.

</div>

4. Write a python program that reads two files. The first is a one column text file ([contig_ids.txt](file_samples/contig_ids.txt)) with the identifiers of some contigs that are present in the second file, which is a fasta formatted file ([contigs82.fasta](file_samples/contigs82.fasta)). The program will write on a third, fasta formatted file (e.g. filtered_contigs.fasta) only those entries in *contigs82.fasta* having identifier in *contig_ids.txt*.

<div class="tggle" onclick="toggleVisibility('ex4');">Show/Hide Solution</div>
<div id="ex4" style="display:none;">

In [17]:
def readIDS(f):
    """reads a one column file in and stores
    the ids in a dictionary that is returned at the end"""
    ret = dict()
    with open(f, "r") as file:
        for line in file:
            line = line.strip()
            if(line not in ret):
                ret[line] = 1
    return ret

def filterFasta(inF, outF, ids2keep):
    oF = open(outF, "w")
    
    outputME = False
    with open(inF, "r") as file:
        for line in file:
            line = line.strip()
            if(line.startswith(">")):
                #this is the header
                if(line[1:] in ids2keep):
                    oF.write(line +"\n")
                    outputME = True
                    print("Writing contig ", line[1:])
                else:
                    outputME = False
            else:
                if(outputME):
                    oF.write(line +"\n")
        
    oF.close()
    

idsFile = "file_samples/contig_ids.txt"
inFasta = "file_samples/contigs82.fasta"
outFasta = "file_samples/filtered_contigs.fasta"

ids = readIDS(idsFile)
filterFasta(inFasta,outFasta, ids)

Writing contig  MDC020656.85
Writing contig  MDC001115.177
Writing contig  MDC013284.379
Writing contig  MDC018185.243
Writing contig  MDC018185.241
Writing contig  MDC004527.213
Writing contig  MDC012176.157
Writing contig  MDC001204.810
Writing contig  MDC004389.256
Writing contig  MDC018297.229
Writing contig  MDC001802.364
Writing contig  MDC014057.243
Writing contig  MDC021015.302
Writing contig  MDC017187.314
Writing contig  MDC012865.410
Writing contig  MDC000427.83
Writing contig  MDC017187.319
Writing contig  MDC004364.265
Writing contig  MDC002360.219
Writing contig  MDC015155.172
Writing contig  MDC019140.398
Writing contig  MDC019140.399
Writing contig  MDC011390.337
Writing contig  MDC007154.375
Writing contig  MDC010588.505
Writing contig  MDC002519.240
Writing contig  MDC006346.711
Writing contig  MDC011551.182
Writing contig  MDC002717.156
Writing contig  MDC006346.719
Writing contig  MDC007838.447
Writing contig  MDC007018.186
Writing contig  MDC017873.233
Writing cont

</div>

5. Write a python program that:

    1. reads the text file [sample_text.txt](file_samples/sample_text.txt) and stores in a dictionary how many times each word appears (hint: the key is the word and the count is the value);
    2. prints to screen how many lines the file has and how many distinct words are in the file;
    3. writes to a text file (scientist_histo.csv) the histogram of the words in comma separated value format (i.e. word,count). Words must be sorted alphabetically;
    4. Finally, write a function that prints to screen (alphabetically) all the words that have a count higher than a threshold N and apply it with N = 15.
    
<div class="tggle" onclick="toggleVisibility('ex5');">Show/Hide Solution</div>
<div id="ex5" style="display:none;">

In [41]:
def wordHisto(myText):
    """this function returns a dictionary
    with the count of each word in myText (separated by " " or "\n") 
    """
    myDict = dict()
    tmp = myText.replace("\n"," ")
    
    for word in tmp.split(" "):
        if(word not in myDict):
            myDict[word] = 1
        else:
            myDict[word] += 1
    return myDict

def writeWordHisto(outF, data):
    """this function writes to outFile
    the word histogram data contained in the dictionary data
    the output format is comma separated
    """
    fh = open(outF, "w")
    dictKeys = list(data.keys())
    dictKeys.sort()
    fh.write("#word,count\n")
    for k in dictKeys:
        curVal = data[k]
        myStr = k + "," + str(curVal)+"\n" #string to write in file
        fh.write(myStr)
    fh.close() #remember to close the file

def printWords(N, data):
    """prints the words in data that have a count higher than N"""
    dictKeys = list(data.keys())
    dictKeys.sort()
    for w in dictKeys:
        cnt = data[w]
        if(cnt > N):
            print("Word \"{}\" is present {} times".format( w, cnt))
    
    
file = "file_samples/sample_text.txt"
outFile = "file_samples/sample_text_histo.csv"
wholeText = ""
fh = open(file,"r")
wholeText = fh.read()
wholeText = wholeText.strip() #to remove the final newline character

fh.close()
print("The file {} has {} lines".format(file, wholeText.count("\n") +1 )) #n lines have n-1 \n

#Let's do the job:
wordD = wordHisto(wholeText)
writeWordHisto(outFile, wordD)
print("The total number of distinct words is ", len(wordD))
printWords(5,wordD)


The file file_samples/sample_text.txt has 28 lines
The total number of distinct words is  308
Word "challenge" is present 6 times
Word "future" is present 22 times
Word "reply" is present 7 times
Word "shop" is present 6 times
Word "thunder" is present 11 times
Word "umbrella" is present 10 times


</div>

6. Write the following python functions and test them with some parameters of your choice: 

    1. *getDivisors*: the function has a positive integer as parameter and returns a list of all the positive divisors of the integer in input (excluding the number itself). Example: ```getDivisors(6) --> [1,2,3]```
    
    2. *checkSum*: the function has a list and an integer as parameters and returns True if the sum of all elements in the list  equals the integer, False otherwise. Example: ```checkSum([1,2,3], 6) --> True```, ```checkSum([1,2,3],1) --> False```.
    
    3. *checkPerfect*: the function gets an integer as parameter and returns True if the integer is a [perfect number](https://en.wikipedia.org/wiki/Perfect_number), False otherwise. A number is perfect if all its divisors (excluding itself) sum to its value. Example: ```checkPerfect(6) --> True``` because 1+2+3 = 6. Hint: use the functions implemented before.
    
Use the three implemented functions to write a fourth function:

*getFirstNperfects*: the function gets an integer N as parameter and returns a dictionary with the first N perfect numbers. The key of the dictionary is the perfect number, while the value of the dictionary is the list of its divisors. Example: ```getFirstNperfects(1) --> {6 : [1,2,3]}```
    
Get and print the first 4 perfect numbers and finally test if 33550336 is a perfect number.

**WARNING:** do not try to find more than 4 perfect numbers as it might take a while!!!

<div class="tggle" onclick="toggleVisibility('ex6');">Show/Hide Solution</div>
<div id="ex6" style="display:none;">

In [24]:
def getDivisors(intVal):
    """returns the integer divisors of intVal"""
    ret = [x for x in range(1,intVal//2 + 1) if intVal % x == 0]
    #OR:
    #for i in range(1,intVal//2+1):
    #    if(intVal % i == 0):
    #        ret.append(i)
    return ret

def checkSum(intList, intVal):
    """checks if the sum of elements in intList equals intVal"""
    s = 0
    for x in intList:
        s += x
    return (s == intVal)

def checkPerfect(intVal):
    """checks if intVal is a perfect number"""
    divisors = getDivisors(intVal)
    return checkSum(divisors,intVal)

def getFirstNPerfects(N):
    """Finds the first N perfect numbers"""
    i = 0
    val = 2
    ret = {}
    while(i<N):
        if(checkPerfect(val)):
            i+=1
            ret[val] = getDivisors(val)
            val += 1
        else:
            val += 1
    
    return ret
            
        
perfects = getFirstNPerfects(4)
perKeys = list(perfects.keys())
perKeys.sort()
for p in perKeys:
    print(p, " = ", "+".join([str(x) for x in perfects[p]]))
    
print("Is 33550336 a perfect number?", checkPerfect(33550336))

6  =  1+2+3
28  =  1+2+4+7+14
496  =  1+2+4+8+16+31+62+124+248
8128  =  1+2+4+8+16+32+64+127+254+508+1016+2032+4064
Is 33550336 a perfect number? True


</div>