Basic Python and native data structures - 4
==========

Importing standard python modules
---------------

Standard python modules are libraries that are available without the need to install additional software (they come together with the python interpreter). They only need to be **imported**. The **import** keyword allows us to import standard (and non standard) Python modules. Some common ones:
- os
- math
- sys
- itertools
- tens of others are available. See https://docs.python.org/3/py-modindex.html

In [None]:
import os
os.listdir('.')

In [None]:
os.path.exists('data.txt')

In [None]:
os.path.isdir('.ipynb_checkpoints/')

In [None]:
# functions usually have a useful help message
os.mkdir?

**A couple more example libraries**

In [None]:
import random

In [None]:
random.random()

In [None]:
random.randint(0, 100)

In [None]:
random.choice(['cat', 'dog', 'cow', 'banana'])

In [None]:
import time

In [None]:
start = time.time()
i = 0
while i < 100000:
    i += 1
end = time.time()
print(end - start)

**Import comes in different flavors**

In [None]:
import math
math.pi

In [None]:
from math import pi
pi

In [None]:
# alias are possible on the module itself
import math as m
m.pi

In [None]:
# or alias on the function/variable itself
from math import pi as PI
PI

Keywords
------

- keywords are special names that are part of the Python language.
- **A variable cannot be named after a keywords** --> SyntaxError would be raised
- The list of keywords can be obtained using these commands (**import** and **print** are themselves keywords that will be explained along this course)

In [None]:
import keyword
# Here we are using the "dot" operator, which allows us to access objects (variables, that is) attributes and functions
print(keyword.kwlist)

In [None]:
raise = 1

Exceptions
------

Used to avoid crashes and handle unexpected errors

In [None]:
d = {'first': 1,
     'second': 2}
d['third']

In [None]:
# Exceptions can be intercepted and cashes can be avoided
try:
    d['third']
except:
    print('Key not present')

In [None]:
# Specific exceptions can be intercepted
try:
    d['third']
except KeyError:
    print('Key not present')
except:
    print('Another error occurred')

In [None]:
# Specific exceptions can be intercepted
try:
    d['second'].non_existent_method()
except KeyError:
    print('Key not present')
except:
    print('Another error occurred')

In [None]:
# The exception can be assigned to a variable to inspect it
try:
    d['second'].non_existent_method()
except KeyError:
    print('Key not present')
except Exception as e:
    print(f'Another error occurred: {e}')

In [None]:
# Exception can be created and "raised" by the user
if d['second'] == 2:
    raise Exception('I don\'t like 2 as a number')

Simple file reading/writing
---------

**Writing**

In [None]:
mysequence = """>sp|P56945|BCAR1_HUMAN Breast cancer anti-estrogen resistance protein 1 OS=Homo sapiens GN=BCAR1 PE=1 SV=2
MNHLNVLAKALYDNVAESPDELSFRKGDIMTVLEQDTQGLDGWWLCSLHGRQGIVPGNRL
KILVGMYDKKPAGPGPGPPATPAQPQPGLHAPAPPASQYTPMLPNTYQPQPDSVYLVPTP
SKAQQGLYQVPGPSPQFQSPPAKQTSTFSKQTPHHPFPSPATDLYQVPPGPGGPAQDIYQ
VPPSAGMGHDIYQVPPSMDTRSWEGTKPPAKVVVPTRVGQGYVYEAAQPEQDEYDIPRHL
LAPGPQDIYDVPPVRGLLPSQYGQEVYDTPPMAVKGPNGRDPLLEVYDVPPSVEKGLPPS
NHHAVYDVPPSVSKDVPDGPLLREETYDVPPAFAKAKPFDPARTPLVLAAPPPDSPPAED
VYDVPPPAPDLYDVPPGLRRPGPGTLYDVPRERVLPPEVADGGVVDSGVYAVPPPAEREA
PAEGKRLSASSTGSTRSSQSASSLEVAGPGREPLELEVAVEALARLQQGVSATVAHLLDL
AGSAGATGSWRSPSEPQEPLVQDLQAAVAAVQSAVHELLEFARSAVGNAAHTSDRALHAK
LSRQLQKMEDVHQTLVAHGQALDAGRGGSGATLEDLDRLVACSRAVPEDAKQLASFLHGN
ASLLFRRTKATAPGPEGGGTLHPNPTDKTSSIQSRPLPSPPKFTSQDSPDGQYENSEGGW
MEDYDYVHLQGKEEFEKTQKELLEKGSITRQGKSQLELQQLKQFERLEQEVSRPIDHDLA
NWTPAQPLAPGRTGGLGPSDRQLLLFYLEQCEANLTTLTNAVDAFFTAVATNQPPKIFVA
HSKFVILSAHKLVFIGDTLSRQAKAADVRSQVTHYSNLLCDLLRGIVATTKAAALQYPSP
SAAQDMVERVKELGHSTQQFRRVLGQLAAA
"""

In [None]:
# First, we open a file in "w" write mode
fh = open("mysequence.fasta", "w")
# Second, we write the data into the file:
fh.write(mysequence)
# Third, we close:
fh.close()

**Reading**

In [None]:
# First, we open the file in read mode (r)
fh = open('mysequence.fasta', 'r')
# Second, we read the content of the file
data = fh.read()
# Third we close
fh.close()
# data is now a string that contains the content of the file being read
data

In [None]:
print(data)

**For both writing and reading you can use the context manager keyword "with" that will automatically close the file after using it, even in the case of an exception happening** 


**Writing**

In [None]:
# First, we open a file in "w" write mode with the context manager
with open("mysequence.fasta", "w") as fh:
    # Second, we write the data into the file:
    fh.write(mysequence)
# When getting out of the block, the file is automatically closed in a secure way 

**Reading**

In [None]:
# First, we open the file in read mode (r) with the context manager
with open('mysequence.fasta', 'r') as fh:
    # Second, we read the content of the file
    data = fh.read()
# When getting out of the block, the file is automatically closed in a secure way 

Notice the **\n** character (newline) in the string...

In [None]:
data

In [None]:
data.split("\n")

In [None]:
data.split("\n", 1)

In [None]:
header, sequence = data.split("\n", 1)

In [None]:
header

In [None]:
sequence

In [None]:
# we want to get rid of the \n characters
seq1 = sequence.replace("\n","")

In [None]:
# another way is to use the split/join pair
seq2 = "".join(sequence.split("\n"))

In [None]:
seq1 == seq2

In [None]:
# make sure that every letter is upper case
seq1 = seq1.upper()

In [None]:
# With the sequence, we can now play around 
seq1.count('A')

If a file is too big, using the "read" method could completely fill our memory! It is advisable to use a "for" loop.

In [None]:
for line in open('mysequence.fasta'):
    # remove the newline character at the end of the line
    # also removes spaces and tabs at the right end of the string
    line = line.rstrip()
    print(line)

---


Exercises
---------

Write a function `create_directory` that creates a directory with a name specified by the user, using the `os.mkdir` function.

**Note:** what happens if you run the function twice with the same argument? Do you need to change your function in some ways because of this?

Create a function `read_single_fasta` that reads a fasta file provided by the user (*e.g.* the `mysequence.fasta` that was created earlier) and returns two variables:

* The sequence header
* The actual sequence

Create a function `read_fasta_and_count` that reads a fasta file provided by the user (*e.g.* the `mysequence.fasta` that was created earlier) and returns a dictionary with the *frequency* of each aminoacid.

Write a function `count_k_mers` that takes two arguments:

* A list of objects
* A variable `k`

And returns the number of all possible [combinations](https://en.wikipedia.org/wiki/Combination) of length *k* from the list of objects.

**Note:** you can implement the generation of combinations using pure python or using the `itertools.combinations` function

Slightly more difficult ones
--------------

Write a function that will calculate the average word length of a text stored in a file (i.e the sum of all the lengths of the word tokens in the text, divided by the number of word tokens). Add an option to exclude blank lines or chapter headers from the computation.

Use the `../data/aristotle.txt` file downloaded below

If all the previous exercises sounded boring to you
-------------------

An anagram is a type of word play, the result of rearranging the letters of a word or phrase to produce a new word or phrase, using all the original letters exactly once; e.g., orchestra = carthorse. Using the word list in the `../data/unixdict.txt` file (downloaded below), write a program that finds the sets of words that share the same characters that contain the most words in them.