Basic Python and native data structures
==========

Let's start with some poetic wisdom
---------------------------------

In [None]:
import this

Resources
----------
- https://docs.python.org
- https://docs.python.org/3/tutorial/index.html

Hello world example
-------------------

In [None]:
print("hello world")

About indentation
-----------------

Before starting, you need to know that in Python, code indentation is an essential part of the syntax. It is used to delimitate code blocks such as loops and functions. It may seem cumbersome, but it makes all Python code consistent and readable. The following code is incorrect:

```python
>>> a = 1
>>>   b = 2
```
since the two statements are not aligned despite being part of the same block of statements (the main block). Instead, they must be indented in the same way:
```python
>>> a = 1
>>> b = 2
```
Here is another example involving a loop and a function (def):
```python
def example():
    for i in [1, 2, 3]:
        print(i)
```     
In C, it may look like 
```c
void example(){
  int i;
  for (i=1; i<=3; i++){
      printf("%d\n", i);
  }
}```
OR
```c
void example(){
int i;
for (i=1; i<=3; i++)
{
printf("%d\n", i);
}
}```

**Note:** both tabs and spaces can be used to define the indentation, but conventionally **4 spaces** are preferred. 

Rules and conventions on naming variables
-------------------------------

* Variable names are unlimited in length
* Variable names start with a letter or underscore *_* followed by letters, numbers or underscores.
* Variable names are case-sensitive
* **Variable names cannot be named with special keywords (see below)**

Variable names conventionally have lower-case letters, with multiple words seprated by underscores. 

**Other rules and style conventions:** PEP8 style recommendations (https://www.python.org/dev/peps/pep-0008/)

Basic numeric types
----------------

**Integers**

In [None]:
a = 10  
b = 2
a + b

In [None]:
# incremental operators
a = 10
a += 2    # equivalent to a = a + 2   (there is no ++ operators like in C/C++])
a

In [None]:
a = 10
a = a + 2
a

**Boolean**

In [None]:
test = True
if test:
    print(test)

In [None]:
test = False
if not test:
    print(test)

In [None]:
# Other types can be treated as boolean
# Main example are integers
true_value = 1
false_value = 0
if true_value:
    print(true_value)
if not false_value:
    print(false_value)

**Integers, Float and Complex**

In [None]:
float1 = 2.1           
float2 = 2.0
float3 = 2.

complex_value = 1 + 2j

In [None]:
float3

**Basic mathematical operators**

In [None]:
1 + 2

In [None]:
1 - 2

In [None]:
3 * 2

In [None]:
# implicit casting
3 / 2

In [None]:
3

In [None]:
# explicit casting
float(3)

In [None]:
# floor division
3 // 2

In [None]:
# modulo (division remainder)
3 % 2

In [None]:
# exponent
3 ** 2

**Promotion:** when you mix numeric types in an expression, all operands are converted (or coerced) to the type with highest precision

In [None]:
5 + 3.1

**Converting types: casting**

A variable belonging to one type can be converted to another type through "casting"

In [None]:
int(3.1)

In [None]:
float(3)

In [None]:
bool(1)

In [None]:
bool(0)

Functions
-------

Allows to re-use code in a flexible way

In [None]:
def sum_numbers(first, second):
    return first + second

In [None]:
sum_numbers(1, 2)

In [None]:
sum_numbers('one', 'two')

In [None]:
sum_numbers('one', 2)

In [None]:
# positional vs. keyword arguments
def print_variables(first, second):
    print(f'First variable: {first}')
    print(f'Second variable: {second}')
  

In [None]:
print_variables(1, 2)

In [None]:
print_variables()

In [None]:
print_variables(1)

In [None]:
# positional vs. keyword arguments
def print_variables(first=None, second=None):
    print(f'First variable: {first}')
    print(f'Second variable: {second}')


In [None]:
print_variables()

In [None]:
print_variables(second=2)

Flow control operators
------

In [None]:
# if/elif/else
animals = ['dog', 'cat', 'cow']
if 'cats' in animals:
    print('Cats found!')
elif 'cat' in animals:
    print('Only one cat found!')
else:
    print('Nothing found!')

In [None]:
# for loop
foods = ['pasta', 'rice', 'lasagna']
for food in foods:
    print(food)

In [None]:
# nested for loops
foods = ['pasta', 'rice', 'lasagna']
deserts = ['cake', 'biscuit']
for food in foods:
    for desert in deserts:
        print(food, desert)

In [None]:
# while loop
counter = 0
while counter < 10:
    counter += 1
counter

In [None]:
# A normal loop
for value in range(10):
    print(value)

In [None]:
# continue: skip the rest of the expression and go to the next element
for value in range(10):
    if not value == 3:
        continue
    print(value)

In [None]:
# break: exit the "for" loop
for value in range(10):
    if not value == 3:
        break
    print(value)

In [None]:
# break and continue will only exit from the innermost loop
foods = ['pasta', 'rice', 'lasagna']
deserts = ['cake', 'biscuit']
for food in foods:
    for desert in deserts:
        if food == 'rice':
            continue
        print(food, desert)

In [None]:
# break and continue will only exit from the innermost loop
foods = ['pasta', 'rice', 'lasagna']
deserts = ['cake', 'biscuit']
for food in foods:
    for desert in deserts:
        if desert == 'biscuit':
            break
        print(food, desert)

A note about objects
-------

- Everything in Python is an object, which can be seen as an advanced version of a variable
- objects have methods
- the **dir** keyword allows the user to discover them
- to access them you can use the following syntax: VARIABLE.ATTRIBUTE or VARIABLE.METHOD()

In [None]:
print(dir(bool))

Data structures
------------


There are quite a few data structures available. The builtins data structures are: 
- **lists**
- **tuples**
- **dictionaries**
- **strings**
- **sets** 

Lists, strings and tuples are **ordered sequences** of objects. Unlike strings that contain only characters, list and tuples can contain any type of objects. Lists and tuples are like arrays. Tuples like strings are **immutables**. Lists are mutables so they can be extended or reduced at will. Sets are mutable unordered sequence of unique elements.

Lists are enclosed in brackets:

```    python
l = [1, 2, "a"]
```

Tuples are enclosed in parentheses:

```python
t = (1, 2, "a")
```

Tuples are faster and consume less memory.

Dictionaries are built with curly brackets:

```python
d = {"a":1, "b":2}
```

Sets are made using the **set** builtin function. More about the data structures here below:

|                    | immutable  | mutable     |
|--------------------|------------|-------------|
| ordered sequence   | string     |             |
| ordered sequence   | tuple      |  list       |
| unordered sequence |            |  set        |
| hash table         |            |  dict       |


**Indexing** starts at 0, like in C

In [None]:
s1 = "Example"
s1[0]

In [None]:
# last index is therefore the length of the string minus 1
s1[len(s1)-1]

In [None]:
s1[6]

In [None]:
# Negative index can be used to start from the end
s1[-2]

In [None]:
# Careful with indexing out of bounds
s1[100]

Strings and slicing
-----

There are 4 ways to represent strings:
- with single quotes
- with double quotes
- with triple single quotes
- with triple double quotes

In [None]:
"Simple string"

In [None]:
'Simple string'

In [None]:
#single quotes can be used to use double quotes and vice versa
"John's book"

In [None]:
#we can also use escaping
'John\'s book'

In [None]:
"""This is an example of 
a long string on several lines"""

**String operations**

In [None]:
s1 = "First string"
s2 = "Second string"
# + operator concatenates strings
s1 + " and " + s2

In [None]:
# Strings are immutables
# Try
s1[0] = 'e'

In [None]:
# to change an item, you got to create a new string
'N' + s1[1:]

**Slicing sequence syntax**

Applies to strings as well as lists, tuples and any other **iterable** (_i.e._ an object that behaves like a list, tuple, etc..)

<pre>
- [start:end:step]   most general slicing
- [start:end:]      (step=1)
- [start:end]       (step=1)
- [start:]          (step=1,end=-1)
- [:]               (start=0,end=-1, step=1)
- [::2]             (start=0, end=-1, step=2)
</pre>

In [None]:
s1 = 'Banana'
s1[1:6:2]

In [None]:
s = 'TEST'
s[-1:-4:-2]

In [None]:
# slicing. using one : character means from start to end index.
s1 = "First string"
s1[:]

In [None]:
s1[::2]

In [None]:
# indexing
s1[0]

**Other string operations**

In [None]:
print(dir(s1))

Well, that's a lot ! Here are the common useful ones:
- split
- find
- index
- replace
- lower
- upper
- endswith
- startswith
- strip

In [None]:
# split is very useful when parsing files
s = 'first second third'
s.split()

In [None]:
# a different character can be used as well as separator
s = 'first,second,third'
s.split(',')

In [None]:
# Upper is a very easy and handy method
s.upper()

In [None]:
# Methods can be chained as well!
s.upper().lower().split(',')

Lists
-----

The syntax to create a list can be the function **list** or square brackets **[]**

In [None]:
# you can  any kind of objects in a lists. This is not an array !
l = [1, 'a', 3]
l

In [None]:
# slicing and indexing like for strings are available
l[0]
l[0::2]

In [None]:
l

In [None]:
# list are mutable sequences:
l[1] = 2
l

**Mathematical operators can be applied to lists as well**

In [None]:
[1, 2] + [3, 4]

In [None]:
[1, 2] * 10

**Adding elements to a list: append Vs. expand**

Lists have several methods amongst 
which the **append** and **extend** methods. The former appends an object to the end of the list (e.g., another list) while the later appends each element of the iterable object (e.g., anothee list) to the end of the list.

For example, we can append an object (here the character 'c') to the end of a simple list as follows:

In [None]:
stack = ['a','b']
stack.append('c')
stack

In [None]:
stack.append(['d', 'e', 'f'])
stack

In [None]:
stack[3]

The object ``['d', 'e', 'f']`` has been appended to the exiistng list. However, it happens that sometimes what we want is to append the elements one by one of a given list rather the list itself. You can do that manually of course, but a better solution is to use the :func:`extend()` method as follows:


In [None]:
# the manual way
stack = ['a', 'b', 'c']
stack.append('d')
stack.append('e')
stack.append('f')
stack

In [None]:
# semi-manual way, using a "for" loop
stack = ['a', 'b', 'c']
to_add = ['d', 'e', 'f']
for element in to_add:
    stack.append(element)
stack

In [None]:
# the smarter way
stack = ['a', 'b', 'c']
stack.extend(['d', 'e','f'])
stack

Tuples
----

Tuples are sequences similar to lists but **immutables**. Use the parentheses to create a tuple

In [None]:
t = (1, 2, 3)
t

In [None]:
# simple creation:
t = 1, 2, 3
print(t)
t[0] = 3

In [None]:
# Would this work?
(1)

In [None]:
# To enforce a tuple creation, add a comma
(1,)

**Same operators as lists**

In [None]:
(1,) * 5

In [None]:
t1 = (1,0)
t1 += (1,)
t1

**Why tuples instead of lists?**

- faster than list
- protects the data (immutable)
- tuples, being immutable, can be used as keys on dictionaries (more on that later)

Sets
----

Sets are constructed from a sequence (or some other iterable object). Since sets cannot have duplicates, there are usually used to build sequence of unique items (e.g., set of identifiers).

The syntax to create a set can be the function **set** or curly braces **{}**


In [None]:
a = {'1', '2', 'a', '4'}
a

In [None]:
# a list preserves duplicates
a = [1, 1, 1, 2, 2, 3, 4]
a

In [None]:
# a set ignores duplications
a = {1, 2, 1, 2, 2, 3, 4}
a

In [None]:
a = []
to_add = [1, 1, 1, 2, 2, 3, 4]
for element in to_add:
    if element in a:
        continue
    else:
        a.append(element)
a

In [None]:
# Sets have the very handy "add" method
a = set()
to_add = [1, 1, 1, 2, 2, 3, 4]
for element in to_add:
    a.add(element)
a

**Sets have very interesting operators**

What operators do we have ?
- | for union
- & for intersection
- < for subset
- \- for difference
- ^ for symmetric difference

In [None]:
a = {'a', 'b', 'c'}
b = {'a', 'b', 'd'}
c = {'a', 'e', 'f'}

In [None]:
# intersection
a & b

In [None]:
# union
a | b

In [None]:
# difference
a - b

In [None]:
# symmetric difference
a ^ b

In [None]:
# is my set a subset of the other?
a < b

In [None]:
# operators can be chained as well
a & b & c

In [None]:
# the same operations can be performed using the operator's name
a.intersection(b).intersection(c)

In [None]:
# a more complex operation
a.intersection(b).difference(c)

Dictionaries
------

- A dictionary is a sequence of items.
- Each item is a pair made of a **key** and a **value**. They are useful to convert a **key** to its corresponding **value** (_e.g._ gene identifier to its common name)
- Dictionaries are unordered. 
- You can access to the list of keys or values independently.

In [None]:
d = {} # an empty dictionary

In [None]:
d = {'first':1, 'second':2} # initialise a dictionary

In [None]:
# access to value given a key:
d['first']

In [None]:
# add a new pair of key/value:
d['third'] = 3

In [None]:
# what are the keys ?
d.keys()

In [None]:
# what are the values ?
d.values()

In [None]:
# what are the key/values pairs?
d.items()

In [None]:
# can be used in a for loop as well
for key, value in d.items():
    print(key, value)

In [None]:
# Delete a key (and its value)
del d['third']
d

In [None]:
# naive for loop approach:
for key in d.keys():
    print(key, d[key])

In [None]:
# no need to call the "keys" method explicitly
for key in d:
    print(key, d[key])

In [None]:
# careful not to look for keys that are NOT in the dictionary
d['fourth']

In [None]:
# the "get" method allows a safe retrieval of a key
d.get('fourth')

In [None]:
# the "get" method returns a type "None" if the key is not present
# a different value can be specified in case of a missed key
d.get('fourth', 4)

**Note on the "None" type**

In [None]:
n = None
n

In [None]:
print(n)

In [None]:
None + 1

In [None]:
# equivalent to False
if n is None:
    print(1)
else:
    print(0)

In [None]:
# we can explicitly test for a variable being "None"
value = d.get('fourth')
if value is None:
    print('Key not found!')

Importing standard python modules
---------------

Standard python modules are libraries that are available without the need to install additional software (they come together with the python interpreter). They only need to be **imported**. The **import** keyword allows us to import standard (and non standard) Python modules. Some common ones:
- os
- math
- sys
- urllib2
- tens of others are available. See https://docs.python.org/3/py-modindex.html

In [None]:
import os
os.listdir('.')

In [None]:
os.path.exists('data.txt')

In [None]:
os.path.isdir('.ipynb_checkpoints/')

**Import comes in different flavors**

In [None]:
import math
math.pi

In [None]:
from math import pi
pi

In [None]:
# alias are possible on the module itself
import math as m
m.pi

In [None]:
# or alias on the function/variable itself
from math import pi as PI
PI

Keywords
------

- keywords are special names that are part of the Python language.
- **A variable cannot be named after a keywords** --> SyntaxError would be raised
- The list of keywords can be obtained using these commands (**import** and **print** are themselves keywords that will be explained along this course)

In [None]:
import keyword
# Here we are using the "dot" operator, which allows us to access objects (variables, that is) attributes and functions
print(keyword.kwlist)

In [None]:
raise = 1

Exceptions
------

Used to avoid crashes and handle unexpected errors

In [None]:
d = {'first': 1,
     'second': 2}
d['third']

In [None]:
# Exceptions can be intercepted and cashes can be avoided
try:
    d['third']
except:
    print('Key not present')

In [None]:
# Specific exceptions can be intercepted
try:
    d['third']
except KeyError:
    print('Key not present')
except:
    print('Another error occurred')

In [None]:
# Specific exceptions can be intercepted
try:
    d['second'].non_existent_method()
except KeyError:
    print('Key not present')
except:
    print('Another error occurred')

In [None]:
# The exception can be assigned to a variable to inspect it
try:
    d['second'].non_existent_method()
except KeyError:
    print('Key not present')
except Exception, e:
    print('Another error occurred: {0}'.format(e))

In [None]:
# Exception can be created and "raised" by the user
if d['second'] == 2:
    raise Exception('I don\'t like 2 as a number')

Simple file reading/writing
---------

**Writing**

In [None]:
mysequence = """>sp|P56945|BCAR1_HUMAN Breast cancer anti-estrogen resistance protein 1 OS=Homo sapiens GN=BCAR1 PE=1 SV=2
MNHLNVLAKALYDNVAESPDELSFRKGDIMTVLEQDTQGLDGWWLCSLHGRQGIVPGNRL
KILVGMYDKKPAGPGPGPPATPAQPQPGLHAPAPPASQYTPMLPNTYQPQPDSVYLVPTP
SKAQQGLYQVPGPSPQFQSPPAKQTSTFSKQTPHHPFPSPATDLYQVPPGPGGPAQDIYQ
VPPSAGMGHDIYQVPPSMDTRSWEGTKPPAKVVVPTRVGQGYVYEAAQPEQDEYDIPRHL
LAPGPQDIYDVPPVRGLLPSQYGQEVYDTPPMAVKGPNGRDPLLEVYDVPPSVEKGLPPS
NHHAVYDVPPSVSKDVPDGPLLREETYDVPPAFAKAKPFDPARTPLVLAAPPPDSPPAED
VYDVPPPAPDLYDVPPGLRRPGPGTLYDVPRERVLPPEVADGGVVDSGVYAVPPPAEREA
PAEGKRLSASSTGSTRSSQSASSLEVAGPGREPLELEVAVEALARLQQGVSATVAHLLDL
AGSAGATGSWRSPSEPQEPLVQDLQAAVAAVQSAVHELLEFARSAVGNAAHTSDRALHAK
LSRQLQKMEDVHQTLVAHGQALDAGRGGSGATLEDLDRLVACSRAVPEDAKQLASFLHGN
ASLLFRRTKATAPGPEGGGTLHPNPTDKTSSIQSRPLPSPPKFTSQDSPDGQYENSEGGW
MEDYDYVHLQGKEEFEKTQKELLEKGSITRQGKSQLELQQLKQFERLEQEVSRPIDHDLA
NWTPAQPLAPGRTGGLGPSDRQLLLFYLEQCEANLTTLTNAVDAFFTAVATNQPPKIFVA
HSKFVILSAHKLVFIGDTLSRQAKAADVRSQVTHYSNLLCDLLRGIVATTKAAALQYPSP
SAAQDMVERVKELGHSTQQFRRVLGQLAAA
"""

In [None]:
# First, we open a file in "w" write mode
fh = open("mysequence.fasta", "w")
# Second, we write the data into the file:
fh.write(mysequence)
# Third, we close:
fh.close()

**Reading**

In [None]:
# First, we open the file in read mode (r)
fh = open('mysequence.fasta', 'r')
# Second, we read the content of the file
data = fh.read()
# Third we close
fh.close()
# data is now a string that contains the content of the file being read
data

In [None]:
print(data)

**For both writing and reading you can use the context manager keyword "with" that will automatically close the file after using it, even in the case of an exception happening** 


**Writing**

In [None]:
# First, we open a file in "w" write mode with the context manager
with open("mysequence.fasta", "w") as fh:
    # Second, we write the data into the file:
    fh.write(mysequence)
# When getting out of the block, the file is automatically closed in a secure way 

**Reading**

In [None]:
# First, we open the file in read mode (r) with the context manager
with open('mysequence.fasta', 'r') as fh:
    # Second, we read the content of the file
    data = fh.read()
# When getting out of the block, the file is automatically closed in a secure way 

Notice the **\n** character (newline) in the string...

In [None]:
data

In [None]:
data.split("\n")

In [None]:
data.split("\n", 1)

In [None]:
header, sequence = data.split("\n", 1)

In [None]:
header

In [None]:
sequence

In [None]:
# we want to get rid of the \n characters
seq1 = sequence.replace("\n","")

In [None]:
# another way is to use the split/join pair
seq2 = "".join(sequence.split("\n"))

In [None]:
seq1 == seq2

In [None]:
# make sure that every letter is upper case
seq1 = seq1.upper()

In [None]:
# With the sequence, we can now play around 
seq1.count('A')

In [None]:
counter = {}
counter['A'] = seq1.count('A')
counter['T'] = seq1.count('T')
counter['C'] = seq1.count('C')
counter['G'] = seq1.count('G')
counter

If a file is too big, using the "read" method could completely fill our memory! It is advisable to use a "for" loop.

In [None]:
for line in open('mysequence.fasta'):
    # remove the newline character at the end of the line
    # also removes spaces and tabs at the right end of the string
    line = line.rstrip()
    print(line)

In [None]:
header = ''
sequence = ''
for line in open('mysequence.fasta'):
    # remove the newline character at the end of the line
    # also removes spaces and tabs at the right end of the string
    line = line.rstrip()
    if line.startswith('>'):
        header = line
    else:
        sequence += line

In [None]:
header

In [None]:
sequence

Lists, sets and dictionary comprehension: more compact constructors
--------------

There is a more concise and advanced way to create a list, set or dictionary.

This is a bit more advanced, and less "pythonic", so it is advisable to use this only
after you are familiar with the "regular" constructs.

In [None]:
range(10)

In [None]:
my_list = [x*2 for x in range(10)]
my_list

In [None]:
redundant_list = [x for x in my_list*2]
redundant_list

In [None]:
my_set = {x for x in my_list*2}
my_set

In [None]:
my_dict = {x:x+1 for x in my_list}
my_dict

In [None]:
# if/else can be used in comprehension as well
even_numbers = [x for x in range(10) if not x%2]
even_numbers

In [None]:
# if/else can also be used to assign values based on a test
even_numbers = ['odd' if x%2 else 'even' for x in range(10)]
even_numbers

On objects and references
----------

A common source of errors for python beginners

In [None]:
a = [1, 2, 3]
a

In [None]:
b  = a # is a reference
b[0] = 10
print(a)

In [None]:
# How to de-reference (copy) a list
a = [1, 2, 3]

# First, use the list() function
b1 = list(a)

# Second use the slice operator
b2 = a[:]  # using slicing
b1[0] = 10
b2[0] = 10
a #unchanged

In [None]:
# What about this ?
a = [1,2,3,[4,5,6]]
# copying the object
b = a[:]
b[3][0] = 10 # let us change the first item of the 4th item (4 to 10)
a

Here we see that there is still a reference. When copying, a shallow copy is performed. 
You'll need to use the **copy** module.

In [None]:
from copy import deepcopy
a = [1,2,3,[4,5,6]]
b = deepcopy(a)
b[3][0] = 10 # let us change the first item of the 4th item (4 to 10)
a