# Python crash course

Author: [Michele Lombardi](mailto:michele.lombardi2@unibo.it)

Python is a programming language, designed to be _productive_, _expressive_ and _easy to read_.
* No compilation step (unlike C, C# or Java)
* Multiple back-ends (you can think of them as "interpreters")
    * The default back-end employs bytecode just-in-time compilation + a C interpreter

Python is not a particularly fast language (quite the opposite, in fact). However:

* There is a very large collection of external modules (i.e. libraries/packages)
* These external modules are often written in C and very efficient

In summary, python offers the combination of:

* An extremely productive (but slow) programming language
* Very efficient implementation for several "basic" operations

...And this is what makes is attractive for data scientists:

* Python allows one to quickly write flexible scripts for  data analysis...
* ...Which can be also quite efficient since the most expensive operations are run by external modules.

Here's a plot with trend for open job position based on different queries on a search engine (courtesy of Jear-François Puget):

![trends0.png](resources/trends0.png)

Just to be super clear: python is the blue line.

In orde to run a python program (or python _script_) one usually needs to open a terminal window and then call:

```sh
python scriptname.py
```

Instead, we will work with **jupyter notebooks**. 

## Jupyter notebooks (formerly iptyhon notebooks)

This lab session is supported via series of _Jupyter notebooks_. Jupyter is an awsome framework that simplifies the task of writing code interactively.

A notebook consists of a sequence of cells, some containing text, and some containing code. Cells that contain code _can be exectuted_: they are passed to a parallel process, which runs the python interpreter. The output is then returned to the notebook and displayed.

We don't really need the details: it's enough to understand that each code cell in a notebook can be executed. You can do it in a number of ways:

* Press the ">|" button in the toolbar
* Select any option from the "Cell" menu
* Press `Ctrl+Enter` or `Shift+Enter`

Usualy, pressing the keyboard shortcuts is the best way. Occasionally you will need to run again a whole notebook, and that can be conveniently done from the "Cell" menu.

Note that "Shift-Enter" switches to the next cell aftern executing the current one, while `Ctrl+Enter` stays in the current cell.

We can operate on a jupyter notebook in two "modes":

* _Normal_ model (cells have blue borders)
* _Edit_ mode (cells have green border)

You can always switch to edit mode by pressing `Esc` and to edit mode by pressing `Enter` (or by double-clicking). After executing a cell, the notebook always switches to "normal" mode.

You can do a number of other useful operations on cells (adding, inserting, deleting...) from the "Edit" menu, and in general from the toolbar.

If you are a keyboard maniac like me, you can find a quick overview of the available keyboard shortcuts can [on this pdf file](resources/weidadeyue_jupyter-notebook.pdf).

## Hello, world

Python has probably the simplest "hello, world" ever. You just need to write:

In [2]:
print("Hello, world!")

Hello, world!


Try to execute the above cell: select it and then press `Shift-Enter` (or `Ctrl-Enter`). The string "Hellow, world!" should be printed immediately after the cell.

## Variables in Python

Variables in python are loosely typed.

* There is no keyword for defining new variables
* The type of the content of a variable can change at run time

Here are some example, and also a quick list of basic data types:

In [1]:
a = 10 # integer (typically C long)
b = 2.5 # real-valued numbers (typically C double)
c = 'A string with "quotes"'
d = "A string with 'apostrophes'"
e = True # A true boolean
f = False # A false boolean
g = None # Nothing (like "null" in Java o "NULL" in C)

Try to execute the above cell (and do the same with all the ones that will follow)!

The strings after the "#" character are comments

You can also experiment with the code, either by editing the cell or by inserting a new cell, like the one you have below:

## Arithmetic operators

Python supports the usual arithmetic operators:

In [8]:
a = 10 + 2 - 4 # sum and difference
b = 10 + (2 - 4) # change of priority (parantheses)
c = 10 * 2 # product
d = 10 / 3 # integer division
e = 10 / float(3) # casting, to force a floating point division
f = 10 % 3 # modulus (remainder of integer division)
g = 10 / 3.0 # real division
h = 10**2 # power
i = abs(-3.4) # absolute value

print(a)
print(b)
print(c)
print(d)
print(e)
print(f)
print(g)
print(h)
print(i)

a = 3
2 / float(a)

8
8
20
3.3333333333333335
3.3333333333333335
1
3.3333333333333335
100
3.4


0.6666666666666666

## Logical operators

Logical operators are available as:

In [7]:
v = True
w = False
a = v and w # Logical "and"
b = v or w # Logical "or"
c = not v # not

print(a)
print(b)
print(c)

False
True
False


If we are dealing with bits (e.g. with the boolean `True` and `False` values) we can also use bitwise logical operators instead of `and` and `or`

In [9]:
a2 = v & w
b2 = v | w
print(a, a2)
print(b, b2)

False False
True True


## Comparison operators

The comparison operators are:

In [9]:
print(5 < 10) # Less than
print(5 <= 10) # Less than or equal
print(5 == 10) # equal
print(5 >= 10) # Greater than or equal
print(5 > 10) # Greater than
a = 7
print(5 <= a <= 10) # Newbie error in C, works in python!

True
True
False
False
False
True


## String formatting

The `%` operator for string is redefined to allow C-style formatting, e.g.:

In [14]:
one = 1
two = 2.0
three = one + two
print('And %d + %f = %.2f' % (one, two, three))

And 1 + 2.000000 = 3.00


* `%d` prints an integer
* `%f` prints a real
* `%.2f` prints a real with two decimals
* `% (one, ...)` specifies the source for each field

## Lists

Lists (mutable sequences) are primitive data structures:

In [15]:
a = ['a', 'is', 'a', 'list']
b = ['a', 1, True] # a list with multiple types
c = [1, 2, 3]

a[0] = 'b' # indexing (traditional)
print(a[-1]) # backwards indexing

a.append('another element') # append an element
print(a)

a.pop() # pop last element
print(a)

list
['b', 'is', 'a', 'list', 'another element']
['b', 'is', 'a', 'list']


In [11]:
a = [1, 2, 3]

Lists are objects, i.e. data items with callable methods.

Want to know which methods are available?

* You can either Google "python list"...
* ...Or use the jupyter on-line help!

Firs, build a list:

In [16]:
a = [1, 2, 3]

Then, use TAB completion. Write:

a.[AND THEN HIT "TAB"]

In [17]:
a.

SyntaxError: invalid syntax (<ipython-input-17-a0d310e2b5e6>, line 1)

You should see a pop-up list, from which you can pick a method.

Then you can go further: select a method and/or put the cursor a the the end of the method name, then the hit [SHIFT+TAB] 

In [None]:
a.append

The notebook should display a small (but incredibly useful) help window at the bottom of the page.

## Tuples

Tuples in python are _immutable sequences_. They are also a primitive data structure:

In [14]:
a = ('a', 'is', 'a', 'tuple')
b = ('a', 1, True) # a tuple with multiple types

print(a[0])
a[0] = 'b'

# a[0] = 'b' # ERROR!!! Tuples are immutable (comment this line to run the rest)
a = ('b', a[1], a[2], a[3]) # This works
print(a[-1]) # backwards indexing

print(len(a)) # number of elements

a


TypeError: 'tuple' object does not support item assignment

In the code, `len` is a function that returns the number of elemnets in the tuples. It works for most data structures in python (including lists).

Tuples have "superpowers" ;-)

In [15]:
# tuple assignment
a, b = 10, 20

# tuple unpacking
c = (1, 2, 3)
d, e, f = c 
# works with every iterable sequence!
d, e, f = [1, 2, 3]

# tuple printing
print(d, e, f) # automatic separator: space

10 20
1 2 3


You can obtain more help (and see the other available methods) in the usual way.

## Dictionaries

Dictionaries (i.e. maps) are also primitive data structures in python:

In [17]:
a = {'name':'gigi', 'age':23} # key:value pairs
b = {'name':'gigi', 23:'age'} # keys of different types
# a key can be of any immutable type
c = {(0, 1):'a tuple as key!'}

print(a['name']) # prints 'gigi'

a['name'] = 'pino'
print(a)

print(len(a)) # number of items
print(a.keys()) # list of keys
print(a.values()) # list of values
print(a.items()) # list of (key, value) tuples

gigi
{'name': 'pino', 'age': 23}
2
dict_keys(['name', 'age'])
dict_values(['pino', 23])
dict_items([('name', 'pino'), ('age', 23)])


You can get more help in the usual way.

## Sets

The last primitive data structure that we will see is the _set_, i.e. a non-order collections of items without repetitions.

In [18]:
a = [1, 1, 2, 2, 3, 3] # This is list
b = set(a) # returns {1, 2, 3} (a set obtained from the list)
c = {1, 1, 2, 2, 3, 3} # builds {1, 2, 3}
print(c)

b.add(4) # add an element
b.remove(3) # remove an element

print(len(b)) # number of elements

{1, 2, 3}
3


## Operators for collection

Some operators have ad hoc behavior when applied to collections:

In [20]:
a = [1, 2, 3] + [4, 5] # list concatenation
b = (1, 2, 3) + (4, 5) # tuple contatenation

print(sum(a) + sum(b)) # sum (lists and tuples)

print(2 in a) # membership testing (any collection)

c = {'a':1, 'b':2}
print('a' in c) # "in" looks for keys in dictionaries
print(1 in c) # so this is False

30
True
True
False


Note in particular the `sum` function and the `in` operator

## Control flow in python

* Instructions end when the line ends (no final ";")
* What if we need instructions on multiple lines?

In [26]:
print(1 + 2 + 3 + 4 + 5 + 6 + 7 + 9 + 10 + 11 + 12 + 
      13 + 14) # You can use "\" to break a line
print (1 + 2 + 3 + 4 + 5 + 6 + 7 + 9 + 10 + 11 + 12 +
       13 + 14) # But there's no need to do that in parentheses

97
97


* There are no "{}" block delimiters!
* Blocks are defined _via indentation_

Some practical examples of blocks inside **conditional instructions**:

In [30]:
b = 10
if -1 <= b <= 1: # no parentheses needed
    b *= 2;
    print('b = %f' % b) # no need for parantheses
                       # with a single term
elif b > 1:  # "elif"/"else" are optional
    print('Big number') # different block, different indentation
else:
    print('Small number')

Big number


If the block consists of a single instruction, it can stay on the same line as the `if`:

In [31]:
if b > 0: print(b)

10


**For loops** have the following syntax:

In [21]:
for x in ['first', 'second', 'third']:
    print(x)
    
for i in range(10):
    print(i)

for i in range(10): # range returns [0,..., 9]
    if i == 0: continue # continue, as in C
    if i > 5: break # break, as in C
    print(i)

# enumerate returns a list of (index, item) tuples
# they can be "separated" via tuple unpacking
for i, x in enumerate(['three', 'item', 'list']):
    print(i, x)

# "zip" returns a list of tuples
# range(x, y) = list of integers between x and y-1
for x, y in zip(range(5), range(5, 10)):
    print(x, y) # "0 5" "1 6" "2 7" ...

first
second
third
0
1
2
3
4
5
6
7
8
9
1
2
3
4
5
0 three
1 item
2 list
0 5
1 6
2 7
3 8
4 9


The `enumerate` function is particularly useful when we need to iterate over both items and indices. In many case, `zip` can be used to avoid indices altogether.

**While loops** are also available:

In [33]:
a = 10
while a > 0:
    print(a)
    a -= 1 # decrement (no ++ or -- operators)

10
9
8
7
6
5
4
3
2
1


## Comprehensions

Comprehensions are a compact _syntactical construct to build and fill data structures on the fly_. Some examples:

In [23]:
[v**2 for v in range(1, 16) if v % 2 == 0]

[4, 16, 36, 64, 100, 144, 196]

In [34]:
# list comprehension
a = [i**2 for i in range(5) if i % 2 == 0]

# set comprehension
b = {i**2 for i in range(5) if i % 2 == 0}

# dictionary comprehension (<key>:<expr>)
c = {i : i**2 for i in range(5) if i % 2 == 0}

# multiple levels
a = {(i,j) : 0 for i in range(10)
               for j in range(10)}

The general syntax is (using lists as examples):

```
<list comprehension> ::= [<generator expression>]
<generator expression> ::= <expr>
                           {for <var> in <collection>
                            [if <condition>]}
```

## Functions

Functions can be defined using the syntax:

In [35]:
# Simple function
def f1(x):
    return x**2;

# Function within function
def f3(x):
    a = 10
    def f4(x): # scope includes that of f3
        return x*a; # f4 can access local vars of f3
    return f4(x)

In [25]:
def f():
    return 42, 21

a, b = f()
a, b

(42, 21)

As in many other programming languages, the return function specifies what the function should send back to the caller.

In python, functions have access to the variables in the environment in which they are defined (e.g. `f4` can access variables defined by `f3`).

* This can be very useful at times...
* ...And it's incredibly messy in many other occasions

It is possible to specify _named arguments_ and _default values_ for functions:

In [36]:
def the_answer_is(res = 42):
    return res

print(the_answer_is()) # prints 42
print(the_answer_is(13)) # prints 13
print(the_answer_is(res=3)) # prints 3

42
13
3


Python incorporate some elements of _functional programming_. In particular, functions are objects!

Here, `tranform` calls the function passed as argument (in the `f` variable) on each element of the `x` collection:

In [37]:
def transform(x, f):
    return [f(v) for v in x]

def g(x):
    return 2*x;

transform([1, 2, 3], g)

[2, 4, 6]

Functions are objects are very handy in some situation. One of them is sorting collections:

In [39]:
voti = [('gigi', 20), ('gianni', 30), ('gino', 24)]

def f(t): return -t[1]

# sorted returns a sorted copy of the collection
# key = score function to be used for sorting
for name, score in sorted(voti, key=f):
    print(name) # prints gianni, gino, gigi

gianni
gino
gigi


When we need a small function for an ad-hoc task (e.g. sorting), we can use a nameles function (aka "lambda function" instead).

This code is equivalent to the one above, but it avoids the explicit definition of the score function:

In [40]:
voti = [('gigi', 20), ('gianni', 30), ('gino', 24)]

# sorted returns a sorted copy of the collection
# key = score function to be used for sorting
for name, score in sorted(voti, key=lambda t: -t[1]):
    print(name) # prints gianni, gino, gigi

gianni
gino
gigi


The lambda function is constructed by:

In [41]:
lambda t: -t[1]

<function __main__.<lambda>>

The general syntax is:

```
lambda <parameter list>: <expression to be returned>
```

## Modules

Python has a huge collection of external modules (both standard and non-standard).

A module con be imported in several ways:

In [42]:
import numpy # A module for vector operations
import numpy as np # Ranaming, for ease of access
import numpy.random # Import a submodule
from numpy import random # Import a submodule (allows accessing with "random")

# Numpy

[**numpy**](http://www.numpy.org) is a python module for vector computation. Numpy enables efficent computations over vectors, using an approach and API that inspired by Matlab.

We will heavily the [scikit-learn](http://scikit-learn.org/stable/index.html) python module, but we will have plenty of time to see how it works.

Finally, we will use two more imporant python modules, namely [matplotlib](https://matplotlib.org/index.html) and (pandas)[https://pandas.pydata.org]. In case of both of them, we will be interested in very speficic functionalities and it will make sense to present them at the moment of use.


## Numpy basics

Numpy provides a fundamental data structure calle **array**. Numpy arrays are data structure corresponding to fixed-size vectors, matrices, or tensors (i.e. matrix genraliazation to n-dimensions). 

They are internally stored as C arrays (i.e. sequences of contiguous elements in central memory), which makes them very efficient to process. This storage method is the same, no matter what the number of array dimensions is. As a side effect, each array object has an associatd _shape_ field (a tuple) that defines the actual structure of the data.

An all zero, 5x5 matric can be built with:

In [6]:
import numpy as np # It is customary to rename "numpy" as np

np.zeros(5)

array([0., 0., 0., 0., 0.])

The `(5, 5)` parameter represents the shape.

Notice that jupyter always displays the output of the last expression in a cell, even if no `print` instruction is issued.

An all-one, 3x4 matrix can be obtained with:

In [2]:
print(np.ones((3, 4)))

[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]


One dimensional arrays can be build by passing a single number instead of a tuple, i.e.:

In [4]:
print(np.zeros(5))
print(np.ones(4))

[ 0.  0.  0.  0.  0.]
[ 1.  1.  1.  1.]


An array of number separated by a fixed increment can be built with:

In [4]:
print(np.arange(1, 10, 2))

[1 3 5 7 9]


Where the first argument is the starting number, the second is the limit (non-included) and the third is the increment. All arguments except the limit are optional. Hence we can write:

In [6]:
x = np.arange(10)
print(x)

[0 1 2 3 4 5 6 7 8 9]


And obtain the same result.

Once an array has been built, we can change its shape. This is a very efficient because the memory representation stays the same: it's only the shape tuple that is update.

Reshaping is done using the method:

In [8]:
x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [9]:
print(x.reshape((2, 5)))

[[0 1 2 3 4]
 [5 6 7 8 9]]


`reshape` return a reshaped copy (careful: this is not strictly true...) of the original array.

Elementwise operations on numpy arrays can be efficiently performed using the usual arithmetic operators. This is possible since python allows operator overring, which numpy takes great advantage of.

Some examples:

In [10]:
a = np.arange(1, 10).reshape((3,3))
b = np.arange(1, 10).reshape((3,3))
print(a)
print(b)

print(a + b)
print(a * b) # elementwise product (not the matrix product!)
print(a / b)

[[1 2 3]
 [4 5 6]
 [7 8 9]]
[[1 2 3]
 [4 5 6]
 [7 8 9]]
[[ 2  4  6]
 [ 8 10 12]
 [14 16 18]]
[[ 1  4  9]
 [16 25 36]
 [49 64 81]]
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


The trick works even if one of the two terms is a scalar:

In [8]:
c = 2

print(a * 2)
print(2 + a)
print(2 / a)

[[ 2  4  6]
 [ 8 10 12]
 [14 16 18]]
[[ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
[[ 2.          1.          0.66666667]
 [ 0.5         0.4         0.33333333]
 [ 0.28571429  0.25        0.22222222]]


If we need the matrix multiplication or the dot product, we can use:

In [11]:
print(np.matmul(a, b)) # Matrix multiplication
print(np.dot(a[0], b[0]))

[[ 30  36  42]
 [ 66  81  96]
 [102 126 150]]
14


The example shows also that the rows (i.e. the first dimension, or _axis_) of a numpy array can be accessed with the notation:

```
<array>[<index>]
```

Numpy arrays, however, support much more flexible forms of indexing. We will use matrices to show some examples, but the presented methods work for n-dimensional arrays:

In [16]:
a = np.arange(16).reshape((4,4))

print(a)
print(a[1,1]) # Access by row and column
print(a[1:3, 1]) # Range of rows, single column
print(a[1:3, 1:3]) # Range of row, range of columns
print(a[1, :]) # One full row
print(a[:, 1]) # One full column
print(a[2:, :]) # All rows, starting with index 2

print(a > 3)

print(a[a > 3]) # Indexing with a boolean mask (this looses the shape)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
5
[5 9]
[[ 5  6]
 [ 9 10]]
[4 5 6 7]
[ 1  5  9 13]
[[ 8  9 10 11]
 [12 13 14 15]]
[[False False False False]
 [ True  True  True  True]
 [ True  True  True  True]
 [ True  True  True  True]]
[ 4  5  6  7  8  9 10 11 12 13 14 15]
