[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/guvendemirel/QMULSBM_PhDWorkshop/blob/master/SBM_PhD_python_ws_part1and2.ipynb)

# QMUL SBM PHD Python Workshop 
## Introduction to Python Programming

This is the first and seconds part of the PhD Workshop on [Python](https://www.python.org/) programming. In the first session, we will introduce the basics of programming with Python, in which we will look at variable assignments and different data types (numerical, text, list, and tuples). In the second session, we continue with data types (sets, dictionaries) and building blocks of programming (loops, conditional statements, and functions). In the following sessions we will work on extracting and cleaning data and processing textual data with natural language processing techniques.

The first set of computer lab tutorials on Python programming follows the structure and organisation of McKinney, W. 2017. Python for Data Analysis, 2nd Edition, O'Reilly, but there are many equally good online resources. We might also cover web scraping if the time permits.

## Install Python and Jupyter Notebook on Your Personal Computers

You can install Python and Jupyter Notebook by downloading the most recent stable version of the [Anaconda](https://www.anaconda.com/products/individual) distribution, which can be downloaded from the provided link. Different interpreters and development environments can be used with Python. We will use the [Jupyter Notebook](https://jupyter.org/), which is a web browser based application that allows you to run and to insert text, equations, and graphics. These can be directly opened on Google Colab.

After installation, you can open a Jupyter notebook into which you can write code and execute. Follow the steps on [Jupyter Notebook website](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/execute.html) for operating system specific instructions, which involves running `jupyter notebook` in a terminal. You can `rename` and `save` notebooks from the `File` menu.

## Markdown and Code Cells

You can insert **code** and **markdown** cells into Jupyter notebooks. Code cells are for typing and running Python code. Markdown cells are for the text content and organizing and formatting documents. 

You can add section titles by `#`, `##`, `###`, you can create numbered and bullet lists by starting the line with the corresponding marker, e.g. `1.`,`2.`, `3.`. You can format text for instance by `*italic_text*` and `**bold_text**`. Visit [Markdown Guide](https://www.markdownguide.org/) for further details.  

You can change the cell type from the drop-down menu at the top or choose the cell and press the keys <kbd>y</kbd> and <kbd>m</kbd> for the code and the markdown cells, respectively. In Google Colab, you need to press <kbd>ctrl</kbd> + <kbd>m</kbd> + <kbd>y</kbd> and <kbd>ctrl</kbd> + <kbd>m</kbd> + <kbd>m</kbd>, respectively.

Python is an interpreted language, meaning that you can freely execute different segments of code (which are cells on Jupyter notebooks) without the need to compile it first to the machine code. This reduces the development time and facilitates experimentation with data in an interactive way but slows down execution, especially for computationally intensive tasks. This is why most heavy computations are implemented in other languages C and C++, which is then wrapped in Python. 

If you click on *Run* or hit <kbd>shift</kbd>+<kbd>enter</kbd>  (also <kbd>ctrl</kbd>+<kbd>enter</kbd> in Jupyter notebook) shortcuts, it runs the selected cell (more options under *Cell*).  

## Getting Started

We shall now start using Python as a simple calculator. You can use the operators `+` (add), `-` (substract), `*` (multiply), `/` (divide), `**` (power), `%` (modulus) in *expressions*. For instance:

In [1]:
# Calculate (3 - 5 * 1.8) / 1.2
(3 - 5 * 1.8 ) / 1.2

-5.0

### Variable Assignment

Assign the outcome of an expression on the right of the equation to the variable. 

In [2]:
# Assign square of 3 to x
x = 3 ** 2

You can print the value of the variable `x` by simply running `x` or `print(x)`. `print()` is the first *function* we are seeing and using. A function is a block of code that takes inputs through its arguments, performs some taks, and potentially returns a value at the end.

In [3]:
print(x)

9


Some dataypes such as integers, floating numbers, and strings are *immutable*, meaning that if you try to alter the value, Python basically creates a new object with the same name.

Check what results you expect to see:

In [4]:
x = 10
y = x 
x *= 2 #This is called augmented assignment and it is equivalent to x = x * 2
print('x:', x, ',y:', y) #You can print multiple values by putting a comma in between 

x: 20 ,y: 10


## Data Types

Python has several built-in data types and you can define new types by creating classes. A thorough coverage of object-oriented programming is beyond the scope of this workshop. 

You can learn the type of an object by calling the `type` function. Here we focus on the most fundamental data types. 

### Numeric Data Types
The numeric data is held as integers (`int`) and floating numbers (`float`). 

In [5]:
x = 1
y = 2.8
# Print the types of the variables x and y
print(type(x), type(y), type(x + y)) 

<class 'int'> <class 'float'> <class 'float'>


For numerical computations, the most commonly used Python package (library) is `numpy`, which most importantly provides a collection of mathematical functions and the `array` type (`ndarray`). We will look at `numpy` in more detail later but we shall now illustrate how it is used. For using functions or classes from a package, you should first `import` the package.

In [6]:
import numpy as np # np is an alias for numpy, which we will use to access numpy functions

We can now call the functions and constants from the `numpy` package by for instance`np.sqrt()`.

In [7]:
# Print the result of the expression pi + sqrt(3)
print(np.pi+np.sqrt(3))

4.87364346115867


### Text Data Type

The text data type in Python is `str`, which is immutable but iterable. You define a `str` by using `''` or `""`. 

In [8]:
# Assign your first name and last name to the respective variables
first_name = 'Guven'
last_name = 'Demirel'

You can concatenate `str` objects by `+`: 

In [9]:
# Create your full name variable
name = first_name + ' ' + last_name
# Print your full name:
name

'Guven Demirel'

You can access the individual characters in your string variable, because it is iterable. Python is zero-indexed, meaning that the index of the first, second, third elements are 0, 1, 2, .... The subscripting is done by `x[n]`, which returns the fourth element of the iterable x. 

In [10]:
# Print the second letter of your name
name[1]

'u'

The last element can be accessed by -1 and you can go backwards by -1, -2, -3, etc.

In [11]:
# Print the first two and last two letters of your name, 
print(name[0], name[1], name[2], name[3])

G u v e


You can *slice* the string by giving the first (inclusive) and last (exclusive) indices from which to slice. For instance, `x[1:4]` returns the x elements in indices 1, 2, and 3.  

In [12]:
# Return the second to fourth letters of your name, including both
name[1:4]

'uve'

Note that you cannot change this value because `str` objects are immutable.

In [13]:
# Try the following
name[3] = 'z'

TypeError: 'str' object does not support item assignment

You can get the length of a `str` object (and any iterable) by calling the `len()` function:

In [14]:
# Print the number of letters in your name
len(name)

13

#### Str Methods
When you create a string object, it comes equipped with a set of methods. Some useful methods are as follows. You can get a list of the methods by `dir(str)`
- `split`: splits the string into words separated by whitespace
- `count`: counts the appearance of provided substring in the string
- `find`: returns the first instance of the substring
- `isalpha`: returns `True` if all alphabetical letters, `False` otherwise
- `isdigit`: returns `True` is all digits, `False` otherwise
- `lower` (`upper`): change all letters to lower character
- `replace`: replace all instances of the substring to the new substring 
- `join`: if you have a series of objects you can join them by the provided string

In [15]:
sentence = "We have all been looking forward to this day"
# split the sentence to a list of words
words = sentence.split()
#print words
words

['We', 'have', 'all', 'been', 'looking', 'forward', 'to', 'this', 'day']

In [16]:
# Count the number of instances of the phrases 
#'THE' / 'the' / 'tHe' ... and also the start index of the first instance
text = "instances of THE, while the first starts at index"
print('counts of "the" ', text.upper().count('THE'), 
      ' - first "the" starts at index: ', text.upper().find('THE'))

counts of "the"  2  - first "the" starts at index:  13


## LIST

List is a sequence data type and provides an iterable collection of individual elements. You can create a list of any objects you want. For instance, a list of numbers, a list of strings, a list of numbers and strings mixed, list of lists, list of functions, and basically list of any objects.

In [17]:
# list of integers 2, 7, 9
first_list = [2,7,9] 
# print the list
first_list

[2, 7, 9]

In [18]:
# create a list of strings 'apple' and 'orange'
second_list = ['apple', 'orange']
# print the list 
second_list

['apple', 'orange']

Empty list is created by `x = []`, which might be needed if you populate a list through a loop.

In [19]:
x = [] #empty list

You can concatenate lists by using `+`. Note the data types of elements are not affected.

In [20]:
# Append second_list to the end of the first_list
first_list + second_list

[2, 7, 9, 'apple', 'orange']

You can convert other data types to list by the `list()` function, which is useful in some instances.

In [21]:
my_string = 'SBM PhD Programme'
# Create a list of characters in my_string
list(my_string)

['S',
 'B',
 'M',
 ' ',
 'P',
 'h',
 'D',
 ' ',
 'P',
 'r',
 'o',
 'g',
 'r',
 'a',
 'm',
 'm',
 'e']

Lists are iterable, which means you can index and slice lists as for strings

In [22]:
x = [1, 4, "three", -2.8]
# Subscript the element in index 2
x[2]

'three'

In [23]:
# Slice the list of all elements except the last one
x[:-1]

[1, 4, 'three']

Lists are mutable, meaning you can change the values of individual elements, differently from strings and numerical basic datatypes.

In [24]:
# Change the value of "three" to 3
x[2] = 3
#Print x
x

[1, 4, 3, -2.8]

### List Aliasing 

Lists are mutable and Python calls by object. When you assign one variable to another list, the two variables become alias for the same list object, also called deep copy. You must keep this in mind when working with lists. 

In [25]:
# What output do you expect from the following?
x = [3, 6, 8]
y = x
y[2] = -1
print(x, y)

[3, 6, -1] [3, 6, -1]


If you want a shallow copy, you can use the `list()` function.

In [26]:
# What output do you expect from the following?
x = [3, 6, [1, 7, 2]]
y = list(x)
y[1] = -1
print(x, y)

[3, 6, [1, 7, 2]] [3, -1, [1, 7, 2]]


### List Methods
There are various methods of lists for inserting, removing and sorting elements, which are mutator methods, meaning that they change the list object itself. You call the method `method1` of the object x by calling `x.method1()`.
- `insert(ind, y)` inserts element `y` at index `ind` and shifts all consecutive items by one to the right
- `append(y)` adds element `y` at the end of the list
- `extend([y1, y2, y3])` append elements `y1`, `y2`, and `y3` at the end of the list
- `pop(ind)` removes the element in index ind, remove the last element if no index provided
- `sort()` sorts the list if it is sortable (numbers or strings)

In [27]:
y = [-6, 2, 'python', 3.1, [1.2, 0]]
y.insert(2, 'anaconda')
# What output do you expect?
y

[-6, 2, 'anaconda', 'python', 3.1, [1.2, 0]]

In [28]:
# Add 'jupyter' at the end of the list
y.append('jupyter')
# Add [-2, 6] at the end of the list
y.extend([-2, 6])
# Remove the first (zeroth index) element
y.pop(0)
# Print y
y

[2, 'anaconda', 'python', 3.1, [1.2, 0], 'jupyter', -2, 6]

In [29]:
list1 = [-6, 7, 2]
# Sort the elements
list1.sort()
# Print list1
print(list1)

[-6, 2, 7]


In [30]:
# What output do you expect?
list2 = [0, -2.2, 'accounting']
list2.sort()

TypeError: '<' not supported between instances of 'str' and 'float'

## TUPLE
A tuple is a fixed-length, immutable sequence of Python objects. It is very similar to list but cannot be altered. You can convert any sequence or iterable to tuple using `tuple()` (similar to `list()`). Elements can be accessed with square brackets [] as for other sequence types.

In [31]:
# Create a tuple
tup = 1, 2, 3, 'a', None, [1, 6, 2]
tup

(1, 2, 3, 'a', None, [1, 6, 2])

In [32]:
# Slice elements in indices 2  to 4 (inclusive)
tup[2:4]

(3, 'a')

In [33]:
# What do you expect?
tup[-1] = 0
tup

TypeError: 'tuple' object does not support item assignment

In [34]:
# What do you now expect?
tup[-1][1] = -2
tup

(1, 2, 3, 'a', None, [1, -2, 2])

Tuples can be unpacked:

In [35]:
a = (-1, 1)
x, y = a
# What output do you expect?
print(x, y)

-1 1


## SET
A set is an unordered collection of unique elements. They can be created by the `set` constructor or using curly brackets {}. They are very handy because they allow set operations such as union, intersection, and difference.
- Intersection of sets `a` and `b`: `a & b` or `a.intersection(b)`
- Union of sets `a` and `b`: `a | b` or `a.union(b)`
- Difference of set `a` from `b`: `a - b` or  `a.difference(b)`

In [36]:
x = set([2, 2, 2, 1, 3, 3])
y = {3, 'a', 6, 1, 'a'}
# What outputs do you expect?
print(x,y)

{1, 2, 3} {1, 'a', 3, 6}


In [37]:
a = {0, 1, 2, 3, 4}
b = set([3, 4, 5, 6, 7, 8])
# Print union of a and b
print(a | b)
# Print intersection of a and b
print(a & b)
# Print difference of a from b
print(a - b)

{0, 1, 2, 3, 4, 5, 6, 7, 8}
{3, 4}
{0, 1, 2}


## DICT

The `dict` structures are dictionaries of **key-value** pairs, where key and value are Python objects. Dictionaries are critical for labelling objects, which provide for instance the basis of indices in tabular data (Pandas) as we will soon see. You can create a dictionary by putting in curly brackets {} the `key:value` pairs,  separated by commas.  
For example, let's create a dictionary of profits (say in million pounds):

In [38]:
firms = {"firm A": 2.68, "firm B": None, "firm C": 1.13}
firms

{'firm A': 2.68, 'firm B': None, 'firm C': 1.13}

You access values by key, i.e. `my_dict[key]`:

In [39]:
# What is the profit of firm A?
firms['firm A']

2.68

You can insert new key-value pairs or update existing ones by assignment: `my_dict[key]=value`

In [40]:
# Decrease the profit entry of firm A by 0.13
firms['firm A'] -= .13

# Insert new entry for firm D that has a profit of 3.0
firms['firm D'] = 3.0

# print firms dictionary
firms

{'firm A': 2.5500000000000003, 'firm B': None, 'firm C': 1.13, 'firm D': 3.0}

You can remove elements using the `pop()` method as in lists.

In [41]:
# Remove the firm B entry
firms.pop('firm B')

# Print the dictionary
firms

{'firm A': 2.5500000000000003, 'firm C': 1.13, 'firm D': 3.0}

The `keys()` and `values()` methods return the keys and the values 
of the dictionary.



In [42]:
# Keys
print(firms.keys())

# Values
print(firms.values())

dict_keys(['firm A', 'firm C', 'firm D'])
dict_values([2.5500000000000003, 1.13, 3.0])


You can pair up sequences with `zip` function to form dictionaries, i.e. `dict(zip(seq1, seq2))`:

In [43]:
countries = ('UK', 'Spain', 'Italy', 'France') # keys
capitals = ('London', 'Madrid', 'Rome', 'Paris') # values

# Create a dict that returns capital for the country chosen
my_dict = dict(zip(countries, capitals))
# Print my_dict
my_dict

{'UK': 'London', 'Spain': 'Madrid', 'Italy': 'Rome', 'France': 'Paris'}

## Control Flow

Control flow refers to the specification of the order of execution of different blocks of code using mainly conditional statements and loops. Boolean data type and operations are critical for control flows.

The scalar Boolean data type in Python is `bool`, which has two instances `True` and `False`. 

In [44]:
type(True)

bool

### Boolean Operations
- `a & b`: AND - True if both a and b are True
- `a | b`: OR - True if either a and/or b is True
- `a ^ b`: XOR - True if any one of a or b is True, but not both
- `a == b` True if a equals b
- `a != b` True if a is not equal to b
- `a <= b, a < b`	True if a is less than or equal to (less than) b
- `a > b, a >= b`	True if a is greater than or equal to (greater than) b
- `a is b`	True if a and b reference the same Python object
- `a is not b`	True if a and b reference different Python objects

In [45]:
# What outputs do you expect?
print(True & False)
print(True ^ True)

False
False


In [46]:
list1 = [1, 2, 0]
list2 = [1, 2, 0]
# What output do you expect?
print(list1 == list2, list1 is list2)

True False


In [47]:
list3 = list1
# What output do you expect?
print(list2 == list3, list1 is list3, list2 is list3)

True True False


## Choice Statements:
Choice statements control the flow in the program depending on which condition is met.
```pyton
if condition1:
    statement1
elif condition2:
    statement2
elif condition3:
    statement3
else:
    statement4
```

As an example, we will generate two standard normal variables and choose the greater value. 

In [48]:
np.random.seed(42) #set the seed for pseudo-random numbers 

# a standard normal random number 
x = np.random.standard_normal() 

# generate another standard normal random number
y = np.random.standard_normal()

# format replaces the {} in the string with the corresponding argument
print("x={0:.2f} and y={1:.2f}".format(x, y)) 

x=0.50 and y=-0.14


In [49]:
# Print the maximum of x and y
if x >= y: 
    print("Maximum is {0:.2f}".format(x))
else:
    print("Maximum is {0:.2f}".format(y))

Maximum is 0.50


## Loops 

`for` loops are used for iterating over collections, such as lists, tuples, and any iterators. 

`range` function is used to create iterators. It is called by `range(1,5)`, which iterates as 1, 2, 3, and 4. The advantage of iterators as opposed to lists or tuples is that you do not keep the full list. We need only the value in a specific iteration, which saves a lot of memory if looping over a large range.

In Python, `for` loops should only be used in complex situations because they are slower than list comprehensions and Numpy array methods (both to follow). 

If you want to check time performance of an operation you can use the `%%time` and `%%timeit` magics.

Example: create a loop that calculates the square of one million randomly generated standard normal variables.

In [50]:
# Genereate 1000000 standard normal variables
vals = np.random.standard_normal(1000000)

In [51]:
%%timeit
# Create an empty list
vals_squared = []
# Loop over the vals array and append to the list the square of the current element
for val in vals: 
    vals_squared.append(val ** 2)

234 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


You can do the same with **list comprehension** in shorter time:

In [52]:
%%timeit
vals_squared = [val ** 2 for val in vals]

203 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


It is even faster to use Numpy methods and operations, which are directly implemented in C.

In [53]:
%%timeit
vals_squared = vals ** 2 # note that this works with Numpy arrays, not lists

1.43 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In most cases, we need to combine `for` loops with `if else` statements. You can also do this in list comprehensions:

In [54]:
# For the list below
my_list = [1, 10, 'analytics', 'business analytics', 'data', 'machine learning', 
           'statistics', 'python']

# We want to convert all strings to uppercase (common task in data cleaning) 
# and skip other types 
# Hint: isintance(x,type) checks if x is an instance of class type 
cleaned_list = [entry.upper() for entry in my_list if isinstance(entry, str)] 
cleaned_list

['ANALYTICS',
 'BUSINESS ANALYTICS',
 'DATA',
 'MACHINE LEARNING',
 'STATISTICS',
 'PYTHON']