[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/guvendemirel/QMULSBM_PhDWorkshop/blob/master/SBM_PhD_python_ws_part3.ipynb)

# QMUL SBM PHD Python Workshop - Part 3

In this session, we will continue with functions and Numpy and Pandas packages for numerical operations and working with tabular data. 

## Functions
Functions are crucial building blocks of programming, which eliminate redundancy and provide abstraction. If you repeat a certain code that solves a non-trivial task in different parts of your code, it is a good indication that you should introduce a function. Organizing your blocks of code into functions provides an abstraction, which makes it easier to understand your code. Furthermore, the maintenance of code is easier with functions. We have so far used several built-in functions, e.g. `print` function, or functions from other packages such as Numpy, e.g. `np.sqrt()`. The functions are called by providing their arguments, lists of which can be checked from the help.

We can also define our own functions. Let's now define a simple function. **Positional arguments** must always be provided inputs. Default values are used for **keyword arguments** if no value is provided. The function definition syntax is as follows:

```python
def function_name(arg1, arg2 = val2):
    code
    return value
```

### Namespaces
The local namespace (variable scope) is created when the function is called and the arguments are automatically loaded. You can also use variables from the global namespace in the local namespace. However, you cannot change their values. If you try, a new variable is created. You must use the `global` keyword if you want to update its value.

As an example, let's write a function that cleans the names of subject areas.

In [None]:
import re #regular expressions library for handling text

courses = [' accounting ', 'finance ', 'Marketing ', 'supply chain management#', '?interna92tional BuSiness1\n']

# Create a list of characters to remove
remove_chars = '[_!#?*0-9]'

def clean_name(name):
    cl_name = __.strip() #remove the beginning and ending white spaces and new lines
    cl_name = re.sub(remove_chars, '', __) #remove irrelevant characters
    cl_name = __.title() #first letter capital 
    __

# Write a list comprehension that applies clean_name to each name in the courses list
cleaned_courses = __
cleaned_courses

Final check about variable scope:

In [None]:
# What happens if you try accessing cl_name outside the function (in the global scope), 
# where it is not in the namespace?
cl_name

Since in Python everything is an object, the variables are passed by object (reference), in contrast to passing by value, which is common in most other languages. The behaviour depends on whether the argument is of mutable vs immutable data type, similar to the behaviour of variable assignment as we have seen before. The following exercises illustrate variable scopes and the behaviour for mutable and immutable arguments.

In [None]:
# Try the following code - Version 1
def clean_name(name):
    name = name.strip() 
    name = re.sub(remove_chars, '', name) 
    name = name.title()
    return name

name1 = "?businesS analytics#"
name2 = clean_name(__)

# What output do you expect and why?
print(name1, name2)

In [None]:
# Try the following code - Version 2
def clean_name_alt(name):
    name1 = name.strip() 
    name1 = re.sub(remove_chars, '', name1) 
    name1 = name1.title() #first letter capital 
    return name1

name1 = "something else"
name2 = "?businesS analytics#"
name3 = clean_name_alt(name2)

# What output do you expect and why?
print(name1, name2, name3)

In [None]:
# Try the following code - Version 3
def clean_name_alt(name):
    # set the scope of the variable to global
    __ 
    name1 = name.strip() 
    name1 = re.sub(remove_chars, '', name1) 
    name1 = name1.title() #first letter capital 
    return __

name1 = "something else"
name2 = "?businesS analytics#"
name3 = clean_name_alt(name2)

# What output do you expect and why?
print(name1, name2, name3)

In [None]:
# Check whether name1 and name3 are the same object
__

In [None]:
# Working with list arguments

# Global scope:
courses = [' accounting ', 'finance ', 'Marketing ', 'supply chain management#', '?interna92tional BuSiness1\n']
no_courses = 0

def clean_names(name_list):
    # set the scope to global
    __ no_courses
    # loop over the list by both index and value
    for __, __ in __:
        # increment no_courses
        __
        # call clean_name function and update the list
        __        

# clean the courses list
__

# What output do you expect and why?
print(courses, no_courses)

## Numpy Package
NumPy is the main package for numerical computations in Python. It is used together with the Pandas package for data analysis. Numpy is highly efficient even for large amounts of data because its methods are implemented in C. NumPy provides methods, operations, and functions that can be applied to whole arrays without the need for loops or list comprehensions.

### Numpy Arrays
The core data structure of Numpy is `ndarray` (N-dimensional array). Arrays are homogeneous, meaning that all units have the same data type (mostly numeric and logical), differently from lists that are heterogeneous. With arrays, you can apply operations as if they were scalars. 

In [None]:
# Import the numpy package
__

# Set the seed for random numbers
__

# Create my_array 2 x 3 array of standard normal variables
my_array = __.__.__((2, 3)) 
my_array

All mathematical operators are applied element-wise:

In [None]:
# Multiply all entries of my_array by 10
10 * my_array

In [None]:
# Subtract 0.2*my_array from my_array
__

In [None]:
# Divide 10 by each entry of my_array
__

In [None]:
# Raise 0.2 to the power given by each entry of my_array
__

In [None]:
# Take the fourth power of each entry of my_array
__

You can access the shape of the array by the `shape` attribute:

In [None]:
# How many rows and columns does my array have my_array?
__

You can create arrays from other collections by using the `np.array()` function.

In [None]:
my_list = [3, 6, -2]
# Create my_array from my_list
my_array = __
# Change the values of both my_list and my_array
my_list *= 4  
my_array *= 2

# What results do you expect?
print(my_list, my_array)

In [None]:
my_list = [3, 6, -2]
# create a list that contains 2 * each value of my_list elements
__

You can create two-dimensional arrays from lists with equal length:

In [None]:
data1 = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
# Create 2D array from data1
arr1 = __
# print arr1
__

A commonly used Numpy array method is `reshape((n1,n2))` which reshapes the array to a 2D array with n1 rows and n2 columns (can be applied to higher dimensional arrays). Another useful method is `ravel()`, which flattens the array

In [None]:
# range(1, 13) -> 1, 2, 3, ..., 12
# Numpy's arange is similar to the range iterator but returns an array 
arr2 = __
# Print the array
arr2

In [None]:
# Reshape the array to 3x4 two-dimensional array
arr2.__

In [None]:
# Reshape the array to 2 rows and assign to arr3
arr3 = __

# Flatten arr3
__

You can compare two numerical arrays to form a Boolean array.

Create two standard normally distributed 2d (2x4) arrays and check whether the entry in the first array is greater than or equal to the second array entry.

In [None]:
arr1 = __
arr2 = __
# create the boolean array that contains whether arr1 value is greater than or equal to arr2 value
__

## Indexing and Slicing Numpy Arrays
Indexing and slicing are done in a very similar way to lists and tuples. If the array is high dimensional (at least 2), you provide indices for all dimensions.

In [None]:
arr1 = np.array([3, 6, -2, 7, 9])
# The element in index 1
print(__)
# Slice that starts with the index 2 up until the end
print(__)

With Numpy arrays, you can pass multiple indices as a list:

In [None]:
# Subscript elements at index 0, 2, and 3
arr1[__]

Working with 2D arrays:

In [None]:
arr2 = np.arange(12).reshape(4,3)
arr2

In [None]:
# Subscript to the element in row 1 and column 2
__

In [None]:
# Dice to the sub-matrix from row 2 to the last row and columns 0 and 2 (not 1)
__

You can then assign values to these slices. If you assign a scalar, its value is repeated for each entry.

In [None]:
# For the dice above, assign the values to 100
__
# Print arr2


You can also index by using logical conditions, which you can equally apply to other collections such as lists and tuples. 

In [None]:
# Data type conversion needed for the next replacement (remember homogeneous types)
arr2 = arr2.astype(np.float64)

# Identify the elements of arr2 which are greater than or equal to 7
__

# Set the values >= 7 to normal random numbers with mean 10, std dev =2
arr2[__] = __(10, 2, np.sum(arr2 >= 7))
arr2

## Numpy functions
Numpy provides a wide range of functions and methods that can be applied efficiently on arrays.  

**Unary functions**:
They are all element-wise transformations.
- `abs`: Compute the absolute value
- `sqrt`: Compute the square root
- `exp`: Compute the exponential
- `log`,`log10`: Natural logarithm, log base 10
- `sign`: Compute the sign
- `rint`: Round to the nearest integer
- `isnan`: Return boolean array indicating whether each value is NaN (Not a Number)
- `cos`, `cosh`, `sin`, `sinh`, `tan`, `tanh`: Regular and hyperbolic trigonometric functions

**Binary functions:**
They take two arrays and return a single array as the result. 
- `maximum`: Element-wise maximum
- `minimum`: Element-wise minimum
- `mod`: Element-wise modulus

In [None]:
# Create an array of 0, 1, ...,9
arr1 = __
# Exponentiate that array elementwise
print(__)

Create two standard normal 2d (2x4) arrays and choose the maximum of the two arrays for each element

In [None]:
arr1 = __
arr2 = __
# Create arr3, which is elemtwise maximum of the two
arr3 = __
# Print arr1, arr2, arr3

In [None]:
arr = np.arange(12).reshape(4,3)
arr

In [None]:
# Overall mean, sum, standard deviation
arr.__, arr.__, arr.__

In [None]:
# Mean across the rows
arr.__

In [None]:
# Maximum across the columns
arr.__  

## Working with Tabular Data: Pandas
We will now learn the basics of Pandas, the Python package for working with tabular data, and start working on data pre-processing.

Consider that you have just started working as an analyst at a film production company and your job involves analysing the market trends in the filming industry. As a data source on individual firms and audience preferences, you start your analysis by downloading the datasets on IMDB https://www.imdb.com/interfaces/. Your first objective is to clean data in the individual datasets and form an integrated dataset of movies produced in the past two decades, involving Title, Genre, Year, Runtime (Minutes), IMDB Rating, and Number of Votes data by merging two datasets.

## Organising your workspace

To work with data files, we must first ensure that they are in our working directory. In the below, we change the path to the directory in which we have the data files in our local drive. For this, we can use the shell commands by importing the `os` package. The function `os.getcwd()` returns the current working directory, while `os.chdir()` changes the working directory. The `os.path` module contains functions to work with path names in a way that is robust across operating systems (Windows, MacOS, and Linux).

In [None]:
#import os
import os

# Print your current working directory
print(os.getcwd())

# assign your home address to the variable HOME. The expanduser function is used to replace ~ with the home
HOME = os.path.expanduser('~')

# Locate the folder in which you saved the data and create a path by joining them
# In my case, my HOME is "C:\Users\Guven" and my files are under 
# C:\Users\Guven\Documents\PhD workshop\
PROJECT_DIR = os.path.join(HOME, 'Documents', 'PhD workshop')

# Change to the folder in which the dataset
os.chdir(PROJECT_DIR)

# Print your current working directory
print(os.getcwd())

If you are working on Google Colab, do the following after you copy the files in the designated folder on your Google Drive:

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive')

os.chdir(os.path.join(os.getcwd(), 'drive', 'My Drive', 'Colab Notebooks'))
os.listdir()

## Reading data with Pandas

The first step of data analysis is reading data from files. Pandas library provides several functions for reading data from different types of files, including comma or tab separated files. We first read the data from the `title.basics.tsv.gz` file, which includes the following data fields:  
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title.

We now read the data by the Pandas `read_csv` function, which takes the file path and `sep` (separator, which is `\t` for tab in our data that is tab-seperated). This returns a `dataframe` object which includes different observations (films) in the rows, and features in the columns. Each row is identifed by its index value very much like Python dictionaries.

In [None]:
# Import pandas library (convention: as pd)
__

# Read the data from title_basics into the dataframe movies_df
__ = __(__, sep=__, low_memory=False)

## Data preprocessing

We shall now explore the data set. For this, we can use the `head()` and `tail()` methods of the dataframe, which display the top or bottom rows, respectively.

In [None]:
# Display the top 10 rows of movies_df
__

The column `tconst` is the unique identifier for the record. Hence, we would like to set it as the index by the `set_index()` method. The default behaviour of Pandas objects is not to mutate the original data frame and to return a new dataframe. To overwrite the orginal dataframe, you should pass the argument `inplace=True`. Passing the argument `drop=True` leads to dropping the column used for setting the index.

In [None]:
movies_df.__(__, drop=__, __)

# print the tail (last 5)
__

The `info` method provides key information on the data frame, including names and types of variables, data shape (number of rows and columns), and memory. As you can see below, there are 8333396 titles recorded on IMDB.

In [None]:
# Show info about the dataframe
movies_df.__()

We can now move to data cleaning. First, let's remove duplicate entries, if any, by the `drop_duplicates` method.

In [None]:
# What is the initial number of rows?
# Hint: you can use the shape attribute as in Numpy arrays
N1 = movies_df.__

# Drop the duplicates in-place
__

# Print the number of rows that have been dropped
__

We can obtain a view of one of the columns by using square brackets [], which keeps the association with the index and returns a Series. Since it is only a view of a part of the original dataframe, if we make any changes, it applies to the original dataframe.

In [None]:
# Return a view of the "titleType" series
__

Not all of the columns are relevant. Especially if you are working with big data sets, it is beneficial to drop the irrelevant variables directly. You need to identify these variables based on your research questions and initial exploration of the data set. You can use the `drop` method with the argument `axis = 1` to drop the columns, for which you pass the names. You can obtain the columns by using the `columns` attribute of the dataframe.

In [None]:
# Drop the columns ["originalTitle", "isAdult", "endYear"] 
__

# Check the remaining columns by the columns attribute
__

### Missing Values

We need to treat string and numerical variables separately and handle the missing values properly. As an example, we shall first check how many numeric entries the `runtimeMinutes` has. For this, you can call the `pandas.Series.str` functions (`isnumeric` to check whether it is a number).

In [None]:
# Print the number of entries 
print(__)
    
# Print the number of numeric entries of the runtimeMinutes column
print(__)

The missing values in this data set are encoded as `\N`, which we replace with `None` to be properly handled by pandas. We use the `replace()` method to replace a given value with a desired value, in our case `\\N` with `None`. 

Note that all variables are currently held as `object` type, which is used when there is mixed datatype and for strings. As we will see, Pandas cannot infer the correct data types in this case due to missing values being recorded as string. We want to ultimately cast to the following data types:
- 'titleType': 'string'
- 'primaryTitle': 'string'
- 'startYear': 'Int64'
- 'runtimeMinutes': 'float64'
- 'genres': 'string'

In [None]:
# Replace the missing value place holder \\N with np.nan (missing value indicator)
movies_df.__('\\N', __, __)

We shall now check whether the numeric variables truely hold numeric entries. 

In [None]:
# Print the number of rows in the dataframe and the number of rows with numeric entries
# for the 'startYear' column
print(__, __, __)

# for the 'runtimeMinutes' column
print(__, __, __)

Since we see a mismatch in the 'runtimeMinutes, let's inspect the column and find the source of the problem. For this we will select the rows where the entry is not numeric passing a boolean array. If you have a dataframe `X` and you select the rows where the column `c1` satisfies a certain condition (say is negative) by `X[X['c1'] < 0]`. 

In [None]:
# print the rows of the dataframe in which the `runTimeMinutes` is not numeric
movies_df[__]

In [None]:
# Replace non-numeric runtimeMinutes with missing values
movies_df.loc[__,__] = np.nan

We are now ready to correct the data types by using the method `astype()` to which we pass a dictionary of data types. Note that we cast 'startYear' as 'float64' first and then to 'Int64' because the 'object' type can be converted to float64 but not Int64 when there are missing values. A work-around is to first convert to float64 and then to int 64.

In [None]:
# Column dataypes
column_types = {'titleType': 'string', 'primaryTitle': 'string', 
                'startYear': 'float32', 'runtimeMinutes': 'float32', 
                'genres': 'string'}

# Convert all data types using the dictionary
movies_df = movies_df.__(__) 

# Correct the data type for startYear
movies_df['startYear'] = movies_df['startYear'].__('Int16')

# Check the variables and data types
movies_df.__

We shall now inspect missing values. You can use the `isna()` method to obtain a data frame which holds a value `True` for cells where the data is missing.

In [None]:
# Find how many missing values there are for each variable
__

We have only a handful of missing values in the Primary Title. We will exclude those without title.  For this, we use the `dropna()` method. You can specify when to drop, i.e. in which column when there is a missing value, by specifying the `subset` argument.

In [None]:
# Drop rows if they do not have a title
movies_df.__

 Let's drop cells if both `runtimeMinutes` and `genres` are missing. For this, you can use the `how='all'` argument of the `dropna()` method.

In [None]:
# If both runtimeMinutes and genres are missing, drop that row
movies_df.__(__=['runtimeMinutes', 'genres'], __, __)

# Create a new data_frame by selecting only movies (titleType = 'movie') that were produced after 2000
movies_selected_df = __

We now drop the 'titleType' field, which is now always movie, hence not needed, rename the variables for convenience, and sort according to a given variable. The `rename()` method takes a dictionary as the columns arguments in the format {var1_old_name: var1_new_name, var2_old_name: var2_new_name}. We then sort the data in the ascending order of year by using the `sort_values()` method by setting the keyword argument `ascending=False`.

In [None]:
# Drop the column "titleType"
movies_selected_df.__(__, axis = __, inplace = True)

# Rename the variables
movies_selected_df.__(columns = {'primaryTitle': 'movie', 
                            'startYear': 'year', 
                            'runtimeMinutes': 'minutes'}, 
                      inplace = True)

# sort the dataframe wrt year in-place
movies_selected_df.__ 

# show the head of the dataframe
movies_selected_df.__

We can now impute the `minutes` by the median `minutes`. For this, we can use the `fillna()` method with the median as the positional argument.

In [None]:
# Impute any missing values in the minutes column by its median
movies_selected_df['minutes'] = __

# show the head of the dataframe
movies_selected_df.head()

### DataFrame Merging
We shall now merge the user ratings data with the movie dataframe. For this, we shall first read the title.ratings.tsv.gz dataset to a dataframe as we did for the first data set. Here, we will directly specify the index column, which is `tconst` as for the movie dataframe. This is done by passing the argument `index_col = 'tconst'`. We also set the data types by setting the `dtype` to corresponding data type.   

In [None]:
# Read the dataset title_ratings.tsv to a dataframe and set the index and the data types
ratings_df = __("title.ratings.tsv.gz", __ = "tconst", 
                         __ = {'averageRating': 'float64', 'numVotes': 'Int64'}, 
                         sep = '\t')

#Show top rows
ratings_df.head()

As you can see, it associates the same identifier `tconst` with the `averageRating` and `numVotes` variables.

Pandas `merge` method allows merging a dataset with another dataset. Here, we use the index `tconst` for matching the two dataframes, by setting the arguments `left_index = True` and `right_index = True`. This means that if two rows in the two datasets have the same index value, they belong to the same entity (movie). The argument `how = 'inner'` specifies that all rows of the left and right frames should match. If there are non-matching indices in either, those rows are excluded.

In [None]:
# Inner-Merge the ratings_df with movies_selected_df (key should exist in both)
final_movies_df = movies_selected_df.__(__, how = __, left_index = __, right_index = __)

# Show the head of the data frame
final_movies_df.head()

In [None]:
# Check the number of entries
__

## EDA

We shall now look at the descriptive statistics for numeric variables by using the `describe` method:

In [None]:
# Check descriptive statistics
final_movies_df.__

There seems to be some movies with very low number of votes. Hence, we shall slice only to those movies with at least 10000 votes. We shall then have a quick look at the top 10 movies with highest `averageRating`.

In [None]:
# Choose only those movies with at least 10000 votes
final_movies_df = __

# Display the top 10 movies in terms of averageRating (if equal, more votes first)
final_movies_df.__(__, ascending = __).__

We look at the correlation between numerical variables by using the `corr()` method.

In [None]:
# Check the correlation coefficients 
final_movies_df.__

There are some positive correlations between the number of votes, average rating, and the minutes. However, the correlation coefficient is a linear measure of assocation and it does not mean causation. We shall look at scatterplots. 

Matplotlib is the main visualisation package in Python and standard plots are directly implemented as methods of dataframes. Hence, you can call the `plot` method directly from a dataframe object. Here, we specify the plot type by `kind = 'scatter'`. We provide the x and y axes and the title of the plot.

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Scatter plot between numVotes and averageRating
__.plot(kind = __, x =__, y = __, title =__, figsize=(10,8))

We can see that the variation in averageRating decreases with numVotes, as expected since $SE(\bar{x}) = \frac{\sigma}{\sqrt{n}}$ if $n$ individual viewer ratings are independent (plausible) and randomly sampled. 

We shall now plot and inspect boxplots to see how ratings evolve over time. We can directly plot a boxplot using the `boxplot` method, specifying thecolumn for which to plot the boxplot. The optional argument `by = 'year'` specifies that we want to plot the boxplots separately for different values of the variable `year`.

In [None]:
# Create a box plot of averageRating by year
final_movies_df.__(column = __, by = __, rot = 90, figsize=(10,8))

The ratings look pretty stable over time.

We shall finally look at the association between the genre and the movie rating. We must consider that each movie can belong to multiple genres. We first start by identifying the different genres. For this, we can use the `str` methods that we learned before. I copy some code below for this. Work on this on your own at home for practice.

In [None]:
# The cat method concatenates all entries in a column using the specified seperator (here ',')
genres = final_movies_df["genres"].str.cat(sep = ',') 

# This returns a long str of individual genres. To obtain a set of genres, we first
# split from ',' to a list and then remove duplicates by the set() constructor
genres = set(genres.split(','))

# The following calculates the average of ratings for each genre in the list genres.
# final_movies_df["genres"].str.contains(genre) checks if each cell contains the genre being iterated
# hence this will return a slice for the correct genre, for which we then extract
# the "averageRating" column and take its mean
avg_ratings = {genre: final_movies_df[final_movies_df["genres"].str.contains(genre)]["averageRating"].mean() 
               for genre in genres}
avg_ratings