<a href="https://colab.research.google.com/github/mickaeltemporao/reproducible-research-in-python/blob/master/notebooks/reproducible-research-in-python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reproducible Research in Python
_Workshop at McMaster University, October 25th 2019_

This notebook was created as part of a workshop on *Reproducible Research in Python*. 
- You can access the workshop materials here: [Reproducible Research in Python](https://github.com/mickaeltemporao/reproducible-research-in-python).

## Important Note

This is a hands on workshop. It is better if you start coding along with me during the workshop, experiment bugs and try to understand your errors. 
Learn by doing and try to avoid copy/pasting.

Feel free to ask questions at any time during the workshop.

## Prerequisites
Prior to the workshop, users need to:
- [ ] Account on [Google Colab](https://colab.research.google.com/) 
- [ ] Account on [GitHub](https://github.com/)

## Structure
This workshop is divided into three parts. The first part is an introduction the [Python](https://www.python.org/) programming language where you will learn the basics of the language and how to use built-in libraries. The second part of the workshop will teach you how to acquire, explore, and transform data. In the third and last part, you will learn how to train, save, and load models from data.

## Software
- [ ] [Python 3.6.8+](https://docs.python-guide.org/starting/installation/) 

## Resources
- [The Python Package Index](https://pypi.org/)
- [Installing Python Pakcages](https://packaging.python.org/tutorials/installing-packages/)
- [Pandas Cheatsheet](http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [seaborn: statistical data visualization](https://seaborn.pydata.org/examples/index.html)
- [SciKit-Learn](https://scikit-learn.org/stable/)
- [Python Regular Expressions](https://www.w3schools.com/python/python_regex.asp)

Python Packaging and Dependency Management
- [Minimal Package Structure](https://python-packaging.readthedocs.io/en/latest/minimal.html)
- [Project Templates](https://cookiecutter.readthedocs.io/en/latest/index.html)
    - [Cookiecutter PyPackage](https://github.com/audreyr/cookiecutter-pypackage)
- [Guidelines to document your code](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html)
- [Guidelines to choose a licencse](https://help.github.com/en/github/creating-cloning-and-archiving-repositories/licensing-a-repository)
- [Packaging and Dependency Management with Poetry](https://poetry.eustace.io)


## License and credit
Science should be open and shared. This workshop is inspired and built on top of other open licensed material, so unless otherwise noted, all materials for this workshop are licensed under the Creative Commons Attribution Share Alike 4.0 International License.

The source for the materials of this course is on GitHub at [mickaeltemporao/reproducible-research-in-python](https://github.com/mickaeltemporao/reproducible-research-in-python).

## Contact
For any follow-up questions:
- twitter: [@mickaeltemporao](https://twitter.com/mickaeltemporao)
- email: mickael.temporao [at] gmail [dot] com


# Introduction

![Draw the Owl](http://www.forimpact.org/wp-content/uploads/2014/01/HowToDrawOwl.jpg)

## Agenda

- [Python Basics](#python-basics) 
- [Data Acquisition and Exploration](#data-acquisition-and-exploration)
- [Data Pre-Processing and Modeling](#data-pre-processing-and-modeling)






# Python Basics

**Learning Objective:** 
- Learn how code is executed in an Interactive Python Environment
- Get familiar with Python and some of its data types 
- Learn how to use functions, modules, and packages.


## Python 

- Open Source 
- General Purpose Programming Language
- Created by Guido Van Rossum
- Interpreted
- Large Community 


## IPython Shell

- Run Python commands interactively

Magic Commands

```
- %lsmagic
- %who
- %history
- %save
- %run
- %?
```

## Hello World


In [0]:
print("Greetings!")

If you run the code below, what is the output?


In [0]:
# Press CTRL/CMD+ENTER to run this cell
print("5" + "3")

In python you can sum *STRINGS*. As you are doing the sum of two *STRINGS* the result is 53.

If you run the code below, what do you see?


In [0]:
print(" _____")
# This is a comment.
print("|     |")
# Here's another comment.
print("|     |")
"This is a string"
print("|_____|")


Instructions are executed sequentially.


### Hack Time

In [0]:
# You code here.
# Print your first and last name?


## Arithmetic with Python

In its most basic form, Python can be used as a simple calculator. Consider the following arithmetic operators:

- Addition: +
- Subtraction: -
- Multiplication: *
- Division: /
- Exponentiation: **
- Modulo: % 

### Hack Time 

In [0]:
# Your code here.
# Divide 7 by 3.

# Raise 2 to the 5 power.


## Variables assignment

Variables are containers that allow you to store a value (e.g. 5) or an object(e.g. a function).

Python uses the symbol **"="** as the assignment statement.


In [0]:
x = 24
x

### Hack Time



In [0]:
# Your code here.
# Assign numerical values to two variables named `day_1` and `day_2`.

# Add these two variables together.

# Create a `my_total` variable containing the sum of the day `day_1` and `day_2` variables.

# Print the contents of the `my_total` variable.


## Basic data types in Python

Python works with numerous data types. Some of the most basic types to get started are:

- Natural numbers like 2 are called integers (*int*). 
- Decimal values like 2.5 are called floating point (*float*).
- Textual values like "orange" or 'bananas' are called strings (*str*).
- Logical values (True or False) are called boolean (*bool*).
- Lists are like variables but can contain any Python type (*list*).

In [0]:
day_1 = 20
type(day_1)


In [0]:
day_2 = 30
type(day_2)


In [0]:
description = "Vote Share"
type(description)


In [0]:
increasing = True
type(increasing)


In [0]:
data = [description, increasing, "Tuesday", day_1, "Wednesday", day_2]
type(data)

### Hack Time

In [0]:
# Your Code Here.
## Create a variable containinig the average of vote shares for day_1 and day_2

## What is the type of this new variable?


## List Manipulation

You can select, slice or edit elements in a list.

Note that Python is 0 indexed.


In [0]:
# Select an element in a list
data
data[3]


In [0]:
# Slicing lists: list[begin:end]
data[2:]


In [0]:
# Editing a list
data[0] = "Monday"
data[1] = 25.6
data


In [0]:
# Adding to a list
day_3 = ["Thursday", 40]
day_3
data = data + day_3
data


### Unpacking elements from lists


In [0]:
# Unpack contents of a list into multiple variables
a, b = range(2)

print("a:", a)
print("b:", b)


In [0]:
# You can use the asterisk to unpack multiple elements
a, b, *c = range(20)

print("a:", a)
print("b:", b)
print("c:", c)


## Functions and Methods
We have already used some functions until now (e.g. `print()`, `type()`, `range()`).

- A function is a group of related statements that perform a specific task.
- Help break our program into smaller and modular chunks.
- Make your code more organized and manageable. 
- Avoids repetition and makes code reusable.


The general form that functions take is:

```
output = function_name(input)
```



In [0]:
result = type(day_3)
result


In [0]:
# Note that help is also a function!
help(help)


In [0]:
# Alternatively IPython offers a shortcut
?print


In [0]:
data[1]


In [0]:
round(data[1])


### Defining your own functions

#### The syntax of Function
```python
def function_name(parameters):
    """A one line summary docstring of the function."""
    tmp = first_statement(s)
    output = second_statement(tmp)
    return output
```
A function definition consists of following components:
- The Keyword def marks the start of a function header.
- A function name to uniquely identify it.
- Parameters (arguments) through which we pass values to a function. They are optional.
- A colon (:) to mark the end of function header.
- "Optional" documentation string (docstring) to describe what the function does.
- One or more valid python statements that make up the function body. Statements must have same indentation level (usually 4 spaces).
- An optional return statement to return a value from the function.


### Hack Time


In [0]:
# Your code here.
## Let's create a function that returns the mean of its items.


### Methods
Methods are functions that belong to objects.

The general form that methods take is:
```python
`object.method(input)`
```


In [0]:
data.index("Tuesday")


In [0]:
help(data.index)


Each type of data has its own set of methods.



In [0]:
print(description)
type(description)


In [0]:
description.upper()


In [0]:
description.count("i")


You can also chain methods.


In [0]:
description.lower().count("i")


In [0]:
day_4 = ['Friday', 35]
data.extend(day_4)
data


## Modules, and Packages

A module is a set of python commands that are saved in a script (eg. script.py).
You can load a module and access all its contents at anytime using the command `import module`.

Packages are standardized way of organizing code and usually consist of multiple modules.
    - Minimal Package Structure: https://python-packaging.readthedocs.io/en/latest/minimal.html

Python, comes with pre-installed packages that you can directly load.



In [0]:
import math
pi = math.pi
pi


### File-system interaction.


In [0]:
import os
# Execute a shell command
os.system("touch test_script.py")


In [0]:
# Return the current working directory
os.getcwd()


In [0]:
# List all of the files and sub-directories in a particular folder
os.listdir()


In [0]:
# Create folders recursively
my_path = "my_tmp_project/test1/test2/test3"
os.makedirs(my_path)
os.listdir()


In [0]:
# Delete directories recursively.
os.removedirs(my_path)
os.listdir()


In [0]:
# Handling slashes / in file paths
file = "process.py"
folder = "Documents/project1"
full_path = os.path.join(folder, file)
full_path


In [0]:
os.rename("test_script.py", "tmp_script.py")
os.listdir()


In [0]:
# Create and write data to a file
file_path = "tmp_file.txt"
file_contents = "Hello Again,\nThis is a new Line!"

file = open(file_path, 'w') 
file.write(file_contents) 
file.close() 


In [0]:
# Using the contextual `with` statement
with open(file_path, 'w') as file: 
    file.write(file_contents) 


In [0]:
# Reading the contents of a file
with open(file_path, "r") as file:
	read_contents = file.read()

print(read_contents)

In [0]:
## Delete a file
os.remove(file_path)
os.listdir()


In [0]:
# Get the directory and file name from a full path
file = os.path.basename(full_path)
folder = os.path.dirname(full_path)
print(file, folder)


In [0]:
# Check if a file or folder exists
os.path.exists(full_path)


In [0]:
# Get the extension of a file
name, extension = os.path.splitext(file)
print(name, extension)


### Install package

To install a package in Python you use the command `pip install package_name` directly in your terminal.

There are thousands of packages available such as:
    - matplotlib
    - numpy
    - pandas
    - pytorch
    - sci-kit learn
    - ...

For more packages see:
    - The Python Package Index: https://pypi.org/


In [0]:
# We will rely on some IPython magic to directly interact with the terminal.
!pip install wikipedia


# Data Acquisition and Exploration
**Learning Objective:** 
- Get familiar with common data exploration libraries
- Learn to acquire and clean data
- Learn to explore and visualize 



## Acquiring Data
With some Python basics we will start combining existing packages to acquire, and explore data.


In [0]:
# Load the required libraries
import pandas as pd
import wikipedia as wp
import matplotlib
matplotlib.rcParams['figure.figsize'] = [10, 5]


In [0]:
# Identify Wikipedia Page and acquire the date
page_title = "Opinion polling for the 2019 Canadian federal election"


In [0]:
#Get the html source
html = wp.page(page_title).html().encode("UTF-8")


In [0]:
# Extract tables and convert the html tables into pd.DataFrame()
df = pd.read_html(html)[0].iloc[2:,:]


## Cleaning Data


In [0]:
# Inspect the data
df.head()


In [0]:
# We notice that there seems to be a double header
df.columns


In [0]:
# What is the type of columns
type(df.columns)


In [0]:
# Let's use a loop to extract and edit each element of the MultiIndex dataframe
columnn_names = []
for c in df.columns:
    tmp = c[0].lower()
    columnn_names.append(tmp.replace(" ", "_"))

columnn_names


In [0]:
# Let's use regular expressions in a list comprehension this time
import re
regex = "[a-z]+"
columnn_names = ["_".join(re.findall(regex, i)) for i in columnn_names]


In [0]:
# Let's edit the columns of our dataset
df.columns = columnn_names
df.head()


In [0]:
# Let's further rename those columns
names_dict = {
    "polling_firm": "source",
    "last_dateof_polling": "date",
    "samplesize": "sample_size",
    "marginof_error": "error",
    "cons": "cpc",
    "liberal": "lpc",
    "green": "gpc",
    "polling_method": "method",
}

type(names_dict)


In [0]:
# Pass the new dictionary as an argument to the .rename method
df.rename(columns=names_dict, inplace=True)
df.head()


In [0]:
# Let's check the data types
df.dtypes


In [0]:
# The date field needs to be converted
df[['date']] = pd.to_datetime(df.date)
df.head()


In [0]:
# We should also only keep the numeric values for the margins of error
regex = "(\d+\.*\d*)"
df.error = df.error.str.extract(regex)


In [0]:
# Let's look again at our dataset
df.head()


In [0]:
# What if we look at a random subsample
df.sample(5)


In [0]:
# Let's clean the sample
regex = r"\(.*\)"
df.sample_size = df.sample_size.str.replace(regex, "")
df.sample_size = df.sample_size.str.replace(" |,", "")


In [0]:
# How does the data look now?
df.sample(5)


In [0]:
# What about the data types?
df.info()


In [0]:
# Which of these variables are still objects?
df.select_dtypes(include='object')


In [0]:
# Let's use a dictionary to recode the data types
convert_dict = {
    'error': float,
    'sample_size': int,
    'lead': float
}

df = df.astype(convert_dict)


In [0]:
# Let's look once again at our data
df.sample(5)


In [0]:
# What are the remaining objects?
df.select_dtypes(include='object')


In [0]:
# Keep only necessary variables by creating a variable filter
to_keep = [
    'source',
    'date',
    'lpc',
    'cpc',
    'ndp',
    'bq',
    'gpc',
    'ppc',
    'method'
]

df = df[to_keep]


## Data IO


In [0]:
# Save the cleaned dataframe to a file
file_name = "national_polls_2019.csv"
df.to_csv(file_name, index=False)
print(df)

df.dtypes


In [0]:
# Read the data back-in from the recorded csv file.

# More info on read_csv
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
df = pd.read_csv("national_polls_2019.csv", parse_dates=['date'])
df.dtypes


## Data Exploration and Visualization


In [0]:
# Let's convert this into a time-series dataframe
df.set_index('date', inplace=True)


In [0]:
# Time-series data should be stored in descending order
df = df.sort_values(by=['date', 'source'])


In [0]:
# How does the data look now?
df.head()


In [0]:
# What about the tail?
df.tail()


In [0]:
# A time indexed data frame provides much more control over the data
df.loc[df.index > '2019-10-15']


In [0]:
# We can look at a single party
df.lpc.loc['2019-10-20']


In [0]:
# We can focus on a subset of columns
parties = ["lpc", "cpc", "ndp", "bq", "gpc", "ppc"]
df.loc[:, parties]


In [0]:
# We can aggregate/resample the data
df[parties].resample('D', how='mean').head()


In [0]:
# We can also use pandas to plot
df[parties].resample('D', how='mean').plot()



### Anatomy of a Figure
![Anatomy of a Figure](https://matplotlib.org/3.1.1/_images/anatomy.png)


In [0]:
# We can look at the distributions for each party
df[parties].plot(kind='kde')


In [0]:
# Or do a simple box-plot
df[parties].boxplot()


In [0]:
# Let's look at missing values
df.isnull().mean()


In [0]:
# We can remove missing values
df.dropna()


In [0]:
# We just lost half of our dataset...
# Maybe we should fill the missing values
tmp_df = df.fillna(method='ffill', limit=3).copy()
tmp_df.isnull().mean()

df = tmp_df


In [0]:
# Let's investigate which polling firms have been most active
df.source.value_counts()


In [0]:
# Remove the firms that released less than 5 polls
tmp_mask = df.source.value_counts() >= 5
mask = tmp_mask.index[tmp_mask]

df = df[df.source.isin(mask)]


In [0]:
# Once again we could decide to visualize directly the result
df.source.value_counts().plot(kind='barh')


In [0]:
# Try to do grouped operations and see how did each of these firms portrayed the liberal party
df.groupby('source').lpc.describe().sort_values(by='mean')


In [0]:
# We can also look at the means for all the parties
df.groupby('source')[parties].mean().sort_values('lpc')


In [0]:
# We can also apply custom functions by groups
z_score = lambda x: (x-x.mean()) / x.std()
df.reset_index().groupby('source')[parties].apply(z_score).head()


In [0]:
# Most algorithms need you to shape the date in a long format
long_df = pd.melt(
    df.reset_index(),
    id_vars=['date', 'source'],
    value_vars=parties,
    var_name='party',
    value_name='share',
)

long_df.head()


In [0]:
# Seaborn, a statistical data visualization library uses long-format
import seaborn as sns
sns.set(style="whitegrid", palette="muted")

sns.swarmplot(
    x="party",
    y="share",
    hue="source",
    data=long_df,
)


In [0]:
# What if we need to add the sample size back?
new_df = long_df.merge(
    df[['method', 'source']].reset_index(),
    on=['date', 'source']
)

new_df.head()


In [0]:
# We can also expand the dataframe back to a wide format
new_df = new_df.pivot_table(
    index=['date', 'source', 'method'],
    columns='party',
    values='share',
)

new_df.head()

# Data Modeling
**Learning Objective:** 
- Learn create data pre-processing functions
- Learn how to train and save model objects
- Learn to load and make predictions on unseen data



Let's try to forecast the election based on existing polls!

## Preparing the training set

In [0]:
# For the training part we will rely on polls from the 2015 election.
title_train = "Opinion polling for the 2015 Canadian federal election"
html = wp.page(title_train).html().encode("UTF-8")
df_train = pd.read_html(html)[0]


In [0]:
# Cleaning the training set.
import re


In [0]:
# A function to fix the column names
def fix_names(input_df, names_dict):
    """Renames the columns in the input dataframe."""
    regex = "[a-z]+"

    columnn_names = []

    tmp_df = input_df.copy()

    for c in tmp_df.columns:
        tmp = c.lower()
        columnn_names.append(tmp.replace(" ", "_"))

    tmp_names = ["_".join(re.findall(regex, i)) for i in columnn_names]
    tmp_df.columns = tmp_names

    return tmp_df.rename(columns=names_dict)


In [0]:
# Let's edit them...
df_train = fix_names(df_train, names_dict)
df_train.columns


In [0]:
# Let's keep relevant variables only
df_train = df_train[to_keep]


In [0]:
# Remember lists also have useful methods
to_keep.remove('ppc')


In [0]:
# What does the training set look like ?
df_train = df_train[to_keep]
df_train.head()


In [0]:
# Let's store and remove the election results
results_2015 = df_train.iloc[1]
df_train = df_train.drop(1).dropna()


In [0]:
# Let's deal with missing values
df_train.dropna(inplace=True)


In [0]:
# What about the data types?
df_train.select_dtypes(include='object')


In [0]:
# Let's fix that date variable
df_train['date'] = pd.to_datetime(df_train.date)
df_train.sample(3)


In [0]:
# As we mentioned, most algorithms require the data to be in long-format
df_train = pd.melt(
    df_train.reset_index(),
    id_vars=['date', 'source', 'method'],
    value_vars=parties.remove('ppc'),
    var_name='party',
    value_name='share',
)

df_train.head()


Let's do some more exploration and see if polls actually improve as we get closer to the election day?


In [0]:
# We need to merge the outcome of the election back
targets = (
    results_2015
    .transpose()
    .iloc[2:-1]
    .reset_index()
)

targets.columns = ['party', 'outcome']
targets['outcome'] = targets.outcome.astype('float')

df_train = df_train.merge(targets)
df_train.head()


In [0]:
# Does time have an impact on the error of pollsters?
df_train['error'] = abs(df_train.share - df_train.outcome)
df_train.set_index('date', inplace=True)
df_train.error.resample('D').mean().plot()


In [0]:
# What about the data collection method?
df_train.method.value_counts()


In [0]:
# Let's use some regex to do an initial cleaning
regex = r"\(.*\)|/| |rolling"
df_train['method'] = df_train.method.str.replace(regex, "")
df_train['method'].value_counts()


In [0]:
# Let's groups these even further
df_train['method'] = df_train.method.str.lower().str[:3]
df_train['method'].value_counts()


In [0]:
# Let's use seaborn this time as we now have a long-dataset and see see if there is an abservable difference between the data collection methods
sns.violinplot(x="method", y="error",
               split=True, inner="quart",
               data=df_train)


## Preparing the test set


In [0]:
# Now that we have some intuition about 2015!
# We need to prepare our test set and verify it has the same form as the train set.
df_test = new_df.stack()
df_test.name = 'share'
df_test = df_test.reset_index().set_index('date')

data_2019 = {
    "party": ["lpc", "cpc", "bq", "ndp", "gpc"],
    "outcome": [33.1,34.4, 7.7, 15.9, 6.5],
}

df_test = df_test.reset_index().merge(pd.DataFrame(data_2019)).set_index('date')
df_test['error'] = abs(df_test.share - df_test.outcome)
all(df_test.columns == df_train.columns)


In [0]:
# Let's create a function to clean the method string!
def str_magic(input_series):
    regex = r"\(.*\)|/| |rolling"
    tmp = input_series.copy()
    tmp = df_test['method'].copy()
    tmp = tmp.str.replace(regex, "")
    return tmp.str.lower().str[:3]

df_test['method'] = str_magic(df_test['method'])


## Feature Creation


In [0]:
# We need to prepare our features
election_day_2015 = "2015-10-19"
election_day_2019 = "2019-10-21"

def add_days(df, election_day):
    test = pd.to_datetime(election_day) - df.reset_index()['date']
    test.index = df.index
    df['days'] = test.dt.days
    return df

df_train = add_days(df_train, election_day_2015)
df_test = add_days(df_test, election_day_2019)



In [0]:
# One-Hot Encoding
# Let's remove the group with most counts
df_train.method.value_counts().plot(kind='barh')


In [0]:
# Let's drop the most common value
train_dummies = pd.get_dummies(df_train['method'])
train_dummies.pop('tel')
df_train = pd.concat([df_train, train_dummies], axis=1)

test_dummies = pd.get_dummies(df_test['method'])
test_dummies.pop('tel')
df_test = pd.concat([df_test, test_dummies], axis=1)

y_var = 'outcome'
X_vars = ['share', 'days', 'ivr', 'onl']

predictions = []


In [0]:
# Now that we have our train and test sets let's train our models

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import pickle

models = [
    LinearRegression(),
    RandomForestRegressor(),
]


In [0]:
# Fit, predict, and save your models
for i in range(2):
    models[i].fit(df_train[X_vars], df_train[y_var])
    predictions.append(models[i].predict(df_test[X_vars]))
    pickle.dump(models[i], open(f"model_{i}.pkl", 'wb'))

predictions[0]

In [0]:
# Load a saved model from disc and make a prediction
input_date = '2019-09-20'

file_name = "model_0.pkl"
loaded_model = pickle.load(open(file_name, 'rb'))

predictions = loaded_model.predict(df_test.loc[input_date,X_vars])
results = df_test.loc[input_date, [y_var] + ["party", "share"]].assign(model_0=predictions)
results['abs_e_poll'] = abs(results.outcome - results.share)
results['abs_e_model_0'] = abs(results.outcome - results.model_0)


In [0]:
# Did our model beat the polls? 
print(results.loc[:,results.columns.str.contains('abs_e')].sum())


In [0]:
# Bonus - Packaging
## > Let's go to your terminal!