# Data Modeling

This notebook was created as part of a workshop on *Reproducible Research in Python*. 

- You can access the entire workshop materials at: [Reproducible Research in Python](https://github.com/mickaeltemporao/reproducible-research-in-python).

**Learning Objective:** 
- Learn create data pre-processing functions
- Learn how to train and save model objects
- Learn to load and make predictions on unseen data



## Data Acquisition & Cleaning

Let's try to forecast the election based on existing polls!



In [0]:
# Installing and Importing Packages
!pip install wikipedia
import wikipedia as wp 
import pandas as pd

In [0]:
# Let's gather the Data
# For the training set we will rely on polls from the 2015 election.
# For the test set we will rely on polls from the 2019 election.
page_titles = [
    "Opinion polling for the 2015 Canadian federal election",
    "Opinion polling for the 2019 Canadian federal election",
]

html_pages = [wp.page(page).html().encode("UTF-8") for page in page_titles]
dfs = [pd.read_html(html)[0] for html in html_pages]


In [0]:
# Creating a function to rename the column names
import re

names_dict = {
    "polling_firm": "source",
    "last_dateof_polling": "date",
    "samplesize": "sample_size",
    "marginof_error": "error",
    "cons": "cpc",
    "liberal": "lpc",
    "green": "gpc",
    "polling_method": "method",
}

def fix_names(input_df, names_dict):
    """Renames the columns in the input dataframe."""
    regex = "[a-z]+"

    columnn_names = []
    tmp_df = input_df.copy()

    # Fix double header 
    if type(tmp_df.columns[0]) is tuple:
        tmp_cols = [col[0] for col in tmp_df.columns]
        tmp_df.columns = tmp_cols

    for c in tmp_df.columns:
        tmp = c.lower()
        columnn_names.append(tmp.replace(" ", "_"))

    tmp_names = ["_".join(re.findall(regex, i)) for i in columnn_names]
    tmp_df.columns = tmp_names

    return tmp_df.rename(columns=names_dict)


In [0]:
# Edit the columns names in both data frames
dfs = [fix_names(df, names_dict) for df in dfs]


In [0]:
# Subsets variables and merge the data into a single dataframe
# This will help us to prepare our train/test set later on so that both 
# sets have the same form.

to_keep = [
    'source',
    'date',
    'lpc',
    'cpc',
    'ndp',
    'bq',
    'gpc',
    'method'
]

dfs = [df[to_keep] for df in dfs]
df = pd.concat(dfs)
df


In [0]:
# As we mentioned, most algorithms require the data to be in long-format
parties = ["lpc", "cpc", "ndp", "bq", "gpc"]

df = pd.melt(
    df,
    id_vars=['date', 'source', 'method'],
    var_name='party',
    value_name='share',
)
df.source.value_counts()


This is still not a long data frame as our share variable contains the vote share predicted by the polls and our target variable (the election outcome).


In [0]:
# Add the target variable 
# We need a year to merge the target on
df['date'] = pd.to_datetime(df.date)
df['year'] = df.date.dt.year
mask = df['source'] == 'Election'
targets_df = df.loc[mask].rename(columns={'share':'target'})

df.shape


In [0]:
# We can now merge the target into the original dataframe
df = df.merge(
    targets_df[['year', 'party', 'target']], 
    how='left', 
    on=['year', 'party']
)



In [0]:
# And remove observations that are the target we are trying to forecast 
df = df.loc[~mask]
df.sample(5)


In [0]:
# Let's deal with missing values
df = df.dropna()
df.sample(5)


In [0]:
# Are the data types correct?
df.dtypes


In [0]:
# Let's look at remaining objects
df.select_dtypes(include='object')


Ok, we now have a long data set.


## Modeling Data Exploration 


Let's do some more exploration to see if the polls actually improve as we get closer to the election day?

If we want to further explore the data with the objective of building intuition around the model and the features we want to build, we need to focus only on the training set. This allows us to simulate a real world situation where we are only using information that is available at the time of the prediction.



In [0]:
# We will create a mask to separate the training set from the test set.
df.set_index('date', inplace=True)
training_mask = df.year < 2019

In [0]:
# Now let's see if time seems to be related to the error of the pollsters?
# Let's measure the size of the error made by the pollsters.
df['error'] = abs(df.share - df.target)
df.loc[training_mask].error.resample('D').mean().plot()


In [0]:
# What about the data collection method?
df.loc[training_mask].method.value_counts()


In [0]:
# Let's define a function to do some initial cleaning
def str_magic(input_series):
    return input_series.str.lower().str[:3]

df['method'] = str_magic(df['method'])
df['method'].value_counts()


In [0]:
# Let's use seaborn this time as we now have a long-dataset and see see if there is an abservable difference between the data collection methods
import seaborn as sns
sns.violinplot(x="method", y="error",
               split=True, inner="quart",
               data=df.loc[training_mask])


## Feature Creation


Now that we have some intuition about 2015, let's add new features to the data!

In [0]:
# We need to prepare our features
# Let's add the number of days until the election
election_day_2015 = "2015-10-19"
election_day_2019 = "2019-10-21"

def count_days(df, election_day):
    output = pd.to_datetime(election_day) - df.reset_index()['date']
    output.index = df.index
    return output.dt.days

df.loc[training_mask, 'days'] = count_days(df.loc[training_mask], election_day_2015)
df.loc[~training_mask, 'days'] = count_days(df.loc[~training_mask], election_day_2019)


In [0]:
# One-Hot Encoding
# Let's remove the group with most counts
df.loc[training_mask, 'method'].value_counts().plot(kind='barh')


In [0]:
# Let's drop the most common value
dummies = pd.get_dummies(df['method'])
dummies.pop('tel')
df = pd.concat([df, dummies], axis=1)


In [0]:
# Finally, separate the training set from the test set.
df_train = df.loc[training_mask].copy()
df_test = df.loc[~training_mask].copy()

# And define our model.
y_var = 'target'
X_vars = ['share', 'days', 'ivr', 'onl']

predictions = []


## Model Training

In [0]:
# Now that we have our train and test sets let's train our models

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import pickle

models = [
    LinearRegression(),
    RandomForestRegressor(),
]


In [0]:
# Fit, predict, and save your models
for i in range(2):
    models[i].fit(df_train[X_vars], df_train[y_var])
    predictions.append(models[i].predict(df_test[X_vars]))
    pickle.dump(models[i], open(f"model_{i}.pkl", 'wb'))

predictions[0]

In [0]:
# Load a saved model from disc and make a prediction
input_date = '2019-09-20'

file_name = "model_0.pkl"
loaded_model = pickle.load(open(file_name, 'rb'))

predictions = loaded_model.predict(df_test.loc[input_date,X_vars])
results = df_test.loc[input_date, [y_var] + ["party", "share"]].assign(model_0=predictions)
results['abs_e_poll'] = abs(results.target - results.share)
results['abs_e_model_0'] = abs(results.target - results.model_0)


In [0]:
# Did our model beat the polls? 
print(results.loc[:,results.columns.str.contains('abs_e')].sum())


In [0]:
# Bonus - Packaging
## > Let's go to your terminal!