This notebook extracts and formats 4 predictor variables (number of working inhabitants, number of college graduates, number of youth (18-24) and number of immigrants) by Parisian district over the period 2006-2016 (last available year). These data come from Insee's IRIS database, which collects several hundreds of variables at the sub-city level. 

We selected four variables that we believe have a strong influence (potentially causal) on the outcome of elections in each district of Paris. Our assumption may be wrong, but it will be easy to see that once we put the data into the model -- it won't run or will tell us that these variables are not correlated with the outcome. The model will use these predictors to try and predict election results in each district, but we'll do that in another notebook. 

Let's start with some import statements and a handy function to extract predictors:

In [1]:
%load_ext lab_black
%load_ext watermark

import logging
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from fbprophet import Prophet
from pathlib import Path

logging.getLogger().setLevel(logging.CRITICAL)

repos = ["activite_residents", "diplomes_formation", "population", "population"]
var_codes = ["C_ACTOCC1564", "P_NSCOL15P_SUP", "P_POP1824", "P_POP_IMM"]
var_names = ["actifs_occupes", "college_grad", "youth", "immigration"]

In [2]:
def extract_predictor(repo: str, var_code: str, var_name: str) -> pd.Series:
    """
    Gets all files in the given repo, selects wanted predictor variable, 
    restricts to Paris, extracts the district numbers, aggregates predictor by district,
    and then returns formatted time series
    """
    basepath = Path(f"../../../Downloads/db_iris_all/{repo}/")
    files_in_path = basepath.glob("*.xls")
    print(f"Began extracting {var_name} predictor from {repo} repo...\n")

    # load and concat files (heavy):
    preds = pd.DataFrame()
    for file in files_in_path:
        df = pd.read_excel(
            file,
            header=5,
            sheet_name="IRIS",
            usecols=["DEP", "LIBCOM", f"{var_code[:1]}{file.stem[-2:]}{var_code[1:]}"],
            dtype={"DEP": "category", "LIBCOM": "category"},
            nrows=40_500,
        )
        df = df[df.DEP == "75"].reset_index(drop=True).drop("DEP", axis=1)
        preds = pd.concat([preds, df], axis=1)

    # drop duplicated column values:
    preds = preds.T.drop_duplicates().T
    # drop duplicated column names:
    preds = preds.loc[:, ~preds.columns.duplicated()]

    # extract district number:
    preds["LIBCOM"] = preds.LIBCOM.str.extract("(\d+)").astype(int)
    preds = preds.rename(columns={"LIBCOM": "arrondissement"})

    # aggregate by district and prettify columns:
    preds = preds.groupby("arrondissement").sum()
    preds.columns = preds.columns.str[1:3].astype(int) + 2000
    preds.columns.name = "year"
    preds = preds.sort_index(axis=1)
    preds = preds.stack()
    preds.name = var_name

    print(f"Finished extracting and aggregating {var_name} predictor.\n")
    return preds

The raw excel files where the data live are very heavy, so this function will take some time to run -- but it will be worth it. Indeed, it will go and load the files where each predictor is, for  each year on record, do some formatting and restricting and then return a dataframe with the proper time series. Let's run it and go get a cup of coffee:

In [3]:
predictors = []
for r, c, n in zip(repos, var_codes, var_names):
    predictors.append(extract_predictor(r, c, n))

Began extracting actifs_occupes predictor from activite_residents repo

Finished extracting and aggregating actifs_occupes predictor

Began extracting college_grad predictor from diplomes_formation repo

Finished extracting and aggregating college_grad predictor

Began extracting youth predictor from population repo

Finished extracting and aggregating youth predictor

Began extracting immigration predictor from population repo

Finished extracting and aggregating immigration predictor



In [4]:
predictors = pd.concat(predictors, axis=1)
predictors

Unnamed: 0_level_0,Unnamed: 1_level_0,actifs_occupes,college_grad,youth,immigration
arrondissement,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2006,9485.059228,6453.630949,1874.672223,3148.134542
1,2007,9546.148694,6731.105539,1866.646378,3227.921219
1,2008,9469.633224,6770.997547,1816.180756,3121.358408
1,2009,9665.691628,6994.804860,1842.097989,3121.406343
1,2010,9558.180760,7009.239728,1779.033595,3021.113668
...,...,...,...,...,...
20,2012,91753.677270,44244.946169,18234.641207,43045.904247
20,2013,90488.610079,63058.398038,18156.671990,42888.160363
20,2014,90469.181326,66282.764695,18133.119561,42123.803538
20,2015,90370.240523,68786.240273,17977.773858,41633.325845


Had a nice coffee? As you can see, we now have the predictors ready to match with past election results, and then to give to the model! Ready? Well, not completely... The data stop in 2016 but we will train our model on elections as recent as 2017, and we'll test it on 2019 European elections, so we need data for the period 2017-2019.

Unfortunately, these type of data generally take two years to produce. This means 2019 data should be available around 2021 -- we can't wait that long! Facebook's Prophet library comes very handy here and will allow us to make some reasonable extrapolations of the predictors' values. Ideally, we should think hard about Prophet's default settings and if they are adapted to our use case -- we could even see if our predictors could be predicted by other, available data.

Here however, I'll do a quick and dirty extrapolation, sticking to Prophet's default. We'll see how the model handles that and we always do better afterwards if needed. Actually, I think it could be even more helpful to incorporate measurement error on predictors *into* the model, so that the Bayesian machinery takes it into account -- so let's not spend too much time here, at least for our first iteration.

Let's turn our `year` variable into a real datetime (new year's eve) and write our interpolation function:

In [3]:
predictors = predictors.reset_index().set_index("arrondissement")
predictors["year"] = pd.to_datetime(predictors.year, format="%Y") + pd.DateOffset(
    months=11, days=30
)
predictors

Unnamed: 0_level_0,year,actifs_occupes,college_grad,youth,immigration
arrondissement,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2006-12-31,9485.059228,6453.630949,1874.672223,3148.134542
1,2007-12-31,9546.148694,6731.105539,1866.646378,3227.921219
1,2008-12-31,9469.633224,6770.997547,1816.180756,3121.358408
1,2009-12-31,9665.691628,6994.804860,1842.097989,3121.406343
1,2010-12-31,9558.180760,7009.239728,1779.033595,3021.113668
...,...,...,...,...,...
20,2012-12-31,91753.677270,44244.946169,18234.641207,43045.904247
20,2013-12-31,90488.610079,63058.398038,18156.671990,42888.160363
20,2014-12-31,90469.181326,66282.764695,18133.119561,42123.803538
20,2015-12-31,90370.240523,68786.240273,17977.773858,41633.325845


In [86]:
def extrapol_pred(
    district: int, predictor: str, pred_df: pd.DataFrame, timeframe: int
) -> pd.DataFrame:
    """
    Quick and dirty extrapolation of predictor in the district, 
    for the number of years specified in timeframe variable.
    The function uses Facebook's Prophet default settings -- hence 'quick and dirty'.
    """
    df = pred_df.loc[district, ["year", predictor]].reset_index(drop=True)
    df.columns = ["ds", "y"]  # Prophet needs this names

    m = Prophet()
    m.fit(df)
    future = m.make_future_dataframe(periods=timeframe, freq="Y")
    forecast = m.predict(future)

    forecast = forecast.iloc[-timeframe:][["ds", "yhat"]]
    forecast.columns = ["year", predictor]

    forecast.index = [district] * len(forecast)
    forecast.index.name = "arrondissement"
    forecast = forecast.reset_index().set_index(["arrondissement", "year"])
    return forecast

Each pair (district, predictor) represents a time series that we extrapolate for the next three years (2017-2019). Then, we combine all that in a dataframe:

In [90]:
districts_dfs = []

for district in predictors.index.unique():
    extrapol = []
    for predictor in predictors.columns.difference(["year"]):
        extrapol.append(extrapol_pred(district, predictor, predictors, timeframe=3))
    
    print(f"Finished extrapolating all 4 predictors for district {district}\n")
    districts_dfs.append(pd.concat(extrapol, axis=1))

districts_dfs = pd.concat(districts_dfs)

Finished extrapolating all 4 predictors for district 1

Finished extrapolating all 4 predictors for district 2

Finished extrapolating all 4 predictors for district 3

Finished extrapolating all 4 predictors for district 4

Finished extrapolating all 4 predictors for district 5

Finished extrapolating all 4 predictors for district 6

Finished extrapolating all 4 predictors for district 7

Finished extrapolating all 4 predictors for district 8

Finished extrapolating all 4 predictors for district 9

Finished extrapolating all 4 predictors for district 10

Finished extrapolating all 4 predictors for district 11

Finished extrapolating all 4 predictors for district 12

Finished extrapolating all 4 predictors for district 13

Finished extrapolating all 4 predictors for district 14

Finished extrapolating all 4 predictors for district 15

Finished extrapolating all 4 predictors for district 16

Finished extrapolating all 4 predictors for district 17

Finished extrapolating all 4 predictors 

The only thing left to do is concatenating the extrapolations and the observed data:

In [114]:
predictors = pd.concat(
    [predictors.reset_index().set_index(["arrondissement", "year"]), districts_dfs],
    sort=True,
).sort_index()
predictors.to_csv("data/predictors_by_district.csv")
predictors

Unnamed: 0_level_0,Unnamed: 1_level_0,actifs_occupes,college_grad,immigration,youth
arrondissement,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2006-12-31,9485.059228,6453.630949,3148.134542,1874.672223
1,2007-12-31,9546.148694,6731.105539,3227.921219,1866.646378
1,2008-12-31,9469.633224,6770.997547,3121.358408,1816.180756
1,2009-12-31,9665.691628,6994.804860,3121.406343,1842.097989
1,2010-12-31,9558.180760,7009.239728,3021.113668,1779.033595
...,...,...,...,...,...
20,2015-12-31,90370.240523,68786.240273,41633.325845,17977.773858
20,2016-12-31,90874.227479,71851.173023,41180.244725,17537.759874
20,2017-12-31,90955.383757,73470.335532,40830.947447,17485.646561
20,2018-12-31,90935.955004,77881.217226,40333.333708,17380.539014


And now we're ready to match predictors against past election results, and to give data to the model! Let's do that in another notebook.

In [115]:
%watermark -a AlexAndorra -n -u -v -iv

logging 0.5.1.2
seaborn 0.9.0
numpy   1.17.3
pandas  0.25.3
AlexAndorra 
last updated: Fri Nov 22 2019 

CPython 3.7.5
IPython 7.9.0
