This notebook follows `munic_model_prod.ipynb`, where we ran the model. Here, we'll extract and format the out-of-sample data, to then push them through the model and get predictions for Paris 2020 city-council elections. Then, we'll analyze and plot the results in the notebook `munic_model_analysis.ipynb`. You don't have to understand the model to read this notebook, but if you're curious about it, please go ahead and read the other notebook!

So, as usual, let's start by importing the necessary packages and defining handy helper functions:

In [1]:
import arviz as az
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pymc3 as pm
import xarray as xr

from bokeh.palettes import brewer
from scipy.special import softmax
from typing import List

In [2]:
az.style.use("arviz-darkgrid")
SPAN_POLLS = 5
ALPHA_POLLS = 2 / (SPAN_POLLS + 1)
CANDIDATES = {
    "Simonnet": "farleft",
    "Hidalgo": "left",
    "Belliard": "green",
    "Buzyn": "center",
    "Griveaux": "center",
    "Dati": "right",
    "Federbusch": "farright",
}
MONTHS = {"janvier": 1, "février": 2, "mars": 3}
PARTIES = ["farleft", "left", "green", "center", "right", "farright", "other"]
Nparties = len(PARTIES) - 1
PARTIES_AGG = [
    "farleft_agg",
    "left_agg",
    "green_agg",
    "center_agg",
    "right_agg",
    "farright_agg",
]
RIGHT_POLLSTER = {
    "Harris Interactive": "Harris",
    "Ifop-Fiducial": "Ifop",
    "Ipsos-Sopra Steria": "Ipsos",
    "Ipsos-Sopra Steria[87]": "Ipsos",
    "Ipsos- Sopra [90]Steria": "Ipsos"
}
BINS = np.array([15.0, 25.0, 35.0, 45.0, 55.0, 65.0, 75.0])
COLORS = {
    "farleft": np.array(brewer["Reds"][7][::-1]),
    "left": np.array(brewer["PuRd"][7][::-1]),
    "green": np.array(brewer["Greens"][7][::-1]),
    "center": np.array(brewer["Oranges"][7][::-1]),
    "right": np.array(brewer["Blues"][7][::-1]),
    "farright": np.array(brewer["Purples"][7][::-1]),
    "other": np.array(brewer["Greys"][7][::-1]),
}

In [3]:
def compute_analyt_weights(df: pd.DataFrame) -> pd.DataFrame:

    pollster_ratings = pd.read_csv("../data/polls_1st_round/pollsters_weights.csv")
    df = pd.merge(
        df, pollster_ratings, how="left", left_on="pollster", right_on="sondage"
    )

    for p in PARTIES[:-1]:
        df[f"weightsondeur_{p}"].fillna(
            pollster_ratings[f"weightsondeur_{p}"].median(), inplace=True
        )
        df[f"analyt_weights_{p}"] = np.log(df.samplesize) * df[f"weightsondeur_{p}"]

    return df.set_index("date").sort_index()


def agg_polls(df: pd.DataFrame) -> pd.DataFrame:

    unique_dates = sorted(set(df.index))

    for p in PARTIES[:-1]:
        for i, d_outer in enumerate(unique_dates):
            for j, d_inner in enumerate(unique_dates[: i + 1]):
                df.loc[d_inner, f"expon_weights_{p}"] = (1 - ALPHA_POLLS) ** (i - j)

            df[f"final_weights_{p}"] = (
                df[f"analyt_weights_{p}"] * df[f"expon_weights_{p}"]
            )
            final_weights = df.loc[:d_outer, f"final_weights_{p}"]
            vote_share = df.loc[:d_outer, f"{p}"]

            df.loc[d_outer, f"{p}_agg"] = np.average(vote_share, weights=final_weights)

            # compute aggregate sample size only once:
            if p == "right":
                # same weights, whatever the party:
                expon_weights = df.loc[:d_outer, "expon_weights_right"]
                sample_size = df.loc[:d_outer, "samplesize"]
                df.loc[d_outer, "samplesize_agg"] = round(
                    np.average(sample_size, weights=expon_weights)
                )

    return df.reset_index()[
        ["date", "samplesize_agg"] + [f"{p}_agg" for p in PARTIES[:-1]]
    ]

Now, let's load our MCMC samples and data -- nothing really thrilling here:

In [32]:
# posterior samples
trace_prod = az.from_netcdf("trace_prod_bis.nc")
post = trace_prod.posterior
trace_prod

In [5]:
# test data
unemp = pd.read_excel(
    "../data/predictors/chomage-zone-demploi-2003-2019.xls",
    header=5,
    sheet_name="txcho_ze",
)
unemp = unemp[unemp["LIBZE2010"] == "Paris"].iloc[:, 4:].T
unemp.columns = ["unemployment"]
unemp.index = pd.period_range(start=unemp.index[0], periods=len(unemp), freq="Q")
unemp

Unnamed: 0,unemployment
2003Q1,8.4
2003Q2,8.7
2003Q3,8.6
2003Q4,9.0
2004Q1,9.2
...,...
2018Q3,7.7
2018Q4,7.4
2019Q1,7.4
2019Q2,7.2


In [6]:
# training data
d = pd.read_csv("../data/whole_formatted_data.csv", index_col=0)
district_id, districts = d.arrondissement.factorize(sort=True)
Ndistricts = len(districts)
type_id, types = d.type.factorize(sort=True)

Here is something more fun: we're going to scrape the most recent polls from the appropriate [Wikipedia page](https://fr.wikipedia.org/wiki/%C3%89lections_municipales_de_2020_%C3%A0_Paris). Indeed, our model is trained on polls and unemployment data from previous elections. To get predictions for the coming city-council elections (March 15th, 2020), we need the last [unemployment figures in Paris](https://www.insee.fr/fr/statistiques/1893230) (most recent are for Q3 2019) and the last polls. Here is how the scraping of polls goes:

In [7]:
raw_polls = pd.read_html(
    "https://fr.wikipedia.org/wiki/%C3%89lections_municipales_de_2020_%C3%A0_Paris",
    attrs={"class": "wikitable centre"},
    match="Date de réalisation",
    decimal=",",
    thousands=" ",
    na_values=["—", "?"],
)[0]
raw_polls.columns = raw_polls.columns.droplevel([0, 2])
raw_polls = raw_polls[
    ~raw_polls.Source.str.contains("candidature|annonce|renonce|retire|command", regex=True)
].drop(["Gantzer puis aucun chef de file", "Villani", "Bournazel", "Campion", "Berkani", "Autres"], axis=1)

# clean polls' characteristics:
raw_polls = raw_polls.rename(
    columns={
        "Source": "pollster",
        "Date de réalisation": "date",
        "Échantillon": "samplesize",
    }
)
raw_polls["pollster"] = raw_polls.pollster.replace(RIGHT_POLLSTER)

# drop any poll whose sample size is nan, then cast as int
raw_polls = raw_polls.dropna(subset=["samplesize"])
raw_polls["samplesize"] = raw_polls["samplesize"].str.split().str.join("").astype(int)

# take last field date:
field = raw_polls["date"].str.split(" au ", expand=True)[1].str.split(expand=True)
field.columns = ["day", "month"]
field["month"] = field["month"].replace(MONTHS)
field["year"] = 2020
raw_polls["date"] = pd.to_datetime(field[["day", "month", "year"]])

# clean candidates' values:
raw_polls[list(CANDIDATES.keys())] = raw_polls[CANDIDATES.keys()].astype(float)
raw_polls["Buzyn"] = raw_polls[["Buzyn", "Griveaux"]].fillna(0).sum(axis=1)
raw_polls = (
    raw_polls.drop("Griveaux", axis=1)
    .rename(columns=CANDIDATES)
    .sort_values("date")
    .dropna()
    .reset_index(drop=True)
)
raw_polls.to_csv("oos_data/raw_polls_2020.csv")
raw_polls

Unnamed: 0,pollster,date,samplesize,farleft,left,green,center,right,farright
0,Ifop,2020-01-17,955,5.0,25.0,14.0,15.0,19.0,5.0
1,Ifop,2020-01-17,955,5.0,25.0,14.0,16.0,17.0,5.0
2,Odoxa,2020-01-20,879,8.0,24.0,13.0,16.0,18.0,5.0
3,Odoxa,2020-01-23,916,4.0,23.0,14.5,16.0,20.0,6.0
4,Ipsos,2020-02-19,1000,5.0,24.0,13.0,19.0,20.0,4.0
5,Harris,2020-02-19,1092,6.0,23.0,13.0,17.0,23.0,5.0
6,Odoxa,2020-02-19,809,7.0,23.0,14.0,17.0,25.0,4.0
7,Ifop,2020-02-21,976,6.0,24.0,12.0,19.0,22.0,3.5
8,Ifop,2020-02-28,946,5.0,24.0,11.0,20.0,25.0,3.5
9,Elabe,2020-02-28,1001,5.0,24.0,9.5,18.5,25.0,4.0


Pretty nice, uh? Yeah, pandas is awesome! Now, we're going to aggregate those polls by recency, sample size and historical performance of the pollster. This last weight is based on [a pollster ratings I computed](https://www.pollsposition.com/indicateurs/pollster_ratings) and updated with last year's polls (2019 European elections) -- I didn't have time to open-source this analysis yet, but hopefully one day I will! I'm not going to detail everything by writing, but you can see how it's done in the code of the two functions we defined at the beginning -- `compute_analyt_weights` and `agg_polls`. The code below also transforms the polls from their natural habitat ($[0, 1]$) to the real line ($[-\infty, +\infty]$. This is for technical reasons that I'm not going to detail here -- if you're curious, I explain everything in the notebook of the model.

In [7]:
raw_polls = pd.read_csv("oos_data/raw_polls_2020.csv", index_col=0)
oos_polls = compute_analyt_weights(raw_polls)
oos_polls = agg_polls(oos_polls).drop_duplicates()

# revert the softmax:
oos_polls[PARTIES_AGG] = oos_polls[PARTIES_AGG].div(100).apply(np.log) + 1
oos_polls.round(2)

Unnamed: 0,date,samplesize_agg,farleft_agg,left_agg,green_agg,center_agg,right_agg,farright_agg
0,2020-01-17,955.0,-2.0,-0.39,-0.97,-0.86,-0.71,-2.0
2,2020-01-20,922.0,-1.83,-0.4,-0.99,-0.85,-0.71,-2.0
3,2020-01-23,920.0,-1.94,-0.43,-0.97,-0.84,-0.67,-1.92
4,2020-02-19,950.0,-1.87,-0.45,-1.01,-0.75,-0.55,-2.07
7,2020-02-21,956.0,-1.85,-0.44,-1.03,-0.73,-0.54,-2.12
8,2020-02-28,963.0,-1.91,-0.44,-1.14,-0.7,-0.47,-2.18
10,2020-03-02,1001.0,-1.94,-0.43,-1.16,-0.71,-0.45,-2.19
11,2020-03-06,1043.0,-2.03,-0.42,-1.14,-0.69,-0.43,-2.2
13,2020-03-09,1021.0,-2.05,-0.41,-1.16,-0.68,-0.44,-2.21
14,2020-03-10,1042.0,-2.03,-0.39,-1.17,-0.68,-0.44,-2.23


Ok, we're ready to make out-of-sample predictions! This is going to look weird and esoteric, but I'm actually just taking each parameters for the couples (district, party) and pushing them through the model to get the posterior probabilities of each party, in each district -- this will make more sense when we get to the vizualization part.

The interesting part though is that we're gonna use ArviZ's `InferenceData` capabilities instead of raw numpy arrays. That way, we won't have to take care shape broadcasting, adding new axes, removing axes, and similar issues that usually arise. And when I say "we", it's actually Oriol Abril-Pla who helped me tremedously to code this part. So, let's sincerely thank Oriol and [check out his great blog](https://oriolabril.github.io/oriol_unraveled/) ;)

Let's begin by defining some variables and helper functions, to make our code more readable:

In [8]:
# standardize latest polls:
last_polls = (oos_polls[PARTIES_AGG].iloc[-1] - d[PARTIES_AGG].stack().mean()) / d[PARTIES_AGG].stack().std()
last_polls.values

array([-0.02300336,  0.37182675,  0.18305373,  0.30088352,  0.35941297,
       -0.08042145])

In [9]:
# standardize latest unemployment:
last_unemployment = (
    (np.log(unemp.iloc[-1]) - np.log(d["unemployment"]).mean())
    / np.log(d["unemployment"]).std()
).iloc[0]
last_unemployment

-1.0281829564033451

In [10]:
# extract incumbency matrix
center_results = pd.get_dummies(d.loc[d.date == "2019-05-25", "winner"])["center"].values
incumb_mat = np.c_[np.zeros(center_results.size), center_results, np.zeros(center_results.size)].astype(int)
incumb_mat

array([[0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])

In [33]:
trace_prod.add_groups(predictions_constant_data=
    xr.Dataset(
        {
            "incumbent_indicator": (["district", "incumbents"], incumb_mat),
            "stdz_polls": (["party"], last_polls.values),
            "log_unemployment": ([], last_unemployment),
        },
        coords={
            "district": post.district.values,
            "incumbents": trace_prod.constant_data.incumbents.values,
            "party": trace_prod.constant_data.party.values,
        },
    )
)
const = trace_prod.predictions_constant_data
trace_prod

In [12]:
def prepend_strings(base: str, strings: List[str]) -> List[str]:
    return [f"{base}_{item}" for item in strings]

def ds_to_da_list(ds: xr.Dataset , var_names: List[str]) -> List[xr.DataArray]:
    """Turn part of an xarray data set into a list of xarray data arrays"""
    return [ds[name] for name in var_names]

In [34]:
# step 1: reorganize variables to avoid loops
# handle the party loops
post["poll_error"] = xr.concat(ds_to_da_list(post, prepend_strings("poll_error", post.party.values)), "party")
post["type_effect"] = xr.concat(ds_to_da_list(post, prepend_strings("type_effect", post.party.values)), "party")

In [35]:
# align party and params dims
post["β_district"] = xr.concat(
    [
        da.rename({"params_short" if "params_short" in da.dims else "params_extend": "params"})
        for da in ds_to_da_list(post, prepend_strings("β_district", post.party.values))
    ], 
    "party",
    fill_value=0
)

# align incumbent with party dimension
_, incumbent_indicator = xr.align(
    post["party"],
    const["incumbent_indicator"].rename(incumbents="party"),
    join="left",
    fill_value=0
)

In [38]:
post["poll_error"].sel(election_type="municipale").std(("chain", "draw"))

In [39]:
# broadcasting city-level polls to each district
_, poll_broadcast = xr.broadcast(
    post["district"],
    const["stdz_polls"],
)
# add noise due to individual polling errors
poll_uncert = xr.apply_ufunc(lambda poll, poll_std: np.random.normal(poll, poll_std), poll_broadcast, post["poll_error"].sel(election_type="municipale").std(("chain", "draw")))
poll_uncert = poll_broadcast

In [41]:
# step 2: remove unnecessary vars from memory, to relieve RAM
# use dask?
#post.drop([
   # *prepend_strings("poll_diff", post.party.values),
  #  *prepend_strings("type_effect", post.party.values),
 #   *prepend_strings("β_district", post.party.values)
#]) 

# step 3: write the calculation mus_parties
mus_parties = (
    poll_uncert
    + post["β_district"].sel(params="local_standing")
    + post["type_effect"].sel(election_type="municipale")
    + post["β_district"].sel(params="log_unemployment") * const["log_unemployment"]
    + post["β_district"].sel(params="incumbency") * incumbent_indicator
).drop("params")
mus_parties

In [42]:
def add_other_party(mu_parties, vary_piv_mu, vary_piv_std, rng=None):
    """
    Implement last steps of the model: 
        - Sample the last category from the posterior and concatenate it to the other parties
        - Softmax everything to get vote shares
    Parameters
    ----------
    mu_parties: posterior shares for the six parties
    vary_piv_mu: posterior mean of last multinomial category
    vary_piv_std: posterior std of last multinomial category
    """
    # append last category:
    vary_pivot = rng.normal(loc=vary_piv_mu, scale=vary_piv_std, size=(*mu_parties.shape[:-1], 1))
    post_preds = np.concatenate((mu_parties, vary_pivot), axis=-1)
    
    # preferences of each district
    share_est = softmax(post_preds, axis=-1) * 100
    
    return share_est

In [43]:
# step 4-6: 
#     add vary_pivot and concat
#     apply softmax along party dimension

share_est = xr.apply_ufunc(
    add_other_party,
    mus_parties,
    post["vary_pivot"].mean(),
    post["vary_pivot"].std(),
    kwargs={"rng": np.random.default_rng()},
    input_core_dims=[["party"], [], []],
    output_core_dims=[["party_complete"]],
).assign_coords(party_complete=post.party_complete)

In [44]:
az.to_netcdf(share_est, "oos_data/share_est_bis.nc")
share_est.to_dataset(name="share_est") # dataset views are more compact and informative than dataarray ones

Now we have the proportions for each party, in each district! But we'll also want to map those results on the geography of Paris. To do that, we'll compute the most likely winner in each district, its probability of winning, and the proportion of the votes it's expected to get. This is done by the code below. I want to thank Grégoire David for [his open-source project, france-geojson](https://github.com/gregoiredavid/france-geojson), where I found the geographic shapes of Paris.

In [45]:
win_summary = []
for district in share_est.district.values:
    winners = pd.DataFrame(
        share_est.sel(district=district).idxmax(dim="party_complete").to_series().value_counts(normalize=True).multiply(100).round().astype(int)
    ).reset_index().loc[[0]] # last statement keeps only most probable winner
    winners.columns = ["winner", "prob"]
    winners["district"] = district
    
    winner = winners.loc[0, "winner"]
    samples_winner = share_est.sel(district=district).sel(party_complete=winner)
    
    winners["mean"] = samples_winner.mean().data
    winners["low"] = az.hdi(samples_winner).sel(hdi="lower")["x"].data
    winners["high"] = az.hdi(samples_winner).sel(hdi="higher")["x"].data
    
    win_summary.append(winners)
win_summary = pd.concat(win_summary, ignore_index=True)

# bin mean estimates to match with colors
win_summary["bins_idx"] = np.digitize(win_summary["mean"], BINS)

def label_color(row):
    """
    Associate winner with its color, based on the magnitude of the mean estimate.
    """
    return COLORS[row["winner"]][row["bins_idx"]]

win_summary["color"] = win_summary.apply(lambda row: label_color(row), axis=1)
win_summary

Unnamed: 0,winner,prob,district,mean,low,high,bins_idx,color
0,center,90,4,28.562973,26.981177,30.088558,2,#fdae6b
1,left,63,5,26.421203,24.809263,27.968289,2,#c994c7
2,right,100,6,30.721111,28.513962,32.96453,2,#9ecae1
3,right,100,7,39.343477,36.468735,41.806597,3,#6baed6
4,right,100,8,34.626346,31.281976,38.010807,2,#9ecae1
5,left,100,9,29.026972,27.472882,30.498489,2,#c994c7
6,left,100,10,33.316364,30.660068,35.424876,2,#c994c7
7,left,100,11,29.716197,28.185143,31.203296,2,#c994c7
8,left,100,12,30.525909,28.434261,32.222697,2,#c994c7
9,left,100,13,29.072681,27.381592,30.718977,2,#c994c7


In [46]:
paris_shape = (
    gpd.read_file("../data/paris_shape.json")
    .set_index("code")
    .sort_index()
    .drop("nom", axis=1)
)

# merge first four districts and reappend
first_four = paris_shape.loc[: "75104"].copy()
first_four["new_name"] = 4
first_four = first_four.dissolve(by="new_name")

paris_shape = paris_shape.loc["75105": ].copy()
paris_shape["new_name"] = districts[1:]
paris_shape = paris_shape.reset_index(drop=True).set_index("new_name")

paris_shape = first_four.append(paris_shape).reset_index().rename(columns={"new_name": "district"})

# merge with winner summary
paris_shape = paris_shape.merge(win_summary, left_on="district", right_on="district")
paris_shape["winner"] = paris_shape["winner"].str.title()

# save file
try: 
    os.remove("oos_data/paris_shape.geojson")
except FileNotFoundError:
    pass
paris_shape.to_file("oos_data/paris_shape_bis.geojson", driver='GeoJSON')
paris_shape

Unnamed: 0,district,geometry,winner,prob,mean,low,high,bins_idx,color
0,4,"POLYGON ((2.36841 48.85574, 2.36902 48.85322, ...",Center,90,28.562973,26.981177,30.088558,2,#fdae6b
1,5,"POLYGON ((2.34456 48.85399, 2.36432 48.84617, ...",Left,63,26.421203,24.809263,27.968289,2,#c994c7
2,6,"POLYGON ((2.31663 48.84675, 2.32829 48.85179, ...",Right,100,30.721111,28.513962,32.96453,2,#9ecae1
3,7,"POLYGON ((2.32078 48.86308, 2.33285 48.85930, ...",Right,100,39.343477,36.468735,41.806597,3,#6baed6
4,8,"POLYGON ((2.32712 48.88349, 2.32576 48.86955, ...",Right,100,34.626346,31.281976,38.010807,2,#9ecae1
5,9,"POLYGON ((2.32576 48.86955, 2.32712 48.88349, ...",Left,100,29.026972,27.472882,30.498489,2,#c994c7
6,10,"POLYGON ((2.36468 48.88429, 2.37702 48.87192, ...",Left,100,33.316364,30.660068,35.424876,2,#c994c7
7,11,"POLYGON ((2.37702 48.87192, 2.39430 48.85649, ...",Left,100,29.716197,28.185143,31.203296,2,#c994c7
8,12,"POLYGON ((2.36595 48.84491, 2.36432 48.84617, ...",Left,100,30.525909,28.434261,32.222697,2,#c994c7
9,13,"POLYGON ((2.36595 48.84491, 2.39007 48.82569, ...",Left,100,29.072681,27.381592,30.718977,2,#c994c7


We've now stored all the data we need -- it's time to plot them and see what the model is telling us about the coming elections... See you in the notebook `munic_model_analysis.ipynb` or, even better, [here](https://mybinder.org/v2/gh/AlexAndorra/pollsposition_models/master?urlpath=%2Fvoila%2Frender%2Fdistrict-level%2Fmunic_model_analysis.ipynb)! Indeed, this notebook is designed to be seen online, not really to be read on GitHub, where the map's JavaScript is not displayed.

In [97]:
%load_ext watermark
%watermark -a AlexAndorra -n -u -v -iv

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
pymc3     3.9.3
pandas    1.0.5
arviz     0.10.0
numpy     1.19.1
geopandas 0.8.1
xarray    0.16.0
AlexAndorra 
last updated: Fri Oct 02 2020 

CPython 3.8.5
IPython 7.18.1
