This notebook aggregates polls for each election we're interested in for later prediction in the model of elections in Paris at the district level (see notebook `munic_model_prod.ipynb`). For each election, polls are aggregated according to their sample size, their recency and the historical performance of the pollster at the time of the election.

In [1]:
%load_ext watermark

import numpy as np
import os
import pandas as pd
import scipy as sp

from typing import List

In [2]:
NB_PARTIES = {
    "nbfarleft": "farleft",
    "nbleft": "left",
    "nbgreen": "green",
    "nbcenter": "center",
    "nbright": "right",
    "nbfarright": "farright",
}
VARIABLES_TO_KEEP = [
    "type",
    "dateelection",
    "date",
    "sondage",
    "samplesize",
    "nbfarleft",
    "nbleft",
    "nbgreen",
    "nbcenter",
    "nbright",
    "nbfarright",
]
DATES_ELECTIONS = {
    "presid2007": "2007-04-22",
    "legis2007": "2007-06-10",
    "munic2008": "2008-03-09",
    "euro2009": "2009-06-07",
    "regio2010": "2010-03-14",
    "presid2012": "2012-04-22",
    "legis2012": "2012-06-10",
    "munic2014": "2014-03-23",
    "euro2014": "2014-05-25",
    "regio2015": "2015-12-06",
    "presid2017": "2017-04-23",
    "legis2017": "2017-06-11",
}
SPAN = 5  # span of poll-aggregation

Let's load the data between 2006 inclusive and 2019 exclusive -- 2019 will be our out-of-sample test election and we already have the poll aggregation for this election; and we start in 2006 because our district-level predictors do. Now let's see the data:

In [3]:
all_polls = pd.read_csv(
    "../data/polls_1st_round/tour1_complet_unitedfl.csv",
    parse_dates=["date", "dateelection"],
    usecols=VARIABLES_TO_KEEP,
).sort_values(["date", "sondage"])

all_polls = all_polls[
    (all_polls.sondage != "seats")
    & (all_polls.sondage != "result")
    & (all_polls.dateelection.dt.year >= 2006)
    & (all_polls.dateelection.dt.year < 2019)
].reset_index(drop=True)
all_polls

Unnamed: 0,type,dateelection,date,sondage,samplesize,nbfarleft,nbleft,nbgreen,nbcenter,nbright,nbfarright
0,president,2007-04-22,2006-05-18,Kantar,715.0,5.0,30.0,2.5,8.0,34.0,10.0
1,president,2007-04-22,2006-06-15,Kantar,788.0,6.0,32.0,2.0,8.0,31.0,12.5
2,president,2007-04-22,2006-07-17,Kantar,601.0,4.0,32.0,1.5,6.0,35.0,11.5
3,president,2007-04-22,2006-09-05,Kantar,683.0,3.5,34.0,1.5,7.0,36.0,10.0
4,president,2007-04-22,2006-10-05,Kantar,839.0,5.0,29.5,2.0,7.0,38.0,9.5
...,...,...,...,...,...,...,...,...,...,...,...
579,legislatives,2017-06-11,2017-06-07,Elabe,1152.0,11.0,9.0,3.0,29.0,23.0,17.0
580,legislatives,2017-06-11,2017-06-07,Harris,500.0,12.0,7.0,3.0,30.0,19.0,17.0
581,legislatives,2017-06-11,2017-06-07,Ifop,886.0,11.0,8.0,3.5,30.0,20.0,18.0
582,legislatives,2017-06-11,2017-06-07,OpinionWay,1667.0,12.0,7.0,3.0,30.0,21.0,18.0


We have to add the polls for Paris 2008 and 2014 city-council elections -- these are not included in our database and our pollster ratings because 1/ there aren't a lot of them and 2/ only a handfull of pollsters surveyed this race. So usually they don't hold a lot of information. But here they do: as our goal in the model will be to predict Paris 2020 city-council elections, these elections are particularly relevant, and their associated polls -- although limited -- are of interest.

So let's load these bad boys and concatenate them with the previous polls:

In [4]:
for year in ["2008", "2014"]:
    new_polls = pd.read_excel(f"data/polls_1st_round/paris_city_council_{year}.xlsx")
    new_polls["type"] = "municipale"
    new_polls["dateelection"] = pd.to_datetime(DATES_ELECTIONS[f"munic{year}"])

    all_polls = pd.concat([all_polls, new_polls], ignore_index=True, sort=False)

all_polls = all_polls.sort_values(["date", "sondage"])
all_polls[list(NB_PARTIES.keys())] = all_polls[list(NB_PARTIES.keys())].fillna(0)
all_polls

Unnamed: 0,type,dateelection,date,sondage,samplesize,nbfarleft,nbleft,nbgreen,nbcenter,nbright,nbfarright
0,president,2007-04-22,2006-05-18,Kantar,715.0,5.0,30.0,2.5,8.0,34.0,10.0
1,president,2007-04-22,2006-06-15,Kantar,788.0,6.0,32.0,2.0,8.0,31.0,12.5
2,president,2007-04-22,2006-07-17,Kantar,601.0,4.0,32.0,1.5,6.0,35.0,11.5
3,president,2007-04-22,2006-09-05,Kantar,683.0,3.5,34.0,1.5,7.0,36.0,10.0
4,president,2007-04-22,2006-10-05,Kantar,839.0,5.0,29.5,2.0,7.0,38.0,9.5
...,...,...,...,...,...,...,...,...,...,...,...
579,legislatives,2017-06-11,2017-06-07,Elabe,1152.0,11.0,9.0,3.0,29.0,23.0,17.0
580,legislatives,2017-06-11,2017-06-07,Harris,500.0,12.0,7.0,3.0,30.0,19.0,17.0
581,legislatives,2017-06-11,2017-06-07,Ifop,886.0,11.0,8.0,3.5,30.0,20.0,18.0
582,legislatives,2017-06-11,2017-06-07,OpinionWay,1667.0,12.0,7.0,3.0,30.0,21.0,18.0


Now, for each election, we want to aggregate all those polls and weight them by their recency, sample size and historical performance of the pollster. This last weight is approximated by our pollster ratings. So our goal is to get the polling aggregation on the eve of each election. 

And the pollster ratings we'll use for each election will be different: it will be based on all the polls we'd have seen up to (but not including) any given election. For instance, the pollster ratings for the 2017 presidential election is based on all polls of all elections in our database _before_ this election -- because at the time, while doing our aggregation, we wouldn't have known the future performance of pollsters during thise eklection. That way we're not cheating and our model will be fit on data it could have known at the time of each election. Got it?

One last thing: we don't have any pollster ratings for the 2007 legislative and 2008 and 2014 city-council elections -- both because there weren't enough polls and because they are not from enough different pollsters. So we'll take the same ratings as the elections that took place just after each one of them (because these ratings will include polls from the elections that happened just _before_).

The helper functions basically exectute this roadmap: they compute the weights for our aggregation -- based on the pollster ratings, the recency and the sample size of the poll -- and then they aggregate the polls election by election:

In [5]:
def compute_analyt_weights(election: str, df: pd.DataFrame) -> pd.DataFrame:

    if (election == "legis2007") or (election == "munic2008"):
        pollster_ratings = pd.read_csv("../data/polls_1st_round/classement_euro2009.csv")
        print(f"Just loaded classement_euro2009 for {election}\n")

    elif election == "munic2014":
        pollster_ratings = pd.read_csv("../data/polls_1st_round/classement_euro2014.csv")
        print(f"Just loaded classement_euro2014 for {election}\n")

    else:
        pollster_ratings = pd.read_csv(f"../data/polls_1st_round/classement_{election}.csv")
        print(f"Just loaded classement_{election}.csv\n")

    df = pd.merge(df, pollster_ratings, how="left", on="sondage")

    for p in NB_PARTIES.values():
        df[f"weightsondeur_{p}"].fillna(
            pollster_ratings[f"weightsondeur_{p}"].median(), inplace=True
        )
        df[f"analyt_weights_{p}"] = np.log(df.samplesize) * df[f"weightsondeur_{p}"]

    return df.set_index("date").sort_index()


def agg_polls(df: pd.DataFrame) -> pd.DataFrame:

    alpha = 2 / (SPAN + 1)
    unique_dates = sorted(set(df.index))

    for nb_p, p in NB_PARTIES.items():
        for i, d_outer in enumerate(unique_dates):
            for j, d_inner in enumerate(unique_dates[: i + 1]):
                df.loc[d_inner, f"expon_weights_{p}"] = (1 - alpha) ** (i - j)

            df[f"final_weights_{p}"] = (
                df[f"analyt_weights_{p}"] * df[f"expon_weights_{p}"]
            )
            final_weights = df.loc[:d_outer, f"final_weights_{p}"]
            vote_share = df.loc[:d_outer, f"{nb_p}"]

            df.loc[d_outer, f"{p}_agg"] = np.average(vote_share, weights=final_weights)

            # compute aggregate sample size only once:
            if p == "right":
                # same weights, whatever the party:
                expon_weights = df.loc[:d_outer, "expon_weights_right"]
                sample_size = df.loc[:d_outer, "samplesize"]
                df.loc[d_outer, "samplesize_agg"] = round(
                    np.average(sample_size, weights=expon_weights)
                )

    return df.reset_index()[
        ["type", "dateelection", "samplesize_agg"]
        + [f"{p}_agg" for p in NB_PARTIES.values()]
    ]

And now we just have to run these functions for all the elections we're interested in:

In [6]:
polls_series = []

for election in DATES_ELECTIONS:
    election_df = compute_analyt_weights(
        election, all_polls[all_polls.dateelection == DATES_ELECTIONS[election]]
    )
    polls_series.append(agg_polls(election_df).iloc[-1])

polls_df = (
    pd.concat(polls_series, axis=1).T.sort_values("dateelection").reset_index(drop=True)
)
polls_df.to_excel("../data/polls_1st_round/aggregated_polls.xlsx")
polls_df

Just loaded classement_presid2007.csv

Just loaded classement_euro2009 for legis2007

Just loaded classement_euro2009 for munic2008

Just loaded classement_euro2009.csv

Just loaded classement_regio2010.csv

Just loaded classement_presid2012.csv

Just loaded classement_legis2012.csv

Just loaded classement_euro2014 for munic2014

Just loaded classement_euro2014.csv

Just loaded classement_regio2015.csv

Just loaded classement_presid2017.csv

Just loaded classement_legis2017.csv



Unnamed: 0,type,dateelection,samplesize_agg,farleft_agg,left_agg,green_agg,center_agg,right_agg,farright_agg
0,president,2007-04-22,1513,2.64232,23.8718,0.708134,18.8066,27.3619,14.2822
1,legislatives,2007-06-10,916,7.33043,27.5673,3.70615,11.1186,41.3225,6.30612
2,municipale,2008-03-09,755,3.58771,44.5113,5.78942,8.09858,32.7406,2.18221
3,europeennes,2009-06-07,2287,12.9945,19.6148,13.3515,11.1685,27.2626,5.75632
4,regionales,2010-03-14,907,9.82754,28.9297,13.5842,4.684,28.9307,9.08352
5,president,2012-04-22,1400,13.6837,27.627,2.5297,10.3735,26.7834,15.8625
6,legislatives,2012-06-10,1193,7.97386,32.1113,5.07061,2.77511,33.6824,15.2586
7,municipale,2014-03-23,977,5.59076,38.1997,6.53985,0.0,36.5966,8.23862
8,europeennes,2014-05-25,3248,7.5563,16.6892,8.88391,9.68667,21.4976,23.2143
9,regionales,2015-12-06,1749,5.26773,22.9061,5.86065,0.0,28.1125,28.8815


In [7]:
%watermark -a AlexAndorra -n -u -v -iv

pandas 0.25.3
scipy  1.3.1
numpy  1.17.3
AlexAndorra 
last updated: Thu Jan 23 2020 

CPython 3.7.5
IPython 7.9.0
