# Study Catalan Elections Participation Dataset

Load libraries:

In [65]:
import pandas as pd
import pprint
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import logging
from unidecode import unidecode

pp = pprint.PrettyPrinter(indent=2)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

Load the dataset:

In [66]:
df = pd.read_pickle('../../data/raw/catalan-elections-participation.pkl')
df_original = df.copy()

## Dataset Structure

Visualize the structure of the dataset:

In [67]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 930819 entries, 0 to 930818
Data columns (total 24 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   index_autonumeric       930819 non-null  Int64 
 1   id_eleccio              930819 non-null  string
 2   nom_eleccio             930819 non-null  string
 3   id_nivell_territorial   930819 non-null  string
 4   nom_nivell_territorial  930819 non-null  string
 5   territori_codi          930819 non-null  string
 6   territori_nom           930819 non-null  string
 7   districte               855641 non-null  Int64 
 8   seccio                  868235 non-null  Int64 
 9   cens_electoral          930819 non-null  Int64 
 10  votants                 930819 non-null  Int64 
 11  abstencio               930819 non-null  Int64 
 12  vots_nuls               930819 non-null  Int64 
 13  vots_blancs             930819 non-null  Int64 
 14  vots_candidatures       878103 non-n

| Column name        | Description                                 | Type      |
|--------------------|---------------------------------------------|-----------|
| TERRITORI_NOM      | Nom del territori                           | Text      |
| DISTRICTE          | Districte electoral                         | Text      |
| SECCIO             | Secció electoral                            | Text      |
| MESA               | Mesa electoral                              | Text      |
| CENS_ELECTORAL     | Cens del territori                          | Nombre    |
| PADRO              | Padró d'habitants del territori             | Nombre    |
| ESCONS             | Escons a escollir a la circumscripció       | Nombre    |
| NOMBRE_MESES       | Nombre de meses electorals                  | Nombre    |
| VOTANTS            | Nombre de votants                           | Nombre    |
| ABSTENCIO          | Nombre de persones que s'han abstingut      | Nombre    |
| VOTS_NULS          | Vots nuls                                   | Nombre    |
| VOTS_BLANCS        | Vots en blanc                               | Nombre    |
| VOTS_CANDIDATURES  | Vots a candidatures                         | Nombre    |
| VOTS_VALIDS        | Vots vàlids (a candidatures + en blanc)     | Nombre    |
| VOTS_PRIMER_AVAN   | Votants al primer avanç de participació     | Nombre    |
| HORA_PRIMER_AVAN   | Hora del primer avanç de participació       | Text      |
| VOTS_SEGON_AVAN    | Votants al segon avanç de participació      | Nombre    |
| HORA_SEGON_AVAN    | Hora del segon avanç de participació        | Text      |

## Types of Elections

First of all, we will divide `id_eleccio` into `type`, `year` and `sequential` as we did in the previous notebook. This will allow us to analyze the dataset by type of election.

In [68]:
df['type'] = df['id_eleccio'].str[:1]
df['year'] = df['id_eleccio'].str[1:5].astype(int)
df['sequential'] = df['id_eleccio'].str[5:]

Show the types of elections:

In [69]:
types = df[['type', 'nom_eleccio']].groupby(['type']).first()
print(types)
print(len(types))

                                   nom_eleccio
type                                          
A     Eleccions al Parlament de Catalunya 1995
C                         Eleccions Municipals
D          Eleccions a Diputacions Provincials
E               Eleccions al Parlament Europeu
G                      Eleccions Generals 2008
M                    Eleccions Municipals 2007
R               Referèndum Constitucional 1978
S                      Eleccions al Senat 1979
V     Eleccions al Consell General d'Aran 2023
9


Now we know that the dataset contains data from 9 different types of elections:

| Type | Election Type Name                          |
|------|---------------------------------------------|
| A    | Elections to the Parliament of Catalonia    |
| C    | Elections to the County Councils            |
| D    | Elections to the Provincial Councils        |
| E    | Elections to the European Parliament        |
| G    | Elections to the Congress                   |
| M    | Municipal Elections                         |
| R    | Constitutional Referendum                   |
| S    | Elections to the Senate                     |
| V    | Elections to the General Council of Aran    |

We can see that there is one type of election that is not present in the previous dataset, which is `R`. This type of election is called Constitutional Referendum (`Referèndum Constitucional 1978`).

## Filter by Election Type

As we explained on the previous notebook, we are only interested on the data from the elections to the Parliament of Catalonia ('A'), municipal elections ('M'), elections to the European Parliament ('E') and elections to the Congress ('G'), so we filter the dataset:

In [70]:
df = df[df['type'].isin(['M', 'E', 'A', 'G'])]

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 678882 entries, 548 to 930234
Data columns (total 27 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   index_autonumeric       678882 non-null  Int64 
 1   id_eleccio              678882 non-null  string
 2   nom_eleccio             678882 non-null  string
 3   id_nivell_territorial   678882 non-null  string
 4   nom_nivell_territorial  678882 non-null  string
 5   territori_codi          678882 non-null  string
 6   territori_nom           678882 non-null  string
 7   districte               624021 non-null  Int64 
 8   seccio                  633605 non-null  Int64 
 9   cens_electoral          678882 non-null  Int64 
 10  votants                 678882 non-null  Int64 
 11  abstencio               678882 non-null  Int64 
 12  vots_nuls               678882 non-null  Int64 
 13  vots_blancs             678882 non-null  Int64 
 14  vots_candidatures       678882 non-null

## Check for missing values

Now, we want to count the number of missing values in each column:

In [71]:
# Calculate the percentage of missing values in each column and sort them
df.isnull().mean().sort_values(ascending=False) * 100

nombre_meses              93.256118
escons                    93.256118
padro                     93.115593
mesa                      49.661502
vots_segon_avan           22.344678
vots_primer_avan          22.344678
districte                  8.081080
seccio                     6.669348
index_autonumeric          0.000000
vots_valids                0.000000
year                       0.000000
type                       0.000000
hora_segon_avan            0.000000
hora_primer_avan           0.000000
vots_blancs                0.000000
vots_candidatures          0.000000
id_eleccio                 0.000000
vots_nuls                  0.000000
abstencio                  0.000000
votants                    0.000000
cens_electoral             0.000000
territori_nom              0.000000
territori_codi             0.000000
nom_nivell_territorial     0.000000
id_nivell_territorial      0.000000
nom_eleccio                0.000000
sequential                 0.000000
dtype: float64

Based on the previous table, we can see that `escons`, `nombre_meses` and `padro` have lots of empty values.

Also, we can see that `mesa` and `vots_segon_avan` and `vots_primer_avan` have a significant number of empty values.

We will study them to see if we can fill them with some value or if we need to remove them.

### Missing values in `escons`

In [72]:
df[df["escons"].isnull()]

Unnamed: 0,index_autonumeric,id_eleccio,nom_eleccio,id_nivell_territorial,nom_nivell_territorial,territori_codi,territori_nom,districte,seccio,cens_electoral,...,hora_segon_avan,mesa,vots_primer_avan,vots_segon_avan,padro,escons,nombre_meses,type,year,sequential
548,99504,A19951,Eleccions al Parlament de Catalunya 1995,DM,Districte Municipal,25165,Peramola,1,0,336,...,18:00:00,,167,232,,,,A,1995,1
549,99503,A19951,Eleccions al Parlament de Catalunya 1995,DM,Districte Municipal,25164,Penelles,1,0,477,...,18:00:00,,113,275,,,,A,1995,1
550,99502,A19951,Eleccions al Parlament de Catalunya 1995,DM,Districte Municipal,25163,"Coma i la Pedra, la",1,0,218,...,18:00:00,,61,117,,,,A,1995,1
551,99501,A19951,Eleccions al Parlament de Catalunya 1995,DM,Districte Municipal,25161,Conca de Dalt,1,0,400,...,18:00:00,,95,185,,,,A,1995,1
552,99500,A19951,Eleccions al Parlament de Catalunya 1995,DM,Districte Municipal,25158,"Palau d'Anglesola, el",1,0,1321,...,18:00:00,,285,770,,,,A,1995,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
929151,101498,A19951,Eleccions al Parlament de Catalunya 1995,DM,Districte Municipal,25170,Bellaguarda,1,0,316,...,18:00:00,,39,141,,,,A,1995,1
930231,100525,A19951,Eleccions al Parlament de Catalunya 1995,DM,Districte Municipal,25169,"Pobla de Cérvoles, la",1,0,188,...,18:00:00,,48,119,,,,A,1995,1
930232,100524,A19951,Eleccions al Parlament de Catalunya 1995,DM,Districte Municipal,25168,"Poal, el",1,0,537,...,18:00:00,,200,370,,,,A,1995,1
930233,100523,A19951,Eleccions al Parlament de Catalunya 1995,DM,Districte Municipal,25167,Pinós,1,0,295,...,18:00:00,,83,152,,,,A,1995,1


We calculate the percentage of missing ``escons`` values by ``id_nivell_territorial``. The `NA` values represent 0.0% of the total:

In [73]:
df[["id_nivell_territorial", "nom_nivell_territorial"]].drop_duplicates()

Unnamed: 0,id_nivell_territorial,nom_nivell_territorial
548,DM,Districte Municipal
588,ME,Mesa
7876,SE,Secció
8459,CA,Catalunya
14911,PR,Província
14915,VE,Vegueria
14925,CO,Comarca
22989,MU,Municipi
112991,PR,Provincia
120016,ME,Municipi


In [74]:
100 * df[df["escons"].isnull()].value_counts("id_nivell_territorial") / df.value_counts(
    "id_nivell_territorial"
)

id_nivell_territorial
CA      2.083333
CL    100.000000
CO      0.152439
DM    100.000000
ME    100.000000
MU      2.172104
PR      2.083333
SE    100.000000
VE     23.232323
Name: count, dtype: float64

We can see that for ``Mesa``, ``Col·legi``, ``Districte Municipal`` and ``Secció`` the percentage of missing values is 100.0%. On the other territorial levels, the percentage of missing values is near 0.0%.

We can also study the percentage of missing values by election type (`type`):

In [75]:
100 * df[df["escons"].isnull()].value_counts("type") / df.value_counts("type")

type
G    93.117926
A    93.382706
M    92.577261
E    94.132990
Name: count, dtype: float64

We will filter out the territorial levels with 100.0% of missing values:

In [76]:
df_filtered = df[~df["id_nivell_territorial"].isin(["CL", "DM", "ME", "SE"])]
print(
    (
        100
        * df_filtered[df_filtered["escons"].isnull()].value_counts(
            "id_nivell_territorial"
        )
        / df_filtered["id_nivell_territorial"].value_counts()
    ).sort_values(ascending=False)
)
print(
    (
        100
        * df_filtered[df_filtered["escons"].isnull()].value_counts("type")
        / df_filtered["type"].value_counts()
    ).sort_values(ascending=False)
)
print(
    (
        100
        * df_filtered[df_filtered["escons"].isnull()].value_counts("year")
        / df_filtered["year"].value_counts()
    )
    .sort_values(ascending=False)
    .dropna()
)
print(
    (
        100
        * df_filtered[df_filtered["escons"].isnull()].value_counts("territori_nom")
        / df_filtered["territori_nom"].value_counts()
    )
    .sort_values(ascending=False)
    .dropna()
)

id_nivell_territorial
VE    23.232323
MU     2.172104
CA     2.083333
PR     2.083333
CO     0.152439
Name: count, dtype: Float64
type
E    11.988964
G     0.127218
A     0.115723
M     0.064167
Name: count, dtype: Float64
year
2009    95.285858
2012     0.795229
2019     0.570578
2021     0.496032
2011     0.350350
2016     0.099305
2017     0.099206
2015     0.066335
Name: count, dtype: float64
territori_nom
CERA Tarragona      100.0
CERA Lleida         100.0
CERA Girona         100.0
CERA Catalunya      100.0
CERA Barcelona      100.0
                   ...   
Gironella         2.12766
Gisclareny        2.12766
Godall            2.12766
Golmés            2.12766
Tarragona         1.06383
Name: count, Length: 956, dtype: Float64


Most of the missing values are from ``CERA *`` and ``Residents Absents`` on the ``territori_nom`` column. This is logic because these "territories" are not physical places, but groups of people and therefore they don't have a number of seats to be elected.

In [94]:
territori_nom_escons_null = (
    (
        100
        * df_filtered[df_filtered["escons"].isnull()].value_counts("territori_nom")
        / df_filtered["territori_nom"].value_counts()
    )
    .sort_values(ascending=False)
    .dropna()
)
eleccions_escons_null = (
    (
        100
        * df_filtered[df_filtered["escons"].isnull()].value_counts("type")
        / df_filtered["type"].value_counts()
    )
    .sort_values(ascending=False)
    .dropna()
)
generals_escons_null = (
    df_filtered[(df_filtered["escons"].isnull()) & (df_filtered["type"] == "G")]
)
autonomiques_escons_null = (
    df_filtered[(df_filtered["escons"].isnull()) & (df_filtered["type"] == "A")]
)
europees_escons_null = (
    df_filtered[(df_filtered["escons"].isnull()) & (df_filtered["type"] == "E")]
)
print(
    (
        100
        * europees_escons_null[europees_escons_null["escons"].isnull()].value_counts("territori_nom")
        / europees_escons_null["territori_nom"].value_counts()
    )
    .sort_values(ascending=False)
)

Vegueries doesn't have seats to be elected, so on the catalan elections they have missing values for seats.

On the european elections, any municipality has seats to be elected, so they have missing values for seats.

### Missing values in `nombre_meses`

### Missing values in `padro`

### Missing values in `mesa`

### Missing values in `vots_segon_avan` and `vots_primer_avan`

### Missing values in `districte`

### Missing values in `seccio`