# Study Clean Catalan Elections Dataset

Load libraries:

In [2]:
import pandas as pd
import pprint
import matplotlib.pyplot as plt
import seaborn as sns

pp = pprint.PrettyPrinter(indent=2)

Load the clean dataset:

In [3]:
df = pd.read_pickle('../../data/processed/catalan-elections-clean-data.pkl')
df_original = df.copy()

## Dataset Structure 

Visualize the structure of the dataset:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12339340 entries, 0 to 12339339
Data columns (total 21 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   index_autonumeric       int64         
 1   nom_eleccio             object        
 2   id_nivell_territorial   object        
 3   nom_nivell_territorial  object        
 4   territori_codi          object        
 5   territori_nom           object        
 6   seccio                  float64       
 7   vots                    int64         
 8   escons                  float64       
 9   districte               float64       
 10  mesa                    object        
 11  party_code              int32         
 12  party_name              object        
 13  party_abbr              object        
 14  party_color             object        
 15  type                    object        
 16  year                    int32         
 17  sequential              object        
 18  

| Column name               | Description                                            | Type      |
|---------------------------|--------------------------------------------------------|-----------|
| INDEX_AUTONUMERIC         | Autonumeric index identifier for the row               | Number    |
| NOM_ELECCIO               | Name of the electoral process                          | Plain Text|
| ID_NIVELL_TERRITORIAL     | Identifier of the territorial level (Municipality, Vegueria, County...) | Plain Text|
| NOM_NIVELL_TERRITORIAL    | Name of the territorial level of the record (Municipality, County...) | Plain Text|
| TERRITORI_CODI            | Territory code                                         | Plain Text|
| TERRITORI_NOM             | Name of the territory                                  | Plain Text|
| DISTRICTE                 | Electoral district                                     | Plain Text|
| SECCIO                    | Electoral section                                      | Plain Text|
| MESA                      | Electoral table                                        | Plain Text|
| PARTY_CODE                | Code of the party                                      | Number    |
| PARTY_NAME                | Name of the party                                      | Plain Text|
| PARTY_ABBR                | Acronym of the party                                   | Plain Text|
| PARTY_COLOR               | Color of the party                                     | Plain Text|
| VOTS                      | Votes of the party                                     | Number    |
| ESCONS                    | Seats of the party                                     | Number    |
| TYPE                      | Type of election                                       | Plain Text|
| YEAR                      | Year when the election took place                      | Number    |
| MONTH                     | Month when the election took place                     | Number    |
| DAY                       | Day when the election took place                       | Number    |
| DATE                      | Date when the election took place                      | Datetime  |

## Group candidatures

One of the challenges of forecasting elections is the large number of candidatures. Most of the candidatures belong to a big party, but there are also many small parties and independent candidatures. We need to group the candidatures that belong to the same party in order to get the historical data of the biggest parties.

We will try to group the candidatures automatically, but this will only be useful when the candidatures have a similar name. When the candidatures have a different name (for example parties that have merged, or parties that have changed their name), we will need to group them manually.

There are many ways to group the candidatures. We could use the code, the acronym, the name or the columns ``AGRUPACIO_*``, but we need to keep in mind that these columns have 77% of empty values. We will start by grouping the candidatures by the code.

The small parties or independent candidatures could be grouped in a single category called "Others".

### Group party by code

The first step is to analyze the number of individual candidatures. This will help us to know whether the groping is useful or not.

In [18]:
# Show the number of unique parties by code
candidatures = len(df['party_code'].unique())
candidatures

636

In [19]:
unique_df = df[['party_code', 'party_name', 'party_abbr']].drop_duplicates()
sorted_unique_df = unique_df.sort_values(by='party_code')
sorted_unique_df

Unnamed: 0,party_code,party_name,party_abbr
1880988,3.000000e+00,Unificació Comunista d'Espanya,U.C.E.
1880989,6.000000e+00,Partit dels Socialistes de Catalunya (PSC-PSOE),PSC
1594306,1.000000e+01,Esquerra Republicana de Catalunya,ERC
5620082,1.100000e+01,Partido Socialista de Andalucía-Partido Andaluz,PSA
1594307,1.200000e+01,Convergència i Unió,CiU
...,...,...,...
12181354,4.390542e+08,FEM SALOU FEM REPÚBLICA,SR
12181356,4.390542e+08,UNIDOS SALOU COSTA DORADA,UNIDOS SALOU
12181358,4.390542e+08,TRABAJANDO POR SALOU-TREBALLANT PER SALOU,TXS
12275390,2.019050e+13,BARCELONA ETS TÚ,BCN ETS TÚ


In [20]:
# Group by 'party_code', then aggregate to get the count and the first 'party_name'
count_df = (
    unique_df.groupby("party_code")
    .agg(
        count=("party_code", "size"),  # Count the number of occurrences
        first_party_name=(
            "party_name",
            "first",
        ),  # Get the first 'party_name'
    )
    .reset_index()
)
sorted_count_df = count_df.sort_values(by="count", ascending=False).reset_index(
    drop=True
)
sorted_count_df

Unnamed: 0,party_code,count,first_party_name
0,7.510000e+02,2,Endavant Cerdanya
1,6.430000e+02,2,Escons en Blanc
2,3.010000e+02,2,Ciutadans-Partido de la Ciudadanía
3,3.320000e+02,2,Partit Comunista del Poble de Catalunya
4,9.060000e+02,2,Som Catalans
...,...,...,...
630,8.970000e+02,1,Recuperemos Torredembarra/Recuperem Torredembarra
631,8.980000e+02,1,Republicanos
632,8.990000e+02,1,Roquetencs Tots Units
633,9.000000e+02,1,Sentit Comú


In [21]:
# Merge count_df and unique_df by 'party_code'
merged_df = pd.merge(
    count_df, unique_df, on="party_code", how="inner"
)

# Sort merged_df by 'count' and 'party_code'
sorted_merged_df = merged_df.sort_values(
    by=["count", "party_code"], ascending=[False, True]
).reset_index(drop=True)

sorted_merged_df

Unnamed: 0,party_code,count,first_party_name,party_name,party_abbr
0,1.800000e+01,2,Falange Española de las J.O.N.S.,Falange Española de las J.O.N.S.,FE-JONS
1,1.800000e+01,2,Falange Española de las J.O.N.S.,FALANGE ESPAÑOLA DE LAS J.O.N.S.,FE de las JONS
2,8.600000e+01,2,Partit Popular,Partit Popular,PP
3,8.600000e+01,2,Partit Popular,PARTIT POPULAR / PARTIDO POPULAR,PP
4,3.010000e+02,2,Ciutadans-Partido de la Ciudadanía,Ciutadans-Partido de la Ciudadanía,C's
...,...,...,...,...,...
647,4.317042e+08,1,SOM-HI VILA-RODONA,SOM-HI VILA-RODONA,SV
648,4.390542e+08,1,FEM SALOU FEM REPÚBLICA,FEM SALOU FEM REPÚBLICA,SR
649,4.390542e+08,1,UNIDOS SALOU COSTA DORADA,UNIDOS SALOU COSTA DORADA,UNIDOS SALOU
650,4.390542e+08,1,TRABAJANDO POR SALOU-TREBALLANT PER SALOU,TRABAJANDO POR SALOU-TREBALLANT PER SALOU,TXS


### Group parties by name

In [22]:
# Show the number of unique parties by name
candidatures_name = len(df['party_name'].unique())
candidatures_name

617

### Group parties by acronym

In [23]:
# Show the number of unique parties by acronym
candidatures_acronym = len(df['party_abbr'].unique())
candidatures_acronym

548

## Candidatures statistics

### Plot votes by candidature

### Plot seats by candidature

##

## Region statistics

### Plot votes by region

### Plot seats by region

### Plot number of tables by region