# Study Clean Catalan Elections Dataset

Load libraries:

In [11]:
import pandas as pd
import pprint
import matplotlib.pyplot as plt
import seaborn as sns

pp = pprint.PrettyPrinter(indent=2)

Load the clean dataset:

In [12]:
df = pd.read_pickle('../../data/processed/catalan-elections-clean-data.pkl')
df_original = df.copy()

## Dataset Structure 

Visualize the structure of the dataset:

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12339340 entries, 0 to 12339339
Data columns (total 21 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   index_autonumeric       int64         
 1   nom_eleccio             object        
 2   id_nivell_territorial   object        
 3   nom_nivell_territorial  object        
 4   territori_codi          object        
 5   territori_nom           object        
 6   seccio                  float64       
 7   vots                    int64         
 8   escons                  float64       
 9   districte               float64       
 10  mesa                    object        
 11  party_code              float64       
 12  party_name              object        
 13  party_abbr              object        
 14  party_color             object        
 15  type                    object        
 16  year                    int32         
 17  sequential              object        
 18  

| Column name               | Description                                            | Type      |
|---------------------------|--------------------------------------------------------|-----------|
| INDEX_AUTONUMERIC         | Autonumeric index identifier for the row               | Number    |
| NOM_ELECCIO               | Name of the electoral process                          | Plain Text|
| ID_NIVELL_TERRITORIAL     | Identifier of the territorial level (Municipality, Vegueria, County...) | Plain Text|
| NOM_NIVELL_TERRITORIAL    | Name of the territorial level of the record (Municipality, County...) | Plain Text|
| TERRITORI_CODI            | Territory code                                         | Plain Text|
| TERRITORI_NOM             | Name of the territory                                  | Plain Text|
| DISTRICTE                 | Electoral district                                     | Plain Text|
| SECCIÓ                    | Electoral section                                      | Plain Text|
| MESA                      | Electoral table                                        | Plain Text|
| PARTY_CODE                | Code of the party                                      | Number    |
| PARTY_NAME                | Name of the party                                      | Plain Text|
| PARTY_ABBR                | Acronym of the party                                   | Plain Text|
| PARTY_COLOR               | Color of the party                                     | Plain Text|
| VOTS                      | Votes of the party                                     | Number    |
| ESCONS                    | Seats of the party                                     | Number    |
| TYPE                      | Type of election                                       | Plain Text|
| YEAR                      | Year when the election took place                      | Number    |
| MONTH                     | Month when the election took place                     | Number    |
| DAY                       | Day when the election took place                       | Number    |
| DATE                      | Date when the election took place                      | Datetime  |

## Group candidatures

One of the challenges of forecasting elections is the large number of candidatures. Most of the candidatures belong to a big party, but there are also many small parties and independent candidatures. We need to group the candidatures that belong to the same party in order to get the historical data of the biggest parties.

We will try to group the candidatures automatically, but this will only be useful when the candidatures have a similar name. When the candidatures have a different name (for example parties that have merged, or parties that have changed their name), we will need to group them manually.

There are many ways to group the candidatures. We could use the code, the acronym, the name or the columns ``AGRUPACIO_*``, but we need to keep in mind that these columns have 77% of empty values. We will start by grouping the candidatures by the code.

The small parties or independent candidatures could be grouped in a single category called "Others".

### Group party by code

The first step is to analyze the number of individual candidatures. This will help us to know whether the groping is useful or not.

In [8]:
# Show the number of unique parties by code
candidatures = len(df['party_code'].unique())
candidatures

8617

In [27]:
unique_df = df[['party_code', 'party_name', 'party_abbr']].drop_duplicates()
sorted_unique_df = unique_df.sort_values(by='party_code')
sorted_unique_df

Unnamed: 0,candidatura_codi,candidatura_denominacio,candidatura_sigles
0,1,Conservadors de Catalunya,C.i.C.
171727,2,Partit dels Comunistes de Catalunya,PCC
226225,3,Unificació Comunista d'Espanya,U.C.E.
1,4,Partit Socialista Unificat de Catalunya,PSUC
2835068,5,C.E. Front Comunista de Catalunya,C.E.-FCC
...,...,...,...
12181272,439064190,JUNTS PER L'AMPOLLA,JUNTS
11244646,439064910,Candidatura Independent per l'Ampolla,C.IND/1
12181366,439074190,SOM POBLE- ALTERNATIVA MUNICIPALISTA,SP-AMUNT
12181367,439074191,JUNTS PER LA CANONJA,JUNTS


In [36]:
# Group by 'party_code', then aggregate to get the count and the first 'party_name'
count_df = (
    unique_df.groupby("party_code")
    .agg(
        count=("party_code", "size"),  # Count the number of occurrences
        first_party_name=(
            "party_name",
            "first",
        ),  # Get the first 'party_name'
    )
    .reset_index()
)
sorted_count_df = count_df.sort_values(by="count", ascending=False).reset_index(
    drop=True
)
sorted_count_df

Unnamed: 0,candidatura_codi,count,first_candidatura_denominacio
0,193,47,Esquerra Republicana de Catalunya-Acord Municipal
1,842,36,Partit dels Socialistes de Catalunya-Candidatu...
2,1039,30,CANDIDATURA D'UNITAT POPULAR-ALTERNATIVA MU...
3,301,29,Ciutadans-Partido de la Ciudadanía
4,86,13,Partit Popular
...,...,...,...
8612,82054110,1,Esquerra Republicana de Catalunya -Reagrupamen...
8613,82054072,1,Candidatura d Unitat Popular-CAV
8614,82054071,1,Unió d'Independents de Sant Cugat
8615,82054036,1,Una Altra Democràcia és Possible


In [37]:
# Merge count_df and unique_df by 'party_code'
merged_df = pd.merge(
    count_df, unique_df, on="party_code", how="inner"
)

# Sort merged_df by 'count' and 'party_code'
sorted_merged_df = merged_df.sort_values(
    by=["count", "party_code"], ascending=[False, True]
).reset_index(drop=True)

sorted_merged_df

Unnamed: 0,candidatura_codi,count,first_candidatura_denominacio,candidatura_denominacio,candidatura_sigles
0,193,47,Esquerra Republicana de Catalunya-Acord Municipal,Esquerra Republicana de Catalunya-Acord Municipal,ERC-AM
1,193,47,Esquerra Republicana de Catalunya-Acord Municipal,ESQUERRA REPUBLICANA DE CATALUNYA - Acord Muni...,ERC - AM
2,193,47,Esquerra Republicana de Catalunya-Acord Municipal,ESQUERRA REPUBLICANA DE CATALUNYA-ACORD MUNICIPAL,ERC-AM
3,193,47,Esquerra Republicana de Catalunya-Acord Municipal,ESQUERRA REPUBLICANA DE CATALUNYA - ACORD MUNI...,ERC- AM
4,193,47,Esquerra Republicana de Catalunya-Acord Municipal,ESQUERRA REPUBLICANA DE CATALUNYA - ACORD MUNI...,ERC - AM
...,...,...,...,...,...
8841,439064190,1,JUNTS PER L'AMPOLLA,JUNTS PER L'AMPOLLA,JUNTS
8842,439064910,1,Candidatura Independent per l'Ampolla,Candidatura Independent per l'Ampolla,C.IND/1
8843,439074190,1,SOM POBLE- ALTERNATIVA MUNICIPALISTA,SOM POBLE- ALTERNATIVA MUNICIPALISTA,SP-AMUNT
8844,439074191,1,JUNTS PER LA CANONJA,JUNTS PER LA CANONJA,JUNTS


### Group parties by name

In [12]:
# Show the number of unique parties by name
candidatures_name = len(df['party_name'].unique())
candidatures_name

7351

### Group parties by acronym

In [14]:
# Show the number of unique parties by acronym
candidatures_acronym = len(df['party_abbr'].unique())
candidatures_acronym

548

## Candidatures statistics

### Plot votes by candidature

### Plot seats by candidature

##

## Region statistics

### Plot votes by region

### Plot seats by region

### Plot number of tables by region