# Study Clean Catalan Elections Dataset

Load libraries:

In [2]:
import pandas as pd
import pprint
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import textdistance
from tqdm import tqdm
from unidecode import unidecode

pp = pprint.PrettyPrinter(indent=2)

Load the clean dataset:

In [3]:
df = pd.read_pickle('../../data/processed/catalan-elections-clean-data.pkl')
df_original = df.copy()

## Dataset Structure 

Visualize the structure of the dataset:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12339340 entries, 0 to 12339339
Data columns (total 21 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   index_autonumeric       int64         
 1   nom_eleccio             object        
 2   id_nivell_territorial   object        
 3   nom_nivell_territorial  object        
 4   territori_codi          object        
 5   territori_nom           object        
 6   seccio                  Int64         
 7   vots                    int32         
 8   escons                  float64       
 9   districte               Int64         
 10  mesa                    object        
 11  party_code              int32         
 12  party_name              object        
 13  party_abbr              object        
 14  party_color             object        
 15  type                    object        
 16  year                    int32         
 17  round                   object        
 18  

| Column name               | Description                                            | Type      |
|---------------------------|--------------------------------------------------------|-----------|
| INDEX_AUTONUMERIC         | Autonumeric index identifier for the row               | Number    |
| NOM_ELECCIO               | Name of the electoral process                          | Plain Text|
| ID_NIVELL_TERRITORIAL     | Identifier of the territorial level (Municipality, Vegueria, County...) | Plain Text|
| NOM_NIVELL_TERRITORIAL    | Name of the territorial level of the record (Municipality, County...) | Plain Text|
| TERRITORI_CODI            | Territory code                                         | Plain Text|
| TERRITORI_NOM             | Name of the territory                                  | Plain Text|
| DISTRICTE                 | Electoral district                                     | Plain Text|
| SECCIO                    | Electoral section                                      | Plain Text|
| MESA                      | Electoral table                                        | Plain Text|
| PARTY_CODE                | Code of the party                                      | Number    |
| PARTY_NAME                | Name of the party                                      | Plain Text|
| PARTY_ABBR                | Acronym of the party                                   | Plain Text|
| PARTY_COLOR               | Color of the party                                     | Plain Text|
| VOTS                      | Votes of the party                                     | Number    |
| ESCONS                    | Seats of the party                                     | Number    |
| TYPE                      | Type of election                                       | Plain Text|
| YEAR                      | Year when the election took place                      | Number    |
| MONTH                     | Month when the election took place                     | Number    |
| DAY                       | Day when the election took place                       | Number    |
| DATE                      | Date when the election took place                      | Datetime  |

## Group candidatures

One of the challenges of forecasting elections is the large number of candidatures. Most of the candidatures belong to a big party, but there are also many small parties and independent candidatures. We need to group the candidatures that belong to the same party in order to get the historical data of the biggest parties.

We will try to group the candidatures automatically, but this will only be useful when the candidatures have a similar name. When the candidatures have a different name (for example parties that have merged, or parties that have changed their name), we will need to group them manually.

There are many ways to group the candidatures. We could use the code, the acronym, the name or the columns ``AGRUPACIO_*``, but we need to keep in mind that these columns have 77% of empty values. We will start by grouping the candidatures by the code.

The small parties or independent candidatures could be grouped in a single category called "Others".

### Group party by code

The first step is to analyze the number of individual candidatures. This will help us to know whether the groping is useful or not.

In [5]:
# Show the number of unique parties by code
candidatures = len(df['party_code'].unique())
candidatures

920

In [6]:
unique_df = df[['party_code', 'party_name', 'party_abbr']].drop_duplicates()
sorted_unique_df = unique_df.sort_values(by='party_code')
sorted_unique_df

Unnamed: 0,party_code,party_name,party_abbr
12174323,-2147483648,BARCELONA ETS TÚ,BCN ETS TÚ
0,1,Conservadors de Catalunya,C.i.C.
171727,2,Partit dels Comunistes de Catalunya,PCC
226225,3,Unificació Comunista d'Espanya,U.C.E.
1,4,Partit Socialista Unificat de Catalunya,PSUC
...,...,...,...
12174318,431664190,TOT(S) VILALLONGA DEL CAMP,TOT(S)
12174319,431704190,SOM-HI VILA-RODONA,SV
12174320,439054190,FEM SALOU FEM REPÚBLICA,SR
12174321,439054192,UNIDOS SALOU COSTA DORADA,UNIDOS SALOU


In [7]:
# Group by 'party_code', then aggregate to get the count and the first 'party_name'
count_df = (
    unique_df.groupby("party_code")
    .agg(
        count=("party_code", "size"),  # Count the number of occurrences
        first_party_name=(
            "party_name",
            "first",
        ),  # Get the first 'party_name'
    )
    .reset_index()
)
sorted_count_df = count_df.sort_values(by="count", ascending=False).reset_index(
    drop=True
)
sorted_count_df

Unnamed: 0,party_code,count,first_party_name
0,301,5,Ciutadans-Partido de la Ciudadanía
1,86,4,Partit Popular
2,1083,3,Junts per Catalunya
3,662,3,Partit Animalista Contra el Maltractament Animal
4,643,2,Escons en Blanc
...,...,...,...
915,614,1,Iniciativa per Catalunya Verds-EUiA: L'Esquerra
916,615,1,Salamanca-Zamora-León
917,616,1,Libertas-Ciudadanos de España
918,617,1,Los Verdes-Grupo Verde Europeo


In [8]:
# Merge count_df and unique_df by 'party_code'
merged_df = pd.merge(
    count_df, unique_df, on="party_code", how="inner"
)

# Sort merged_df by 'count' and 'party_code'
sorted_merged_df = merged_df.sort_values(
    by=["count", "party_code"], ascending=[False, True]
).reset_index(drop=True)

sorted_merged_df

Unnamed: 0,party_code,count,first_party_name,party_name,party_abbr
0,301,5,Ciutadans-Partido de la Ciudadanía,Ciutadans-Partido de la Ciudadanía,C's
1,301,5,Ciutadans-Partido de la Ciudadanía,Ciutadans-Partido de la Ciudadanía,Cs
2,301,5,Ciutadans-Partido de la Ciudadanía,CIUTADANS-PARTIDO DE LA CIUDADANIA,CS
3,301,5,Ciutadans-Partido de la Ciudadanía,CIUTADANS-PARTIDO DE LA CIUDADANÍA,Cs
4,301,5,Ciutadans-Partido de la Ciudadanía,CIUTADANS-PARTIDO DE LA CIUDADANIA,Cs
...,...,...,...,...,...
952,431664190,1,TOT(S) VILALLONGA DEL CAMP,TOT(S) VILALLONGA DEL CAMP,TOT(S)
953,431704190,1,SOM-HI VILA-RODONA,SOM-HI VILA-RODONA,SV
954,439054190,1,FEM SALOU FEM REPÚBLICA,FEM SALOU FEM REPÚBLICA,SR
955,439054192,1,UNIDOS SALOU COSTA DORADA,UNIDOS SALOU COSTA DORADA,UNIDOS SALOU


After the cleaning we have **920 candidatures**, that is a huge reduction compared to the original dataset, which had **7351 candidatures**.

But there are still many candidatures that could be grouped into the same party. To do this, we need to perform language processing to group the candidatures that have a similar name.

### Group party by name similarity

Grouping the candidatures by name similarity is a complex task. We will use the Levenshtein distance to measure the similarity between the names of the candidatures. The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

In order to improve the process of grouping the candidatures by similarity, we will apply the following preprocessing steps to the names of the candidatures:
- Convert names to lowercase
- Remove accents
- Remove special characters

Then, we will store this new information in a new column called ``party_clean_name``.

In [9]:
def clean_party_name(party_name):
    party_name = party_name.lower()
    party_name = unidecode(party_name)
    party_name = party_name.replace("(", "").replace(")", "").replace(",", "").replace("-", "")
    return party_name

def clean_party_names(df):
    df['clean_party_name'] = df['party_name'].apply(clean_party_name)
    return df

In [10]:
df = clean_party_names(df)

In [11]:
df.head()

Unnamed: 0,index_autonumeric,nom_eleccio,id_nivell_territorial,nom_nivell_territorial,territori_codi,territori_nom,seccio,vots,escons,districte,...,party_name,party_abbr,party_color,type,year,round,month,day,date,clean_party_name
0,1,Eleccions al Parlament de Catalunya 1980,CA,Catalunya,9,Catalunya,0,4095,0.0,,...,Conservadors de Catalunya,C.i.C.,#195182,A,1980,1,3,20,1980-03-20,conservadors de catalunya
1,2,Eleccions al Parlament de Catalunya 1980,CA,Catalunya,9,Catalunya,0,507753,25.0,,...,Partit Socialista Unificat de Catalunya,PSUC,#9E352F,A,1980,1,3,20,1980-03-20,partit socialista unificat de catalunya
2,3,Eleccions al Parlament de Catalunya 1980,CA,Catalunya,9,Catalunya,0,606717,33.0,,...,Partit dels Socialistes de Catalunya (PSC-PSOE),PSC,#DD2809,A,1980,1,3,20,1980-03-20,partit dels socialistes de catalunya pscpsoe
3,4,Eleccions al Parlament de Catalunya 1980,CA,Catalunya,9,Catalunya,0,27807,0.0,,...,Fuerza Nueva,FN,#0000C4,A,1980,1,3,20,1980-03-20,fuerza nueva
4,5,Eleccions al Parlament de Catalunya 1980,CA,Catalunya,9,Catalunya,0,240871,14.0,,...,Esquerra Republicana de Catalunya,ERC,#FFB232,A,1980,1,3,20,1980-03-20,esquerra republicana de catalunya


In [12]:
party_names = df["clean_party_name"].unique()
print(len(party_names))


def calculate_distance_matrix(text, distance_algorithm):
    n = len(text)
    distance_matrix = np.zeros((n, n), dtype=float)  # Initialize a matrix with zeros

    # Only calculate for one half and mirror it to the other half
    for i in tqdm(range(n), desc="Calculating distances"):
        for j in range(
            i + 1, n
        ):  # Start from i+1 to avoid calculating the distance of a string to itself
            distance = distance_algorithm(text[i], text[j])
            distance_matrix[i, j] = distance
            distance_matrix[j, i] = (
                100.0  # Making sure the values below the diagonal is high to avoid taking into account a value 2 times (will be filtered out)
            )

    # Making sure the diagonal is high to simulate distance to itself (will be filtered out)
    np.fill_diagonal(distance_matrix, 100.0)

    # Convert the NumPy array to a pandas DataFrame
    return pd.DataFrame(distance_matrix, index=text, columns=text)


levenshtein_distances = calculate_distance_matrix(
    party_names, textdistance.levenshtein.distance
)

836


Calculating distances: 100%|██████████| 836/836 [00:01<00:00, 626.54it/s]


In [15]:
def get_similar_parties(distances, threshold):
    similar_parties = (
        distances[distances < threshold]
        .stack()
        .reset_index()
        .rename(columns={"level_0": "party1", "level_1": "party2", 0: "distance"})
        .sort_values(by="party1")
    )
    return similar_parties

In [16]:
# Show the names of the parties that have a distance below 10
levenshtein_similar_parties = get_similar_parties(levenshtein_distances, 5)
levenshtein_similar_parties

Unnamed: 0,party1,party2,distance
55,#som sant vicenc,nou sant vicenc,3.0
57,aae osona,aae solsones,4.0
24,actua,dcide,4.0
22,actua,entesa,4.0
23,actua,ara,3.0
3,alternativa verdamoviment ecologista catalunya,alternativa verdamoviment ecologista de catalunya,3.0
33,ara,djv,3.0
31,ara,piai,3.0
32,ara,babord,4.0
20,aralar,ara talarn,4.0


In [17]:
jaro_winkler_distances = calculate_distance_matrix(party_names, textdistance.jaro_winkler.distance)
jaccard_distances = calculate_distance_matrix(party_names, textdistance.jaccard.distance)
cosine_distances = calculate_distance_matrix(party_names, textdistance.cosine.distance)

Calculating distances: 100%|██████████| 836/836 [00:01<00:00, 555.54it/s]
Calculating distances: 100%|██████████| 836/836 [00:08<00:00, 103.29it/s]
Calculating distances: 100%|██████████| 836/836 [00:06<00:00, 126.00it/s]


In [18]:
jaro_winkler_similar_parties = (
    jaro_winkler_distances[jaro_winkler_distances < 0.1]
    .stack()
    .reset_index()
    .rename(columns={"level_0": "party1", "level_1": "party2", 0: "distance"})
    .sort_values(by="party1")
)

jaro_winkler_similar_parties

Unnamed: 0,party1,party2,distance
196,"agrupacio d'electors ""+bellvei""",agrupacio d'electors alternativa per bot,0.098214
195,"agrupacio d'electors ""+bellvei""",agrupacio electors savall,0.091155
328,agrupacio d'electors alternativa per bot,agrupacio d'electors tria roda de bera,0.086347
296,agrupacio d'electors esparreguera 2031,agrupacio d'electors som rupia,0.092710
185,agrupacio d'electors futur per jafre,agrupacio d'electors tria roda de bera,0.088772
...,...,...,...
270,units per martorelles,units per sant llorenc,0.098268
307,units per sant cugat,units per sant llorenc,0.094545
324,units per sant llorenc,units per masllorenc,0.068852
209,v vilanova365,v vilanova 365,0.014286


In [19]:
jaccard_similar_parties = (
    jaccard_distances[jaccard_distances < 0.1]
    .stack()
    .reset_index()
    .rename(columns={"level_0": "party1", "level_1": "party2", 0: "distance"})
    .sort_values(by="party1")
)

jaccard_similar_parties

Unnamed: 0,party1,party2,distance
1,alternativa verdamoviment ecologista catalunya,alternativa verdamoviment ecologista de catalunya,0.061224
4,candidatura d'unitat popular,candidatures d'unitat popular,0.1
12,convergencia democratica aranesapartit naciona...,convergencia democratica aranesa partit nacion...,0.016949
3,els verds grup verd europeu,els verdsgrup verd europeu,0.071429
10,en comu podemguanyem el canvi,en comu podem guanyem el canvi,0.064516
16,grup independents urbanitzacions,grup independent urbanitzacions,0.03125
14,independents per vilaller,independents per olivella,0.076923
2,izquierda republicanapartit republica d'esquerra,partit republica d'esquerraizquierda republicana,0.0
11,les piles pel futur,les piles pel f utur,0.05
0,partido espanol democrata,partido democrata espanol,0.0


In [21]:
cosine_similar_parties = (
    cosine_distances[cosine_distances < 0.1]
    .stack()
    .reset_index()
    .rename(columns={"level_0": "party1", "level_1": "party2", 0: "distance"})
    .sort_values(by="party1")
)

cosine_similar_parties

Unnamed: 0,party1,party2,distance
29,a. d'independents progressistes i nacionalistes,agrupacio d'independents progressistes i nacio...,0.086913
6,alternativa verdamoviment ecologista catalunya,alternativa verdamoviment ecologista de catalunya,0.031096
18,candidatura d'unitat popular,candidatures d'unitat popular,0.052486
0,centre democratic i social,centro democratico y social,0.094178
32,convergencia democratica aranesapartit naciona...,convergencia democratica aranesa partit nacion...,0.008511
16,els verds grup verd europeu,els verdsgrup verd europeu,0.036376
7,els verds ecologistes,los verdes ecologistas,0.069516
28,en comu podemguanyem el canvi,en comu podem guanyem el canvi,0.032796
37,grup independents urbanitzacions,grup independent urbanitzacions,0.015749
25,grupo independiente liberal,independents per llobera,0.096475


### Join the similar parties

Joining similar parties (or candidatures) into the same party code is not a trivial task. We must be careful when joining the parties, because some similar candidatures could belong to different parties. We will ensure that grouped candidatures hasn't competed in the same election.

This function will help us to discard those candidatures with similar name that have competed in the same election, and therefore, they are different parties.

Also, we need to take into account that the candidatures with similar name could belong to the same `party_code`. We only need to group the candidatures that have different `party_code`.

In [34]:
# Checks if a party has competed in the same election with another party
def has_competed_together(df, party1, party2, column="party_name"):
    counts = df[
        (
            (df[column] == party1)
            | (df[column] == party2)
        )
        & (df["id_nivell_territorial"] == "CA")
    ].groupby("nom_eleccio").size().reset_index(name="count")["count"]

    # Check if any value in 'counts' is 2
    return (counts == 2).any()
    
has_competed_together(df, "partit popular", "centro democratico y social", column="clean_party_name")

True

In [41]:
df[
    (df["nom_eleccio"] == "Eleccions al Parlament de Catalunya 2010")
    & (
        (df["clean_party_name"] == "partit popular")
        | (df["clean_party_name"] == "centro democratico y social")
    )
    & (df["id_nivell_territorial"] == "CA")
]

Unnamed: 0,index_autonumeric,nom_eleccio,id_nivell_territorial,nom_nivell_territorial,territori_codi,territori_nom,seccio,vots,escons,districte,...,party_name,party_abbr,party_color,type,year,round,month,day,date,clean_party_name
1880993,1869024,Eleccions al Parlament de Catalunya 2010,CA,Catalunya,9,Catalunya,,387066,18.0,,...,Partit Popular,PP,#01A7E3,A,2010,1,11,28,2010-11-28,partit popular
1881010,1869041,Eleccions al Parlament de Catalunya 2010,CA,Catalunya,9,Catalunya,,218,0.0,,...,Centro Democràtico y Social,CDS,#3563A8,A,2010,1,11,28,2010-11-28,centro democratico y social


In [36]:
df[
    ((df["clean_party_name"] == "partit popular") | (df["clean_party_name"] == "centro democratico y social"))
    & (df["id_nivell_territorial"] == "CA")
].groupby("nom_eleccio").size().reset_index(name="count")

Unnamed: 0,nom_eleccio,count
0,Eleccions Municipals 1991,1
1,Eleccions Municipals 1995,1
2,Eleccions Municipals 1999,1
3,Eleccions Municipals 2003,1
4,Eleccions Municipals 2007,1
5,Eleccions Municipals 2011,1
6,Eleccions Municipals 2015,1
7,Eleccions al Congrés 1989,1
8,Eleccions al Congrés 1993,1
9,Eleccions al Congrés 1996,1


In [20]:
# Show the party names that competed on the last election
df[df["date"] == df["date"].max()]["clean_party_name"].unique()

array(['per un mon mes just', 'esquerra republicana de catalunya',
       'partit popular', 'ciutadanspartido de la ciudadania',
       'front nacional de catalunya', 'escons en blanc', 'vox',
       'suport civil catala', 'junts per catalunya',
       'partit nacionalista de catalunya', 'moviment corrent roig',
       'union europea de pensionistas', 'en comu podempodem en comu',
       'partit democrata europeu catala',
       "canditatura d'unitat popularun nou cicle per guanyar",
       'partit comunista dels treballadors de catalunya',
       'izquierda en positivo',
       'moviment primaries per la independencia de catalunya',
       'recortes cerogrup verdmunicipalistes',
       'partit dels socialistes de catalunya pscpsoe',
       'unidos por la democracia + jubilados',
       'alianza por el comercio y la vivienda', "som terres de l'ebre"],
      dtype=object)

## Candidatures statistics

### Plot votes by candidature

### Plot seats by candidature

##

## Region statistics

### Plot votes by region

### Plot seats by region

### Plot number of tables by region