# Create Simple ML Model

On this notebook we will create a base model using a simple machine learning algorithm and training it without the spatial data. This will be our baseline model to compare with the more complex models that will be created later.

The data we will use contains the following information:
- Porcentage of votes for each important party for each election held in Catalonia.
- The socio-economic index for each censal section in Catalonia.
- The age groups for each censal section in Catalonia.
- The proportion of born abroad for each censal section in Catalonia.

## Load data

First, we will load the libraries.

In [1]:
import pandas as pd
import geopandas as gpd
import pprint
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import logging
import pysal as ps
import contextily
from splot.esda import plot_moran
from pysal.viz import splot
from unidecode import unidecode
from pysal.explore import esda
from pysal.lib import weights
from numpy.random import seed
from sklearn.model_selection import train_test_split

pp = pprint.PrettyPrinter(indent=2)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

  from .autonotebook import tqdm as notebook_tqdm


Now, we load the the dataset.

In [3]:
data = gpd.read_file("../../data/output/merged_data.geojson")
data.head()

Unnamed: 0,MUNICIPI,DISTRICTE,SECCIO,MUNDISSEC,cens_electoral_percentage_4_A19801,cens_electoral_percentage_4_A19841,cens_electoral_percentage_4_E19871,cens_electoral_percentage_4_E19891,cens_electoral_percentage_4_E20041,cens_electoral_percentage_4_G19771,...,vots_valids_percentage_999999999_M19871,vots_valids_percentage_999999999_M19911,vots_valids_percentage_999999999_M19951,vots_valids_percentage_999999999_M19991,vots_valids_percentage_999999999_M20031,vots_valids_percentage_999999999_M20071,vots_valids_percentage_999999999_M20111,vots_valids_percentage_999999999_M20151,vots_valids_percentage_999999999_M20191,geometry
0,80018,1,1,8001801001,16.898349,8.6,0.840336,0.7724,0.084175,14.143646,...,9.62963,0.0,0.0,5.921053,0.0,0.0,3.206997,12.598791,13.577023,"POLYGON ((408792.512 4596753.053, 408797.803 4..."
1,80018,1,2,8001801002,0.0,6.558642,0.925926,0.668648,0.0,0.0,...,9.66908,0.0,0.0,4.17802,0.0,0.0,4.210526,13.98286,11.01993,"POLYGON ((408265.107 4597223.944, 408314.501 4..."
2,80018,1,3,8001801003,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,13.963964,0.0,0.0,3.51682,12.434692,8.695652,"POLYGON ((409117.362 4598172.550, 409152.489 4..."
3,80018,1,4,8001801004,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.508772,0.0,0.0,2.241848,11.373578,10.777778,"POLYGON ((408670.544 4598076.591, 408718.938 4..."
4,80018,1,5,8001801005,0.0,0.0,0.0,0.0,0.068729,0.0,...,0.0,0.0,0.0,8.643617,0.0,0.0,3.773585,12.807676,7.623318,"POLYGON ((408075.992 4597102.922, 408093.732 4..."


As we won't use the geographical data, we will drop the columns that contain it.

In [4]:
df = data.drop(columns=["geometry"])

In [5]:
df.to_csv("../../data/output/only_votes.csv", index=False)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5083 entries, 0 to 5082
Columns: 1392 entries, MUNICIPI to vots_valids_percentage_999999999_M20191
dtypes: float64(1388), object(4)
memory usage: 54.0+ MB


In [7]:
# show all columns with dtypes object
df.select_dtypes(include=["object"]).head()

Unnamed: 0,MUNICIPI,DISTRICTE,SECCIO,MUNDISSEC
0,80018,1,1,8001801001
1,80018,1,2,8001801002
2,80018,1,3,8001801003
3,80018,1,4,8001801004
4,80018,1,5,8001801005


## Analyze data

Each census section belong to a provice, but we don't have this information as a variable in the dataset. We will create a new variable that indicates the province of each census section based on the first two characters of the `MUNICIPI` code.

This new variable will be used to split the dataset into training and testing sets.

In [9]:
# create the province column based on the first two characters of the `MUNICIPI` column
df["province"] = df["MUNICIPI"].str[:2]

# show the proportion of each province
df["province"].value_counts(normalize=True)

province
08    0.712178
43    0.106630
17    0.104859
25    0.076333
Name: proportion, dtype: float64

The majority (71,1%) of the census sections are in the province of Barcelona, followed by Tarragona and Girona that have similar number of census sections. Lleida is the province with the least number of census sections. This could be a problem when splitting the dataset into training and testing sets, as the model could be biased towards the province of Barcelona. Also, the provinces of Tarragona, Girona and Lleida have way less census sections than Barcelona, so the model could have problems to generalize to these provinces.

We have some possible ways to tackle this problem:
- Use stratified sampling when splitting the dataset into training and testing sets. This way, the proportion of census sections from each province will be the same in the training and testing sets.
- Remove census sections from Barcelona to end up with similar proportions of census sections from each province. This way, the model will have to generalize to all provinces.
- Create a model for each province.

The last option doesn't seem correct. This would assume that the relationship between the variables and the target variable is different for each province, which is unlikely. The first option seems the best, as it will allow the model to generalize to all provinces.

## Prepare data

We want to convert this data into a format that can be used by a machine learning algorithm. The objective is to predict the percentage of votes for each party based on the socio-economic index, the age groups, the proportion of born abroad and the past percentatge of party's votes.

We have multiple variables expressing different ways to calculate the percentage of votes for each party. We will only use the `cens_electoral_percentage_*` columns.

In [12]:
# Drop `df` columns `vots_*`, `votants_percentage_*` and `vots_valids_percentage_*`
df_filtered = df.loc[:, ~df.columns.str.contains("vots_")]
df_filtered = df_filtered.loc[:, ~df_filtered.columns.str.contains("votants_percentage_")]
df_filtered = df_filtered.loc[:, ~df_filtered.columns.str.contains("vots_valids_percentage_")]

We set "MUNDISSEC" as the index of the dataset. And then we drop the columns that we won't use.

In [13]:
# Set "MUNDISSEC" as index
df_filtered = df_filtered.set_index("MUNDISSEC")

# Remove census section identifier columns
strarify_col = df["province"]
df_filtered = df_filtered.drop(columns=["MUNICIPI", "DISTRICTE", "SECCIO", "province"])

We will split the data into `X` and `y` where `X` contains the features and `y` contains the target variable.

We have multiple parties to predict, so we will use a multi-target regression model. Therefore, `y` will be a matrix with the percentage of votes for each party on the last election (2021).

In [14]:
# Columns of `df` that contain "2021" will be on the `y` dataframe
# The rest will be on the `X` dataframe
y = df_filtered.filter(regex="2021")
X = df_filtered.drop(columns=y.columns)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

X shape: (5083, 338)
y shape: (5083, 9)


We have 5083 censal sections and 9 parties. Now we will split the data into training and test sets.

- The training set will be used to train the model and the test set will be used to evaluate the model. It will cointain 80% of the data.
- The test set will contain the remaining 20% of the data.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, stratify=strarify_col, random_state=42)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (3303, 338)
X_test shape: (1780, 338)
y_train shape: (3303, 9)
y_test shape: (1780, 9)


In [19]:
print(f"X_train proprotion of provinces: {X_train.index.str[:2].value_counts(normalize=True)}")
print(f"X_test proprotion of provinces: {X_test.index.str[:2].value_counts(normalize=True)}")
print(f"y_train proprotion of provinces: {y_train.index.str[:2].value_counts(normalize=True)}")
print(f"y_test proprotion of provinces: {y_test.index.str[:2].value_counts(normalize=True)}")

X_train proprotion of provinces: MUNDISSEC
08    0.712080
43    0.106570
17    0.105056
25    0.076294
Name: proportion, dtype: float64
X_test proprotion of provinces: MUNDISSEC
08    0.712360
43    0.106742
17    0.104494
25    0.076404
Name: proportion, dtype: float64
y_train proprotion of provinces: MUNDISSEC
08    0.712080
43    0.106570
17    0.105056
25    0.076294
Name: proportion, dtype: float64
y_test proprotion of provinces: MUNDISSEC
08    0.712360
43    0.106742
17    0.104494
25    0.076404
Name: proportion, dtype: float64


## Multi-Target Learning in XGBoost 2.0

### Model definition

In [55]:
import xgboost as xgb

# Create DMatrix objects for training and testing
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

In [21]:
# Parameters specific to the multi-target regression task
params = {
    "objective": "reg:squarederror",  # Squared error regression objective
    "max_depth": 6,                   # Depth of each tree
    "eta": 0.1,                       # Learning rate
    "num_parallel_tree": 1            # Number of parallel trees (boosting round)
}

### Model training

In [56]:
# Train the model
num_boost_round = 200
model = xgb.train(params, dtrain, num_boost_round)

### Model evaluation

In [57]:
# Predict the targets for the test data
y_pred = model.predict(dtest)

In [31]:
y_pred.shape

(1017, 9)

In [34]:
# Convert the predictions to a DataFrame
y_pred_df = pd.DataFrame(y_pred, columns=y_test.columns, index=y_test.index)

y_pred_df.head()

Unnamed: 0_level_0,cens_electoral_percentage_6_A20211,cens_electoral_percentage_10_A20211,cens_electoral_percentage_86_A20211,cens_electoral_percentage_301_A20211,cens_electoral_percentage_693_A20211,cens_electoral_percentage_1031_A20211,cens_electoral_percentage_1099_A20211,cens_electoral_percentage_2019838_A20211,cens_electoral_percentage_999999999_A20211
MUNDISSEC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8200901003,19.775175,7.070753,2.197721,3.372607,4.083109,1.671946,4.443032,1.193168,0.151337
8187805001,9.582745,15.086157,1.079001,2.293527,4.511827,10.145722,2.7946,3.252101,0.251875
8244401001,11.228998,16.776817,1.293505,2.735657,3.052797,9.099374,5.321958,3.792694,0.248029
8245704006,17.763641,7.729955,2.061852,3.686827,4.341558,2.850704,4.332741,1.837639,0.123481
8282401004,11.792672,15.609639,1.819223,2.721537,3.584018,17.79307,4.351094,5.00096,0.465495


In [47]:
y_pred_df.head(2)

Unnamed: 0_level_0,cens_electoral_percentage_6_A20211,cens_electoral_percentage_10_A20211,cens_electoral_percentage_86_A20211,cens_electoral_percentage_301_A20211,cens_electoral_percentage_693_A20211,cens_electoral_percentage_1031_A20211,cens_electoral_percentage_1099_A20211,cens_electoral_percentage_2019838_A20211,cens_electoral_percentage_999999999_A20211
MUNDISSEC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8200901003,19.775175,7.070753,2.197721,3.372607,4.083109,1.671946,4.443032,1.193168,0.151337
8187805001,9.582745,15.086157,1.079001,2.293527,4.511827,10.145722,2.7946,3.252101,0.251875


In [48]:
y_test.head()

Unnamed: 0_level_0,cens_electoral_percentage_6_A20211,cens_electoral_percentage_10_A20211,cens_electoral_percentage_86_A20211,cens_electoral_percentage_301_A20211,cens_electoral_percentage_693_A20211,cens_electoral_percentage_1031_A20211,cens_electoral_percentage_1099_A20211,cens_electoral_percentage_2019838_A20211,cens_electoral_percentage_999999999_A20211
MUNDISSEC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8200901003,21.580928,6.900878,2.38394,3.262233,5.018821,1.380176,5.520703,1.003764,0.107546
8187805001,8.754209,15.572391,0.420875,2.861953,5.218855,10.43771,2.946128,2.777778,0.16835
8244401001,12.068966,18.448276,0.91954,2.356322,3.333333,10.632184,5.344828,3.448276,0.270936
8245704006,14.779874,5.345912,1.886792,4.716981,4.297694,2.93501,5.24109,1.991614,0.104822
8282401004,13.84959,14.147431,2.755026,2.755026,3.797468,19.061802,4.61653,4.169769,0.574407


In [59]:
import re

def parse_column_name(column_name: str, variable_name: str = "cens_electoral_percentage"):
    """
    Parse the column name to extract party code, election type, year, and repetition.

    Args:
        column_name (str): Column name formatted as `VARIABLE_PARTYCODE_ELECTIONSCODE`.
        variable_name (str): The prefix used in the column name.

    Returns:
        dict: A dictionary containing parsed components: party_code, election_type, election_year, election_repetition.
    """
    # Dynamic regex pattern using the provided variable name
    pattern = rf"{re.escape(variable_name)}_(\d+)_([A-Z])(\d{{4}})(\d)"

    # Use regex to find all elements
    match = re.match(pattern, column_name)

    if match:
        return {
            'party_code': match.group(1),
            'election_type': match.group(2),
            'election_year': match.group(3),
            'election_repetition': match.group(4)
        }
    else:
        raise ValueError(f"Invalid column name format: {column_name}")

In [58]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Ensure y_test and y_pred are NumPy arrays
y_test_np = y_test.values if isinstance(y_test, pd.DataFrame) else np.array(y_test)
y_pred_np = y_pred.values if isinstance(y_pred, pd.DataFrame) else np.array(y_pred)

# Retrieve the column names from the DataFrame (y_test)
columns = y_test.columns

# Initialize lists to hold metrics for each target
mae = []
mse = []
rmse = []
r2 = []

# Calculate metrics for each target column
for i in range(y_test_np.shape[1]):
    mae.append(mean_absolute_error(y_test_np[:, i], y_pred_np[:, i]))
    mse.append(mean_squared_error(y_test_np[:, i], y_pred_np[:, i]))
    rmse.append(np.sqrt(mse[i]))
    r2.append(r2_score(y_test_np[:, i], y_pred_np[:, i]))

    # Extract the party code from the column name using the parse function
    parsed_info = parse_column_name(columns[i])

    # Display results for each target
    print(f"Target {i+1} (Party Code: {parsed_info['party_code']}):")
    print(f" - MAE: {mae[i]:.4f}")
    print(f" - MSE: {mse[i]:.4f}")
    print(f" - RMSE: {rmse[i]:.4f}")
    print(f" - R²: {r2[i]:.4f}")
    print()


Target 1 (Party Code: 6):
 - MAE: 1.1347
 - MSE: 2.2300
 - RMSE: 1.4933
 - R²: 0.9026

Target 2 (Party Code: 10):
 - MAE: 1.0280
 - MSE: 2.2021
 - RMSE: 1.4840
 - R²: 0.8941

Target 3 (Party Code: 86):
 - MAE: 0.4731
 - MSE: 0.4835
 - RMSE: 0.6953
 - R²: 0.8320

Target 4 (Party Code: 301):
 - MAE: 0.5314
 - MSE: 0.4797
 - RMSE: 0.6926
 - R²: 0.7921

Target 5 (Party Code: 693):
 - MAE: 0.6382
 - MSE: 0.7648
 - RMSE: 0.8745
 - R²: 0.8089

Target 6 (Party Code: 1031):
 - MAE: 1.0474
 - MSE: 2.9809
 - RMSE: 1.7265
 - R²: 0.9528

Target 7 (Party Code: 1099):
 - MAE: 0.5694
 - MSE: 0.5834
 - RMSE: 0.7638
 - R²: 0.8305

Target 8 (Party Code: 2019838):
 - MAE: 0.6349
 - MSE: 1.0874
 - RMSE: 1.0428
 - R²: 0.8597

Target 9 (Party Code: 999999999):
 - MAE: 0.0894
 - MSE: 0.0236
 - RMSE: 0.1536
 - R²: 0.5928



In [53]:
X.columns

Index(['cens_electoral_percentage_4_A19801',
       'cens_electoral_percentage_4_A19841',
       'cens_electoral_percentage_4_E19871',
       'cens_electoral_percentage_4_E19891',
       'cens_electoral_percentage_4_E20041',
       'cens_electoral_percentage_4_G19771',
       'cens_electoral_percentage_4_G19791',
       'cens_electoral_percentage_4_G19821',
       'cens_electoral_percentage_4_G19861',
       'cens_electoral_percentage_4_G19891',
       ...
       'cens_electoral_percentage_999999999_M19831',
       'cens_electoral_percentage_999999999_M19871',
       'cens_electoral_percentage_999999999_M19911',
       'cens_electoral_percentage_999999999_M19951',
       'cens_electoral_percentage_999999999_M19991',
       'cens_electoral_percentage_999999999_M20031',
       'cens_electoral_percentage_999999999_M20071',
       'cens_electoral_percentage_999999999_M20111',
       'cens_electoral_percentage_999999999_M20151',
       'cens_electoral_percentage_999999999_M20191'],
      dt