# Modeling of the Finnish 2019 election results data

This notebook is for training different machine learning algorithms on the data from the Finnish 2019 general elections. The output variable is the vote total per candidate and the input variables are different background features found in the data. (For more details, see the notebook about wrangling the said data.)

The main purpose of this exercise is to get some hands-on experience with the algorithms being tested on the data. In addition to that, I am naturally also curious to see if a machine learning model that predicts the vote totals reasonably well from the background variables can be found.

In [1]:
# Importing the necessary libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Before any modeling can be done, it is of course necessary to load the data and split it into the training and testing samples. In the interests of making my results replicable, I am doing the splitting with a set random seed.

In [2]:
# Reading the data from the .csv file into a dataframe
data = pd.read_csv('election_data.csv')

# Checking the data frame
data.head()

Unnamed: 0,gender,age,euro,parl,council,votes
0,0.0,0.435294,0.0,0.0,0.0,1073
1,0.0,0.682353,0.0,0.0,1.0,236
2,0.0,0.705882,0.0,0.0,0.0,249
3,0.0,0.717647,0.0,0.0,0.0,48
4,0.0,0.647059,0.0,0.0,1.0,746


In [3]:
# Splitting the data
train_data, test_data, train_votes, test_votes = train_test_split(
    data.iloc[:,:5], data.iloc[:, 5:], test_size = 0.2, random_state = 17)

# Checking the split data
for item in [train_data, test_data, train_votes, test_votes]:
    print(item.info())
    print(item.head(), '\n')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1974 entries, 1021 to 2191
Data columns (total 5 columns):
gender     1974 non-null float64
age        1974 non-null float64
euro       1974 non-null float64
parl       1974 non-null float64
council    1974 non-null float64
dtypes: float64(5)
memory usage: 92.5 KB
None
      gender       age  euro  parl  council
1021     1.0  0.458824   0.0   1.0      1.0
350      0.0  0.717647   0.0   0.0      1.0
496      1.0  0.458824   0.0   0.0      0.0
1160     1.0  0.776471   0.0   0.0      0.0
1142     0.0  0.376471   0.0   0.0      1.0 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 494 entries, 2081 to 819
Data columns (total 5 columns):
gender     494 non-null float64
age        494 non-null float64
euro       494 non-null float64
parl       494 non-null float64
council    494 non-null float64
dtypes: float64(5)
memory usage: 23.2 KB
None
      gender       age  euro  parl  council
2081     1.0  0.400000   0.0   0.0      1.0
192      0.0 

The different checks done on the loaded and split data don't show any problems, so the next step is move on to the actual modeling. The first type of machine learning model that I am applying here is linear regression. It is a natural starting point since the background attributes of the candidates (age, gender, incumbency in different posts) could be easily perceived as having cumulative effects on the vote totals accumulated by them.

In [4]:
# Create the model and fit it to the training data
regression = LinearRegression()
regression.fit(train_data, train_votes)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Before evaluating how well the above model predicts the vote totals, I am curious to see what kind of relationships it posits between the background variables and the vote totals. To find out that, I am printing the model's coefficients.

In [5]:
# Printing the model coefficients
print("Regression model coefficients: {}".format(list(zip(list(train_data.columns.values), regression.coef_[0]))))

Regression model coefficients: [('gender', 173.47209865915841), ('age', -1221.558774070455), ('euro', 17251.504084596687), ('parl', 5278.3568495857526), ('council', 1088.3067378854057)]


Looking at the coefficients, it seems that female candidates tend to get slightly more votes than male ones, while older candidates get less than votes than the young. Candidates that hold a seat in the European parlament, the Finnish parlament or a local council also tend to get more votes than those who don't.

Of course, more important than what kind of relationship the background variables have with the vote totals is how well the model predicts the totals from the background variables. Thus, the next step is to use the test portion of the data to make these predictions, and to calculate from the predictions a couple of suitable metrics that show how far off the predictions are from the actual vote totals observed in the test data.

In [6]:
# Using the model to predict vote totals from the test data
regr_pred_votes = regression.predict(test_data)

# Printing the evaluation metrics
print("Root mean square error:", np.sqrt(metrics.mean_squared_error(test_votes, regr_pred_votes)))
print("Maximum error:", metrics.max_error(test_votes, regr_pred_votes))
print("Explained variance:", metrics.explained_variance_score(test_votes, regr_pred_votes))

Root mean square error: 1417.99755237
Maximum error: 11742.3960113
Explained variance: 0.52650308842
