# Wrangling the Finnish 2019 election results data

This notebook is for loading and preprocessing data about the results of the Finnish 2019 parliamentary elections. The data that I am dealing with consists of the vote totals for all of the candidates as well as some background information about the candidates. The purpose of processing this data is preparing it for trying different machine learning algorithms on it in a subsequent notebook.

The data is sourced from an elections information and results [website](https://tulospalvelu.vaalit.fi/indexe.html) maintained by the Finnish ministry of justice. The data file that I am using gives the election results broken down by the candidates. The specifications and descriptions for the results data can be found in this [pdf document](https://tulospalvelu.vaalit.fi/EKV-2019/ohje/Vaalien_tulostiedostojen_kuvaus_EKV-EPV-2019_EN.pdf) with the fourth chapter detailing the fields in the data file that I am using.

In [None]:
# Importing the necessary libraries

from urllib import request
from zipfile import ZipFile

import numpy as np
import pandas as pd

In [None]:
# Downloading the data from the source and unzipping it

request.urlretrieve('https://tulospalvelu.vaalit.fi/EKV-2019/ekv-2019_ehd_maa.csv.zip',
                   filename = 'ekv-2019_ehd_maa.csv.zip')

with ZipFile('ekv-2019_ehd_maa.csv.zip', 'r') as zip_archive:
    zip_archive.extractall()

The data in the `.csv` file consists of nearly fifty different variable fields, many of which are not particularly relevant for modeling purposes, eg. candidate number and voting area identifier. The data is also presented in a very granular manner: the file has candidate-wise rows for each voting area of the electoral districts as well as separate rows for each candidate's electoral district vote totals. Thus, it is quite clear that for modeling purposes most of the data in the `.csv` file is redundant and needs to be pruned heavily in order to be usable.

From a modeling perspective the most usable variables in the raw data are the following: the candidates' first and last name, their gender and age, whether they were members of the Finnish or the European parliament or municipal councillors at the time of the elections, and of course, the candidates' vote totals - the dependent variable which is being predicted by modeling its relationship with the other variables. The first and last names of the candidates aren't used in the modeling per se, but they are necessary for differentiating the candidates from each other when aggregating the data in order to extract the vote totals. Besides the said variables, there are a couple of other background factors, like the candidates' party affiliation and their place of residence, that might provide useful for modeling purposes, but that are somewhat tricky to recode into usable features. I might come back to these variables later on, but for the time being I am leaving them aside.

In [None]:
# Listing the columns to be extracted from the .csv file and naming them
data_cols = [17, 18, 19, 20, 26, 27, 28, 34]
col_names = ['f_name', 'l_name', 'gender', 'age', 'euro', 'parl', 'council', 'votes']

# Reading the data from the .csv file into a pandas dataframe
df = pd.read_csv('ekv-2019_teat_maa.csv', sep = ';', header = None, names = col_names, usecols = data_cols,
                 encoding = 'latin_1')

# Checking the data frame
df.head()

Now that the variables in the dataset have been reduced to a more useful and manageable selection, the next step is to extract those rows that have each candidate's total votes across all voting areas. The simplest way for getting the totals is to group the rows by individual candidates and use the pandas `.max()` method on them. As there were a few candidates with the same first and last names in the elections, it is better to use all the other variables besides the votes for grouping to make sure the candidates are properly separated from each other.

In [None]:
# Making a list of the grouping variables by excluding the votes column
grouping_cols = col_names[:-1]

# Grouping the candidate data and finding the vote totals with .max()
data = df.groupby(grouping_cols).max()

# Transforming the groupby object back into a dataframe
data = data.reset_index(level = grouping_cols)

# Checking the resulting data frame
data.head()

Of course, before proceeding further it is important to make sure that the information extracted from the raw data is valid and accurate. The simplest way to do this is to look at the distribution of the vote totals in the extracted data.

In [None]:
data['votes'].describe()

From the distribution above can be seen that the number of observations is equal to the number of candidates in the election, 2468 persons (cf. https://tulospalvelu.vaalit.fi/EKV-2019/en/ehd_listat_kokomaa.htm). Also, the min and max values for the votes correspond to the vote counts of the candidates that got the smallest and the biggest amounts of votes in the election, Mr. Jarmo Vikman and Mr. Jussi Halla-Aho (cf. https://vaalit.yle.fi/ev2019/en/candidates).

Since some machine learning methods work best with normalised variables, the next step is to perform those operations and also clean away the name columns that have become redudant at this point. Also, because the variables for Finnish and European parliament as well as municipal council membership have been recorded in the data as strings, it is necessary to use an if statement to cast them as the proper type.

In [None]:
# Dropping name columns since they're not needed for further work on the data
data = data.drop(columns = ['f_name', 'l_name'])

# Finding highest age among candidates for normalising the age variable
max_age = data['age'].max()

# Transforming background variables into float types and normalised distributions
data['age'] = data['age'].transform(func = lambda x: x / max_age)
data['euro'] = data['euro'].transform(func = lambda x: 1.0 if x == '1' else 0.0)
data['parl'] = data['parl'].transform(func = lambda x: 1.0 if x == '1' else 0.0)
data['council'] = data['council'].transform(func = lambda x: 1.0 if x == '1' else 0.0)
data['gender'] = data['gender'].transform(func = lambda x: x-1.0) # Original encoding: male = 1; female = 2

# Checking the results of the above operations
data.info()

Finally, it is best to save the wrangled and cleaned data into a new `.csv` file - that way it will be readily at hand when starting the actual modeling work. Also, saving the data in `.csv` format will make it easier to work with the data using other tools than Python, like for example R.

In [None]:
data.to_csv(path_or_buf = 'election_data.csv', index = False)