# Election Calculator Model Selection
author: gwiazdan  
date: 27-06-2025

In [1]:
import sys
sys.path.append('scripts')
from IPython.display import Markdown, display
from formating_features import format_scientific_notation, format_time
from combination_generating import generate_results
from function_fitting import function_fitter
from functions import linear_function, improved_linear_function, linear_splines, pchip_interpolation
ff = function_fitter(2)
possible_results = generate_results(ff.keys)

## Table of Contents


1. [Introduction](#Introduction)
2. [Overview of the problem](#overview-of-the-problem)
3. [Proposed solutions](#proposed-solutions)

# Introduction


## Project Purpose ❓
The project was initiated to address the problem of calculating vote distribution at lower administrative levels. It was never intended to create calculator based on small samples or any other regressional models - such precission is unnecessary. The whole project is to be more like a a general-purpose tool, used to get some insight into how votes may be distributed - not the most accurate distribution achievable.
## Solution finding 🔍
The search for an optimal function will be guided by the following criteria:
- Minimization of the mean squared error (MSE) relative to summed up votes,
- Acceptable computational efficiency (not too time-consuming to calculate).

# Overview of the problem


## Mathematical theory 📄

Assume that the vote percentage grows according to a function $f$. Function $f$ takes the **countrywide vote percentage** as an argument and returns the **local vote percentage**. Each function may differ from other functions related to different constituencies. Therefore, it might be convenient to assume that each function belongs to a **function space** $F$, where $f_n \in F, n \in N$, and $N$ is a finite discrete set of territorial unit identifiers.

Each territorial unit $n \in N$ has a 2-tuple $(a_n, b_n)$, where $a_n$ is the "weight" of the territorial unit (total votes), and $b_n$ is a historical reference value that describes the status quo. Then, it is possible to say that $\displaystyle \sum_{n\in N} a_n f_n(x) = a_{total} x$.

It is possible that the sum may differ from the theoretical value. Then our error is $\Delta_p = x_p - \displaystyle \sum_{n\in N} a_n f_n(x_p)$, where $x_p$ is the countrywide vote percentage for party $p$. As the formula is related to one party only, we can say that the total error is $Error = \displaystyle \sum_{p\in P} \Delta_p$, where $P$ is the set of political parties.

# Proposed solutions

## Basic linearity 📐

Let's assume that each $f_n(x)=\frac{b_n}{x_0}x$, then:

$$\begin{equation}
\sum_{n\in N} a_n f_n(x) = a_{total} x \tag{1}
\end{equation}$$

$$\begin{equation}
\sum_{n\in N} a_n \frac{b_n}{x_0} x = a_{total} x \tag{2}
\end{equation}$$

$$\begin{equation}
\frac{x}{x_0} \sum_{n \in N} a_n b_n = a_{total} x \tag{3}
\end{equation}$$

$$\begin{equation}
\sum_{n \in N} a_n b_n = a_{total} x_0 \tag{4}
\end{equation}$$

Equation (4) is satisfied by construction.

However, party results may not sum up to 100% - which may be a major problem in our calculating model.

Pros:
- Easy to implement
- Quick and memoryless
- Sum of votes from constituencies approximates to total votes  

Cons:
- Votes does not sum up to 100% in constituency
- Too idealistic

In [2]:
raw_mse, raw_rmse, time = ff.calculate_mse(linear_function, possible_results)

mse = format_scientific_notation(raw_mse)
rmse = format_scientific_notation(raw_rmse * 100)

display(Markdown("### Calculated Mean Squared Error for the Linear Model:" + "\n" +
r"$$\Delta_{MSE} = " + mse + r"$$" + "\n" + r"$$\Delta_{RMSE} = " + rmse + r"\%$$" + "\n" + r"$$\mathrm{t_{{avg}}}=" + format_time(time) + r"$$" "\nIn theory, the MSE for this model should be zero. However, in practice, due to the limitations of floating-point arithmetic, the accumulation of small rounding errors during calculations leads to a small, non-zero result."))

### Calculated Mean Squared Error for the Linear Model:
$$\Delta_{MSE} = 8.52 \times 10^{4}$$
$$\Delta_{RMSE} = 3.63 \times 10^{-10}\%$$
$$\mathrm{t_{{avg}}}=191.58 ms$$
In theory, the MSE for this model should be zero. However, in practice, due to the limitations of floating-point arithmetic, the accumulation of small rounding errors during calculations leads to a small, non-zero result.

## Linear function with total sum correction

Each time a row vector is calculated:
$$\mathbb{V}=\begin{bmatrix}v_1 \\ v_2 \\ \vdots \\v_n \end{bmatrix}$$
$$\varrho = \frac{v_{total}}{\|\mathbb{V}\|_1}$$
$$\mathbb{V}_{rescaled}=\varrho \mathbb{V}$$
Then, the integer part of the rescaled votes is calculated using the floor function $\lfloor\mathbb{V}_{rescaled}\rfloor$. To ensure the sum matches the total votes, the remaining votes are distributed one by one to the parties with the highest remainders, in descending order.

In [3]:
raw_mse, raw_rmse, time = ff.calculate_mse(improved_linear_function, possible_results)

mse = format_scientific_notation(raw_mse)
rmse = format_scientific_notation(raw_rmse * 100)

display(Markdown("### Calculated Mean Squared Error for the Improved Linear Model:" + "\n" +
r"$$\Delta_{MSE} = " + mse + r"$$" + "\n" + r"$$\Delta_{RMSE} = " + rmse + r"\%$$" + "\n" + r"$$\mathrm{t_{{avg}}}=" + format_time(time) + r"$$"))

### Calculated Mean Squared Error for the Improved Linear Model:
$$\Delta_{MSE} = 2.01 \times 10^{12}$$
$$\Delta_{RMSE} = 8.57 \times 10^{-3}\%$$
$$\mathrm{t_{{avg}}}=78.50 ms$$

## Linear splines model with total sum correction

Below is the piecewise function used in this model:
$$
f_n(x) = 
\begin{cases} 
      b_n + \frac{1-b_n}{1-x_0}(x-x_0) & \text{dla } x \in \langle x_0, 1 \rangle \\
      \frac{b_n}{x_0}x & \text{dla } x \in \langle 0, x_0 )
\end{cases}
$$
Function $f_n$ is:
- $f: \mathbb{R_{\langle 0,1 \rangle}} \mapsto \mathbb{R_{\langle 0,1 \rangle}}$
- Continous on $\langle 0,1 \rangle$: $\mathrm{C}_{\langle 0,1 \rangle}$
- Noncontinous derivative $\frac{df}{dx} \notin \mathrm{C}_{\langle 0,1 \rangle}$
$$ a_{total}x = \displaystyle \sum_{n\in N} a_n f_n(x) $$


`

In [4]:
raw_mse, raw_rmse, time = ff.calculate_mse(linear_splines, possible_results)

mse = format_scientific_notation(raw_mse)
rmse = format_scientific_notation(raw_rmse * 100)

display(Markdown("### Calculated Mean Squared Error for the Linear splines Model:" + "\n" +
r"$$\Delta_{MSE} = " + mse + r"$$" + "\n" + r"$$\Delta_{RMSE} = " + rmse + r"\%$$" + "\n" + r"$$\mathrm{t_{{avg}}}=" + format_time(time) + r"$$"))

### Calculated Mean Squared Error for the Linear splines Model:
$$\Delta_{MSE} = 1.81 \times 10^{11}$$
$$\Delta_{RMSE} = 7.72 \times 10^{-4}\%$$
$$\mathrm{t_{{avg}}}=98.62 ms$$

## Piecewise Cubic Hermite Model with total sum preservation
Model is based on scipy PCHIP Interpolation  
- $f$: $\mathrm{C}_{\langle 0, 1 \rangle}$
- $f$: $\mathbb{R_{\langle 0, 1 \rangle}} \mapsto \mathbb{R_{\langle 0,1 \rangle}}$
- $\frac{df}{dx}$: $\mathrm{C}_{\langle 0,1 \rangle}$

In [5]:
raw_mse, raw_rmse, time = ff.calculate_mse(pchip_interpolation, possible_results)

mse = format_scientific_notation(raw_mse)
rmse = format_scientific_notation(raw_rmse * 100)

display(Markdown("### Calculated Mean Squared Error for the Piecewise Cubic Hermite Model:" + "\n" +
r"$$\Delta_{MSE} = " + mse + r"$$" + "\n" + r"$$\Delta_{RMSE} = " + rmse + r"\%$$" + "\n" + r"$$\mathrm{t_{{avg}}}=" + format_time(time) + r"$$"))

### Calculated Mean Squared Error for the Piecewise Cubic Hermite Model:
$$\Delta_{MSE} = 1.02 \times 10^{12}$$
$$\Delta_{RMSE} = 4.36 \times 10^{-3}\%$$
$$\mathrm{t_{{avg}}}=12.02 s$$

# Summary
To conclude, the most efficient model is the **piecewise linear function**. It achieves a relatively low RMSE and correctly maps the interval $\langle 0,1 \rangle$ to $\langle 0,1 \rangle$, unlike other linear models.

The **PCHIP model** is likely the smoothest, being the only one in the $\mathrm{C}^1$ class. However, its average computation time is significantly higher, making it inefficient for practical use.

