# Example: Swiss Referenda

We propose in this notebook an example of how to use the `predikon` library to make vote predictions.
The data is a subsample (10%) of Swiss referenda results.
The full dataset can be found in the [submatrix-factorization](https://github.com/indy-lab/submatrix-factorization/blob/master/data/munvoteinfo.pkl) repo.

## Imports

In [1]:
import numpy as np
from predikon import LogisticSubSVD, GaussianSubSVD, WeightedAveraging

DATA_PATH = '../tests/data/'

## Load Data

Each entry `data[i,j]` is the percentage of "yes" in region `i` for referendum `j`.
A region in this dataset is a Swiss municipality.
The `weights` are the number of valid votes in each municipality.
The `outcomes` are the aggregate national outcomes for each referendum.

In [2]:
data = np.loadtxt(f'{DATA_PATH}/data.csv', dtype=np.float, delimiter=',')
weights = np.loadtxt(f'{DATA_PATH}/weights.csv', dtype=np.int, delimiter=',')
outcomes = np.loadtxt(f'{DATA_PATH}/outcomes.csv', dtype=np.float, delimiter=',')

## Prepare Data

The matrix `Y` contains historical data up to vote `V`.
The vector `y` contains the vote results for the vote we would like to make predictions.

In [3]:
Y, y = data[:, :-1], data[:, -1]
ytrue = outcomes[-1]
R, V = Y.shape
print(f'Number of regions: {R:>3}')
print(f'Number of votes:   {V:>3}')

Number of regions: 217
Number of votes:    31


## Set Observations

Set which regions are observed.
The unobserved regional results are `nan`.

In [4]:
# Fix the seed for reproducibility.
np.random.seed(200)
# Random permutation of the regions.
inds = np.random.permutation(R)
# Proportion of observed results.
p = 0.1
# Number of observations (10 %).
n = int(np.ceil(R * p))
# Set observations.
obs = inds[:n]
# Define new vector of (partial) regional results.
ynew = np.array([np.nan] * R)
ynew[obs] = y[obs]

## Evaluate Models

We evaluate three models:
1. A weighted average baseline
2. Our algorithm with a Gaussian likelihood
3. Our algorithm with a Bernoulli likelihood

We set the latent dimensions `D=10` and the regularizer `reg=1e-5`.
We report the predicted aggregated outcome, and we compare it against the true aggregate outcome.
An aggregate outcome is the weighted average of the regional observations and the regional predictions, where the weight is the number of valid votes in each region.

In [5]:
# Hyperparameters: number of latent dimensions and regularizers.
D, reg = 10, 1e-5
# Define models.
base = WeightedAveraging(Y, weighting=weights)
gaus = GaussianSubSVD(Y, weighting=weights, n_dim=D, add_bias=True, l2_reg=reg)
bern = LogisticSubSVD(Y, weighting=weights, n_dim=D, add_bias=True, l2_reg=reg)

for model in [base, gaus, bern]:
    print(model)
    # Predict missing results.
    pred = model.fit_predict(ynew)
    # Compute aggregate outcome. 
    ypred = 1/np.sum(weights) * np.sum(weights.dot(pred))
    print(f'  Predicted outcome: {ypred*100:.2f}%')
    print(f'  True outcome:      {ytrue*100:.2f}%')
    print(f'  Absolute diff.:    {np.abs(ypred - ytrue)*100:.4f}\n')

Weighted Averaging
  Predicted outcome: 52.13%
  True outcome:      55.67%
  Absolute diff.:    3.5426

GaussianSubSVD (dim=10,l2=1e-05)
  Predicted outcome: 54.32%
  True outcome:      55.67%
  Absolute diff.:    1.3529

Logistic SubSVD (dim=10,l2=1e-05)
  Predicted outcome: 54.30%
  True outcome:      55.67%
  Absolute diff.:    1.3713

