# Data Literacy
#### University of Tübingen, Winter Term 2021/22
## Exercise Sheet 4
&copy; 2021 Prof. Dr. Philipp Hennig & Nico Krämer & Emilia Magnani

This sheet is **due on Monday, November 22, 2021 at 10am sharp (i.e. before the start of the lecture).**

---

## Data estimation
Last week, we looked at maximum likelihood estimation for exit polls / election data in the context of the German general election in September 2021.
This week, we will continue the analysis and augment the maximum likelihood estimators from last week with uncertainty quantification.



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# For the docstrings / type hints of the functions we provide.
from typing import Union, Optional, Tuple

The next snippet loads the data and extracts the results for one party, the true voting share of that party, and some other useful quantities.

In [None]:
# Load the data
data = pd.read_csv("data_slim.csv")


# Choose one party here.
my_party = "SPD"

# Grouped results
result_my_party = int(data[data["Gruppenname"] == my_party]["Anzahl"].sum())
result_others = int(data[data["Gruppenname"] != my_party]["Anzahl"].sum())

# True proportion of votes that `my_party` received
truth = result_my_party / (result_my_party + result_others)


# All votes as an array of strings
votes_all = np.concatenate(
    (np.tile(my_party, result_my_party), np.tile("Not " + my_party, result_others))
)

# An array of the relevant parties
parties_all = np.array([my_party, "Not " + my_party])

The next snippet provides a function that simulates an exit poll. You can use your solution from last week's sheet instead.

In [None]:
def exit_poll(
    rng: np.random.Generator,
    *,
    poll_size: int,
    votes: np.ndarray,
    parties: np.ndarray,
) -> Tuple[np.ndarray, np.ndarray]:
    """Conduct an exit poll.

    Parameters
    ----------
    rng
        Random number generator.
    poll_size
        Poll size. How many people are polled.
    votes
        The true election results.
    parties
        List of parties.

    Returns
    -------
    Exit poll counts and full exit poll.
    """
    poll = rng.choice(votes, size=(poll_size,), replace=False)
    poll_counts = count(poll=poll[None, :], parties=parties)
    return poll_counts[0], poll


def count(poll: np.ndarray, parties: np.ndarray) -> np.ndarray:
    """Count the number of occurences of a party in an exit poll."""
    return np.count_nonzero(poll[..., None] == parties[None, None, :], axis=1)


# Some test that checks that the function works
exit_poll_counts, _ = exit_poll(
    rng, poll_size=100, votes=votes_all, parties=parties_all
)

## Uncertainty quantification via Fisher information 

In an exit poll for an election with $K$ parties, the counts $N_k$ for the $k$th party follows a multinomial distribution, 
$$
p(N_1, ..., N_K \mid \pi_1, .., \pi_K) = \frac{\Gamma\left(\sum_k N_k + 1 \right)}{\prod_k \Gamma(N_k + 1)} \prod_{k=1}^K \pi_k^{N_k},
$$
where $\Gamma$ is the Gamma function.
Let $|N| = \sum_k N_k$. Given a sample $(N_1, ..., N_K)$ (an exit poll), the maximum likelihood estimate for $\pi = (\pi_1, ..., \pi_K)$ is
$$
\hat \pi = (N_1 / |N|, ..., N_K / |N|).
$$
In the following, we will consider the case of $K=2$ (the counts for one party, and the counts for "not" this party, i.e., all the others).
This reduces the multinomial distribution to a binomial distribution, with parameters $(\pi, 1 - \pi)$.
You know from the lecture that the Fisher information for this setup is
$$
I(\pi) = \frac{|N|}{\pi (\pi - 1)}.
$$
Asymptotically, the error of the MLE is Gaussian,
$ \hat{\pi} \sim \mathcal{N}\left(\hat{\pi};\pi_\text{truth},  - I(\hat \pi)^{-1}\right)$.

**Task:**
Use the formula for the Fisher information to write a function that computes the (asymptotic) covariance of the MLE. 

**Task:**
Conduct an (artificial) exit poll, and evaluate how the covariance evolves for increasing exit poll sizes $|N|$.
To this end, plot $f(x) = \mathcal{N}(x; \hat \pi, -I(\hat \pi)^{-1})$ and the true likelihood function $p(N \mid \pi)$ for $|N| \in \{10, 20, 50 \}$, and compare them to the true vote distribution $\pi_\text{truth}$.


## Uncertainty quantification via bootstrap estimation

While in the current setting, we can compute the Fisher information in closed form, often, this is not the case.
An alternative is the bootstrap estimator, which resamples a given data set repeatedly to quantify the variability of an estimator.
More precisely, we resample the conducted exit poll with replacement and recompute the estimator.


Instead of the bootstrap estimator, one can also use a parametric bootstrap. There, instead of resampling the data with replacement, we use the knowledge that $(N_1, ..., N_K) \sim p(N \mid \pi)$ follows a bi/multinomial distribution with parameter $\pi=(\pi_1, ..., \pi_K)$. Then, we can parametrise the bi/multinomial distribution with the MLE, sample from $p(N \mid \hat \pi)$ and recompute maximum likelihood estimates for each sample.


**Task:**
Implement the bootstrap estimator and the parametric bootstrap estimator for the MLE of $\pi$. Choose a poll size of, e.g., $|N|=1000$. Repeat the plot from above, but replace the Gaussian bell with a histogram of bootstrapped MLEs. Choose the number of bootstrap samples appropriately.

We can use the bootstrap estimator on a wide range of estimates. For example, we can quantify the uncertainty over estimating $P(\{\text{My Party}\} > \text{threshold})$.
For given exit poll $N=(N_1, ..., N_K)$, last week, we saw how to compute the probability of a party receiving more than a certain share of votes. Using the bootstrap (or parametric bootstrap), we can resample the data and recompute this probability for each sample.

**Task:**
Implement this resampling, and plot 50 bootstrap samples each of which describes the probability of SPD exceeding 0.22% of votes, for increasing exit poll counts $|N|$. How large does the poll have to be for us to be confident about SPD exceeding 0.22% of votes?

Instead of the samples, we can also measure the evolution of the standard deviation of the samples for increasing exit poll size.

**Task:**
Plot the standard deviation of the bootstrap samples from above against the exit poll size. Compare this to the error.