# Estimation

In [None]:
from typing import Callable, Tuple
from functools import partial
from dataclasses import dataclass, field

In [None]:
# for retro implementations
import math
import random

In [None]:
# for modern implementations
import numpy as np
import pandas as pd
from scipy import stats

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
sns.set_theme()

In [None]:
import ipywidgets
from IPython.core.pylabtools import figsize
figsize(9, 6)

## The estimation game

In [None]:
sample = np.array([-0.441, 1.774, -0.101, -1.138, 2.975, -2.138])

What do you think is the mean parameter, μ, of this distribution?

One choice is to use the sample mean, $\bar{x}$, as an estimate of $\mu$.

In this example, $\bar{x}$ is 0.155, so it would be reasonable to guess $\mu = 0.155$.

This process is called *estimation*, and the statistic we used (the sample mean) is called an *estimator*.

Using the sample mean to estimate $\mu$ seems obvious, but what if there are outliers

For example here is a another sample of a normal distribution collected by a surveyor who sometimes puts the dp in the wrong place


In [None]:
sample_with_outliers = np.array([[-0.441, 1.774, -0.101, -1.138, 2.975, -213.8]])

### RMSE

In [None]:
# how can this code be improved?
def RMSE(estimates, actual):
    e2 = [(estimate-actual)**2 for estimate in estimates]
    mse = np.mean(e2)
    return math.sqrt(mse)


In [None]:
def rmse(estimates: np.ndarray, actual: np.float64) -> np.float64:
    '''
    Returns the square root of the mean of the squares of the errors
    '''
    return np.sqrt(((estimates-actual)**2).mean())

In [None]:
print(f'RMSE: {RMSE(sample, sample.mean())}')
print(f'rmse: {rmse(sample, sample.mean())}')

Here is a function that simulates the estimation game and computes the root
mean squared error (RMSE), which is the square root of MSE:

If there are no outliers, the sample mean minimizes the mean squared error (MSE). That is, if we play the game many times, and each time compute the error $\bar{x}-\mu$, the sample mean minimizes


$$
    MSE = \frac{1}{m} \sum{(\bar{x}-\mu)^2}
$$

Here is a function that simulates the estimation game

In [None]:
def Estimate1(n=7, m=1000):
    mu = 0
    sigma = 1
    means = []
    medians = []
    # sample n values m times
    for _ in range(m):
        xs = [random.gauss(mu, sigma) for i in range(n)]
        xbar = np.mean(xs)
        median = np.median(xs)
        means.append(xbar)
        medians.append(median)
    print('rmse xbar', RMSE(means, mu))
    print('rmse median', RMSE(medians, mu))


In [None]:
Estimate1()

In [None]:
def norm_estimate(n: int = 7, m: int = 1000, estimator: Callable = np.mean) -> np.float64:
    '''
    computes the root mean square error
    
    :param n: the sample size used to compute mu
    :param m: the number of estimates (means)
    :param estimator: The function used as an estimator - e.g mean or median
    
    :returns the rmse of all sampled means from the true mean
    '''
    
    std_norm = stats.norm(0, 1)
    # generate a list of estimates
    xs = np.array([estimator(std_norm.rvs(n)) for _ in range(m)])
    return rmse(xs, 0)
    

In [None]:
print(f'Mean: {norm_estimate():0.2f}')
print(f'Median: {norm_estimate(estimator=np.median):0.2f}')

When I ran this code, the RMSE of the sample mean was 0.39, which means that if we use $\bar{x}$ to estimate the mean of this distribution, based on a sample with n = 7, we should expect to be off by 0.39 on average. Using the median to estimate the mean yields RMSE 0.46, which confirms that $\bar{x}$ yields lower RMSE, at least for this example.

Minimizing MSE is a nice property, but it’s not always the best strategy. For example, suppose we are estimating the distribution of wind speeds at a building site. If the estimate is too high, we might overbuild the structure, increasing its cost. But if it’s too low, the building might collapse. Because cost as a function of error is not symmetric, minimizing MSE is not the best strategy.


As another example, suppose I roll three six-sided dice and ask you to predict the total. If you get it exactly right, you get a prize; otherwise you get nothing. In this case the value that minimizes MSE is 10.5, but that would be a bad guess, because the total of three dice is never 10.5. For this game, you want an estimator that has the highest chance of being right, which is a __maximum likelihood estimator__ (MLE). If you pick 10 or 11, your chance of winning is 1 in 8, and that’s the best you can do.

## Guess the variance

In [None]:
def var(data:np.array) -> np.float64:
    return ((data - data.mean())**2).sum() / len(data)

def sample_var(data: np.array) -> np.float64:
    return ((data-data.mean())**2).sum() / (len(data)-1)

In [None]:
# test both sets of functions
print(f'biased: {np.var(sample):0.2f}')
print(f'unbiased: {np.var(sample, ddof=1):0.2f}')

Here is a function that simulates the estimation game and tests the performance of $S^2$ and $S_{n-1}^{2}$

In [None]:
def MeanError(estimates, actual):
    # computes the mean difference between the estimates and the actual value:
    errors = [estimate-actual for estimate in estimates]
    return np.mean(errors)


def Estimate2(n=7, m=1000):
    mu = 0
    sigma = 1
    estimates1 = []
    estimates2 = []
    for _ in range(m):
        xs = [random.gauss(mu, sigma) for i in range(n)]
        biased = np.var(xs)
        unbiased = np.var(xs, ddof=1)
        estimates1.append(biased)
        estimates2.append(unbiased)
    print('mean error biased', MeanError(estimates1, sigma**2))
    print('mean error unbiased', MeanError(estimates2, sigma**2))

In [None]:
Estimate2()

In [None]:
def mean_error(estimates: np.array, actual: np.float64) -> np.float64:
    return (estimates - actual).mean()

def var_estimator(n: int = 7, m: int = 1000, estimator: Callable = np.var) -> np.float64:
    std_norm = stats.norm(0, 1)
    # list of estimated variances
    estimates = np.array([estimator(std_norm.rvs(n)) for _ in range(m)])
    # 1 squared is 1
    return mean_error(estimates, 1)

In [None]:
# change n and see what happens
print(f'biased: {var_estimator(estimator=np.var):.2f}')
print(f'unbiased: {var_estimator(estimator=partial(np.var, ddof=1)):.2f}')

In [None]:
def test_estimator(num_samples: int):
    print(f'Samples: {num_samples}')
    print(f'Biased: {var_estimator(n=num_samples, estimator=np.var):.3f}')
    print(f'Bnbiased: {var_estimator(n=num_samples, estimator=partial(np.var, ddof=1)):.3f}')

In [None]:
ipywidgets.interact(
    test_estimator,
    num_samples=ipywidgets.IntSlider(
        value=5,
        min=0,
        max=100,
        description='Sample size:'
    )
);

In [None]:
nvals = np.arange(2, 101)
estimates = np.array([
    var_estimator(n=n, estimator=partial(np.var, ddof=1)) for n in nvals
])


In [None]:
p = sns.lineplot(
    x=nvals,
    y=estimates
);
p.set(
    xlim=(0, 100),
    ylim=(-0.05, 0.05),
    xlabel = 'Number of samples used to compute $\sigma^2$',
    ylabel = 'Unbiased estimate'
);

In [None]:
ipywidgets.interact(
    test_estimator,
    num_samples=ipywidgets.Dropdown(
        options=[5, 10, 50, 100, 500, 1000, 10000],
        value=5,
        description='Sample size:'
    )
);

## Sampling distrbutions

Suppose you are a scientist studying gorillas in a wildlife preserve. You want
to know the average weight of the adult female gorillas in the preserve. To weigh them, you have to tranquilize them, which is dangerous, expensive,


and possibly harmful to the gorillas. But if it is important to obtain this information, it might be acceptable to weigh a sample of 9 gorillas. Let’s assume that the population of the preserve is well known, so we can choose a representative sample of adult females. We could use the sample mean, $\bar{x}$, to estimate the unknown population mean, μ.


Having weighed 9 female gorillas, you might find $\bar{x} = 90kg$ and sample standard deviation, S = 7.5 kg. The sample mean is an unbiased estimator of &mu;, and in the long run it minimizes MSE. So if you report a single estimate that summarizes the results, you would report 90 kg.


But how confident should you be in this estimate? If you only weigh n = 9 gorillas out of a much larger population, you might be unlucky and choose the 9 heaviest gorillas (or the 9 lightest ones) just by chance. Variation in the estimate caused by random selection is called sampling error. To quantify sampling error, we can simulate the sampling process with hypothetical values of &mu; and &sigma;, and see how much $\bar{x}$ varies.

Since we don’t know the actual values of &mu; and &sigma; in the population, we’ll use the estimates $\bar{x}$ and S. So the question we answer is: “If the actual values of &mu; and &sigma; were 90 kg and 7.5 kg, and we ran the same experiment many times, how much would the estimated mean, $\bar{x}$, vary?” The following function answers that question:

In [None]:
def SimulateSample(mu=90, sigma=7.5, n=9, m=1000):
    means = []
    for j in range(m):
        xs = np.random.normal(mu, sigma, n)
        xbar = np.mean(xs)
        means.append(xbar)
    means = np.array(means)
    ci = np.percentile(means, (5, 95))
    stderr = RMSE(means, mu)
    print('standard error', stderr)
    print('confidence interval', ci)

In [None]:
SimulateSample()

In [None]:
@dataclass
class ConfidenceEstimate:
    
    mu: np.float64
    sigma: np.float64
    # the estimates
    means: np.ndarray
        
    def __str__(self):
        return f'mu: {self.mu}, sigma: {self.sigma}, estimates: {len(self.means)}, \
        stderr: {self.stderr:.2f}, ci: [{self.ci[0]:.2f}, {self.ci[1]:.2f}]'
    
    @property
    def stderr(self) -> np.float64:
        '''
        Standard error (SE) is a measure of how far we expect the estimate to be off, on average
        '''
        return rmse(self.means, self.mu)
    
    @property
    def ci(self) -> Tuple[np.float64, np.float64]:
        '''
        A confidence interval (CI) is a range that includes a given fraction of the sampling distribution
        '''
        return np.percentile(self.means, (5, 95))
        
        
def simulate_sample(mu: np.float64, sigma: np.float64, n: int=9, m: int=1000) -> ConfidenceEstimate:
    '''
    Runs an experiment m times to see how much and estimated mean varies
    
    :param mu: the estimate mean
    :param sigma: the estimated standard deviation
    :param n: the size of each sample
    :param m: the number of experiments
    '''
    norm_dist = stats.norm(mu, sigma)
    # compute the means of m samples of n items
    means = np.array([
        norm_dist.rvs(n).mean() for _ in range(m)
    ])
    return ConfidenceEstimate(mu, sigma, means)


In [None]:
estimate = simulate_sample(90, 7.5)
print(estimate)

&mu; and &sigma; are the hypothetical values of the parameters. __n__ is the sample size, the number of gorillas we measured. __m__ is the number of times we run the simulation.

Here is a plot of the empirical cdf for the esimates

In [None]:
p = sns.ecdfplot(
    x = estimate.means
)
p.fill_between(estimate.ci, (1, 1), facecolor='pink', alpha=0.3);
p.axvline(x=estimate.mu, linestyle='--');
p.set(
    xlim=(80, 100),
    xlabel = 'Sample means'
);

This distribution is called the sampling distribution of the estimator. It shows how much the estimates would vary if we ran the experiment over and over.

## Exponential Distributions

Let’s play one more round of the estimation game. I’m thinking of a distribution. It’s an exponential distribution, and here’s a sample:

In [None]:
sample = [5.384, 4.493, 19.198, 2.790, 6.122, 12.844]

The mean of an exponential distribution is 1/&lambda;, so working backwards we might choose

$$
L=1/\bar{x}
$$

where L is an maximimum liklihood estimator of &lambda;

But we know that $\bar{x}$ is not robust in the presence of outliers, so we expect L to have the same problem.

We can choose an alternative based on the sample median. The median of an exponential distribution is $ln(2)/\lambda$, so working backwards again, we can define an estimator

$$
L_{m} = ln(2)/m
$$

where *m* is the sample median.

To test the performance of these estimators, we can simulate the sampling process:


In [None]:
def Estimate3(n=7, m=1000):
    lam = 2
    means = []
    medians = []
    for _ in range(m):
        xs = np.random.exponential(1.0/lam, n)
        L = 1 / np.mean(xs)
        Lm = math.log(2) / np.median(xs)
    means.append(L)
    medians.append(Lm)
    print('rmse L', RMSE(means, lam))
    print('rmse Lm', RMSE(medians, lam))
    print('mean error L', MeanError(means, lam))
    print('mean error Lm', MeanError(medians, lam))


In [None]:
Estimate3()

In [None]:
def exp_mean(x: np.ndarray) -> np.float64:
    return 1 / np.mean(x)


def exp_median(x: np.ndarray) -> np.float64:
    return np.log(2) / np.median(x)


def exp_estimator(mu: np.float64, n=7, m=1000, estimator: Callable=exp_mean) -> np.float64:
    exp_dist = stats.expon(1/mu)
    estimates = np.array([
        estimator(exp_dist.rvs(n)) for _ in range(m)
    ])
    return estimates

In [None]:
lam = 2
means = exp_estimator(lam, n=100, estimator=exp_mean)
medians = exp_estimator(lam, n=100, estimator=exp_median)

In [None]:
print(f'rmse L {rmse(means, lam)}')
print(f'rmse Lm: {rmse(medians, lam)}')
print(f'mean error L {mean_error(means, lam)}')
print(f'mean Lm: {mean_error(medians, lam)}')

## Exercises

**Exercise:**  In this chapter we used $\bar{x}$ and median to estimate µ, and found that $\bar{x}$ yields lower MSE. Also, we used $S^2$ and $S_{n-1}^2$ to estimate σ, and found that $S^2$ is biased and $S_{n-1}^2$ unbiased.
Run similar experiments to see if $\bar{x}$ and median are biased estimates of µ. Also check whether $S^2$ or $S_{n-1}^2$ yields a lower MSE.

In [None]:
# Solution

def Estimate4(n=7, iters=100000):
    """Mean error for xbar and median as estimators of population mean.

    n: sample size
    iters: number of iterations
    """
    mu = 0
    sigma = 1

    means = []
    medians = []
    for _ in range(iters):
        xs = [random.gauss(mu, sigma) for i in range(n)]
        xbar = np.mean(xs)
        median = np.median(xs)
        means.append(xbar)
        medians.append(median)

    print('Experiment 1')
    print('mean error xbar', MeanError(means, mu))
    print('mean error median', MeanError(medians, mu))
    
Estimate4()

$\bar{x}$ and median yield lower mean error as m increases, so neither one is obviously biased, as far as we can tell from the experiment.

In [None]:
# Solution

def Estimate5(n=7, iters=100000):
    """RMSE for biased and unbiased estimators of population variance.

    n: sample size
    iters: number of iterations
    """
    mu = 0
    sigma = 1

    estimates1 = []
    estimates2 = []
    for _ in range(iters):
        xs = [random.gauss(mu, sigma) for i in range(n)]
        biased = np.var(xs)
        unbiased = np.var(xs, ddof=1)
        estimates1.append(biased)
        estimates2.append(unbiased)

    print('Experiment 2')
    print('RMSE biased', RMSE(estimates1, sigma**2))
    print('RMSE unbiased', RMSE(estimates2, sigma**2))

Estimate5()

The biased estimator of variance yields lower RMSE than the unbiased estimator, by about 10%.  And the difference holds up as m increases.

In [None]:
def generate_estimates(estimator: Callable, n = 10, iters = 10000):
    """
    Generates a list estimates for a given statistic.

    n: sample size
    iters: number of iterations
    """
    return np.array([estimator(stats.norm(0, 1).rvs(n)) for _ in range(iters)])

In [None]:
mean_err = mean_error(generate_estimates(np.mean), 0)
# median_err = mse(generate_estimates(np.median), 0)
median_err = 0.0
var_biased_err = rmse(generate_estimates(np.var), 1)
var_unbiased_err = rmse(generate_estimates(partial(np.var, ddof=1)), 1)
print(f'Mean error: {mean_err:0.4}, Median error: {median_err:0.4}, Var biased: {var_biased_err:0.2f}, Var unbiased: {var_unbiased_err:0.2f}')

In [None]:
def generate_estimates(n = 10, iters = 1000):
    mu = 0
    sigma = 1
    means = []
    medians = []
    var_biased = []
    var_unbiased = []
    dist = stats.norm(loc=mu, scale=sigma)
    for _ in range(iters):
        xs = dist.rvs(n)
        means.append(np.mean(xs))
        medians.append(np.median(xs))
        var_biased.append(np.var(xs))
        var_unbiased.append(np.var(xs, ddof=1))
    return (
        mean_error(np.array(means), mu),
        mean_error(np.array(medians), mu),
        rmse(np.array(var_biased), sigma**2),
        rmse(np.array(var_unbiased), sigma**2)
    )

In [None]:
mean_err, med_err, var_biased_err, var_unbiased_err = generate_estimates(iters=100000)
print(f'Mean error: {mean_err:0.4}, Median error: {median_err:0.4}, Var biased: {var_biased_err:0.2f}, Var unbiased: {var_unbiased_err:0.2f}')

**Exercise:** Suppose you draw a sample with size n=10 from an exponential distribution with λ=2. Simulate this experiment 1000 times and plot the sampling distribution of the estimate L. Compute the standard error of the estimate and the 90% confidence interval.

Repeat the experiment with a few different values of `n` and make a plot of standard error versus `n`.



In [None]:
def exponential_estimates(lam=2, n=10, iters=1000):
    dist = stats.expon(scale=1/lam)
    return np.array([1/dist.rvs(n).mean() for _ in range(iters)])

In [None]:
lam = 2
estimates = exponential_estimates(lam, n=1000)
stderr = rmse(estimates, lam)
ci = np.percentile(estimates, (5, 95))
print(f'Std err: {stderr}, ci: {np.round(ci, 3)}')

In [None]:
p = sns.ecdfplot(
    estimates,
    label = 'CDF'
)
p.axvline(ci[0], label='0.05%', color='darkred', linestyle='--')
p.axvline(ci[1], label='0.95%', color='darkgreen', linestyle='--')
p.set(
    xlabel = 'estimate',
    ylabel = 'CDF'
);
plt.legend(loc='lower right');



### My conclusions:

1. With sample size 10:
    
        standard error 0.762510819389
        confidence interval (1.2674054394352277, 3.5377353792673705)

2. As sample size increases, standard error and the width of the CI decrease:

        10      0.90    (1.3, 3.9)
        100     0.21    (1.7, 2.4)
        1000    0.06    (1.9, 2.1)

All three confidence intervals contain the actual value, 2.

**Exercise:** In games like hockey and soccer, the time between goals is roughly exponential. So you could estimate a team’s goal-scoring rate by observing the number of goals they score in a game. This estimation process is a little different from sampling the time between goals, so let’s see how it works.

Write a function that takes a goal-scoring rate, `lam`, in goals per game, and simulates a game by generating the time between goals until the total time exceeds 1 game, then returns the number of goals scored.

Write another function that simulates many games, stores the estimates of `lam`, then computes their mean error and RMSE.

Is this way of making an estimate biased?

In [None]:
def simulate_game(lam):
    """Simulates a game and returns the estimated goal-scoring rate.

    lam: actual goal scoring rate in goals per game
    """
    goals = 0
    t = 0
    while True:
        time_between_goals = random.expovariate(lam)
        t += time_between_goals
        if t > 1:
            break
        goals += 1

    # estimated goal-scoring rate is the actual number of goals scored
    return goals

In [None]:
lam = 2
def estimate_game(lam, m = 10000):
    return np.array([simulate_game(lam) for _ in range(m)])

simulate many games and use the number of goals scored as an estimate of the true long-term goal-scoring rate.

In [None]:
estimates = estimate_game(lam, m = 100000)
p = sns.histplot(
    estimates,
    binwidth=1,
    stat = 'probability'
)
p.set(
    xlabel = 'Goals scored',
    ylabel = 'PMF',
    title = f'L: {np.mean(estimates):.2f}, RMSE: {rmse(estimates, lam):.2f}, Err: {mean_error(estimates, lam):.4f}'
);

1. RMSE for this way of estimating lambda is 1.4
2. The mean error is small and decreases with m, so this estimator appears to be unbiased.

One note: If the time between goals is exponential, the distribution of goals scored in a game is Poisson.