# Sampling distributions

In [42]:
import altair as alt
import numpy as np
import numpy.typing as npt
import pandas as pd
import seaborn as sns
import types

from scipy.stats import truncnorm

## Introduction

We calculate sampling statistics to summarise the information contained in our sample data. For instance, we calculate the sample mean to get the average of all the values in some sample. However, we also need to be sure that our sampling statistic is _accurate_. This is what we need the sampling distribution for: it is a tool that helps us quantify our uncertainty about our observed sampling statistic.

The sampling distribution is the probability distribution of a sampling statistic, such as the sample mean mentioned above, but it could also be a sample median or variance. This means that, theoretically, our observed sampling statistic is one of many values that we could have got for our sample (depending on what units are included in the sample).

Assuming we have a sample that is **representative** of the wider population, we can calculate a sampling statistic that, at first glance, appears to estimate a corresponding value for the entire population. In the case of the sample mean, we think of it as an estimate of the population mean. (The general label for the population mean, median and so on is _population parameter_.)

How sure are we that the sample and population means are very close to each other? Is the sample mean’s value a little less than that of the population mean, a little over, or something else?

## Simulating the sampling distribution of a sampling statistic

To illustrate what we mean, let’s set up a simulated population of heights. The mean height is 170 cm, with a standard deviation of 7.5 cm. Our population consists of 5,000 units.

In [8]:
MEAN_HEIGHT_CM = 170
SD_HEIGHT_CM = 7.5
POPULATION_SIZE = 5000

# Type: numpy.ndarray
pop_height = np.random.normal(loc=MEAN_HEIGHT_CM, scale=SD_HEIGHT_CM, size=POPULATION_SIZE)

pop_height_pd = pd.DataFrame({"height_cm": pop_height})

In [16]:
pop_height_viz = (
    alt.Chart(pop_height_pd)
    .mark_bar()
    .encode(
        alt.X("height_cm", bin=alt.Bin(maxbins=100), title="Generated heights in cm"),
        alt.Y("count()", title="Frequency")
    )
    .properties(title="Simulated population of heights (n = 5,000)")
)

When we visualise our population as a histogram, our distribution should be bell shaped.

In [17]:
pop_height_viz.show()

In [25]:
pop_mean = float(pop_height.mean())
pop_mean

169.9318020969051

As we can see, our population mean is **169.9 cm**. When we draw a random sample and calculate its sample mean, we will compare that mean to the population mean.

### Drawing a _representative_ sample

Since we are performing simple random sampling, we sample _without_ replacement.

In [27]:
height_sample = np.random.choice(pop_height, size=50, replace=False)

In [33]:
print(f"""
    Sample mean: {float(height_sample.mean())} cm
    Population mean: {pop_mean} cm
""")


    Sample mean: 168.00689687187167 cm
    Population mean: 169.9318020969051 cm



Consider the following situation. We’ve drawn a random sample of our population and calculated the sample mean height. In a real-world setting, we may never know the true population mean. This means that, unlike the above, we only have the sample mean value to work with.

We need to ask ourselves how _accurate_ this sample mean is. Since we only have a single estimate of the population mean, we need a way to know how far we’re off the mark with our observed sample mean.

We use [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) to help us quantify this uncertainty. We generate 2,000 iterations from our sample, so that we get 2,000 sample means to plot.

In [56]:
def calculate_bootstrapped_mean(input_sample: npt.NDArray) -> float:
    return float(np.random.choice(input_sample, size=input_sample.size, replace=True).mean())

In [64]:
def get_sampling_distribution_of_means(
    input_sample: npt.NDArray,
    size: int
) -> list[int]:
    """
    Size refers to the desired size of the population.
    """
    output = []

    for i in range(0, size):
        mean = calculate_bootstrapped_mean(input_sample)
        output.append(mean)

    return output

In [65]:
get_sampling_distribution_of_means(height_sample, 10)

[167.4281968550072,
 167.79757563580097,
 167.84940185770188,
 168.63354835555253,
 167.1251013063803,
 167.37484097848912,
 166.37854645665269,
 167.18859030502344,
 165.86999549065044,
 167.00667164006163]