# From samples to populations

This notebook deals with:

1. Interval estimation
2. Significance testing

## Setup the notebook

In [1]:
from __future__ import annotations
from opynuni.stats import adt, sampling
from opynuni.pandasloader import PandasLoader
from statsmodels.stats import proportion, weightstats

In [2]:
# init loader object
pdloader = PandasLoader()

## Confidence intervals

A confidence interval for some parameter $\theta$ is a range of plausible values $(\theta^{-}, \theta^{+})$.
It can represent the uncertainty in an estimate $\hat\theta$.

One interpretation of a confidence interval is given by **plausible range interpretation of confidence intervals**, which states that a confidence interval represents a range of values of $\theta$ that are plausible at the 95% confidence level, given the observed data.

### Large sample approximate 100(1-$\alpha$)% confidence intervals of the mean

Given a random sample $x_{1}, \ldots, x_{n}$ of size $n>30$ from a population with mean $\mu$, an approximate $100(1-\alpha)\%$ confidence interval for $\mu$, valid for large $n$, is

$$(\mu^{-}, \mu^{+}) = \bigg( \bar{x} \pm z \frac{s^{2}}{n} \bigg),$$

where $\bar{x}$ is the sample mean, $s$ is the sample standard deviation, and $z = q_{1-(\alpha/2)}$ of the standard normal distribution.

#### Example: Number of goals per game

The data on the number of goals scored per game in a English Premier League season has been provided.
Calculate a 95% $z$-interval for the expected number of goals scored in a game, and comment whether it is plausible that on average more than three goals are scored per game.

In [3]:
# get data
games = pdloader.get('number_of_goals_per_game')
# get confint
res = weightstats.zconfint(x1=games["goals"])
adt.ConfIntADT(res)

confint(lower=2.6601, upper=2.9821)

The valule 3 is not included in the 95% confidence interval, so it is not plausible at the 95% confidence level.
Thus it is implausible that the mean number of goals is three, since they would require the data configuration to be unlikely.

#### Example: Ozone levels (pp45)

In [4]:
# init object, get conf int
sample = sampling.ZSample(mean=15.171, std=8.2773, nobs=1594)
# get confint
sample.zconfint_mean()

confint(lower=14.7647, upper=15.5773)

### Approximate 100(1-$\alpha$)% confidence interval for a proportion

Given an estimate $\hat p$ of a proportion $p$, obtained by observing $x$ successes in as a sequence of $n$ independent **Bernoulli trials** each with a probability of success $p$, an approximate confidence interval for the proportion is

$$(p^{-}, p^{+}) = \bigg( \hat p \pm z \sqrt{\frac{\hat p \: (1 - \hat p)}{n}} \bigg)$$

where $\hat p = x/n$ and $z = q_{1-(\alpha/2)}$ of the standard normal distribution.
This confidence interval is valid when both $np, \: n(1-p) \geq 5$.

*Note, this last condition really means that there are at least 5 successes and 5 failures in the sample.*

#### Example: Proportion of females among asthma cases (pp46)

Of the 1761 cases involving asthma patients who were admitted to hospital where the gender was recorded, 907 were female.
Calculate an approximate 95% confidence interval for $p$, the underlying proportion of females among persons admitted for asthma.

In [5]:
res = proportion.proportion_confint(count=907, nobs=1761)
# sampling.to_namedtuple(res, 'confint')
adt.ConfIntADT(res)

confint(lower=0.4917, upper=0.5384)

## Testing hypotheses

#### Example: Mean duration of hospital stay (pp47)

In [6]:
# initialise object
sampling.ZSample(mean=1.885, std=1.85, nobs=1762).twosided_ztest_mean(mu0=2)

results(zstat=-2.6093, pval=0.0091)