# Fundamental definitions

Inferential statistics is the art of making better guesses about a large group of things (the **population** from a sample we measure out of that group. 

A **parameter** is a number used to describe a population. Notations for the population mean is $\mu$, and is a **parameter**, while the sample mean is written $\bar{x}$. **These are *not* the same**. and therefor should not be interpreted in the same way.

## The importance of good sampling

To avoid **sampling bias**, we need to select our sample carefully. To select samples that represent the population accurately, we need to:

1. Define the population of interest
2. Assign a probability of selection to each member
3. Draw a sample, and applying sampling weights

This ensure we don't overselect some portion of the population. This is called **probability sampling**. It's harder to do than picking subjects at random or based on availabilities, but it will give us a much better picture of the actual population.

One method of choosing a random sample might be **systematic sampling** for picking 100 subject in a population of 1000:

1. Set $n = 10$, because $1000/100 = 10$.
2. Choose a number at random between 1 and 10.
3. Select the object with that number, and every 10th object thereafter

You could also use **stratified sampling**, which consists of dividing your population of interest into groups (strata), based on common characteristics. You could then compare dependent data by strata, thus using the groups as independent data.

In a **cluster sample**, the population is sampled making use of pre-existing groups.

# The Binomial Distribution

This is the "coin-flipping distribution"; it models situations where there's only a limited number of choices available; aka **discrete** values. They are generated by a **Bernoulli process**, modelling a number of trials (also called Bernoulli trials) that either fail or succeed. Each trial is independent (see [indepedent events in the probability page](../../Maths%20and%20Computer%20Science/Stats/1.%20Probability.ipynb#Independent-events)}. 

We can then calculate the probability of a number of success like so:

${{n}\choose{k}} p^k (1 - p)^{n - k}$

where

- $k$ is the number of trials and $n$ is the number of successes we are asking about (ex: $n$ is 5 and $k$ is 10 if we want to measure how probable it is to get 5 heads when tossing a coin 10 times)
- ${{n}\choose{k}} = \frac{n!}{k!(n-k)!}$ (${{n}\choose{k}}$ is a [combination (see Set Theory)](../../Maths%20and%20Computer%20Science/Set%20Theory.ipynb#Permutations)), also called the **binomial coefficient**. We "*choose*" $k$ successes out of $n$ trials.
- $p$ is the [probability of success](../../Maths%20and%20Computer%20Science/Stats/1.%20Probability.ipynb#Independent-events)


# The Normal Distribution

A **normal distribution** is characterised by two parameters: the mean $\mu$(mu) and the variance $\sigma$(sigma). The distribuion with a mean of 0 and a standard deviation of 1 is called the **standard normal distribution** or the **Z distribution**.

![image.png](attachment:image.png)
(*Source: Statistics in a Nutshell*)

This standard enables to convert *any* distribution into a normal distribution, using only the mean and variance of our sample distribution. This is called **normalisation**, and the resulting data points are called **Z-scores**. This is how it's done:

$Z=\frac{x-\mu}{\sigma}$

($x$ being the data point we're looking at, $\mu$ is the mean of our sample distribution and $\sigma$ is the variance)



# The Central Limit Theorem

> The Central Limit Theorem states that the sampling distribution of the sample
mean approximates the normal distribution, regardless of the distribution of the
population from which the samples are drawn, if the sample size is sufficiently
large.  
> *Statistics in a Nutshell*

In other words, if $n$ is sufficiently large, we have $\bar{X} \approx N(\mu,\frac{\sigma^2}{n})$

# Hypothesis Testing


## p-value

### A Warning: handle with care

Statistical hypothesis testing a very fundamental proceddure used practically everywhere in science. It's a powerful conceptual tool, but it needs to be handled with care, lest your conclusions become flawed and your results biased.  
The delicate nature of p-values and confidence intervals make them a subject of much controversy. Fortunately, we have others tools at our disposal to interpret experimental results.


## Type I and type II errors

We can always make mistakes in our interpretation of the data. We can have either false positives (Type I) or false negative (Type II).

## Confidence intervals


# The T-Test

On small samples (roughly less then 30), the t distribution below appoximates the normal distribution very well.

$$t = \frac{\overline{x} -  \mu}{\frac{s}{\sqrt(n)}}$$

It's a flatter function with fatter tails, when the sample size is small. As n increases, it approximates the normal distribution.

