# Statistical Inference

*A quick note - you can find LaTeX symbols [here](https://oeis.org/wiki/List_of_LaTeX_mathematical_symbols#Set_and.2For_logic_notation).*

## What is it?

In brief, it's the process of generating conclusions about a population from a noisy sample. We'll collect *data* on a *population*, forming a *sample*.

(Note there are a couple of key words here - *population*, and *sample*.)

There are two broad schools - Bayesian, and frequentists.

### I'm confused. Can you give me an example?

Say we want to understand the mean height of all the male adults in Australia. Our *estimand* is therefore "the mean height of all male adults in Australia".

We're probably not actually going to go and collect the height of this entire population (i.e. every male adult in Australia). Instead, we'll *estimate* the mean by taking a *sample* of, say, 100 male adults in Australia. We can then use this to develop an *estimator* - i.e. some kind of probabilistic model - that will allow us to *estimate* the *estimand*. 

In this process, we've taken a *sample* and linked it to a *population*.

*[The above largely draws from this post.](https://www.quora.com/What-is-an-estimator-and-an-estimands-in-statistical-models-Why-this-is-important)*

---

## Probabilities

### Some basic rules of probabilities

* The probability that nothing occurs is 0.
* The probability that something occurs is 1.
* The probability of something is 1 minus the probability that the opposite occurs.
* The probability of at least one of two (or more) mutually exclusive events is the sum of their respective probabilities.
* If an event A implies the occurence of event B, then the probability of A occurring is less than the probability that B occurs.
* For any two events the probability that at least one occurs is the sum of their probabilities minus their intersection, i.e. $$P(A\cup B) = P(A) + P(B) - P(A\cap B)$$

### Conditional probabilities

Conditional probabilty is "the measure of the probability of an event [...] given that [...] another event has occurred."

We can define this as $P(A|B) = \dfrac{P(A\cap B)}{P(B)}$. In other words, the probability of event $A$ occurring *given* that event $B$ has occurred is equal to the intersection of events $A$ and $B$ divided by the probability of event $B$ occurring. This sounds really complex, but it is actually incredibly intuitive if you draw this out.

*[Obligatory Wikipedia link.](https://en.wikipedia.org/wiki/Conditional_probability)*

### Bayes' theorem

*Sometimes known as Bayes' law or Bayes' rule~~, or Bae's theorem if you really love Bayes~~*.

Building on this, we can build it into Bayes' theorem. 

The most straightforward expression is: $P(A|B) = \dfrac{P(B|A)P(A)}{P(B)}$. The key thing here is that we can known swap $P(A|B)$ and $P(B|A)$.

*[Obligatory Wikipedia link.](https://en.wikipedia.org/wiki/Bayes%27_theorem)*

### A classic example: diagnostic tests

Let $+$ and $-$ be the events that the result of a diagnostic test is positive or negative.

Let $D$ and $D^C$ be the event that the subject of the test does or does not have the disease being tested for.

Therefore, $sensitivity = P(+|D)$ and $specificity = P(- | D^C)$. Both of these values should be high.

Conversely, $\text{positive predictive value} = P(D|+)$ and $\text{negative predictive value} = P(D^C|-)$.

#### Working back and forth

So, if we have a situation where a test has a sensitivity of 99.7% and a specificity of 98.5%, this is on the surface a pretty good test.

But consider there is a population with a 0.1% prevalence of said disease.

What is the positive predictive value?

We can test this by applying Bayes' Theorem - we are swapping the conditions. 

$P(D|+) = \dfrac{P(+|D)P(D)}{P(+|D)P(D) + P(+|D^C)P(D^C)} = \dfrac{0.997\times 0.001}{0.997\times 0.001 + 0.015\times 0.999} $

In [2]:
pos_predictive_value = (0.997*0.001)/(0.997*0.001 + 0.015*0.999)
print(pos_predictive_value)

0.06238268051558003


Therefore a positive test result only provides a 6.2% probability that the person actually has the disease.

We can take this idea further and explore the concept of a $\text{likelihood ratio}$. The LR is a ratio of the probability that a test result is correct *versus* the probability that that test result is incorrect. It can be calculated for positive and negative cases. In this instance, $LR+ = \dfrac{sensitivity}{1-specificity}$, and $LR- = \dfrac{1-sensitivity}{specificity}$. 

This is typically used to generate *post-test odds* by the following formula:

$\dfrac{P(D|+)}{P(D^C|+)} = \dfrac{P(D)}{P(D^C)} \times LR+$.

#### Another example - pregnancy tests

A [web site](www.medicine.ox.ac.uk/bandolier/band64/b64-7.html) for home pregnancy tests cites the following: “When the subjects using the test were women who collected and tested their own samples, the overall sensitivity was 75%. Specificity was also low, in the range 52% to 75%.” Assume the lower value for the specificity. Suppose a subject has a positive test and that 30% of women taking pregnancy tests are actually pregnant. What number is closest to the probability of pregnancy given the positive test?



In [20]:
specificity = 0.52
sensitivity = 0.75
p_pregnant = 0.3

From above: $$P(D|+) = \dfrac{P(+|D)P(D)}{P(+|D)P(D) + P(+|D^C)P(D^C)}$$

In [23]:
p_pregnant_g_pos = (sensitivity*p_pregnant) / (sensitivity*p_pregnant + (1-specificity)*(1-p_pregnant))

In [24]:
p_pregnant_g_pos

0.40106951871657753

---

## Distributions

*[Obligatory Wikipedia link.](https://en.wikipedia.org/wiki/Probability_distribution)*

### Expected value

For a given distribution, the 'expected value' can be intuitively described as the long-run average value of repetitions of the experiment it represents. This is more commonly referred as the *mean* or *first moment* but may also be referred to as *expectation* or *mathematical expectation*.

#### Discrete values

For a discrete value, it's just the combination of the values and their probabilities, i.e. $$E[X] = \sum_{x} xp(x)$$

#### Continuous values

For a continuous value, it's similiar - an integration along the **cumulative distribution function**, i.e. $$E[X] = \int xF(x)dx$$.

#### Expected values in the context of inference

Consider the following:
* The population mean is the center of mass of a population.
* The sample mean is the center of mass of the observed data - i.e. of the samples we took.
* The sample mean is an estimate of the population mean. (We'll use our sample to *infer* something about the population.)
* The sample mean is unbiased.
* The more data that goes into the sample mean, the concentrated its density/mass function is around the population mean.

*[Obligatory Wikipedia link.](https://en.wikipedia.org/wiki/Expected_value)*

---

## Continuous Distributions

Continous values can take any outcome in a continuum. 

A useful introduction to these concepts can be found [here](https://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html).

#### Probability density function

A PDF is a function associated with a continuous random variable.

To be a valid PDF, a function $p$, must:
* Always be $\geq 0$;
* Have the total area under the function $= 1$.

Hence, areas under PDFs correspond to probabilities for that random variable.

Examples of probability densities functions include:
* The [Normal distribution](https://en.wikipedia.org/wiki/Normal_distribution).


#### Useful references

<a href="https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)">Wikipedia link.</a>

[SciPy implementation.](https://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous_uniform.html)


#### Survival function

The survival function is essential $1 - \text{CDF}$, where $CDF$ is the $\text{cumulative distribution function}$.

#### Example of use

A random variable, X, is uniform, a box from 0 to 1 of height 1. (So that its density is $f(x)=1$ for $0≤x≤1$.) What is its median expressed to two decimal places? 

In [1]:
from scipy.stats import uniform

In [2]:
uniform.median(loc=0, scale=1)

0.5

---

## Discrete Distributions

You could broadly describe discrete outcomes as things you can "count" - they have finite values.

#### Probability mass function

A probability mass function at a value corresponds to a probability that a random variable takes that value.

To be a valid pmf, a function, $p$, must:
* Always be $\geq 0$;
* Have the sum of possible variables that the function can take sum to one.

Examples of probability mass functions include:
* The [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution);
* The [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) (used for binary outcomes - the most common example being coin flips).

[Wikipedia link](https://en.wikipedia.org/wiki/Probability_mass_function).

[SciPy implementation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_discrete.pmf.html)

#### Example of use
A random variable takes the value -4 with probability .2 and 1 with probability .8. What is the variance of this random variable?

In [4]:
import numpy as np
from scipy.stats import rv_discrete

In [4]:
example_pmf = rv_discrete(values=((-4, 1), (0.2, 0.8)))

In [5]:
example_pmf.mean()

0.0

In [6]:
example_pmf.var()

4.0

#### Another example
Consider the following PMF:

In [18]:
values = np.array(range(1,5))
example_pmf_2 = rv_discrete(values=(values, values/values.sum()))

In [19]:
example_pmf_2.mean()

3.0