# Chapter 5

## Statistical inference

### Reminders about random variables and probability measures

Recall that *a random variable* is just a function $X:\Omega\rightarrow \mathbb{R}$.

Recall that a *probability measure* is just a function $P$ from events to numbers inbetween zero and one which satisfies [the probability axioms](https://logic-teaching.github.io/philstatsbook/Chap02.html#probability-axioms). 

The pdf, cdf, and ccdf say how probable, in the sense of $P$, it is for the random variable $X$ to take certain values. 

- The *pdf* is the function $f(x)=P(X=x)$, which answers the question "how probable is it for $X$ to have outcome $x$?"

- The *cdf* is the function $F(x)=P(X\leq x)$, which answers the question "how probable is it for $X$ to have outcome $\leq x$"?

- The *ccdf* is the function $\overline{F}(x)=P(X> x)$, which answers the question "how probable is it for $X$ to have outcome $> x$"?

Again, there are subtleties with the pdf when the sample space is infinite; but we are going to put those to the side.

### Probability distributions

The pdf, cdf, and ccdf determine one another, and so we just refer to them all collectively as the *probability distribution* of $X$.

We write $X\sim F$ to say that $X$ has probability distribution with cdf $F$. 

Similarly one sees $X\sim f$ to say that $X$ has probability distribution with pdf $f$.

Again when you see $X\sim F$, the $X$ is a random variable, and the $F$ dictates how probable it is tha the random variable takes certain values.

### The task of statistical inference

The task of statistical inference is to infer, based on data, what the probability distribution is, which is generating the data. 

E.g. am I dealing with a fair coin or a biased coin? Am I dealing with a bell shaped distribution or a distribution which is weighted towards the extremes?

Hence the inference is "*from* data to the nature of the *probability distribution* generating the data."

Note the apparent contrast in character to traditional confirmation theory à la [Bayes' Theorem](https://logic-teaching.github.io/philstatsbook/Chap02.html#theorem-bayes-theorem-formula), where the inference is "from *the prior probability measure* and the evidence to the posterior probability measure." (Later when we look at Bayesian approaches in statistics we can ask whether this apparent contrast can be sustained: ostensibly what is missing so far is a prior).

### The format of the probability distributions

There are theoretically a ton of different probability distributions to choose from. But practically speaking, one just restricts attention to a few families of them, which are parameterized by a few simple numbers.

For instance: 

- Binomial $\mathrm{Binom}(n,p)$ is parameterized by the number of trials $n$ and the probability of success $p$ on each trial. 

- Poisson $\mathrm{Pois}(\lambda)$ is parameterized by the the rate $\lambda$.

The various numbers like $n,p$ and $\lambda$ are called *the parameters*. In *parametric statistical inference* one just decides on a family ahead of time and then tries to use the data to figure out which parameter is at issue.


Hence, we assume that we are dealing with a family $\{F_{\theta}: \theta\in \Theta\}$ of probability distributions, which come from one of the families, and we are trying to figure out which one of these is generating our data. 

The set $\Theta$ is called *the parameter space*.

For instance: 

- In one case we might be concerned with $\{\mathrm{Binomial}(n,p): 5\leq n\leq 10, .4\leq p\leq .6\}$. In this case, the parameter space is $\Theta = \{5,\ldots, 10\}\times [.4, ,.6]$, where $[.4,.6]$ is just all the real numbers inbetween and including .4 and .6 

- In another case we might be concerned with $\{\mathrm{Pois}(\lambda): 3\leq \lambda\leq 10\}$. In this case, the parameter space is $\Theta = [3,10]$, where $[3,10]$ is all the real numbers inbetween and including 3 and 10.

### The format of the data

We assume that our data comes in the following format: 

a sequence $X_1, \ldots, X_n\sim F_{\theta}$, 

where $\theta$ is the true parameter of the distribution, 

and where $X_1, \ldots, X_n$ are [independent](https://logic-teaching.github.io/philstatsbook/Chap04.html#independence-of-random-variables).

We don't know what the parameter $\theta$ is. We're trying to figure that out.

Since we're in a probabilistic setting, we won't get exact knowledge of $\theta$, but we'll get close, in various ways.


## The task of statistical inference

We have a family of $\{F_{\theta}: \theta\in \Theta\}$ of probability distributions. 

One of them $F_{\theta}$, whose parameter is called the *true parameter* $\theta$, is generating the data.

The data has the form of independent random variables $X_1, \ldots, X_n \sim F_{\theta}$, where $\theta$ is the true parameter. 

Our job is to look at the data and to try to figure out the true parameter.


Wasserman writes: 

> Many inferential problems can be identified as being one of three types: estimation, confidence sets, and hypothesis testing ({cite}`Wasserman2013-bc` p. 90)

Most philosophical discussion concerns hypothesis testing.

In this chapter and the next we say a little about estimation.


## Averages

If one has some data $X_1, \ldots, X_n$, the natural thing to do is to take its average $\overline{X}_n$:

$$\overline{X}_n(\omega) = \frac{1}{n} \sum_{i=1}^n X_i(\omega) = \frac{X_1(\omega)+\cdots +X_n(\omega)}{n}$$

If $n$ is clear from context, we may write $\overline{X}$ instead of $\overline{X}_n$.

Hence, note the notational convention: an overline usually denotes an average.

Stigler {cite}`Stigler2016-lx` notes that the naturalness of this is a little puzzling, for two reasons:

- On first principles: it is apriori not obvious how "you can actually gain information by throwing information away" and indeed this was a "truly revolutionary" idea ({cite}`Stigler2016-lx` pp. 3-4).

- On the basis of the historical record: "In antiquity, and in the Middle Ages, when reaching for a summary of diverse data, people chose an individual example" and not an average (pp. 32-33). 

Let us use some simple ideas developed last time to try to dispel these puzzles.

The first pertains to what happens in the limit.

The second pertains to the risk at finite stages. 

## What happens to averages in the limit

### Proposition (expectaton of the average)

Suppose independent $X_1, \ldots, X_n, \ldots \sim F_{\theta}$ with common expectation $\theta$.

Then $\mathbb{E}\overline{X}_n = \theta$.

*Proof*:

$\mathbb{E}\overline{X}_n = \mathbb{E} \frac{1}{n} \sum_{i=1}^n X_i = \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] = \frac{1}{n} \sum_{i=1}^n \theta = \frac{1}{n} \cdot n \cdot \theta = \theta$

This next proposition was mentioned in [the last chapter](https://logic-teaching.github.io/philstatsbook/Chap04.html#proposition-root-n-rule):

### Proposition (variance of the average; aka the root n rule)

Suppose independent $X_1, \ldots, X_n, \ldots \sim F_{\theta}$ with common expectation $\theta$ and variance $\sigma^2$

Then $\mathrm{Var}(\overline{X}_n) = \frac{\sigma^2}{n}$.

*Proof*

One has 

$\mathrm{Var}(\overline{X}_n) = \sum_{i=1}^n \mathrm{Var}(\frac{X_i}{n})$ by independence

$\hspace{5mm} = \sum_{i=1} \frac{1}{n^2}  \cdot \mathrm{Var}(X_i)$ by [earlier proposition](https://logic-teaching.github.io/philstatsbook/Chap04.html#proposition-how-multiplication-and-addition-of-reals-affects-variance)

$\hspace{5mm} = \frac{1}{n^2} \sum_{i=1}  \mathrm{Var}(X_i)$

$\hspace{5mm} = \frac{1}{n^2} \cdot n \cdot \sigma^2 = \frac{\sigma^2}{n}$.



## Consistency in mse of the average

Suppose independent $X_1, \ldots, X_n, \ldots \sim F_{\theta}$ with common expectation $\theta$ and variance $\sigma^2$.

Consider the following *the mean squared error of the average*, which one can think of as the distance between the average and the expectation:

$$\mathbb{E}(\overline{X}_n-\theta)^2$$

Since $\overline{X}_n$ itself has expectation $\theta$, this is also $\mathrm{Var}(\overline{X}_n)$,

which by the previous proposition is $\mathrm{Var}(\overline{X}_n) = \frac{\sigma^2}{n}$.

Hence, as $n$ gets bigger and bigger, this goes to zero. 

That is, in the limit, the mean square error of the average goes to zero.



## Risk in the finite

In general, for a random variable $Y$, one can consider the *mean squared error of the average*

$$\mathbb{E}(Y-\theta)^2$$

Consider again Stigler's remark:

> "In antiquity, and in the Middle Ages, when reaching for a summary of diverse data, people chose an individual example" and not an average (pp. 32-33). 

Suppose that instead of the average, you choose an individual, like $X_{7}$.

But one has that its mean squared error is 

$$\mathbb{E}(X_7-\theta)^2=\mathrm{Var}(X_7)=\sigma^2$$

Hence, for any $n>1$ one has that 

$$\mathbb{E}(\overline{X}_n-\theta)^2 = \frac{\sigma^2}{n}<\sigma^2 =\mathbb{E}(X_7-\theta)^2$$

Hence, in the short run, the average is less risky. 