# Classical Statistical Inference

## Bayesian vs Frequentist

Within the field of statistics there are two prominent schools of thought, with opposing views: the **Bayesian** and the **Classical** (aka **Frequentist**). 

In Bayesian view, the unknown variables are treated as random variables with known prior distrubtions. We conduct an experiment, make some observation ($X$) and then try to infer the value of the random variable ($\Theta$) during the conducted experiment. We derive a posterior distribution $P(\Theta \mid X)$ which tells us how likely it is that $\Theta = \theta$ when the observation $X$ was made.

By contrast, in Classical Inference, the unknown quantity $\theta$ is viewed as a deterministic constant that happens to be unknown. It then strives to develop an estimate of $\theta$ that has some performance guarantees (confidence interval). 

Suppose that we are trying to measure a physical constant, say the mass of the electron, by means of noisy experiments. The classical statistician will argue that the mass of the electron, while unknown, is just a constant, and that there is no justification for modeling it as a random variable. The Bayesian statistician will counter that a prior distribution simply reflects our Rtate of knowledge. For example, if we already know from past experiments a rough range for this quantity, we can express this knowledge by postulating a prior distribution which is concentrated over that range.

A classical statistician will often object to the arbitrariness of picking a particular prior. A Bayesian statistician will counter that every statistical procedure contains some hidden choices. Furthermore, in some cases, classical methods turn out to be equivalent to Bayesian ones, for a particular choice of a prior. By locating all of the assumptions in one place, in the form of a prior, the Bayesian statistician contends that these assumptions are brought to the surface and are amenable to scrutiny.

## Classical Inference

In classical inference we treat the unknown parameter as an unknown constant ($\theta$) rather than a random variable ($\Theta$). The observation $X = (X_1, X_2, \dots, X_n)$ is a random variable, whose probability distribution depends on the value of $\theta$. This probability distribution is denoted by $P_X(x;\theta)$. Remember that this indicates that $P_X$ is dependent on $\theta$, not that it is conditioned on $\theta$ in the probablistic sense.  
**$\theta$ is not a random variable. It just decides the probabilities of other random variables, like $X$.**  

<img src="../images/classical_inference/classical_inference.png">

Our task in classical inference is to come up with an estimate $\hat{\Theta}$. The estimate is decided on the basis of the observations $X$ made during the experiment. Thus $\hat{\Theta} = g(X)$. Recall that a function of a R.V. is a R.V. itself. $\hat{\Theta}$ depends on the observations $X$, thus $\hat{\Theta}$ is also a random variable. Each time we conduct the experiment, $\hat{\Theta}$ takes a value $\hat{\theta}$ and $X$ takes the value $x$. $g$ is called the estimator function, and the process of designing the estimator is called modelling. The model should be such that it gives a **"close enough"** estimate of $\theta$ for all possible values of $\theta$.


There are two types of tasks in classical inference:
1. Parameter Estimation :  
    In parameter estimation, $\theta$ can take values from a continuous range of real numbers. Thus, we have to assign a value to $\hat{\Theta}$ from this continuous interval.  We are intersted in estimators ($g$) that have some desirable properties. For example, we may require that the expected value of the estimation error be zero, or that the estimation error be small with high probability, for all possible values of $\theta$.  
<br/>
2. Hypothesis Testing :  
    In hypothesis testing, there are only a fixed number of discrete values that $\theta$ can take and we have to choose a value (from all possible values) for $\hat{\Theta}$ in each experiment. In particular, $g$ calculates the "likelihood" of each hypothesis under the observed data, and chooses a hypothesis by comparing the likelihoods with a suitable chosen threshold. Here the range of $g$ is the set of all possible hypothesis.  
    

## Classical Parameter Estimation

#### Expected Value Estimation

We will study Classical Estimation by trying to estimate the mean of a random variable. Suppose $X_1,X_2,\dots,X_n$ are i.i.d. (independent identically distributed) random variables drawn from a Gaussian Distribution with mean $\theta$ and variance $\sigma^2$. Parameter $\theta$ is unknown and we want to estimate it using classical statistics.

We can take our estimator to be $g$, where
$$\hat{\Theta} = g((X_1, X_2, \dots, X_n)) = sample\,mean = \frac{X_1 + X_2 + \dots + X_n}{n}$$

The desirable properties of any estimator $g$ are as follows:  
1. $E[\hat{\Theta}_n] = \theta$ **(zero bias)**  
This property should be true for all $\theta$. Recall that $E[\hat{\Theta}_n]$ depends on $\theta$. We don't want our estimates to be systematically low or high, no matter what the value of $\theta$ is. This property is true for our estimator function above.  
<br/>
2. $\hat{\Theta}_n \to \theta$, as $n \to \infty$ **(consistency)**  
Weak Law of Large Numbers ensures that our estimator for mean estimation problem follows this property, for all values of $\theta$.  
<br/>
3. Mean Squared Error $E[(\hat{\Theta}_n - \theta)^2]$ **(mse)**  
Mean squared error should be as less as possible. For our specific problem of mean estimation and our specific estimator $g$, we know that $E[\hat{\Theta}_n] = \theta$. Therefore, $E[(\hat{\Theta}_n - \theta)^2] = var(\hat{\Theta}_n) = \sigma^2/n$. Therefore, as $n$ increases, the value of MSE decreases.

#### Mean Squared Error of an Estimator

Unlike the example that we saw above, $E[\hat{\Theta}_n] \neq \theta$ for all estimation problems. In that case, using the identity $E[Z^2] = var(Z) + (E[Z])^2$, we derive:  

$$E[(\hat{\Theta} - \theta)^2] = var(\hat{\Theta} - \theta) + (E[\hat{\Theta} - \theta])^2 = var(\hat{\Theta}) + (bias)^2$$


In the previous example, our bias was zero. So,  
$$E[(\hat{\Theta}_n - \theta)^2] = var(\hat{\Theta}_n) = \sigma^2/n$$
<br/>
If we take a dumb estimator which always gives out an estimate of 0, irrespective of the observation, then
$$E[(\hat{\Theta}_n - \theta)^2] = var(\hat{\Theta}_n) + bias = 0 + (0 - \theta)^2 = \theta^2$$  

<br/>
Generally the task of Parameter Estimation includes reporting the $MSE$ or $\sqrt{MSE}$, along with designing and implementing an estimator with most of the desirable properties.

#### Confidence Interval

The value of an estimator $\hat{\Theta}$ may not be informative enough on its own. When we need to represent our estimate in more concrete terms, we specify a confidence interval.

An $1-\alpha$ confidence interval is an interval $[\hat{\Theta}_-,\hat{\Theta}_+]$ , s.t.  $P(\hat{\Theta}_- \leq \theta \leq \hat{\Theta}_+) \geq 1-\alpha$, $\forall \theta$

We have mentioned beforhand that $\theta$ is not a RV. Then how does probability of $\theta$ lying in an interval make sense? It makes sense because $\hat{\Theta}_-$ and $\hat{\Theta}_+$ are random variables. The probability is the likelihood of our range being correct, not the likelihood of $\theta$ taking up a particular value.

Confidence Intervals for the problem of mean estimation (discussed above) can be calculated using Normal Tables and Central Limit Theorem. If you need to see an example, click [here](https://www.youtube.com/watch?v=mImHCY0A3a0&list=PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6&index=203). The [next video](https://www.youtube.com/watch?v=MzvRQFYUEFU&list=PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6&index=204) goes over the concept of t-tables.

#### Maximum Likelihood Estimation

Many a times the quantity that we want to estimate cannot be expressed as the expected value of some random variable.   
$\theta \neq E[f(X)]$
<br/>

In such cases, we will pick the value of $\theta$ which maximises the probability of $X = x$.
$$\hat{\theta}_{ML} = g_{ML}(x) = \operatorname*{arg\,max}_{\theta} P_X(x;\theta)$$

We can compare this method to Bayesian Estimation. There we try to pick a value of $\Theta$ for which the posterior distribution $P(\Theta \mid X)$ is the maximum. If you take the prior to be equal for all $\Theta$ (all values are equiprobable), then the MAP Bayesian Estimator would be the value which has maximum likelihood probability $P(X \mid \Theta)$. Maximizing $P(X \mid \Theta)$ is similar to maximizing $P(X ; \theta)$ in Classical Inference.

Despite the similarity in mechanics, the two methods are philosophically very different. In Bayesian setting, you are asking what is the most likely value of $\Theta$, where as in the Classical setting, you are asking which value of $\theta$ makes the observation most likely (least surprising).

Finally we study an example of Maximum Likelihood Estimation, where we try to estimate the bias of a coin, given we get $K$ heads on tossing the coin $n$ times. We calculate the likelihood distribution. We see that it is easier to maximize log-likelihood function, rather than the likelihood function. Finally we arrive at the ML estimator for $\theta$, which is given by the equation:  
$$\hat{\Theta}_{ML} = \frac{K}{n}$$
Follow this [video](https://www.youtube.com/watch?v=00krscK7iBA&list=PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6&index=207) for more details. It also discusses how to derive Maximum Likelihood estimate for the mean of a Gaussian Distribution. We see that for this mean inference problem, $\hat{\Theta}_{ML}$ is the same as the estimate that we obtained earlier using the estimator $g(X) = \sum{X_i}/n$


## Hypothesis Testing

#### Binary Hypothesis Testing

Binary Hypothesis Testing is an inference problem where $\theta$ takes just two values. For historical reasons, the two hypotheses are denoted as $H_0$ and $H_1$. $H_0$ is often called the null hypothesis (default model, to be proved or disproved based on the evidence) and $H_1$ the alternate hypothesis.

The available observation is a vector $X = (X_1,\dots,X_n)$ of random variables whose distribution depends on the hypothesis. $P(X \in A ; H_j)$ denotes the probability of sample X being in a set A, given that $H_j$ is true. Again $P(X \in A ; H_j)$ does not denote conditional probabilities.

In Hypothesis Testing, the function $g$ is generally referred to as the Decision Rule. $g(x) = H_0$ OR $g(x) = H_1$. Any decision rule can be represented by the partition of the set of all possible values of the observation vector $X$ into two subsets, a set $R$ called the Rejection Region, and its complement $R_c$, the Acceptance Region. Hypothesis $H_0$ is rejected if the observation lies in the Rejection Region and accepted otherwise.

For a particular choice of the rejection region R, there are two possible types of errors:  
(a) Reject $H_0$ even though $H_0$ is true. This is called **Type I** error, or a **false rejection**, and happens with probability $\alpha(R) = P(X \in R; H_0)$.  
(b) Accept $H_0$ even though $H_0$ is false. This is called **Type II** error, or a **false acceptance**, and happens with probability $\beta(R) = P(X \notin R; H_1)$.  

We define the rejection regions as follows:  
$$R = \{ x \mid L(x) > \xi \},$$  
where the Likelihood Ratio $L(x)$ is defined as  
$$L(x) = \frac{P_X(x;H_1)}{P_X(x;H_0)}.$$

See the dice rolling example (on Pg488 of the textbook) to understand these concepts better.  

Note that choosing $\xi$ trades off the probabilities of the two types of errors, as illustrated by the preceding example. Indeed, as $\xi$ increases, the rejection region becomes smaller. As a result, the false rejection probability $\alpha(R)$ decreases while false acceptance probability $\beta(R)$ increases. Because of this trade-off there is no single best way to choose $\xi$. The most popular approach is as follows:  
<img src="../images/classical_inference/LRT.png">  
</br>  

When $L(X)$ is a continuous random variable, the probability $P(L(X) > \xi; H_0)$ moves continuously from 1 to 0 as $\xi$ increases. Thus, we can find a value of $\xi$ for which the requirement $P(L(X) > \xi; Ho) = \alpha$ is satisfied. If, however, $L(X)$ is a discrete random variable, it may be impossible to satisfy the equality $P (L(X) > \xi; Ho) = /alpha$ exactly, no matter how $\xi$ is chosen; as seen in the dice rolling example. In such cases, there are several possibilities:
<ul>
    <li>Strive for approximate equality.</li>
    <li>Choose the smallest value of $\xi$ that satisfies $P(L(X) > \xi; Ho) \leq \alpha$.</li>
</ul>


#### Significance Testing

Hypothesis testing problems encountered in realistic settings do not always involve two well-specified alternatives, so the methodology in the preceding section cannot be applied. The purpose of this section is to introduce an approach to this more general class of problems.

Consider problems such as the following:
1. A coin is tossed repeatedly and independently. Is the coin fair?
2. A die is tossed repeatedly and independently. Is the die fair?
3. We observe a sequence of i.i.d. normal random variables $Xl, X2, \dots, Xn$. Are they standard normal?

In all of the above cases we are dealing with a phenomenon that involves uncertainty, presumably governed by a probabilistic model. We have a default hypothesis, usually called the null hypothesis, denoted by $H_0$, and we wish to determine on the basis of the observations $X = (X_l ,\dots, X_n)$ whether the null hypothesis should be rejected or not.

We will mostly restrict the scope of our discussion to situations with the following characteristics:
<ol>
    <li><b>Parametric models</b>: We assume that the observations $X_l,\dots,X_n$ have a distribution governed by a joint PMF (discrete case) or a joint PDF (continuous case), which is completely determined by an unknown parameter $\theta$ (scalar or vector) belonging to a given set M of possible parameters.  
    </li>
    <li><b>Simple null hypothesis</b>: The null hypothesis asserts that the true value of $\theta$ is equal to a given element $\theta_0$ of M.  
    </li>
    <li><b>Alternative hypothesis</b>: The alternative hypothesis, denoted by $H_1$, is just the statement that $H_0$ is not true, i.e., $\theta \neq \theta_0$.
    </li>
</ol>

Now study Example 9.15, where we try to answer the question "A coin is tossed repeatedly and independently. Is the coin fair?" This gives insight into how Significance Testing is performed. We summarize and generalize the essence of Example 9.15 as below   
<img src="../images/classical_inference/significance_testing.png">

Given a value of $\alpha$, if the hypothesis $H_0$ ends up being rejected, one says that $H_0$ is rejected at the $\alpha$ significance level. It does not mean that the probability of $H_0$ being true is less than $\alpha$. Instead, it means that when this methodology is used, we will have false rejections a fraction $\alpha$ times. Rejecting a hypothesis at the 1% significance level doesn't mean that the probability of $H_0$ being true (inspite of being rejected) is 1%. It means that the observed data is highly unusual under the model associated with $H_0$; such that the data would arise only 1% of the time, and thus provides strong evidence that $H_0$ is false.  

Quite often, statisticians skip steps (c) and (d) in the above described methodology. Instead, once they calculate the realized value s of S, they determine and report an associated **p-value** defined by  
p-value $= min\{\alpha \mid H_0$ would be rejected at the $\alpha$ significance level$\}$  

Equivalently, the p-value is the value of $\alpha$ for which $s$ would be exactly at the threshold between rejection and non-rejection. Thus, for example, the null-hypothesis will be rejected at the 5% significance level if and only if the p-value is smaller than 0.05. 

[Videos by Khan Academy](https://www.khanacademy.org/math/ap-statistics/tests-significance-ap/idea-significance-tests/v/p-values-and-significance-tests) are also great for understanding significance-testing and p-values. Also observe that the null and alternate hypothesis in these videos doesn't exactly fit the criterion for alternate hypothesis specified above.

Basically, p-value of a statistic $S=s$ is $p$ implies that the probability of $S$ assuming a value of $s$ or anything less likely than $s$ is $p$ under the null hypothesis. You can either report $s$ with its p-value, or you can have a fixed significance level. If p-value for the experiment is greater than significance level then we do not reject $H_0$ and if p-value is less than the significance level then we reject $H_0$.