###### Hypothesis Testing
## Exposition
In this section, I review the general gist of how to carry out hypothesis testing. This section will largely be inspired and collected from Wasserman's "All of Statistics", Cox and Hinkley's "Theoretical Statistics", and "Linear Statistical Inference and Its Applications" by Rao. Read Wasserman Chapter 10, Keener Chapter 12, Cox/Hinkley Chapters 3, 4, and 5 for more thorough exposition.

A hypothesis test is a means of statistical inference - we are given a set of data from an experiment and desire to understand or characterize the underlying distribution we assume it is generated from. In our understanding of the distribution, we develop a **null hypothesis**, $H_0$, which is a claim about the parameters of the distribution (in the parametric setting) or a statistical functional of the distribution (in the nonparametric setting). More formally, if we let $\Omega$ denote all of the possible values that the data could take, and let $X=(x_1,\dots,x_n)$ denote our observations from an experiment, then $H_0$ either partially or completely specifies the distribution (or probability measure over a Borel $\sigma$-algebra in $\Omega$). To make this more clear, assume we are generating samples from a normal distribution and then a null hypothesis that partially specifies the distribution could be a hypothesis on either the mean or the variance, but not both. A hypothesis about the entire distribution would specify both the mean and variance as that fully characterizes the distribution. Notice that a hypothesis test is rather flexible in connection with confidence intervals in that our null is not restricted to the case where the parameters must be take on single points. Another test can be developed if we instead desire to resolve if the parameter lies in a band of points or an interval. These form the basis for the two types of hypotheses in parametric inference - **simple** ones (single points) and **composite** ones (intervals, unions of intervals, etc.).

After specifying the null hypothesis, our goal is to determine based on the data if the null is true or false. A person familiar with statistics should come to recognize the connection between this statement (hypothesizing parameters based on the data) and the concept of likelihood functions, the realm of MLE and the like. To determine if the null is true or false, we must first develop a test statistic and critical value. The **test statistic** is a random variable that takes in our data as an input and outputs a real number that is larger if we are more certain of rejecting the null and smaller if we are more certain of the null being true. After obtaining a test statistic that captures our confidence of the data under the null, we desire to set a threshold for the test statistic, a real number, known as the **critical value**. If our test statistic is larger than the critical value, then we desire to reject the null. This is very challenging and it is obvious that there is a degree of arbitrariness to this threshold. For this reason, we often times more so use confidence intervals unless all the assumptions within a hypothesis test are met. 

With the test statistic and critical value, we can deduce the region of space in $\Omega$ where we will reject the null - this is known as the **rejection region**. The critical value coupled with the test statistic informs us of the size of the hypothesis test. This **size** is defined as the highest probability from which we reject the null but the null hypothesis is actually true. The size is an intuitive measure for the highest chance that we screw up or the null is true, but our data lies in the rejection region we created, so we reject it. This error of rejecting the null when it is in fact true is known as **type 1 error**. The other common form of error is the reverse - we retain the null when we shouldn't have and is known as **type 2 error**. More generally, we define the **power function** of a test by the probability that the data lies in the rejection region for some parameter in the space of the null and alternate. In symbols, 

$$\text{Power of }\theta=\beta(\theta)=\mathbb{P}(\text{Data is in rejection region}|\theta).$$ 

In non-randomized tests, we typically reject the null if the data is in some critical region $R$ and accept otherwise. In randomized tests, we have a test/critical function $\phi:X\rightarrow [0,1]$ which represents the chance of rejecting the null given the data. A non-randomized test can be seen as a randomized test in which $\phi$ is the indicator function that takes value 1 if it lies in the critical region and 0 otherwise. Observe that 

$$\beta(\theta)=\mathbb{P}(\text{Reject} H_0|\theta)=\mathbb{E}(\mathbb{P}(\text{Reject} H_0|\theta,X)|\theta)=\mathbb{E}(\phi(X)|\theta,X).$$


Testing two simple hypotheses: Let us consider two simple hypotheses. That is, let us consider the mean age of atheletes on a baseball team: $H_0:\mu=25$ versus $H_1:\mu=30$. This is a poorly formulated hypothesis, but just stick with me okay... As we just have two simple hypotheses, let $L(X;\mu=25)$ and $L(X;\mu=30)$ denote the likelihoods of the data under the two parameters. With only two distributions, the power function of the test has two values $\alpha=\beta(25)$ and $\beta(30)$ where 

$$\beta(25)=\mathbb{E}(\phi(X)|\mu=25)=\int \phi(x)L(x;25)d\mu(x)=\sum \phi(x)L(x;25).$$

Our goal is to develop a $\phi$ that minimizes the above while maximizing $\beta(30)$ - in simple words, we want a test that rejects or accepts the null with great precision. We can do so with the help of Lagrange multipliers. If we suppose that $\phi$ has **level** $\alpha$ (the size or supremum of the power function under the null is bounded above by the level $\alpha$), then 

$$\mathbb{E}(\phi(X)|\mu=30)\leq \mathbb{E}(\phi(X)|\mu=30)-\lambda\mathbb{E}(\phi(X)|\mu=25)+\lambda\alpha.$$

Our objective is to maximize the above by picking the right $\phi$. As $\lambda\alpha$ is a constant, our objective function can be minimized as follows

$$\mathbb{E}(\phi(X)|\mu=30)-\lambda\mathbb{E}(\phi(X)|\mu=25)=\int L(x|\mu=30)\phi(x)d\mu(x)-k\int L(x|\mu=25)\phi(x)d\mu(x)$$

$$=\int \left[L(x|\mu=30)-\lambda L(x|\mu=25)\right]\phi(x)d\mu(x).$$

Clearly from here, we can partition the integral into two cases - one where $L(x|\mu=30)>\lambda L(x|\mu=25)$ and another for the reverse direction. Denote $S_1$ as the first region of $x$s satisfying the condition and $S_2$ as the complement condition. Then the above becomes 

$$\int_{S_1} \left[L(x|\mu=30)-\lambda L(x|\mu=25)\right]\phi(x)d\mu(x)+\int_{S_2}\left[\lambda L(x|\mu=25)-\lambda L(x|\mu=30)\right]\phi(x)d\mu(x).$$

The choice of $\phi$ from here is obvious - we simply take $\phi(x)=1$ for $x\in S_1$ and $0$ for $x\in S_2$. What does this result tell us? Well, if 

$$\frac{L(X|\mu=30)}{L(X|\mu=25)}\geq \lambda,$$

then $\phi$ is one and so we reject the null. Any test of this form where we end up characterizing the rejection or acceptance of the null based on likelihoods is more broadly understood as **likelihood ratio tests**. These kinds of tests are very powerful and an important result stemming from the above discussion is known as the **Neyman-Pearson lemma** which states that given a level $\alpha\in(0,1)$, there exists a likelihood ratio test $\phi_\alpha$ with level $\alpha$, and any likelihood ratio test with level $\alpha$ maximizes the power of the alternate ($\mathbb{E}(\phi(X)|\theta_{H_1})$) among all tests with level at most $\alpha$. If you would like to flesh out the arguments given in this simple hypothesis testing, I strongly recommend Keener chapter 12 - he discusses this abstractly as well as Rao pages 446-464 - much better than I could do if you want the mathematical rigor packed in.


## In summary, 

Hypothesis testing is generally a procedure that follows the order of steps given below: 
1. Identify the parameter of interest. 
2. State the null hypothesis, $H_0$.
3. Specify an appropriate alternative hypothesis, $H_1$.
4. Choose a significance level, $\alpha$.
5. Determine an appropriate test statistic of the data, $T(X)$.
6. State the rejection region for the statistic.
7. Use sample quantities, substitute these into the test statistic, and compute the value.
8. Decide whether or not $H_0$ should be rejected and report that in the problem context.

## Tests

### Wald Test
The Wald test is a means of hypothesis testing that relies on large-sample asympotic normality to test a simple hypothesis, $\theta=(\theta_1,\theta_2,\dots,\theta_n)$ or equivalently $(\theta_1-\theta_2-\cdots -\theta_n)=0$ versus the alternate that the parameters do not share the same value. Within this framework, we recognize that the MLE estimate for $\theta$ is asymptotically normal, so 

$$\frac{\hat{\theta}-\theta_0}{\hat{se}(\hat{\theta})}=\left(\hat{\theta}-\theta_0\right)\sqrt{I_n(\hat{\theta})}\xrightarrow{D}\mathcal{N}(0,1).$$

With this in mind, the size $\alpha$ Wald test consists of a rejection region $R=\{W:|W|>z_{\alpha/2}\}$ where the test statistic is specified by 

$$W=\frac{\hat{\theta}-\theta_0}{\hat{se}}.$$

Side notes: note that $\hat{\theta}$ need not be the MLE - for instance, we can use the plug in estimator as well which shares the same asymptotic normality for most parameters*. Similarly, we could instead substitute the standard error with the plug in estimator or the bootstrap for difficult estimator such as the median, mode, etc. 

### Likelihood Ratio Test
"The likelihood ratio test (LRT) is more general and can be used for testing a vector-valued parameter" (Wasserman).

Consider testing $H_0:\theta \in \Theta_0$ versus $H_1:\theta \not\in \Theta_0$ on the basis of $n$ observations of data, $X_1,\dots,X_n$, assumed to stem from the underlying distribution. 

Then the **likelihood ratio test statistic** is given by 

$$
\lambda = 2 log \bigg(\frac{\sup\limits_{\theta \in \Theta} \mathcal{L}_n(X_1,\dots,X_n|\theta)}
{\sup\limits_{\theta \in \Theta_0} \mathcal{L}_n (X_1,\dots,X_n|\theta) } \bigg) 
        =   2 log \bigg(\frac{\mathcal{L}_n (X_1,\dots,X_n|\hat{\theta}_{MLE}) }         {\mathcal{L}_n(X_1,\dots,X_n|\hat{\theta}_{MLE,0}) } \bigg)  
$$

The LRT statistic is assumed to be asymptotically chi-squared: $\lambda \xrightarrow{D} \chi^2_{r-q}$ where $r = dim(\Theta)$ and $q = dim(\Theta_0)$.

* Computation trick: $\lambda = 2 log(\mathcal{L}(X_1,\dots,X_n|\theta)) - 2 log(\mathcal{L}(X_1,\dots,X_n|\theta))$
    * Find the likelihood and plug in the constrained and unconstrained MLEs


### $\chi^2$ Test
asdf

### t Test
asdf

## Defintion Examples
Test Statistic:
- asdf
- asdf
- asdf

Critical Region:
- asdf
- asdf
- asdf

Power Function:
- asdf
- asdf
- asdf

P-value:
- asdf
- asdf
- asdf

Wald Test:
- asdf
- asdf
- asdf

Likelihood Ratio Test:
- asdf
- asdf
- asdf

$\chi^2$ Test:
- asdf
- asdf
- asdf

t Test:
- asdf
- asdf
- asdf

## Problems
1) Consider random variables $Y_1,\dots,Y_n$ representing directions round a circle, i.e. angles between $0$ and $2\pi$. The null hypothesis is that the random variables are independently uniformly distributed. Show that a suitable test statistic for detecting clustering around $0$ is $\sum\limits_{i=1}^n Y_i$ and that it can be interpreted as the abscissa of a random walk in the plane with steps of unit length. Suggest test statistics when clustering may be around 0 and $\pi$, and also when clustering may be around an unknown direction. For both cases, obtain a normal approximation to the distribution under the null hypothesis of a uniform distribution of angles around the circle.

- answer 

2)



# Sources:
* All of Statistics Chapter 10 - Wasserman
* Linear Statistical Inference and its Applications Chapter 7 - Rao
* Theoretical Statistics Chapter 12 - Keener
* Statistics Chapter 26-27 - Freedman, Pisani, Purves