# Test statistic and p-value assignment

Recall that any fixed classification rule that does not use a training set to tune its parameters can be converted to a statistical test. For that we must fix a null hypothesis as a source of negative samples and measure the false positive rate.
Usually, a classifier internally computes a decision value $t$ aka test score. Thus we can assign a false positive rate $p(t)$ for each threshold value $t$. This value is known as p-value.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import numpy.random as rnd

from pandas import Series
from pandas import DataFrame

from tqdm import tnrange
from plotnine import *

# Local imports
from common import *
from convenience import *

## I. Formal definition of a p-value

Let $t$ be a test statistic computed from a sample $\boldsymbol{x}$ as $f(\boldsymbol{x})$. Let $\mathcal{D}$ be the distribution of $\boldsymbol{x}$ determined by the null hypotesis. Then p-value is a probability

\begin{align*}
\mathrm{pvalue}(t)=\Pr[\boldsymbol{x}_*\gets\mathcal{D}, t_*=f(\boldsymbol{x}_*): t_*\geq t]\enspace.
\end{align*}

Note that this definition corresponds to a classifier that classifies all samples $\boldsymbol{x}$ as negatives if $f(\boldsymbol{x})\geq t$ and $\mathrm{pvalue}(t)$ is just the fraction of false positives.

Obviously, we can reverse the classification rule and then  p-value is a probability

\begin{align*}
\mathrm{pvalue}(t)=\Pr[\boldsymbol{x}_*\gets\mathcal{D}, t_*=f(\boldsymbol{x}_*): t_*\leq t]\enspace.
\end{align*}

Both definitions are quite common. Intuitively, you should fix the direction based on the extremeness. 
If you consider larger values more extreme then you should consider the first formula. It is also possible to use a double threshold and consider a cut-off based on $|t|$. This leads to the third formula:

\begin{align*}
\mathrm{pvalue}(t)=\Pr[\boldsymbol{x}_*\gets\mathcal{D}, t_*=f(\boldsymbol{x}_*): |t_*|\geq |t|]\enspace.
\end{align*}

## Statistical power of a test

Different classification algorithms are good for different tasks. The same is true for statistical tests.
Thresholding based on p-value allows us to control the ratio of false positives (significance level).
At the same time, it also affects the ratio of false negatives. The latter is harder to quantify as it depends on two factors:
* definition of a test statistic (*statistical test*)
* the data distribution of positive cases (*alternative hypothesis*)

Of course, a test cannot really work equally good for all alternative hypotheses.
Hence, a statistical test is commonly defined to work well for all reasonable alternative hypotheses.
For instance, let the null hypothesis be that the iid (independent and identically distributed) data sample is from a normal distibution $\mathcal{N}(\mu=0,\sigma=1)$. Then it makes sense to consider a class of alternative hypotheses where the iid data sample is from a normal distribution $\mathcal{N}(\mu\neq 0,\sigma=1)$. 
This situation naturally arises in quality control where we must check that some physical quantity is zero and the measurement procedure is corrupted by additive Gaussian noise $\mathcal{N}(\mu=0,\sigma=1)$. 

Now for any alternative hypothesis specified by a distribution $\mathcal{D}$, we can compute the recall probability as follows:

\begin{align*}
\Pr[data\gets \mathcal{D}: \text{test accepts }]\enspace.
\end{align*}

This is known as **power** of the statistical test. A good test has large recall for alternative hypotheses. 
For our example case, the latter cannot be achieved since $\mathcal{N}(\mu\neq 0,\sigma=1)$ can be arbitrarily close to 
$\mathcal{N}(\mu=0,\sigma=1)$. In general, the best we can achieve is that for all the alternative hypotheses, our test performs roughly as well as the best test designed for that null hypothesis and alternative hypothesis pair.

# Homework

## 2.1 Properties of p-values (<font color='red'>1p</font>)

Let the test statistic $t$ be $t=\sin(x)$. Find the p-value function $\mathrm{pvalue}(t)$ for the null hypothesis where $x$ is sampled uniformly form the range $[-\pi, \pi]$. Find out what is the distribution of $\mathrm{pvalue}(t)$ for $t=\sin(x)$. Explain why you get this result?
* You can use simulations or simple probability computations to determine when $\sin(x_*)\geq t$.
* You can use simulations or simple probability computations to determine the distribution of $\mathrm{pvalue}(t)$.

In [None]:
?rnd.uniform

In [None]:
ggplot(??)+ geom_jitter(aes(y='??', x=0), height=0, width=0.1, alpha=0.3) + xlim(-1,1)

In [None]:
ggplot(??) + geom_histogram(aes(x='??'), bins=20)

## 2.2 Power of a statistical test (<font color='red'>1p</font>)

Consider two statistical tests defined by the test statistics $t_1=\frac{1}{|x|}$ and $t_2=|x|$. 
Find the p-value functions $\mathrm{pvalue}_i(t)$ for the null hypothesis where $x$ is sampled form $\mathcal{N}(\mu=0, \sigma=1)$.
As an alternative hypothesis, consider the case that $x$ is sampled from $\mathcal{N}(\mu\neq 0, \sigma=1)$.
For simplicity, you can fix $\mu=1$.
Compute the power of both tests at the significance level $5\%$.
You can do simulations but there exists a simple  closed form solution for this.

**Clarification:** Use the one-sided formula $\mathrm{pvalue}(t)=\Pr[\boldsymbol{x}_*\gets\mathcal{D}, t_*=f(\boldsymbol{x}_*): t_*\geq t]$
for computing pvalues.



In [None]:
?rnd.normal

In [None]:
?stats.norm.ppf

In [None]:
?stats.norm.cdf