# Permutation tests based on ranks, and calibrating parametric tests using permutations

This notebook explores some common nonparametric tests based on ranks, and shows that (depending
on the null hypothesis) parametric tests can be calibrated nonparametrically to have the correct significance level when the assumptions of the parametric tests fail.

## Notation

Given a set of real numbers $\{x_j\}_{j=1}^N$, $x_{(i)}$ denotes the $i$th smallest element of the set. 
That is, $x_{(1)} \le x_{(2)} \le \ldots \le x_{(N)}$.

The _(mid-)rank_ $r_j$ of the $j$th observation, $x_j$, is $\#\{ k: x_k < x_j \} + (\#\{ k: x_k = x_j \}+1)/2$.
This assigns tied observations the average of the ranks they would have had if they had been distinct.

For example, for the set $\{1, 4, 2, 6\}$ the corresponding (mid-)ranks are $1, 3, 2, 4$; for the set
$\{1, 4, 2, 6, 2, 2\}$, the mid-ranks are  $1, 5, 3, 6, 3, 3$;
and for the set $\{1, 4, 2, 6, 2, 2, 2\}$, the mid-ranks are  $1, 6, 3.5, 7, 3.5, 3.5, 3.5$.

The sum of the (mid-)ranks of $N$ observations is $N(N+1)/2$.

Many of the methods described in this notebook can be performed using the original data or using
the ranks of the data.

When the data are random variables $\{X_j \}_{j=1}^N$, the (random) ranks will be written using uppercase letters, $\{R_j\}_{j=1}^N$.

As in the notebook on [causal inference](./causal-inference.ipynb), there are $T$ possible treatments (or $T$ groups being compared), numbered 0 to $T-1$, 
and $W_j$ is the treatment that subject $j$ receives (or the group that subject $j$ belongs to).
Subject $j$'s response if assigned to treatment $t$ is $x_{jt}$, $j=1, \ldots, N$, $t=0, \ldots, T-1$.
The number of subjects assigned to treatment $t$ is $\sum_{j=1}^N 1(W_j = t)$.
(When there are only two treatments, $t=0$ and $t=1$, $n :=\sum_j W_j$ is the number of subjects 
assigned to treatment 1.)
The effect of treatment $t$ on subject $j$ (compared to control) is 
\begin{equation}
\tau_{jt} := x_{jt} - x_{j0}, \;\;t=1, \ldots, T-1.
\end{equation}
The average effect of treatment $t$ (compared to control) is
\begin{equation}
\bar{\tau}_t := \frac{1}{N} \sum_{j=1}^N (x_{jt} - x_{j0}), \;\; t=1, \ldots, T-1.
\end{equation}



### Why use ranks instead of the original data?

+ Ranks may make sense for ordinal data, even if the original data are not quantitative (in the sense that arithmetic with the numbers doesn't make sense). This occurs, for example, with Likert scales.
+ If the measurements are "contaminated" by outliers, using ranks can decrease the sensitivity to the outliers, potentially increasing power.
+ If the original data have no ties, the null distribution of the rank-based statistic is "universal": can tabulate it once and for all. (This is less important than it used to be because of increases in computing power.)
+ Often, the normal approximation to the null distribution of the rank-based statistic is more accurate than the normal approximation to the null distribution of the test statistic for raw data. (This is less important than it used to be because of increases in computing power.)

### Why use the original data instead of ranks?

+ The original units are more meaningful for things like treatment effect, estimating shifts, etc.
+ Tests using the original data may have more power.


## The two-sample problem

We now specialize to the case of two groups or treatments. 
There are $N$ items in all, $n$ in the "treatment" group and $m := N-n$ in the "control" group.
Let $\{X_j\}_{j=1}^m$ denote the data for the control group and $\{Y_j\}_{j=1}^n$ denote the
data for the treatment group.

In the "randomization" version of the two-sample problem, the two groups are provided by nature, and the 
question is whether the groups are "statistically distinguishable," in the sense that
they don't look like a single group that was partitioned randomly. 
In the context of causal inference,
the groups started as a single group of subjects, randomly partitioned into two groups
that receive different treatments, and the question is whether their responses are statistically
distinguishable.

There is also a "population" version of the two-sample problem, which asks whether
the two groups look like IID samples from the same parent population, i.e., whether there is some
$F$ for which $\{X_1, \ldots, X_m, Y_1, \ldots, Y_n \}$ are IID $F$.
If two groups are IID samples from the same distribution, then every subset of $n$ of the 
$N$ values is equally likely to be the group of size $n$ and every subset of $m$
of the $N$ values is equally likely to be the group of size $m$, so tests that work for
the randomization version of the 2-sample problem also work for the population version
of the 2-sample problem.

### Testing for a difference in location

#### The Wilcoxon rank-sum test.

Define the sum of the ranks of the responses in the treatment group, group 1:
\begin{equation}
W_Y := \sum_j W_j R_j.
\end{equation}
Under the null, every subset of $n$ of the responses is equally likely to be the
treatment group.
The expected value of $W_Y$ under the null is $nN(N+1)/2$.
Moreover, under the null, the distribution of $W_Y$ is symmetric.
If treatment tends to increase responses, $W_Y$ will tend to be larger than its median under the null; 
if treatment tends 
to decrease responses, $W_Y$ will tend to be smaller than its median under the null.

**Equivalent tests.**
Multiplying the test statistic by a (positive) constant produces an equivalent test, meaning
that it rejects the null for exactly the same data.
Hence using $\bar{W}_Y := W_Y/n$ produces an equivalent test.

Since $\sum_j r_j = N(N+1)/2$, as $W$ increases, the sum (and the mean) of the control ranks $W_X := \sum_j (1-W_j) r_j$ decreases monotonically, so the Wilcoxon rank-sum test is equivalent to a test based on the difference between the mean ranks for treatment and control, $W_Y/n - W_X/m$.

The _Mann-Whitney test_ is also equivalent to the Wilcoxon rank-sum test; it uses
the test statistic 
\begin{equation}
\sum_{j \in \{1, \ldots, m\}, k \in \{1, \ldots, n\}} 1_{X_j < Y_k}.
\end{equation}

#### Permutation $t$-test.

Use the $t$ statistic as the test statistic, but calibrate it using the permutation distribution
rather than Student's t distribution.
See [introduction to permutation tests](./permute-intro.ipynb).


### Testing for a difference in dispersion: the Siegel-Tukey test

Assign "ranks" differently: smallest observation is rank 1, largest is rank 2, second-smallest is rank 3,
etc. Then apply Wilcoxon the rank-sum test to these "ranks." Null distribution is the same, but this
has power against the alternative that treatment affects dispersion (and not location).
Wilcoxon rank-sum test has power against the alternative that treatment affects location, not dispersion.


### Testing for a difference in distribution: the Smirnov test

A common measure of the "distance" between two probability distributions $F$ and $G$ is the _Smirnov_ or _Kolmogorov-Smirnov_
distance,
\begin{equation}
\| F-G \|_\infty := \sup_{x \in \mathbb{R}} |F(x)-G(x)|.
\end{equation}
A two-sample test sensitive to general differences (rather than changes in location and/or spread)
can be based on that distance.

Let $\hat{F}$ be the empirical CDF of the responses to treatment and $\hat{G}$ be the empirical
CDF of the responses to control:
\begin{equation}
\hat{F}(x) := \frac{1}{n} \sum_{j=1}^n 1_{x \ge X_j}
\end{equation}
\begin{equation}
\hat{G}(x) := \frac{1}{m} \sum_{j=1}^n 1_{x \ge Y_j}.
\end{equation}
The _Smirnov_ test uses $\|\hat{F} - \hat{G}\|$ as the test statistic.

Note that the value of $\|\hat{F} - \hat{G}\|$ depends only on the relative ordering
of the values $\{X_j\}$ and $\{Y_j\}$, not on their numerical values.
Thus, replacing the original data by their ranks does not change the test statistic
(or its distribution under the null hypothesis): the Smirnov test is a rank-based test.
Moreover, if there are no ties, the null distribution of the test statistic depends only on $n$ and $m$.


### Paired data

Suppose we observe pairs of data, $\{(X_j, Y_j)\}_{j=1}^N$. 

#### Sign test

#### Wilcoxon signed rank test

The _signed (mid)-rank_ of an observation $x_k \in \{x_j\}_{j=1}^N$ is $\mathrm{sgn}(x_k) r_k$, where
$r_k$ is the rank of $|x_k|$ in the set $\{|x_j|\}_{j=1}^N$, and 
\begin{equation}
\mathrm{sgn}(x) := \left \{ 
    \begin{array}{lr}  
    -1, & x < 0 \cr
    0, & x=0 \cr
    1, & x > 0. 
    \end{array}
    \right .
\end{equation}



## The $k$-sample problem

(Calling this $k$, but we are using $T$ to denote the number of groups.)
Size of group $t$ is $n_t$, $t=0, \ldots, T-1$.

### Testing for differences in location

#### The Kruskal-Wallis test

\begin{equation}
    K := \frac{12}{N(N+1)} \sum_{t=0}^{T-1} n_t \left ( \bar{R}_t - (N+1)/2 \right )^2,
\end{equation}
where $\bar{R}_t$ is the mean of the ranks of the observations in group $t$.
(The mean of all $N$ (mid-)ranks is $(N+1)/2$.)

The asymptotic null distribution (as all the group sizes approach infinity)
is chi-square with $T-1$ degrees of freedom.

#### The permutation $F$ test

## One-sample tests

Typically used to test whether data are a sample from a symmetric distribution or a distribution with
a specified median.



## When there is no random assignment

### Imaginary randomization

+ Gender, ethnicity, SES, etc. 

Null: the characteristic/label is assigned "as if" at random. 