# Combinations of tests and stratified tests

## Intersection-Union Hypotheses

In many situations, a null hypothesis of interest is the union of simpler hypotheses. For instance, the hypothesis that a university does not discriminate in its graduate admissions might be represented as 

(does not discriminate in arts and humanities) $\cap$ (does not discriminate in sciences) $\cap$ (does not discriminate in engineering) $\cap$ (does not discriminate in professional schools).

In this example, the alternative hypothesis is a _union_, viz.,

(discriminates in arts and humanities) $\cup$ (discriminates in sciences) $\cup$ (discriminates in engineering) $\cup$ (discriminates in professional schools).

Framing a test this way leads to an _intersection-union test_.
The null hypothesis is

$$
   H_0 \equiv \cap_{j=1}^n H_{0j}
$$

and the alternative is

$$
   H_1 \equiv \cup_{j=1}^n H_{0j}^c.
$$

There can be good reasons for representating a null hypothesis as such an intersection. 
In the example just mentioned, the applicant pool might be quite different across disciplines, making it hard to judge at the aggregate level whether there is discrimination, while testing within each discipline is more straightforward.

Hypotheses about multivariate distributions can sometimes be expressed as the intersection of hypotheses about each dimension separately. For instance, the hypothesis that an $n$-dimensional distribution has zero mean could be represented as 

(1st component has zero mean) $\cap$ (2nd component has zero mean) $\cap$ $\cdots$ $\cap$ ($n$th component has zero mean)

The alternative is again a union:

(1st component has nonzero mean) $\cup$ (2nd component has nonzero mean) $\cup$ $\cdots$ $\cup$ ($n$th component has nonzero mean)

## Combinations of experiments and stratified experiments

The same kind of issue arises when combining information from different experiments.
For instance, imagine testing whether a drug is effective. We might have several randomized, controlled trials in different places, or a large experiment involving a number of centers, each of which performs its own randomization (i.e., the randomization is stratified).

How can we combine the information from the separate (independent) experiments to test the null hypothesis that the drug is ineffective? 

Again, the overall null hypothesis is "the drug doesn't help," which can be written as an intersection of hypotheses

(drug doesn't help in experiment 1) $\cap$ (drug doesn't help in experiment 2) $\cap$ $\cdots$  $\cap$ (drug doesn't help in experiment $n$),

and the alternative can be written as

(drug helps in experiment 1) $\cup$ (drug helps in experiment 2) $\cup$ $\cdots$  $\cup$ (drug helps in experiment $n$),

a union.

## Combining evidence

Suppose we have a test of each "partial" null hypothesis $H_{0j}$. Clearly, if the $P$-value for one of those tests is sufficiently small, that's evidence that the overall null $H_0$ is false.

But suppose none of the individual $P$-values is small, but many are "not large." 
Is there a way to combine them to get sronger evidence about $H_0$?

## Combining functions

Let $\lambda$ be an $n$-vector of statistics such that the distribution of $\lambda_j$
if hypothesis $H_{0j}$ is true is known. 
We assume that smaller values of $\lambda_j$ are stronger evidence against $H_{0j}$.
For instance, $\lambda_j$ might be the $P$-value of $H_{0j}$ for some test.

Consider a function

$$ \phi: [0, 1)^n \rightarrow \Re; \lambda = (\lambda_1, \ldots, \lambda_n) \mapsto \phi(\lambda)
$$ 
with the properties:

+ $\phi$ is non-increasing in every argument, i.e., $\phi( \ldots, \lambda_k, \ldots) \ge \phi(( \ldots, \lambda_k', \ldots)$ if $\lambda_k \le \lambda_k'$, $k = 1, \ldots, n$.

+ $\phi$ attains its maximum if any of its arguments equals 1.

+ $\phi$ attains its minimum if all of its arguments equal 1.

+ for all $\alpha > 0$, there exist finite functions $\phi_-(\alpha)$, $\phi_+(\alpha)$ such that if every partial null hypothesis $\{H_{0j}\}$ is true, 
$$\Pr \{\phi_-(\alpha) \le \phi(\lambda) \le \phi_+(\alpha) \} \ge 1-\alpha$$
and $[\phi_-(\alpha), \phi_+(\alpha)] \subset [\phi_-(\alpha'), \phi_+(\alpha')]$ if $\alpha \ge \alpha'$.

Then we can use $\phi(\lambda)$ as the basis of a test of $H_0 = \cap_{j=1}^n H_{0j}$.

### Fisher's combining function

$$ \phi_F(\lambda) \equiv -2 \sum_{j=1}^n \ln(\lambda_j).$$

### Liptak's combining function

$$ \phi_L(\lambda) \equiv \sum_{j=1}^n \Phi^{-1}(1-\lambda_j),$$

where $\Phi^{-1}$ is the inverse standard normal CDF.

### Tippet's combining function

$$ \phi_T(\lambda) \equiv \max_{j=1}^n (1-\lambda_j).$$

### Direct combination of test statistics

$$ \phi_D \equiv \sum_{j=1}^n f_j(\lambda_j), $$

where $\{ f_j \}$ are suitable decreasing functions. For instance, if $\lambda_j$ is the $P$-value for $H_{0j}$ corresponding to some test statistic $T_j$ for which larger values are stronger evidence against $H_{0j}$, we could use $\phi_D = \sum_j T_j$.

## Fisher's combining function for independent $P$-values

Suppose $H_0$ is true, that $\lambda_j$ is the $P$-value of $H_{0j}$ for some pre-specified test, that the distribution of $\lambda_j$ is continuous under $H_{0j}$, and that $\{ \lambda_j \}$ are independent if $H_0$ is true.

Then, if $H_0$ is true, $\{ \lambda_j \}$ are IID $U[0,1]$.

Under $H_{0j}$, the distribution of $-\ln \lambda_j$ is exponential(1):

$$
   \Pr \{ -\ln \lambda_j \le x \} = \Pr \{ \ln \lambda_j \ge -x \} = \Pr \{ \lambda_j \ge e^{-x} \} = 1 - e^{-x}.
$$

The distribution of 2 times an exponential is $\chi_2^2$:
the pdf of a chi-square with $k$ degrees of freedom is

$$
   \frac{1}{2^{k/2}\Gamma(k/2)} x^{k/2-1} e^{-x/2}.
$$

For $k=2$, this simplifies to $e^{-x/2}/2$, the exponential density scaled by a factor of 2.

Thus, under $H_0$, $\phi_F(\lambda)$ is the sum of $n$ independent $\chi_2^2$ random variables. The distribution of a sum of independent chi-square random variables is a chi-square random variable with degrees of freedom equal to the sum of the degrees of freedom of the variables that were added.

Hence, under $H_0$,

$$
  \phi_F(\lambda) \sim \chi_{2n}^2,
$$

the chi-square distribution with $2n$ degrees of freedom.

Let $\chi_{k}^2(\alpha)$ denote the $1-\alpha$ quantile of the chi-square distribution
with $k$ degrees of freedom.
If we reject $H_0$ when

$$
   \phi_F(\lambda) \ge \chi_{2n}^2(\alpha),
$$

that yields a significance level $\alpha$ test of $H_0$.

In [1]:
## Simulate distribution of -2 \sum_j \ln U_j [TO DO]

## When $P$-values have atoms

A real random variable $X$ is first-order stochastically larger than a real random variable $Y$ if for all $x \in \Re$,

$$
   \Pr \{ X \ge x \} \ge \Pr \{ Y \ge x \}$,
$$
with strict inequality for some $x \in \Re$.

Suppose $\{\lambda_j \}$ for $\{ H_{0j}\}$ satisfy

$$
   \Pr \{ \lambda_j \le p  || H_{0j} \} \le p.
$$

This takes into account the possibility that $\lambda_j$ does not have a continuous 
distribution under $H_{0j}$, ensuring that $\lambda_j$ is still a _conservative_ $P$-value.

Since $\ln$ is monotone, it follows that for all $x \in \Re$

$$
   \Pr \{ -2 \ln \lambda_j \ge x \} \le \Pr \{ -2 \ln U \ge x \}.
$$

That is, if $\lambda_j$ does not have a continuous distribution, 
the a $\chi_2^2$ variable is stochastically larger than the distribution of $-2\ln \lambda_j$.

It turns out that $X$ is stochastically larger than $Y$ if and only if
there is some probability space on which there exist 
two random variables, $\tilde{X}$ and $\tilde{Y}$ such that $\tilde{X} \sim X$,
$\tilde{Y} \sim Y$, and $\Pr \{\tilde{X} \ge \tilde{Y} \} = 1$. 
(See, e.g., Grimmett and Stirzaker,_Probability and Random Processes_, 3rd edition,
Theorem 4.12.3.)

Let $\{X_j\}_{j=1}^n$ be IID $\chi_2^2$ random variables,
and let $Y_j \equiv - 2 \ln \lambda_j$, $j=1, \ldots, n$.

Then there is some probability space 
for which we can define $\{\tilde{Y_j}\}$ and $\{\tilde{X_j}\}$ such that

+ $(\tilde{Y_j})$ has the same joint distribution as $(Y_j)$

+ $(\tilde{X_j})$ has the same joint distribution as $(X_j)$

+ $\tilde{X_j} \ge \tilde{Y_j}$ for all $j$ with probability one.

Then

+ $\sum_j \tilde{Y_j}$ has the same distribution as $\sum_j Y_j = -2 \sum_j \ln \lambda_j$,

+ $\sum_j \tilde{X_j}$ has the same distribution as $\sum_j X_j$ (namely, chi-square with $2n$ degrees of freedom),

+ $\sum_j \tilde X_j  \ge \sum_j \tilde{Y_j}$.

That is, 

$$
  \Pr \left \{-2 \sum_j \ln \lambda_j \ge \chi_{2n}^2(\alpha) \right \} \le \alpha.
$$

Thus, we still get a conservative hypothesis test if one or more of the $p$-values for the
partial tests have atoms under their respective null hypotheses $\{H_{0j}\}$.

## Dependent tests

If $\{ \lambda_j \}_{j=1}^n$ are dependent, the distribution of $\phi_F(\lambda)$ is no longer chi-square when the null hypotheses are true.
Nonetheless, one can calibrate a test based on Fisher's combining function (or any other combining function) by simulation.
This is commonly used in multivariate permutation tests involving dependent partial tests
using "lockstep" permutations.

See, e.g., Pesarin, F. and L. Salmaso, 2010. Permutation Tests for Complex Data: Theory, Applications and Software, Wiley, 978-0-470-51641-6.

Also see [the permute Python package](http://statlab.github.io/permute/).

## Stratified Permutation Tests

Two examples: 

+ Boring, A., K. Ottoboni, and P.B. Stark, 2016. Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness, _ScienceOpen_, doi 10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1

+ Hessler, M.,  D.M. Pöpping, H. Hollstein, H. Ohlenburg, P.H. Arnemann, C. Massoth, L.M. Seidel, A. Zarbock & M. Wenk, 2018. Availability of cookies during an academic course session affects evaluation of teaching, _Medical Education, 52_, 1064–1072. doi 10.1111/medu.13627
