# Multiple Testing and Simultaneous Confidence Sets

References:
+ Benjamini, Y., and Y. Hochberg, 1995, Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing, _JRSS B, 57_, 289-300.
+ Benjamini, Y., and D. Yekutieli, 2001. The control of the false discovery rate in multiple testing under dependency, _Ann. Statist., 29_, 1165-1188 https://doi.org/10.1214/aos/1013699998
+ Blanchard, G., and E. Roquain, 2008. Two simple sufficient conditions for
FDR control, _Electronic Journal of Statistics, 2_,  963–992. DOI: 10.1214/08-EJS180.
+ Genovese, C.R., and L. Wasserman, 2004. A stochastic process approach to false discovery control, _Ann. Statist., 32, 1035--1061. https://doi.org/10.1214/009053604000000283
+ Hsu, J., 1996. _Multiple Comparisons: Theory and Methods_, Chapman and Hall, London.
+ Marcus, R., E. Peritz, and K.R. Gabriel, 1976. On Closed Testing Procedures with Special Reference to Ordered Analysis of Variance, _Biometrika, 63_, 655-660, https://doi.org/10.2307/2335748
+ Shaffer, J., 1995. Multiple Hypothesis Testing, _Ann. Rev. Psychol., 46_, 561-584.
+ Simes, R.J., 1986. An improved Bonferroni procedure for multiple tests of significance. _Biometrika, 73_, 751–754. https://doi.org/10.1093/biomet/73.3.751
+ Wang and Ramdas, 2022. False Discovery Rate Control with E-values, _Journal of the Royal Statistical Society Series B, 84_, 822–852. https://doi.org/10.1111/rssb.12489 


## What is multiplicity, and why does it matter?

If we test hypotheses at significance level $\alpha$, by definition, the chance we reject a particular null hypothesis if that hypothesis is true is at most $\alpha$.

But often we test not just one, but several or many hypotheses.
For example, we might be evaluating a collection of drugs, and want to test the
family of null hypotheses that each is not effective.
Or we might be evaluating one drug using several tests for different clinical outcomes.

Suppose we test each null hypotheses at level $\alpha$.
If we test a collection of hypotheses among which more than one are true, the chance that we erroneously reject one or more true nulls (if more than one is true) may be much larger than the chance of rejecting each individual true null.

Similarly, if we make a collection of $1-\alpha$ confidence sets for $n>1$ parameters, by definition, the chance that each confidence set will contain its corresponding parameter is at least $1-\alpha$, but the chance that _all $n$_ sets contain their corresponding parameter can be much lower.

By analogy, the chance that a fair coin lands heads in a single toss is 1/2, but the chance a fair coin lands heads at least once in 10 independent tosses is $1-(1/2)^{10} > 0.999$.

The issue of multiplicity has become more obvious as Statistics has been brought to bear on problems in genetics and genomics, brain imaging, and similar problems, where millions of hypotheses are tested in a single study.

There is an enormous literature on multiple testing procedures and simultaneous confidence sets. This chapter presents a tiny slice.

## The per-comparison error rate (PCER)

The error rate "per true tested hypothesis" is called the _per-comparison error rate_ (PCER).
If every hypothesis is tested at level $\alpha$, the PCER is controlled at level $\alpha$.

But the chance of making at least one Type I error is at least $\alpha$,
and is typically larger.

The same multiplicity issues arise in computing confidence sets:
If $\{ I_j \}_{j=1}^m$ are individually level $1-\alpha$ confidence sets for
a collection of parameters $\{ \mu_j \}_{j=1}^m$, 
so that 
\begin{equation}
   \mathbb{P} \{ I_j \ni \mu_j \} = 1-\alpha, \;\; j = 1, \ldots, m,
\end{equation}
then the event
\begin{equation}
A = \cap_{j=1}^m \{ I_j \ni \mu_j \}
\end{equation}
typically has probability much less than $1-\alpha$.

## Notation

We'll use the notation in Benjamini and Hochberg (1995).

$\mbox{ }$  | <div style="width:200px">Declared non-significant</div> |  <div style="width:200px">Declared significant</div>| <div style="width:150px">Total</div> |
:---- | :------------------: | :-------------: | :-----------: | 
True null hypotheses | $U$ |  $V$ | $m_0$ | 
False null hypotheses | $T$ | $S$ | $m - m_0$ |
Total | $m-R$ | $R$ |  $m$ |


The number of hypotheses tested is $m$, considered to be known.
The number of null hypotheses that are true is $m_0$; $m_0$ is unknown.
The total number of rejected null hypotheses is the random variable $R$, which is observable.
The random variables $U$, $V$, $T$, and $S$ are not observable.

If each hypothesis is tested individually at level $\alpha$, then the PCER is
$\mathbb{E}(V/m) \le \alpha$.

## The Familywise Error Rate (FWER)

Let $\{ H_j \}_{j=1}^m$ ($m$ for multiplicity)
be the family of null hypotheses to
be tested, and let $H_0 = \cap_{j} H_j$ be the _grand null hypothesis._
If $H_0$ is true, the expected number of rejections is $\alpha m$.
The _familywise error rate_ (FWER) is the probability of one or more incorrect
rejections:
\begin{equation}
\mbox{FWER} = \mathbb{P} \{ V > 0 \} = \mathbb{P} \left \{ \mbox{ reject one or more true } H_j, \; \; j \in \{1, \ldots, m \} \right \}.
\end{equation}
But what is $\mathbb{P}$ here?
**Strong control** of the FWER at level $\alpha$
means that the probability of one or more
incorrect rejections is at most $\alpha$, no matter which (if any) of the 
hypotheses $\{ H_j \}$ happen to be true.
**Weak control** of the FWER at level $\alpha$ means that the probability of one
or more incorrect rejections when the _grand null hypothesis_ $H_0$ is true is
at most $\alpha$:
\begin{equation}
\mathbb{P}_0 \{ \mbox{ reject one or more $H_j$ } \} \le \alpha.
\end{equation}
The FWER can be much larger than the significance level at which the
individual hypotheses are tested.

This _multiplicity problem_ is commonly ignored, which tends to
make results appear more significant than they really are: the
true significance level of the overall test is larger than the reported, "nominal" significance
level.
This problem is exacerbated by "publication bias" in favor of positive results.
If only results that are statistically significant are considered worthy of publication,
then many tests may be performed in searching for one that rejects the null: the chance that
a true null is (eventually) rejected can be very large.
Because the non-rejections are "invisible" (not published), it's hard to know how many tests were
done before the test that led to rejection.

### Closed testing procedures

Primary reference:  Marcus et al. (1976).

One way to control the FWER is to use _closed testing procedures_, which test a collection of hypotheses that is closed 
under intersection.
That is, if the (possibly compound) hypotheses $H_1$ and $H_2$ are in the collection, so is the hypothesis
$H_1 \cap H_2$.

We have a set of $K$ composite null hypotheses, $\{H_k \}_{k=1}^K$.
We want to test all $K$ hypotheses in such a way that the chance of erroneously rejecting one or more true nulls is 
at most $\alpha$; that is, we want the FWER to be at most $\alpha$.

Let $\mathcal{H}$ denote the set of all intersections of subsets of $\{H_k\}$.
The _closure principle_ says that if we test as follows, it controls the FWER.
Test every (composite) null in $\mathcal{H}$ at level $\alpha$.
Reject a null $H_0 \in \mathcal{H}$ only if its test and the 
test of every hypothesis in $\mathcal{H}$ that is a subset of $H_0$ all reject.


The proof that this procedure controls FWER at level $\alpha$ is simple. 
Let $A$ be the event that any _true_ $H_k$ is rejected.
Let $B$ be the event that the intersection of all the true nulls is rejected (that _every_ true null is rejected).
Because $\mathcal{H}$ is closed under intersections, the intersection of all true nulls is one of the hypotheses in $\mathcal{H}$.
Then
\begin{equation}
\mathbb{P} \{A \cap B \} = \mathbb{P}(A | B) \mathbb{P}(B) \le \alpha
\end{equation}
since the intersection is tested at level $\alpha$.
Clearly $B \subset A$, so $A \cap B = B$.
But since this procedure will only reject a true null if it has also rejected the intersection of all
true nulls, $\{A \cap B \} = A$, and hence $\mathbb{P}(A) \le \alpha$.
Thus closed testing limits the FWER to $\alpha$.

### Procedures based on Bonferroni's Inequality

Bonferroni's inequality (also known as the _union bound_) says that
for any collection of events $\{A_j\}$,
$\mathbb{P} \{ \cup_j A_j \} \le \sum_j \mathbb{P} A_j$.
A consequence is that the chance of one or more type I errors in an arbitrary collection of
tests is at most the sum of their separate chances of type I errors.
Thus if each hypothesis is tested at 
level $\alpha/m$, the FWER is
$\mathbb{P}\{V \ge 1\} \le \alpha$.

This _Bonferroni adjustment_ gives strong control of the FWER, but the resulting tests can be very 
conservative--unnecessarily so.
There are other methods that rely on the union bound in more subtle ways to give more powerful testing
procedures that also control FWER.


_Holm's Sequentially Rejective Bonferroni Method_ is based on the ordered $P$-values
$P_{(1)} \le P_{(2)} \le \cdots \le P_{(m)}$ of the $m$ hypotheses.
Order the hypotheses in the same way, so that the $P$-value of $H_{(1)}$ is $P_{(1)}$, etc.
Holm's procedure is 

> Reject $H_{(i)}$ if $P_{(k)} \le \alpha/(m-k+1)$ for all $k \le i$.  
Do not reject the other hypotheses.


**Theorem.**  
Holm's method controls the FWER at level $\alpha$.

**Proof.**  
Let $m_0$ be the number of true null hypotheses.
+ If $m_0 = m$, there is an incorrect rejection only if $P_{(1)} \le \alpha/m$, which has probability at most $\alpha$, by Bonferroni's inequality.
+ If $m_0 = m-1$, there is an incorrect rejection if either $H_{(1)}$ is one of the true null hypotheses and $P_{(1)} \le \alpha/M$, 
or if $H_{(1)}$ is false, $P_{(1)} \le \alpha/m$, and $P_{(2)} \le \alpha/(m-1)$. 
Let $P'_j$ be the $j$th smallest $P$-value among the $m_0$ true null hypotheses. 
There can only be an incorrect rejection if $P'_1 \le \alpha/(m-1)$ (but that condition is not sufficient for an incorrect rejection). By Bonferroni's ineuality, the chance of an incorrect rejection is thus at most $\alpha$.
+ One can proceed similarly for $m_0 = m-2, \ldots, 1$, arguing that an incorrect rejection can only occur (but does not necessarily occur) if $P'_1 \le \alpha/m_0$; in each case, the chance of that event is at most $\alpha$, by Bonferroni's inequality. 
+ If $m_0 = 0$, there can be no incorrect rejection.

Holm's method is an example of a _step-down procedure_.
The schematic of a step-down procedure is that one looks at the smallest $P$-value first.
If that is larger than some threshold, no hypothesis is rejected. 
If not, the corresponding hypothesis is rejected, and one goes on to the second-smallest $P$-value.
As soon as one reaches the point that the $j$th smallest $P$-value is larger than the
$j$th threshold, no more hypotheses are rejected.

In a _step-up procedure_, one looks first at the largest $P$-value. If that is sufficiently
small, all the hypotheses are rejected. If not, the corresponding hypothesis is
not rejected, and one goes on to the second largest $P$-value. As soon as one reaches
the point that the $j$th largest $P$-value is smaller than the $j$th threshold,
all the remaining hypotheses are rejected. 


### Independent Test Statistics
Suppose we wish to test with FWER not exceeding $\alpha$
the family of hypotheses $\{H_i \}_{i=1}^m$ using independent test statistics
$\{T_i\}_{i=1}^m$.
Suppose we test each hypothesis at level $\beta$.
Then the probability of one or more incorrect rejections (the FWER) is
$1- (1-\beta)^m$. To have FWER equal to $\alpha$ requires
\begin{eqnarray}
\alpha &=& 1- (1-\beta)^m \nonumber \\
(1-\alpha )^{1/m} &=& 1 - \beta \nonumber \\
\beta &=& 1 - (1-\alpha)^{1/m} .
\end{eqnarray}
Thus if we test the hypotheses individually at level $1 - (1-\alpha)^{1/m}$, the
FWER is at most $\alpha$.
This is approach is called _Šidák's adjustment_.

### Simes' inequality.

See R.J. Simes (1986).

Suppose we are testing $m$ null hypotheses $\{ H_j \}$ using independent
test statistics $T_j$.
Let $P_{(j)}$ be the $j$th smallest $P$-value among the $m$ $P$-values.
Simes' method is

> reject the grand null hypothesis if for some $j$, $P_{(j)} \le j\alpha/m$.

Simes' method has FWER at most $\alpha$.

**Theorem.** (Simes, 1986)  
Let $P_{(j)}$ be the $j$th order statistic of $m$ IID $U(0,1)$ random variables.
Then for $\alpha \in [0, 1]$,
\begin{equation}
A_m(\alpha) := \mathbb{P} \{ P_{(j)} > j\alpha/m, \; \; j = 1, \ldots, m \} = 1-\alpha .
\end{equation}


**Proof.**  
The proof is by induction on $m$.  Clearly, the statement is true for $m=1$: the chance a $U[0, 1]$ 
random variable is
greater than $\alpha$ is $1-\alpha$, and more generally, the chance (under the null) that a genuine $P$-value is
greater than $\alpha$ is at least $1-\alpha$.
For $m > 1$, $\{ P_{(1)}/P_{(m)}, P_{(2)}/P_{(m)}, \ldots, P_{m-1}/P_m \}$ are distributed as the
order statistics of $m-1$ iid $U(0,1)$ random variables, independent of $P_{(m)}$.
Thus, for $p \ge \alpha$,
\begin{eqnarray}
\mathbb{P} \left \{ P_{(j)} > \frac{j\alpha}{m};\; j = 1, \ldots, m-1 | P_{(m)} = p \right \} 
&=&
\mathbb{P} \left \{ P_{(j)}/p > \frac{j\alpha (m-1)}{pm(m-1)} ;\; j = 1, \ldots, m-1 | P_{(m)} = p \right \}
\nonumber \\
&=&
\mathbb{P} \left \{ P_{(j)} > \frac{j \frac{(m-1)\alpha}{pm}}{m-1};\; j = 1, \ldots, m-1 | P_{(m)} = p \right \}
\nonumber \\
&=&
A_{m-1} \left ( \frac{(m-1)\alpha}{pm} \right) .
\end{eqnarray}
The distribution function of $P_{(m)}$ is $p^m$, $p \in [0, 1]$, so the density
of $P_{(m)}$ is $mp^{m-1}$.
Suppose $A_{m-1}(\alpha) = 1-\alpha$, $\alpha \in [0, 1]$.
\begin{eqnarray}
A_m(\alpha) &=& \int_\alpha^1 A_{m-1} \left ( \frac{(m-1)\alpha}{pm} \right ) 
mp^{m-1} dp  \nonumber \\
&=& 
\int_\alpha^1 \left (1- \frac{(m-1)\alpha}{pm} \right ) mp^{m-1} dp  \nonumber \\
&=& 
\left . p^m \right |_\alpha^1 -
\left . \alpha p^{m-1} \right |_\alpha^1
\nonumber \\
&=& 1 - \alpha^m - \alpha + \alpha^m
\nonumber \\
&=& 1 - \alpha .
\end{eqnarray}

Because $j \alpha/m$ is at least as large as $\alpha/(m-j+1)$,
the grand null is rejected more frequently using this test than using Holm's Bonferroni-based test.

Simes' result has been generalized to 
_positively regression dependent test statistics_, defined as follows.
Two random variables $X$ and $Y$ are _positively regression dependent_ if
for $x_0 < x_1$, a random variable that has the
conditional distribution of $Y$ given $X=x_1$ is stochastically
larger than that of one with the conditional distribution of $Y$ given $X=x_0$
($Y$ tends to be larger when $X$ is larger).
Positively correlated normal random variables have positive regression dependence.



### Chebychev's Other Inequality

**Theorem.** (J. Hsu, 1996, _Multiple Comparisons: Theory and Methods_  Theorem A.1.1.)  
Let $X$ be an $n$-dimensional random variable. Suppose the functions $f, g : \Re^n \rightarrow
\Re$ satisfy
\begin{equation}
[f(x_2) - f(x_1)][g(x_2) - g(x_1)] \ge 0
\end{equation}
for all $x_1, x_2$ in the support of the distribution of $X$.
Then, provided the expectations exist,
\begin{equation}
\mathbb{E}[f(X)g(X)] \ge \mathbb{E}[f(X)]\mathbb{E}[g(x)] .
\end{equation}
I.e., $f(X)$ and $g(X)$ are positively correlated.



**Proof.**  
Let $X$, $X_1$ and $X_2$ be iid. 
Then
\begin{eqnarray}
0 &\le& \mathbb{E} \left [ (f(X_2) - f(X_1) )(g(X_2) - g(X_1) ) \right ] \nonumber \\
&=& \mathbb{E} \left [ (f(X_2)g(X_2) + f(X_1)g(X_1)) - (f(X_1)g(X_2) + f(X_2)g(X_1)) \right ]
\nonumber \\
&=& 2 \left [ \mathbb{E}[f(X)g(X)] - \mathbb{E}[f(X)]\mathbb{E}[g(X)] \right ] .
\end{eqnarray}

**Corollary.** (Kimball's inequality; Hsu (1996) Corollary A.1.1.)  
Let $V$ be a univariate random variable. 
If $\{g_j\}_{j=1}^m$ are bounded, nonnegative
real functions, monotone in the same direction, then
\begin{equation}
\mathbb{E} \left [ \prod_{j=1}^m g_j(V) \right ] \ge \prod_{j=1}^m \mathbb{E}g_j(V).
\end{equation}


**Proof.**  
Use induction from two functions to $m$ functions in the Theorem, 
taking $n=1$.

### Application to the one-way model
Suppose we are interested in the _one-way model_.
We observe
\begin{equation}
X_{ia} = \mu_i + \epsilon_{ia}, \;\; i=1, \ldots, m; \; a = 1, \ldots, n_i .
\end{equation}
This is a model for making $n_i$ observations of the response to treatment $i$
for $m$ different treatments, under the assumption that the response is a mean
response plus a random effect.
Assume that the errors $\epsilon_{ia}$ are iid $N(0, \sigma^2)$, $\sigma^2$ unknown.
Let $\hat{\mu}_i = \bar{X}_i = \frac{1}{n_i} \sum_{a=1}^{n_i} X_{ia}$.
Let $\nu = \sum_{i=1}^m (n_i - 1)$, and define
\begin{equation}
\hat{\sigma}^2 = 
\frac{1}{\nu}\sum_{i=1}^m \sum_{a=1}^{n_i} (X_{ia} - \bar{X}_i)^2 .
\end{equation}
The estimators $\{\hat{\mu}_i\}$ are independent normals with means $\{ \mu_i \}$
and variances $\{ \sigma^2/n_i \}$, independent of $\hat{\sigma}^2$,
and $\nu \hat{\sigma}^2/\sigma^2 \sim \chi^2_\nu$.
For future use, define
\begin{equation}
\hat{\mu} = (\hat{\mu}_j)_{j=1}^m.
\end{equation}
and
\begin{equation}
\hat{\sigma}_B^2 = \frac{1}{m} \sum_{i=1}^m n_i \left ( \hat{\mu}_i - 
\frac{1}{m}\sum_{i=1}^m \hat{\mu}_i \right )^2.
\end{equation}

Suppose we wish to find simultaneous confidence intervals for the
set of parameters $\{ \mu_i \}$.

Define the Studentized test statistics 
\begin{equation}
T_i = \frac{\hat{\mu}_i - \mu_i}{\hat{\sigma}/\sqrt{n_i}}, \; i=1, \ldots, m .
\end{equation}
These test statistics are dependent, because of the common divisor $\hat{\sigma}$.
If they were not, the intervals
\begin{equation}
[ \hat{\mu}_i - \hat{\sigma} t_{1- ( 1- ( 1-\alpha/2)^{1/m})/2},
\hat{\mu}_i + \hat{\sigma} t_{1- ( 1- ( 1-\alpha/2)^{1/m})/2}],
\; i = 1, \ldots, m
\end{equation}
would be exact $1-\alpha$ simultaneous confidence intervals for $\{ \mu_i \}$.

Kimball's inequality lets one show that these intervals are in fact conservative as
a result of the dependence on $\hat{\sigma}$.
Let $A_i$ be the event that the $i$th interval covers. Consider the function
$g_i(\hat{\sigma}) = \mathbb{P} \{ A_i | \hat{\sigma} \}$.
These functions all increase monotonically with $\hat{\sigma}$.
Recall that $\{\hat{\mu}_i\}$ are independent of each other and of $\hat{\sigma}$.
Thus
\begin{eqnarray}
\mathbb{P} \left \{ \cap_{i=1}^m A_i \right \} &=& \mathbb{E} 1_{\cap_{i=1}^m A_i}
\nonumber \\
&=& \mathbb{E} \Pi_{i=1}^m 1_{A_i} \nonumber \\
&=& \mathbb{E}_{\hat{\sigma}} \mathbb{E} (\Pi_{i=1}^m 1_{A_i} | \hat{\sigma} )
\nonumber \\
&=& \mathbb{E}_{\hat{\sigma}} \Pi_{i=1}^m \mathbb{P} \{ A_i | \hat{\sigma} \}
\nonumber \\
&\ge& \Pi_{i=1}^m \mathbb{E}_{\hat{\sigma}} \mathbb{P} \{ A_i | \hat{\sigma} \}
\nonumber \\
&=& \Pi_{i=1}^m \mathbb{P} \{ A_i \} \nonumber
\\
&=& 1 - \alpha,
\end{eqnarray}
where Kimball's inequality was used in the penultimate step.


### Comparisons and Constrasts
We specialize to the case that we are interested in a collection of $m$ parameters
$\{ \mu_i \}_{i=1}^m$.  
Let $\mu = ( \mu_i )_{i=1}^m$.
The hypotheses we wish to test involve comparing the
parameters or linear combinations of the parameters.
For example, we might be interested in the 
family of hypotheses $\{ H_{ij}: \mu_i = \mu_j \}_{i= 1, \ldots,
m-1; j = i+1, \ldots, m }$ (all pairwise comparisons).
For $\mathcal{I}$ a subset of $\{ 1, \ldots, m \}$, let
$H_{\mathcal{I}}$ denote the hypothesis that all $\{ \mu_i \}_{i \in \mathcal{I}}$ are equal
(perhaps a better notation would be that $\#\{ \mu_i \}_{i \in \mathcal{I}} = 1$).
We might be interested in the family of hypotheses $\{ H_{\mathcal{I}} \}_{\mathcal{I} \in {\bf \mathcal{I}}}$,
where ${\bf \mathcal{I}}$ is a collection of subsets of $\{ 1, \ldots, m \}$.

A _contrast_ is a linear combination $\sum_{i=1}^m c_i \mu_i = c \cdot \mu$, with
the restriction that $\sum_{i=1}^m c_i = c \cdot {\mathbf{1}} = 0$. 
A _pairwise comparison_ is a constrast with $c_i = 1$ for some $i$, $c_j = -1$ for some $j \ne i$, and all other
components of $c$ equal to zero.

We are going to assume that we are in a one-way model with and equal number
$N$ of observations of each of the $m$ treatments. 
We again assume that the observational errors are iid $N(0, \sigma^2)$, with $\sigma^2$ unknown.
The rest of the notation is as above.

The _cost_ in terms of reduced power tends to increase with the number of hypotheses
tested; if one is not interested in testing all possible contrasts, one can have
more power testing the limited set. Some major divisions of families of hypotheses tested
in the one-way model include, in decreasing order of complexity,
ACC (all contrasts comparison), MCA (all pairwise comparisons), MCB (multiple comparisons
with the [sample] best), and MCC (multiple comparisons with control).
MCC involves the fewest comparisons: $m-1$ sample values are compared with the $m$th,
which is the control. In MCB, there are also only $m-1$ comparisons, but the measured
effect $\hat{\mu}_i$
that the other $m-1$ are compared with is that one observed to be best; under the 
grand null, that is equally likely to be any of the $\hat{\mu}_i$.
In MCA, there are $(m^2 - m)/2$ hypotheses tested, and in ACC, an infinite number are
tested.

### The Scheffé Method
The Scheffé method controls the FWER for all possible contrasts (ACC).
The _grand null_ in this case is that all the $\mu_i$ are equal, so all
contrasts are zero.

Recall that if $Y$ has a chi-square distribution with $k$ degrees of freedom
and $Y'$ has a chi-square distribution with $\ell$ degrees of freedom, and
$Y$ and $Y'$ are independent, then
\begin{equation}
\frac{Y/k}{Y'/\ell}
\end{equation}
has an $F$ distribution with $k$ and $\ell$ degrees of freedom, denoted $F_{k,\ell}$.
Let $F_{k,\ell,\alpha}$ denote the $\alpha$ critical value of $F_{k,\ell}$.
It is a standard result in the analysis of variance that under the one-way normal model,
$(m-1)\hat{\sigma}_B^2/\sigma^2 \sim \chi_{m-1}^2$ and 
$\nu\hat{\sigma}^2/\sigma^2 \sim \chi_\nu^2$ are independent, so
\begin{equation}
\frac{\hat{\sigma}_B^2}{\hat{\sigma}^2} \sim F_{m-1,\nu}.
\end{equation}

The variables $\{ \sqrt{n_i}(\hat{\mu}_i - \mu_i)/\sigma \}$ are iid $N(0,1)$,
and $\sigma^{-2} \sum_{i=1}^m n_i (\hat{\mu}_i - \mu_i)^2$ has a chi-square distribution
with $m$ degrees of freedom, and is independent of $\hat{\sigma}^2$,
so whatever be $\{ \mu_i \}_{i=1}^m$,
\begin{equation}
\mathbb{P} \left \{ 
\frac{\sum_{i=1}^m n_i | \hat{\mu}_i - \mu_i |^2}{m \hat{\sigma}^2}
\le F_{m,\nu,\alpha} \right \} = 1-\alpha .
\end{equation}
Equivalently,
\begin{equation}
\mathbb{P} \{ \sum_{i=1}^m n_i|\hat{\mu}_i - \mu_i|^2 \le m \hat{\sigma}^2
F_{m,\nu,\alpha} \} = 1-\alpha .
\end{equation}
In the case all $n_i = N$, this becomes
\begin{equation}
\mathbb{P} \{ \| \hat{\mu} - \mu \|^2 \le \frac{m}{N} \hat{\sigma}^2
F_{m,\nu,\alpha} \} = 1-\alpha .
\end{equation}
That is, the chance is at least $1-\alpha$ that $\hat{\mu} \in \Re^m$ 
is in a ball centered at $\mu$ of radius 
\begin{equation}
r_\alpha = \frac{m}{N} \hat{\sigma} \sqrt{\frac{m F_{m,\nu,\alpha}}{N}}.
\end{equation}
The unit ball in $\Re^m$ can be characterized as
\begin{equation}
\{ \beta \in \Re^m : |c \cdot \beta| \le \|c\| \},
\end{equation}
so 
\begin{equation}
\mathbb{P} \{ |c \cdot \hat{\mu} - c \cdot \mu | \le \|c\| r_\alpha \; \forall c \in \Re^m \}
= 1-\alpha .
\end{equation}
This gives simultaneous confidence intervals for $c \cdot \mu$ (whether or not $c$ is a
_contrast_) as
\begin{equation}
\mathcal{I}_c = [ c \cdot \hat{\mu} - \|c\|r_\alpha, c \cdot \hat{\mu} + \|c\|r_\alpha ].
\end{equation}
For testing contrasts, one rejects the hypothesis that $c \cdot \mu = 0$ if
$|c \cdot \hat{\mu}| > \|c\| r_\alpha$, and one rejects the grand null hypothesis
if $\| \hat{\mu} \| \ge r_\alpha$.
Any number of contrasts can be tested this way, with FWER strongly controlled at 
level $\alpha$.

Note that if one uses Scheffé's method to produce confidence intervals only for 
the effects $\{ \mu_i \}$, it is unnecessarily conservative: it amounts to projecting
a ball onto the coordinate axes, which is equivalent to taking the corresponding
hyperrectangle as the confidence set for $\mu$. That hyperrectangle strictly contains
the ball, so it has higher coverage probability than the ball. 
If we were interested only in simultaneous confidence intervals for $\{ \mu_i\}$,
we could get shorter confidence intervals by starting with a hyperrectangular 
confidence region for $\mu$ (with faces aligned with the axes), and projecting _that_
set.  This is more or less what Tukey's maximum modulus method does.

### Tukey's Maximum Modulus Method

Tukey's method was originally introduced for all pairwise comparisons, but can
be modified for ACC.
Again, let's take $n_i = N$.
Define $c^*(\alpha)$ to satisfy
\begin{equation}
\mathbb{P} \left \{ \frac{|\hat{\mu}_i - \hat{\mu_j} - ( \mu_i - \mu_j ) |}{\hat{\sigma}\sqrt{2/N}}
\le c^*(\alpha) \forall j < i \right \} = 1-\alpha.
\end{equation}
Values of $c^*(\alpha)$ can be found by numerical integration.
Then
\begin{equation}
\mathcal{I}_{ij} = [\hat{\mu}_i - \hat{\mu}_j - c^*(\alpha) \hat{\sigma}\sqrt{2/N},
\hat{\mu}_i - \hat{\mu}_j - c^*(\alpha) \hat{\sigma}\sqrt{2/N}], \; j < i
\end{equation}
are simultaneous level $1-\alpha$ confidence intervals for the $(m^2 - m)$ pairwise
difference $\mu_i - \mu_j$, $j < i$.
By construction, the tests
> reject $H_{ij}: \mu_i = \mu_j$ if $|\hat{\mu}_i - \hat{\mu}_j| > c^*(\alpha) \hat{\sigma}\sqrt{2/N}$
control the FWER for all pairwise comparisons at level $\alpha$.

## The False Discovery Rate
See Benjamini and Hochberg (1995) and Benjamini and Yekutieli (2001).

Rejecting a null hypothesis is sometimes called a _statistical discovery_,
so rejecting a trull null hypothesis is a _false discovery_. 
The _False Discovery Rate_ (FDR) of a testing procedure is the expected fraction of rejected null hypotheses that were indeed false. 
That is, let $V$ denote the number of incorrectly rejected null hypotheses and let $R$ denote the total 
number of rejected null hypotheses, and define
\begin{equation}
  \mbox{FDR} := \mathbb{E}(V/R | R>0) \mathbb{P}(R>0) = \mathbb{E} Q,
\end{equation}
where
\begin{equation}
Q := \left \{ \begin{array}{ll}
        V/R, & R>0 \\
        0, & R=0.
        \end{array}
        \right .
\end{equation}
is the _false discovery proportion_ (FDP), a random variable.
That is, the FDR is the expected FDP.
(The conditioning is to prevent division by zero.
The definition of $Q$ makes sense because if no hypothesis was rejected, no hypothesis was rejected
erroneously.)

1. If all null hypotheses are true, the FDR is the same as the FWER: $\mathbb{P} \{V \ge 1 | m_0 = m\} = \mathbb{E}(Q)$. Controlling the FDR thus controls the FWER in a weak sense.
1. When the number $m_0$ of true null hypotheses is less than the total number $m$ of hypotheses, FDR $\le$ FWER. Thus controlling the FWER controls the FDR.

### The Benjamini-Hochberg and Benjamini-Yekutieli procedures to control the FDR

The Benjamini-Hochberg procedure for testing $m$ hypotheses at FDR level $\alpha$ is as follows:
Let $\{P_1, \ldots, P_m\}$ be the $P$-values of the $m$ hypotheses, 
and let $\{P_{(1)}, \ldots, P_{(m)}\}$
be the $P$-values ordered so that $P_{(1)} \le \cdots \le P_{(m)}$.

+ Find $K(\alpha) := \max \left \{k \in \{1, \ldots, m\} : P_{(k)} \le \frac{k}{m} \alpha \right \}$ (with $\max \emptyset := 0$).
+ Reject the null hypotheses $H_{(i)}$ for all $i \le K(\alpha)$.

This procedure (the B-H procedure) works for independent $P$-values and $P$-values that have 
_positive regression dependence on a subset_,
defined as follows:
A set $D \subset \Re^n$ is _increasing_ if $x \in D$ and $y \ge x$ implies $y \in D$.
An $m$-vector $X$ of $P$-values is PRDS if for every component $X_k$ corresponding to a true null $H_k$ 
and every increasing set $D \subset \Re^m$,
\begin{equation}
 \mathbb{P} \{ X \in  D | X_k \le x \}
\end{equation}
is an increasing function of $x$.


The B-H procedure guarantees that $\mbox{FDR} \le \frac{m_0}{m}\alpha \le \alpha$, where $m_0$ is the number of true null
hypotheses.
The Benjamini-Yekutieli procedure works for arbitrary dependence. It involves an additional
function $c(m):= \sum_{i=1}^m 1/i$:

+ Find $K^*(\alpha) := \max \left \{k \in \{1, \ldots, m\} : P_{(k)} \le \frac{k}{mc(m)} \alpha \right \}$ (with $\max \emptyset := 0$).
+ Reject the null hypotheses $H_{(i)}$ for all $i \le K^*(\alpha)$.


**Theorem.** Benjamini and Hochberg (1995) Theorem 1.  
If the test statistics are independent, then for any configuration of
false null hypotheses, the Benjamini-Hochberg procedure controls the FDR at level $\alpha$.

The proof relies on the following lemma:

**Lemma.**  Benjamini and Hochberg (1995)  
Let the number of true null hypotheses be $m_0$, $0 \le m_0 \le m$.
Order the hypotheses such that the first $m_0$ are the true ones.
Let $m_1 = m - m_0$ be the number of false null hypotheses.
If the test statistics of the true null hypotheses are independent, for the procedure
just given,
\begin{equation}
\mathbb{E}(Q| P_{m_0 + 1} = p_1 , \ldots , P_m = p_{m_1} ) \le \frac{m_0}{m} \alpha .
\end{equation}

**Proof of Lemma.**  
Benjamini and Hochberg prove the lemma by induction; the proof is similar to the proof of Simes' inequality.
For simplicity, we assume that the test statistics have continuous distributions so that the $P$-values of
the true nulls have uniform distributions, but the result is true without that assumption.
Suppose $m=1$.
Then the procedure rejects $H_1$ if $P_1 \le \alpha$.
If $m_0 = 0$, no incorrect rejection can occur, so
\begin{equation}
\mathbb{E}(Q| P_1 = p_1) = 0 \le \frac{0}{1} \alpha .
\end{equation}
If $m_0 = 1$, an incorrect rejection occurs if $P_1 \le \alpha$.
There is no $P_{m_0 + 1}$, so
\begin{equation}
\mathbb{E}(Q| P_{m_0 + 1} = p_1 , \ldots , P_m = p_{m_1} )  = \mathbb{E}(Q) = \mathbb{P} \{ P_1 \le \alpha \}
= \alpha \le \frac{1}{1} \alpha .
\end{equation}

Suppose that the lemma is true for all $m' \le m$; we shall show that it is
then true for $m' = m+1$.
If $m_0 = 0$, the null hypotheses are all false, so $Q$ is identically zero, and
the conditional expectation of $Q$, and so
\begin{equation}
\mathbb{E}(Q| P_1 = p_1, \ldots, P_m = p_m ) = 0 \le \frac{m_0}{m+1} \alpha .
\end{equation}
Suppose $m_0 > 0$. 
Let $P'_i$, $i = 1, \ldots, m_0$ be the $P$-values corresponding
to the true null hypotheses. Let $P'_{(m_0)}$ be the largest $P'_i$. 
Since $\{ P'_i \}_{i=1}^{m_0}$ are IID $U[0, 1]$, the density of $P'_{(m_0)}$ is $f(u) = m_0 u^{m_0 - 1}$.
Let $\{p_j\}_{j=1}^{m_1}$ be the $P$-values of the false null hypotheses, ordered
so that $p_1 \le p_2 \le \cdots \le p_{m_1}$.
Define 
\begin{equation}
j_0 :=  \max \left \{j : p_j \le \frac{m_0+j}{m+1} \alpha \right \},
\end{equation}
and
\begin{equation}
p_0 = \frac{m_0 + j_0}{m+1}\alpha .
\end{equation}
Note that $p_{j_0} \le p_0$.
Calculate the expectation, conditioning on the value of $P'_{(m_0)}$:
\begin{eqnarray}
\mathbb{E}(Q| P_{m_0+1} = p_1, \ldots, P_m = p_{m_1} ) &=&
\int_0^{p_0} \mathbb{E} \left (Q| P'_{(m_0)} = u, P_{m_0+1} = p_1, \ldots, P_m = p_{m_1} \right ) f(u) du +
\nonumber \\
&+& \int_{p_0}^1 \mathbb{E}(Q| P'_{(m_0)} = u, P_{m_0+1} = p_1, \ldots, P_m = p_{m_1} ) f(u) du 
\end{eqnarray}
In the first integral, $u \le p_0$, so all the null hypotheses are rejected, and 
$Q = \frac{m_0}{m_0 + j_0}$.
Recall that $p_0 = \frac{m_0 + j_0}{m+1} \alpha$.
Thus
\begin{eqnarray}
\int_0^{p_0} \mathbb{E}(Q| P'_{(m_0)} = u, P_{m_0+1} = p_1, \ldots, P_m = p_{m_1} ) f(u) du
&=&
\int_0^{p_0} 
\mathbb{E}(\frac{m_0}{m_0 + j_0} | P'_{(m_0)} = u, P_{m_0+1} = p_1, \ldots, P_m = p_{m_1} ) 
f(u) du
\nonumber \\
&=&
\int_0^{p_0} \frac{m_0}{m_0 + j_0} m_0 u^{m_0 - 1} du
\nonumber \\
&=& \frac{m_0}{m_0 + j_0} p_0^{m_0}
\nonumber \\
&= & \frac{m_0}{m_0 + j_0} \frac{m_0 + j_0}{m+1} \alpha p_0^{m_0-1}
\nonumber \\
&=& \frac{m_0}{m+1} \alpha p_0^{m_0 - 1}.
\end{eqnarray}
Consider the second integral. 
On the domain of the second integral, $P'_{(m_0)} = u \ge p_0 \ge p_{j_0}$.
Here, the true null hypothesis with the largest $P$-value, and possibly other true nulls, will not be rejected.
Suppose $j > j_0$ so that $p_{j+1} \ge p_j \ge p_{j_0}$.
Recall that $p_{j_0} \le p_0$.
Break the domain of integration into the intervals
$p_j \le u \le p_{j+1}$, $j = j_0 + 1, \ldots, m_1 - 1$, together with
$p_0 \le u \le p_{j_0+1}$ and $p_{m_1} \le u \le 1$.
Because $u, p_{j+1}, \ldots, p_{m_1}$ are all greater than the threshold
value $p_0$, their values
cannot result in any hypothesis being rejected.

Let $\{ H_{(i)} \}_{i=1}^m$ denote the entire set of $m$ null
hypotheses, ordered by their $P$-values.
In the second integral,
the $P$-values of the true null hypotheses are all no larger than $u$
(by definition of $u$---it's the largest $P$-value among the true null hypotheses).
Recall that the rejection procedure is to reject all hypotheses with smaller $P$-values
than $p_0$, so the rejection of $H_{(i)}$ implies that there must be some $k$,
$i \le k \le m_0 + j - 1$ for which
\begin{equation}
p_{(k)} \le \frac{k}{m+1} \alpha.
\end{equation}
This is equivalent to
\begin{equation}
\frac{p_{(k)}}{u} \le \frac{k}{m_0 + j - 1} \frac{m_0 + j - 1}{(m+1) u} \alpha .
\end{equation}
The proof is now similar to that of Simes' inequality: conditional on $P'_{(m_0)} = u$,
$\{P'_i/u \}_{i < m_0}$ are iid $U(0,1)$ random variables; $\{p_i/u \}_{i=1}^j$ are
some numbers between 0 and 1 corresponding to false null hypotheses.
We are testing $m_0 + j - 1 = m' < m$ hypotheses using a different value of $\alpha$,
namely $\frac{m_0 + j - 1}{(m+1)p} \alpha$.
Because $m' \le m$, we can apply the induction hypothesis:
\begin{equation}
\mathbb{E}(Q|P'_{(m_0)} = u, P_{m_0 + 1} = p_1, \ldots, P_m = p_{m_1} )
\le \frac{m_0 - 1}{(m+1) u} \alpha.
\end{equation}
This bound does not depend on $p_j$ or $p_{j+1}$,
so
\begin{eqnarray}
\int_{p_0}^1 \mathbb{E}(Q|P'_{(m_0)} = u, P_{m_0 + 1} = p_1, \ldots, P_m = p_{m_1} )
f_{P_{(m_0)}}(u) du &\le&
\int_{p_0}^1 \frac{m_0 - 1}{(m+1) u} \alpha m_0 u^{m_0 - 1} du 
\nonumber \\
&=& \frac{m_0}{m+1} \alpha \int_{p_0}^1 (m_0 - 1) u^{m_0 - 2} du
\nonumber \\
&=& \frac{m_0}{m+1} \alpha ( 1 - p_0^{m_0 - 1}).
\end{eqnarray}
Adding this to the bound on the first integral proves the Lemma.


**Proof of FDR Theorem.**  
Whatever be the joint distribution of $P_{m_0+1}, \cdots, P_m$ corresponding to the
false null hypotheses, integrating the inequality in the Lemma gives
\begin{equation}
\mathbb{E}(Q) = \mathbb{E}(\mathbb{E}(Q| P_{m_0+1}, \cdots, P_m)) \le \frac{m_0}{m} \alpha \le \alpha .
\end{equation}

The FDR-controlling procedure
is equivalent to picking $\alpha$ _a posteriori_ to maximize the number 
$r(\alpha)$ of rejections at that level, subject to the constraint
\begin{equation}
\alpha m / r(\alpha) \le \alpha .
\end{equation}
That is, we reject as many hypotheses as possible, subject to the constraint that
the expected number of incorrect rejections is at most the 
FDR times the number of hypotheses actually rejected.
The expected number of incorrect rejections is $\mathbb{E}(V) \le \alpha m$, so
$Q_e = \mathbb{E}Q \le \alpha m / r(\alpha) \le \alpha$.

One complaint about this FDR-controlling procedure (see Shaffer, 1995) is that because $Q$ is defined to be zero when no
rejection occurs, the conditional FDR given that some rejection does occur exceeds $\alpha$.

## Using $E$-values to control the FDR

Wang & Ramdas (2020) show that the Benjamini-Hochberg procedure applied to $E$-values instead of $P$-values controls the FDR _under arbitrary dependence of the test statistics_, with no need for the Benjamini-Yekutieli adjustment. 

In particular, suppose we have an $E$-value $E_j$ for each hypothesis $H_j$, $j \in \{1, \ldots, m\}$.
For $k \in \{1, \ldots, m\}$, let $E_{[k]}$ be the $k$th order statistic of the $E$-values, ordered from largest to smallest, so that $E_{[1]} \ge E_{[2]} \ge \cdots \ge E_{[m]}$.
The procedure is:

+ Let $K_E(\alpha) := \max \left \{ k \in \{1, \ldots, m\}: k E_{[k]}/m \ge 1/\alpha \right \}$
+ Reject the hypotheses with the $K_E(\alpha)$ largest $E$-values.

To prove the result requires a bit of new terminology.
A testing procedure based on $E$-values is _self-consistent for level $\alpha$_ if every $E$-value $E_k$ corresponding
to a rejected null satisfies
\begin{equation}
   E_k \ge \frac{m}{\alpha R},
\end{equation}
where $R$ is the number of rejected nulls.
This condition is adapted from a similar condition for FDR-controlling procedures based on $P$-values given by
Blanchard & Roquain (2008).
The idea is that the threshold $E$-value for rejecting a hypothesis should decrease as the FDR (and hence the number of rejections) is allowed to grow. The specific functional form (depending inversely on $R$ and $\alpha$) is an additional restriction.

**Theorem.**  Wang and Ramdas, Proposition 2.  
Every $E$-value based test that is self-consistent for level $\alpha$ has FDR at most $\alpha m_0/m$.

**Proof.**  
Let $\mathbf{E}$ denote the vector of $E$-values; let $\mathcal{N}$ denote the indices of the true null 
hypotheses; and let $\mathcal{R}(\mathbf{E})$ denote the indices of the null hypotheses that are rejected by the test.
The self-consistency condition implies $R \ge \frac{m}{\alpha E_k}$, so
\begin{eqnarray}
    Q := \frac{V}{R \vee 1} &=& \frac{|\mathcal{R}(\mathbf{E}) \cap \mathcal{N}|}{R \vee 1} \\
    &=& \sum_{k \in \mathcal{N}} \frac{1_{k \in \mathcal{R}(\mathbf{E})}}{R \vee 1} \\
    &\le& \sum_{k \in \mathcal{N}} \frac{1_{k \in \mathcal{R}(\mathbf{E})} \alpha E_k}{m} \\
    &\le& \sum_{k \in \mathcal{N}} \frac{ \alpha E_k}{m}.
\end{eqnarray}
For the true nulls $k \in \mathcal{N}$, $\mathbb{E}E_k \le 1$, so
\begin{equation}
   \mathbb{E} Q \le \sum_{k \in \mathcal{N}} \mathbb{E} \frac{\alpha E_k}{m} \le \frac{\alpha m_0}{m}.
\end{equation}

Wang and Ramdas develop the approach much further, including finding situations where the raw $E$-values can be
"boosted," and modifications to enforce structure among the rejected nulls.

## Other quantities

Genovese and Wasserman (2004) propose controlling a percentile of the false discovery proportion $Q$, rather than the false discovery rate, $\mathbb{E}Q$. That is, they give a method for which $\mathbb{P}\{Q \ge \alpha \} \le 1-\gamma$.
Similar approaches have been taken by others.

## Analogues for confidence sets

### Individual confidence sets

### Simultaneous confidence sets

### The False Coverage Rate (FCR)

## Selective inference

### Error rates for selective confidence sets