\section{Multivariate regression}
t

\section{Principal components analysis}
t

\section{Factor analysis}
t

\section{High-dimensional statistics: Multiple testing problem}
Suppose that we have a large number $p$ of variables of which some could be explaining outcome $y$. How do we know which ones are important? What are the statistical measures we can use?

Genone-wide association study: We have $n$ individuals measured on $p \sim 10^6$ positons on the genome where each $x_{ij}$ has values of 0,1 or 2 denoting how many copies of the reference DNA letter the individual $i$ carries at position $j$ of the genome. In addition, each individual has been messured for cholesterol levels (outcome $y$). Which genomic positions affect cholesterole levels? We can do linear regression of $y$ on each predictor $x_j$ separately which leads to the regression summary statistics $(\hat{\beta}_j, SE_j, P_j)$ for each $x_j$. The task is to infer which of the $10^6$ predictors are truly altering cholesterol levels.

\subsection{P-value}
In the simplest modeling approach we start by computing some summary statistics of association, such as effect size estimate $\hat{\beta}$, its standard error $SE$ and a P_value. We have one null hypothesis per each variable. The P-value is a probability of getting something at least as extreme as what has been observed, if the null hypothesis was true. P-value is probability of data given a hypothesis NOT probability of hypothesis given data. 

P-value (of data $y$) is a probability that if additional data $Z$ were genrated according to the null hypothesis, then the corresponding test statistic $t(Z)$ computed from data $Z$ would be at least as large as our observed $t(y)$. That is P-value of data $y$ is $Pr(t(Z) \geq t(y) | Z \sim NULL)$

Example: Let's do one linear regression with $p=1$ and put its p_value in its place in the null distribution of t-statistic. 

\begin{minted}[breaklines]{R}
set.seed(29)
n = 100
x = rnorm(n) # predictor
y = rnorm(n) # outcome, independent of x
lm.fit = lm(y ~ x)
x.grid = seq(-3, 3, 0.05) # to define the plotting region
plot(x.grid, dt(x.grid, df = n-2), lty = 2, lwd = 2, t = "l",
xlab = expression(hat(beta)/SE), ylab = "density", main = "NULL DISTR") #null distribution of t-stat.
t.stat = summary(lm.fit)$coeff[2,3] # observed t-statistic: Estimate/SE
points(t.stat, 0, pch = 19, cex = 2, col = "red")
segments(t.stat*c(1,-1), c(0,0), t.stat*c(1,-1), rep(dnorm(t.stat,0,1),2), col="red")
text(2, 0.25, paste0("P=",signif(summary(lm.fit)$coeff[2,4],3)), col="red")
legend("topright",pch=19,col="red",leg="observed t")
\end{minted}

Recently, there has been much discussion about how traditional use of P-values has led to poor replicability, especially with high dimensional data sets with multiple testing issues. 

\subsection{Distribution of summary statistics}
Let's generate some data, first without any real effects. Our purpose is to see how a large set of p-values behave. 

\begin{minted}[breaklines]{R}
set.seed(6102017)
n = 1000 #individuals
p = 5000 #variables measured on each individual
X = matrix( rnorm(n*p), n, p) #just random variables
y = rnorm(n) #outcome variable that is not associated with any of x
#by mean-centering y and each x, we can ignore intercept terms (since they are 0, see Lecture 0)
X = as.matrix( scale(X, scale = F) ) #mean-centers columns of X to have mean 0
y = as.vector( scale(y, scale = F) )
#apply lm to each column of X separately and without intercept (see Lecture 0.)
lm.res = apply(X, 2 , function(x) summary(lm(y ~ -1 + x))$coeff[1,])
# lm.res has 4 rows: beta, SE, t-stat and P-value
pval = lm.res[4,] #pick P-values
par(mfrow = c(1,2))
plot(density(lm.res[3,]), sub = "", xlab = "t-stat", main = "", lwd = 2) #should be t with n-2 df
curve(dt(x, df = n-2), from = -4, to = 4, add = T, col = "blue", lwd = 2, lty = 2) #t distr in blue
curve(dnorm(x, 0, 1), from = -4, to = 4, add = T, col = "red", lwd = 2, lty = 3)#normal distr in red
hist(pval, breaks = 50, xlab = "P-value", main = "", col = "steelblue")
\end{minted}

We see that the empirical distribution of t-statistics accurately follows its theoretical $t(n-2)$ distribution, and since $n$ is large enough, this distribution resembles the normal distribution $N(0,1)$. The histogram shows that the p-values seem to be distributed uniformly between 0 and 1. This is indeed their distribution when the data follows the null hypothesis. 

Let's also compare the distributions via QQ-plots, 
\begin{minted}[breaklines]{R}
par(mfrow=c(1,2)) #Let's make qqplots for t-stats and for P-values
qqnorm(lm.res[3,], cex = 0.7, pch = 3); qqline(lm.res[3,], col = "red")
#((1:p)-0.5) / p gives us
#p equally spaced values in (0,1) to represent quantiles of Uniform(0,1).
qqplot(-log10( ((1:p)-0.5) / p), -log10(pval), xlab = "Theoretical",
ylab = "Observed", main = "QQ-plot for -log10 P-val", cex = 0.7, pch = 3)
abline(0, 1, col = "red")
\end{minted}

Why are p-values distributed as $U(0,1)$ under the null? $Pr(P \leq q_0 |NULL) $ , thus the cumulative density function (cdf) of p-values is $F(x) = x, 0 \leq x \leq 1$, which equals the cdf of $U(0,1)$. 

About $0.05\cdot 5000$ predictors were labeled as significant even thoigh they were all just random noise. This increasing flood of false positives with an increasing number of tests is a multiple testing problem arising in the standard hypothesis framework, when the significance level $\alpha$ is kept fixed while $p$ grows. 

Let's add  some ($m=50$) predictors that have npn-zero effects on the outcome $y$. 
\begin{minted}[breaklines]{R}
set.seed(6102017)
n = 1000 #individuals
p = 5000 #variables measured on each individual
m = 50 #number of predictors that have an effect: they are x_1,...,x_m.
b = 0.5 #effect size of predictors that have an effect
X = matrix(rnorm(n*p), n, p) # random predictors
y = X[,1:m] %*% rep(b,m) + rnorm(n) #outcome variable that is associated with x_1,...,x_m
#by mean-centering y and each x, we can ignore intercept terms (since they are 0)
X = as.matrix(scale(X, scale = F)) #mean-centers columns of X
y = as.vector(scale(y, scale = F))
#apply lm to each column of X separately and without intercept
lm.res = apply(X, 2 , function(x) summary(lm(y ~ -1 + x))$coeff[1,])
#has 4 rows: beta, SE, t-stat and pval
pval = lm.res[4,]
par(mfrow = c(1,2))
plot(density(lm.res[3,]), sub = "", xlab = "t-stat", main = "", lwd = 2) #under null is t with n-2 df
curve(dnorm(x), -4, 4, col = "red", lty = 3, add = T) #normal distribution in red
hist(pval, breaks = 50, xlab = "P-value", main = "", col = "dodgerblue")
\end{minted}

Now the P-value distribution differs from the null assumption of uniform distribution by an enrichment of the smallest p-values.

Let's try QQ-plots

\begin{minted}[breaklines]{R}
par(mfrow = c(1,2)) #Let's make qqplots for t-stats and for P-values
qqnorm(lm.res[3,], cex = 0.6, pch = 3)
qqline(lm.res[3,], col = "red")
#Now we use ppoints() to give the p quantiles from the uniform:
qqplot(-log10(ppoints(p)), -log10(pval), xlab = "Theoretical",
ylab = "Observed", main = "QQ-plot for -log10 P-val", cex = 0.6, pch = 3)
abline(0, 1, col = "red")
\end{minted}

The next questions is how can we make a statistically sound inference about which ones of the testes predictors are non-zero effects?

\subsection{Multiple testing framework}
Let $H_j$ "the predictor $j$ is null", be the null hypothesis for predictor $x_j$, $j = 1, \dots, p$. We introduce the following notation
\begin{enumerate}
\item $p$ is the total number of hypotehses tested
\item $p_0$ is the number of true null hypotheses, an unknown parameter
\item $p - p_0$ is the number of true not null ("alternative") hypotheses
\item FD is the number of false discoveries or false positives (Type I errors)
\item TD is the number of true discoveries or true positives
\item FN is the number of false negatives or false non-discoveries (Type II errors)
\item TN is the number of true negatives or true non-discoveries
\item D = FD + TD is the number of discoveries, i.e. rejected null hypotheses
\end{enumerate}

Of these we only observe $p$ and $D$ and will make statistical inference on the rest.

\subsection{P-values and family-wise error rate FWER}
The simplest inference procedure is to fix a statistical significance threshold $\alpha$ and call each predictor significant (at significance level $\alpha$) it its p-value $\leq \alpha$. Thus $\alpha$ is also the Type I error rate, or false positive rate, the rate at which null predictors are labelled as significant. This can be written s $E(FD/p_0) = \alpha$. The traditonal significance level testing just controls the false positive rate. Methids that control a much more stringent family-wise error rate (FWER) are often used in the multiple testing setting.

FWER is the probability of making at least one false discovery across all the tests carried out in a multiple testing setting:
\[
FWER = PR(FD \geq 1) = E(FD \geq 1)
\]

Example: Suppose that 10 independent groups do a clinical trial on the same disease. What is the FWER of such a procedure under the null hypothesis?
\[
Pr(\text{at least one} P \leq 0.05 | NULL) = 1 - Pr(all P > 0.05 | NULL) = 1 - (1-0.05)^10m \approx 0.401
\]
This example shows why it is problematic that only "significant" results tend to get published: a proper assessment of the drug should be done based on all available 10 studies, NOT only on the one that happened to give "significant" result, since that study is likely to be biased towards larger effect size given that other studies didn't report significant results.

\subsection{Bonferroni correction}
The simplest way to control FWER at level $\alpha$ is to apply significance threshold $\alpha_B = \alpha/p$ for each test, i.e. report as significant the predictors whose p-value $\leq \alpha_B$. This is called the Bonferroni correction for multiple testing. Proof that it does the job is
\[
FWER = Pr\bigg( \bigcup_{j=1}^{p_0}\{P_j \leq \alpha_B\} | NULL \bigg) \leq \sum_{j=1}^{p_0} Pr(P_j \leq \alpha_B | NULL) = p_0 \cdot \alpha_B = p_0 \frac{\alpha}{p} \leq p\frac{\alpha}{p} = \alpha 
\]
Its advatnages are thus complete generality and very simple form that is easy to apply in practice.

Bonferroni correction controls FWER, but it is very stringent and hence has low statistical power to detect true effects. This has motivated a lot of research on how to improve power, e.g. th Holm method

\subsection{Holm method}
\begin{enumerate}
    \item Order the p-values from the lower to the highest: $p_{(1)} \leq \dots \leq p_{(p)}$, and let the corresponding hypotheses be $H_{(1)}, \dots, H_{(p)}$.
    \item For a given significance level $\alpha$, let $j$ be the smallest index such that $P_{(j)} > \frac{\alpha}{p + 1 - j}$
    \item Reject the null hypothesis $H_{(1)}, \dots, H_{(j-1)}$ and do not reject $H_{(j)}, \dots, H_{(p)}$
    \item If $j=1$ then do not reject any of the null hypotheses and if no such $j$ exist then reject all of the null hypotheses
\end{enumerate}

Proof: Let $I_0$ be the set of $p_0$ indexes of the true null hypotheses. Let $k$ be the index of the first true null hypothesis among the ordered sequence of p-values, i.e. $H_{(1)}, \dots, H_{(k-1)}$ are false but $H_{(k)}$ is true. We want to show that the probability that $H_{(k)}$ is rejected is $\leq \alpha$. Since there are $p_0 -1$ true nulls in the ordered sequence of hypothesis after $H_{(k)}$, it follows that
\[
k + p_0 - 1 \leq p \to \frac{1}{p+1-k} \leq \frac{1}{p_0} \to \frac{\alpha}{p + 1 -k} \leq \frac{\alpha}{p_0}
\]
\[
Pr\bigg( P_{(k)} \leq \frac{\alpha}{p+1-k} \bigg) \leq Pr\bigg( P_{(k)} \leq \frac{\alpha}{p_0} \bigg) = Pr \bigg( \bigcup_{i \in I_0} \bigg(P-i \leq \frac{\alpha}{p_0} \bigg) \bigg) \leq \sum_{i \in I_0}Pr \bigg(P_i \leq \frac{\alpha}{p_0} \bigg) = p_0 \frac{\alpha}{p_0} = \alpha
\]

\begin{minted}[breaklines]{R}
fwer = 0.05
p.ex = 5 #use ".ex" to not mix up with p=5000 existing variables that we will reuse later
pval.ex = c(0.4, 0.001, 0.8, 0.011, 0.12)
#Bonferroni rejects:
(pval.ex <= fwer/p.ex)

#For Holm we first sort P values in ascending order
sorted.pval = sort(pval.ex)
#we compute individual rejection threshold for EACH hypothesis in ascending order
alpha.holm = fwer/( p.ex + 1 - (1:p.ex) )
rbind(sorted.pval, alpha.holm)
\end{minted}

Exercise: Assume that both Bonferroni and Holm methods have rejected the hypothesis corresponding to the $k$ smallest p-values. What is the ratio of p-value thresholds that is needed to reject the next hypothesis with these methods?

\begin{minted}[breaklines]{R}
p.thresh = 0.5 #this is very liberal significance level for raw P-values, but not after FWER adjustment
sum( pval < p.thresh )

sum( p.adjust(pval, method = "holm") < p.thresh )

sum( p.adjust(pval, method = "bonferroni") < p.thresh )

#Let's see how many true and false positives we have
signif.tests = (pval < p.thresh)
S = sum(signif.tests[1:m]) #True positives
V = sum(signif.tests[(m+1):p]) #False positives
print(paste("Raw P-values: TP =",S,"FP =",V))

signif.tests = (p.adjust(pval, method="holm") < p.thresh)
S = sum(signif.tests[1:m]) #True positives
V = sum(signif.tests[(m+1):p]) #False positives
print(paste("Holm: TP =",S,"FP =",V))
\end{minted}

\subsection{Criticism towards FWER control}
By controlling FWER we can clearly keeo the number of false positives low in the total experiment. FWER methods are much afraid of making false positives. Therefore, less stringent multiple testing correction methids have been developed to control the false discovery rates (FDRs), which will be our next topic.

Frequentist approach vs. Bayesian inference

\section{High-dimensional statistics: False discovery rate}
Multiple testing problem: the family-wise error control (FWER) seemed to be a reasonable way to filter a few most prominent candidates for true positives from this vast set of null variables. 

\subsection{Distribution of z-scores}
Suppose that in linear regression $y=\mu + x \beta + \epsilon$ the true slope is $\beta$. If $var(x) = v_x$ and error variance is $\sigma^2$, then the sampling variance of $\hat{\beta}$ is $v_{\beta} = \sigma^2/(nv_x)$ and z-score for testing the slope is
\[
z = \frac{\hat{\beta}}{\sqrt{v_{\beta}}} \sim N(\frac{\beta}{\sqrt{v_{\beta}}}, 1)
\]
We can compute p-values for such z-scores using $pchisq(z^2, df=1, lower=F)$, because under the null $z \sim N(0,1)$ and hence $z^2 \sim \chi_1^2$. With this method we can easily generate p-values for $p_0$ null variables $(\beta=0)$ and $m$ non null variables $\beta \neq 0$ and test various inference methods on those p-values. 

Let's generate p-values for p=1000 variables that we have tested independently. Let's also assume that $m=100$ of them actually do have an effect explaining 1\% of the variance of the corresponding outcome.
\begin{minted}[breaklines]{R}
set.seed(11102017)
n = 1000
p = 1000
m = 100
b = sqrt( 0.01 / (1 - 0.01) ) #This means each predictor explains 1%, See Lecture 0.
#Generate P-values, 1,...,m are true effects, m+1,...,p are null.
eff = c(rep(T, m), rep(F, p-m)) #indicator for non-null effects
pval = pchisq( rnorm(p, b*sqrt(n)*as.numeric(eff), 1)^2, df = 1, lower = F)
boxplot(-log10(pval)[eff], -log10(pval)[!eff], col = c("limegreen","gray"),
names = c("EFFECTS","NULL"), ylab = "-log10 Pval")
abline(h = -log10(0.05), col = "blue", lty = 2) #significance threshold 0.05
abline(h = -log10(0.05 / p), col = "red", lty = 2) #Bonferroni corrected threshold of 0.05set.seed(11102017)
n = 1000
p = 1000
m = 100
b = sqrt( 0.01 / (1 - 0.01) ) #This means each predictor explains 1%, See Lecture 0.
#Generate P-values, 1,...,m are true effects, m+1,...,p are null.
eff = c(rep(T, m), rep(F, p-m)) #indicator for non-null effects
pval = pchisq( rnorm(p, b*sqrt(n)*as.numeric(eff), 1)^2, df = 1, lower = F)
boxplot(-log10(pval)[eff], -log10(pval)[!eff], col = c("limegreen","gray"),
names = c("EFFECTS","NULL"), ylab = "-log10 Pval")
abline(h = -log10(0.05), col = "blue", lty = 2) #significance threshold 0.05
abline(h = -log10(0.05 / p), col = "red", lty = 2) #Bonferroni corrected threshold of 0.05
\end{minted}

We see that true effects tend to have smaller p-values than null effects, but there is some overlap between the distributions and therefore from p-values only we cannot have a rule that detects all true positives but no false positives.

\begin{minted}[breaklines]{R}
alpha = 0.05
dis = (pval < alpha) #logical indicating discoveries
table(!dis, eff, dnn = c("non-discov","non-null"))

dis = (p.adjust(pval, method = "bonferroni") < alpha)
table(!dis, eff, dnn = c("non-discov","non-null"))
\end{minted}

The problem with the raw p-value threshold is that there are many false discoveries. The problem with the Bonferroni correction is that only a third of all true effects is discovered. We want something in between. We want to directly control the proportion of false discoveries made out of all discoveries.

\subsection{Definition of false discovery rate}
Let's define false discovery proportion (FDP) as a random variable
\[
FDP = \frac{FD}{max(1, D)} = \begin{cases}
FD/D, & D > 0 \\
0. & D = 0 \\
\end{cases}
\]
False discovery rate (FDR) is the expectation of FDP:
\[
FDR = E(FDP)
\]
THus it is the (expected) proportion of false discoveries among all discoveries. By controlling FDR given level $\alpha_F$, we will tolerate more false discoveries as the number of test increases as long as we will also keep on doing more true discoveries. Note the differece to FWER control where we always allow at most 1 false discovery in the experiment, no matter whether we are doing 1,10,100000 discoveries altogether. A FDR of 5\% means that, on average, among all features called significant, 5\% of those are truly null. 

\begin{minted}[breaklines]{R}
sort.pval = sort(pval) #sorted from smallest to largest
sort.eff = eff[order(pval)] #whether each sorted pval is from a true positive
fdp = cumsum(!sort.eff)/(1:p) #which proportion of discoveries are false
cols = rep("red",p); cols[sort.eff] = "orange" #true pos. in orange
plot(log10(sort.pval), fdp, xlab = "log10 P-value", ylab = "FDP",
col = cols, cex = 0.7, pch = 20)
alpha = 0.05
i = max( which(fdp < alpha) )
print( paste("fdp <",alpha,"when P-value is <",signif(sort.pval[i],3)) )
abline(v = log10(sort.pval[i]), col = "blue")
abline(h = alpha, col = "red")
#shows the step where fdp < alpha breaks
cbind(D=i:(i+1), FD=fdp[i:(i+1)]*c(i,i+1), fdp=fdp[i:(i+1)], pval=sort.pval[i:(i+1)])
\end{minted}

But how can we in general control FDR given the set of p-values? Such a method was formulated by Yoav Benjamini and Yosef Hochberg in 1995.

\subsection{Benjamini-Hochberg procedure (1995)}
Let $H_j$ be the null hypothesis for test $j$ and let $P_j$ be the corresponding p-value. Denote the ordered sequence of p-values as $P_{(1)} \leq P_{(2)} \leq \dots \leq P_{(p)}$ and let $H_{(j)}$ be the hypothesis corresponding to the jth p-value. Benjamini-Hochberg procedure at level $\alpha_F (BH(\alpha_F))$ is to reject the null hypotheses $H_{(1)}, \dots, H_{(k)}$, where $k$ is the largest index $j$ for which $P_{(j)} \leq \frac{j}{p}\alpha_F$. For independent tests and for any configuration of false null hypotheses, $BH(\alpha_F)$ controls the FDR at level $\alpha_F$. 

\subsubsection{Intuitive explanation why BH works}
Consider p-value $P_{(j)}$. Since $P_{(j)}$ is the jth smallest p-value, if we draw a significance threshold at $P_{(j)}$, we have made exactly $j$ discoveries. On the other hand, we expect that out of all $p_0$ null effects about $p_0P_{(j)}$ give a p-value $\leq P_{(j)}$. THus we have an approximation for discovery proportion at threshold $P_{(j)}$
\[
FDP(P_{(j)}) \approx \frac{p_0 P_{(j)}}{j} \leq \frac{pP_{(j)}}{j}
\]
If we simply ask that for which $j$ is this estimated $FDP(P_{(j)}) \leq \alpha_F$, we get
\[
\frac{p P_{(j)}}{j} \leq \alpha_F \iff P_{(j)} \leq \frac{j}{p}\alpha_F
\]
which is the condition of the BH procedure.

Like in Holm procedure, in BH the rejection of the tested hypotheses depends not only on their p-values but also on their rank among all the p-values. We see that the critical threshold increases from the Bonferroni threshold $\alpha_F/p$ for the smallest p-value to the unadjusted threshold $\alpha_F$ for the largest p-value. Crucially, if any p-value $P_{(j)}$ is below its own threshold $j \alpha_F/p $, it means that also ALL the hypotheses corresponding to smaller p-values will be rejected, no matter whether they are below their own thresholds.

\begin{minted}[breaklines]{R}
alpha = 0.05
i.BH = max(which(sort.pval <= ((1:p)*alpha)/p))
print(paste("Reject 1,...,",i.BH," i.e, if P-value <=",signif(sort.pval[i.BH],3)))

print(paste0("Discoveries:",i.BH,
"; False Discoveries:",sum(!sort.eff[1:i.BH]),
"; fdp=",signif(sum(!sort.eff[1:i.BH])/i.BH,2)))

pval.BH = p.adjust(pval,method = "BH")
#These are pvals adjusted by a factor p/rank[i]
#AND by the fact that no adjusted P-value can be larger than any of other adjusted
#P-values that come later in the ranking of the original P-values (See exercise 3)
sum(pval.BH < alpha) #should be D given above

plot(1:p, -log10(sort.pval), pch = 3, xlab = "rank", ylab = "-log10 pval", col = cols, cex = 0.7)
abline(h = -log10(alpha), col = "blue", lwd = 2)
abline(h = -log10(alpha/p), col = "purple", lwd = 2)
lines(1:p, -log10((1:p)*alpha/p), col = "cyan", lwd = 2)
legend("topright", legend = paste(c("P=","BH","Bonferr."), alpha),
col = c("blue","cyan","purple"), lwd = 2)
\end{minted}

\subsection{Benjamini-Yekutieli procedure}
The BH procedure was proven for independent test statistics and therefore it is not as generally applicable as Bonferroni and Holm methods. An extension of BH has been proven to control FDR in all cases by Benjamini and Yekutieli: when the Benjamini-Hochberg procedure is conducted with $\alpha_F/\sum_{j=1}^p\frac{1}{j}$ in place of $\alpha_F$, it controls the FDR at level $\leq \alpha_F$. 

\begin{minted}[breaklines]{R}
alpha = 0.05
i.BY = max(which(sort.pval <= ((1:p)*alpha/sum(1/(1:p))/p)))
print(paste("Reject 1,...,",i.BY," i.e, if P-value <",signif(sort.pval[i.BY],2)))

print(paste0("Discoveries:",i.BY,
"; False Discoveries:",sum(!sort.eff[1:i.BY]),
"; fdp=",signif(sum(!sort.eff[1:i.BY])/i.BY,2)))

pval.BY = p.adjust(pval,method = "BY") #these are pvals adjusted by factor p/rank[i]*sum(1/(1:p))
sum(pval.BY < alpha) #should be D given above
\end{minted}

As we see, BY is more conservative than BH, whihc is the price to pay for proven guarantees of control of FDR in case of all possible dependency structures.

\subsection{Relationship between FWER and FDR}
We have
\[
FDR = E(FDP) = Pr(D = 0)\cdot0 + Pr(D > 0)\cdot E\bigg( \frac{FD}{D}|D>0 \bigg) = Pr(D > 0)\cdot E \bigg(\frac{FD}{D} | D>0 \bigg)
\]

FDR weakly controls FWER: If all null hypotheses are true, then the concept of FDR is the same as FWER. To see this nite that here $FD=D$ and therefore if $FD=0$ then $FDP = 0$ and if $FD > 0$, then $FDP = 1$. Thus, $FDR = E(FDP) = Pr(FD > 0) = FWER$. This means that FDR weakly controls FWER: If all null hypotheses are true, then any method that controls FDR at level $\alpha$ also controls FWER at level $\alpha$. However, is some null hypotheses are false, then FDR doesn't typically control FWER.

FWER controls FDR. Because $FDP \leq I(FD > 0)$, where $I$ is the indicator function, by taking expactation
\[
FDR = E(FDP) \leq Pr(FD > 0) = FWER
\]
Thus, any method that controls FWER at level $\alpha$ also controls FDR at level $\alpha$.

\begin{minted}[breaklines]{R}
p = 1000 #variables for each data set
alpha = 0.1 #target FWER
R = 1000 #replications of data set
res = matrix(NA, ncol = 3, nrow = R) #collect the number of discoveries by BH, Holm, Bonferr.
for(rr in 1:R){
#Generate P-values that are null.
pval = runif(p)
res[rr,] = c(sum(p.adjust(pval, method = "BH") < alpha),
sum(p.adjust(pval, method = "holm") < alpha),
sum(p.adjust(pval, method = "bonferroni") < alpha))
}
apply(res > 0, 2, mean) #which proportion report at least one discovery?
pval = c(0.014, 0.09, 0.05, 0.16)
p.adjust(pval, method = "BH")
p.adjust(c(pval, 0.001), method = "BH")

pval = c(0.01, 0.02, 0.02, 0.03)
rbind(pval, p.adjust(pval, "bonferroni"))

rbind(pval, p.adjust(pval, "holm"))

rbind(pval, p.adjust(pval, "BH"))
\end{minted}

We say that the Holm method is a step-down procedure and the BH method is a step-up procedure.

Step-down procedures start by testing the hypothesis with the lowest p-value and step down through the sequence of p-values while rejecting the hypotheses. The procedure stops at the first non-rejection and labels the remaining hypotheses as non-rejected. 

Step-up procedures start by testing the hypothesis with the highest p-value and step up through the sequence of p-values while retaining hypotheses. The procedure stops at the first rejection and labels all remaining hypotheses as rejected.

Single step procedures are the ones where the same cirterion is used for each hypothesis, independent of its rank among p-values. Bonferroni correction and fixed significance level testing are examples of single step procedures.

\section{High-dimensional statistics: Q-value}
So far we have used p-values for inference based on false positive rates for FWER and for FDR. Let's here consider a quantity called the Q-value that can be attached to each test and that gives an empirical estimate of FDR among all tests with at least as small Q-values. Before defining the Q-value we will first refine the BH method by empirically estimating $p_0$, the number of null tests. When $p_0$ is consideraböy smaller than $p$, we can improve the accuracy of the FDR method by estimating $p_0$.

Let's define, for each p-value threshold $t \in [0, 1]$
\[
FDR(t) = E \bigg(\frac{FD(t)}{D(t)} \bigg)
\]
where random variables $FD(t) = #\{null P-values \leq t\}$ and $D(t) = #\{P-values \leq t\}$ in an experiment where in total $p$ P-values are available. 

To refine our understanding of FDR methods, let's thinnk that random variables $FD(t)$ and $D(t)$ result from $p$ draws of $p$-values from a mixture distribution between $U(0,1)$ (for null p-values) and an alternative distribution with cdf $\phi_1$ and pdf $\phi_1$ (for non-null p-values), with miture proportion $\pi_0$ for the null. In other words, cdf $\Phi$ and pdf $\phi$ of the p-values are
\[
\begin{cases}
\Phi(t) = \pi_0\cdott + (1 - \pi_0)\Phi_1(t) & t \in [0, 1] \\
phi(t) = \pi_0\cdot1 + (1 - \pi_0)\phi_1(t) & t \in [0,1] 
\end{cases}
\]
We can interpret sampling from such a mixture distribution as a 2-step process. Namely, we first choose between the null distribution (with probability $\pi_0$) and the alternative distribution (with probability $1 - \pi_0$), and second, conditional on the chosen distribution, we sample a p-value from the chosen distribution.

Suppose we do $p=10000$ tests of which $m=2000$ are non-null, i.e. we can model this as a mixture distribution with $\pi_0 = \frac{p-m}{p} = 0.8$. Null p-values come from $U(0,1)$ and non-null p-values come from the distribution $Beta(0.1, 4.9)$ (for deomnstartion purposes)

\begin{minted}[breaklines]{R}
p = 10000
m = 2000
beta.1 = 0.1 # weight for unit interval's end point 1
beta.0 = 4.9 # weight for unit interval's end point 0
null.pval = runif(p-m, 0, 1)
alt.pval = rbeta(m, beta.1, beta.0) #non-null = alternative distribution
pval = c(alt.pval, null.pval) #all P-values together
eff = c(rep(T, m), rep(F, p - m)) #indicator for non-null effects
par(mfrow=c(1,3)) #Empirical histogram and theoretical curve for (1), (2), (3)
hist(null.pval, breaks = 20, prob = T, col = "limegreen", main = "null", xlab = "",
xlim = c(0,1), sub="", lwd = 1.5, lty = 2, xaxs="i", border=NA)
curve(dunif(x, 0, 1), 0, 1, add = T, col = "orange", lwd = 2)
hist(alt.pval, breaks = 20, prob = T, col = "limegreen", main = "non-null", xlab = "",
xlim = c(0,1), sub="", lwd = 1.5, lty = 2, xaxs="i", border=NA)
curve(dbeta(x, shape1 = beta.1, shape2 = beta.0), 0, 1, add = T, col = "orange", lwd = 2)
legend("topright", fill = c("limegreen","orange"),
legend = c("empir","theor"), cex = 1.5)
hist(pval, breaks = 20, prob = T, col = "limegreen", main = "combined", xlab = "",
xlim = c(0,1), sub="", lwd = 1.5, lty = 2, xaxs="i", border=NA)
curve((p-m)/p*1 + m/p*dbeta(x, shape1 = beta.1, shape2 = beta.0), 0, 1, add = T,
col = "orange", lwd = 2)

summary(null.pval)
summary(alt.pval)
\end{minted}

Since $\Phi(t)$ is the probability that a particular p-value from this mixture distribution $\leq t$, the random variables $D(t)$ and $FD(t)$ are distributede as
\[
D(t) \sim Bin(p, \Phi(t))
\]
\[
FD(t) | D(t) \sim Bin(D(t), \theta_t)
\]
where
\[
\theta_t = Pr(NULL|P \leq t) = 
\frac{Pr(NULL)Pr(P \leq t |NULL)}{Pr(P \leq t)} =
\frac{\pi_0 t}{\pi_0 t + (1 - \pi_0)\Phi_1(t)}
\]
Using the law of total expectation: $E_Y(Y) = E_X(E_Y(Y|X))$ we have that
\[
FDR(t) = E\bigg(\frac{FD(t)}{D(t)} \bigg) = 
E\bigg(E\bigg(\frac{FD(t)}{D(t)}|D(t) \bigg)\bigg) = E \bigg(\frac{1}{D(t)}E(FD(t)|D(t)) \bigg) =
E \bigg( \frac{1}{D(t)}\theta_t D(t) \bigg) = \theta_t
\]
On the other hand $E(FD(t)) = E(E(FD(t) | D(t))) = E(D(t)\theta_t) = \theta_t E(D(t))$. Thus
\[
\frac{E(FD(t))}{E(D(t))} = \frac{\theta_t E(D(t))}{E(D(t))} = \theta_t = FDR(t)
\]
So we can estimate $FDR(t)$ by the ratio of expectations of $FD(t)$ and $D(t)$. 

For each p-value threshold $t$, denote the number of all discoveries at threshold $t$ by $\hat{D}(t) = #\{p-values \leq t\}$. We use this to estimate $E(D(t)) \approx \hat{D}(t)$.

To estimate $E(FD(t))$, we remember that the null p-values are uniformly distributed and hence $E(FD(t)) = p_0\cdott = \pi_0\cdot p \cdot t$. In estimating $\pi_0$ we again rely on the fact that the null p-values are uniformly distributed and that most p-values near 1 are expected to be from the null distribution. 
\begin{minted}[breaklines]{R}
hist(pval, breaks = 40, prob = T, col = "limegreen", main = "All P-values", xlab = "",
xlim = c(0,1), sub = "", lwd = 1.5, lty = 2, xaxs = "i", border = NA)
abline(h = 1, lty = 2, lwd = 2)
\end{minted}

We can see that the density of p-values > 0.2 looks fairly flat. 
\[
\hat{\pi}_0(\lambda) = 
\frac{#\{P_j > \lambda | j=1, \dots, p\}}{(1-\lambda)p}
\]
A practical choice for the parameter $\lamdba$ could be 0.5

\begin{minted}[breaklines]{R}
lambda = seq(0, 1, by = 0.05)
pi.0 = sapply(lambda, function(x) {sum(pval > x)}) / p / (1 - lambda)
plot(lambda, pi.0, t = "b", xlab = expression(lambda), ylab = expression(hat(pi)[0]))
abline(h = 1 - m/p, col = "red", lty = 2) #this is the true value
\end{minted}

With this estimator for $\hat{\pi}_0$, we have the estimate
\[

\]