* Bayes formula states $$P(A_i|B) = \frac{P(B|A_i)P(A_i)}{P(B|A_1)P(A_1)+P(B|A_2)P(A_2)+P(B|A_3)P(A_3)}.$$

### Sample mean and variance

\begin{eqnarray*}
\bar y_n 
&=&
 (y_1 + y_2 + \cdots + y_n)/n = n^{-1} \sum_{i=1}^n y_i;\\
s_n^2 
&=&
 \{ (y_1 - \bar y)^2 + (y_2 - \bar y)^2 + \cdots + (y_n - \bar y)^2 \}/(n-1)
= (n-1)^{-1}  \sum_{i=1}^n (y_i - \bar y)^2.
\end{eqnarray*}

### Population mean and variance

Suppose $\mu = E(Y)$ and $\sigma^2 = Var(Y)$.
We have
\begin{eqnarray*}
E(\bar{y}_n) &=&
n^{-1} \sum_{i=1}^n E (y_i) = \mu;\\
Var(\bar y_n) &=&
n^{-2} \sum_{i=1}^n Var(y_i) = \sigma^2/n
\end{eqnarray*}


### Sample variance

We may find sample variance is also given by
$$
s^2 = \mbox{average of } \{ (y_i - y_j)^2/2\}  \mbox{ for } i \neq j
$$
Trust me for the moment.

It is seen that
$$
E\{ (y_1 - y_2)^2\} = Var(y_1) + Var(y_2) = 2 \sigma^2.
$$
Hence,
$$
E (s^2) = \{ \mbox{average of } E \{ (y_i - y_j)^2/2\}  \mbox{ for } i \neq j \} = \sigma^2.
$$

Because of this, we say $s^2$ is an unbiased estimator of $\sigma^2$.

### Sample variance algebra

In words: the sample variance $s^2$ 
is the average squared difference between observations.

To prove this claim, we note
\begin{eqnarray*}
\sum_{i, j} (y_i - y_j)^2 
&=&
\sum_{i, j} \{  (y_i - \bar y) - (y_j - \bar y)\}^2\\
&=&
n \sum_{i} (y_i - \bar y)^2 + n \sum_{j} (y_j - \bar y)^2 - 2 \sum_{i,j}(y_i - \bar y)(y_j - \bar y).
\end{eqnarray*}

Not too hardly,
$$
 \sum_{i,j}(y_i - \bar y)(y_j - \bar y) = 0.
$$
Hence,
$$
\frac{1}{n(n-1)} \sum_{i, j} (y_i - y_j)^2 
 =
\frac{1}{n-1}  \sum_{i} (y_i - \bar y)^2 + \frac{1}{n-1}  \sum_{j} (y_j - \bar y)^2 
=
2 s^2.
$$

### Sample variance algebra

Also take note
$$
\sum_i (y_i - \bar y)^2
=
\sum_i (y_i^2 - 2 y_i \bar y + \bar y^2)
=
\sum_i (y_i^2 )  - 2 \bar y \sum_i y_i + n \bar y^2
=
\sum_i (y_i^2 )  - 2 n \bar y^2 + n \bar y^2 =
\sum_i y_i^2   - n \bar y^2.
$$

We get
$$
 \sum_i (y_i - \bar y)^2
= \sum_i y_i^2   - n \bar y^2
$$
and
$$
 \sum_i y_i^2 
=\sum_i (y_i - \bar y)^2
+ n \bar y^2.
$$
 
The sum of squares formulas will be used
very often.

## Point estimation

#### It is our educated and data supported guess of what ${\mathbf \beta}$ value is.

#### The only requirement for an estimator is: it is a function of data.


One way of estimating ${\mathbf \beta}$ is to use the value $\hat {\mathbf \beta}$ that minimizes
$$
 ({\bf y} - {\bf X} {\mathbf \beta})^T ({\bf y} - {\bf X} {\mathbf \beta}) 
 = \sum_{i=1}^N (y_i - {\bf x_i} {\mathbf \beta})^2.
$$

The optimization solution is given by
$$
 \hat {\mathbf \beta} = ( {\bf X}^T  {\bf X})^{-1}  {\bf X}^T {\bf y}.
 $$

### Some terminologies

* We call $ {\bf r} = {\bf y} - {\bf X} \hat {\mathbf \beta}~$ **residuals**;


* We call
$\hat {\bf y} = {\bf X} \hat {\mathbf \beta}~$ **fitted values**;


* We call $\hat {\mathbf \beta}~$ the **least squares estimate**.

## Confidence Intervals/Prediction Intervals

Suppose the model is suitable.

* Our best guess of the expected $y$ value at ${\bf x}$ is ${\bf x} \hat{\mathbf \beta}$: the fitted value.


* A 95% CI for $\hat{y} = {\bf x} \hat{\mathbf \beta}$ is 

$$
{\bf x} \hat{\mathbf \beta} \pm 
t_{n-k-1, 0.975} \sqrt{var({\bf x} \hat{\mathbf \beta})}.
$$

where $t_{n-k-1, 0.975}$ the 97.5% 
quantile of the t-distribution with $n-k-1$ degrees of freedom and

$$
var({\bf x} \hat{\mathbf \beta}) 
= {\bf x}({\bf X}^T{\bf X})^{-1}{\bf x}^T \hat{\sigma}^2.
$$


$$
var(\hat{\mathbf \beta}{\bf_j} ) 
= ({\bf X}^T{\bf X})_{jj}^{-1} \hat{\sigma}^2.
$$

$$
\hat{\bf y} = x \hat{\beta}
$$

* 95\% prediction interval for $y|x$:
$$
\hat y \pm t_{N-k-1,0.975}\sqrt{(1+x(\mathbf{X}^T\mathbf{X})^{-1}x^T)\hat\sigma^2} = \hat y \pm t_{N-k-1,0.975}\sqrt{(1 + var(\hat{y}))}
$$
Here we have an extra $1$ in $\sqrt{\cdot}$ since future value observation are supposed to be $x\hat \beta + \epsilon$.

### R code for manually calculating CI and PI given a predicted value x = new.x, dataframe = data

In [None]:
# Structure the data that's easy to use
X = cbind(1, data$xcols) # our design matrix X (add a intercept column, make sure y is not included)
#print(X)
y = data$ycol

# Compute LSE manually 
invXTX = solve(t(X) %*% X)
XTy = t(X) %*% y
beta = invXTX %*% XTy
# round to 3 decimals and check
print(round(beta, 3))

# Predict response for new observation
new.X = (some number)
(y_pred = beta[1] + beta[2]*new.X )

# Verify y_pred using R function
predict(fit, newdata=data.frame(temp=new.X))

# Compute confidence interval
new.Xvec = c(1, new.X)
N = nrow(data)
k = ncol(X) - 1
# compute sd estimate
sigma2 = t(y-X%*%beta)%*%(y-X%*%beta) / (N-k-1) 
se.conf = sqrt(sigma2*(t(new.Xvec)%*%invXTX%*%new.Xvec))
# compute t quantile
cv = qt(0.025, N-k-1, lower.tail=FALSE)
# construct CI
c(pred-cv*se.conf, pred+cv*se.conf)

# Verify CI using R function
predict(fit, newdata=data.frame(temp=new.X),
        interval="confidence")

# Compute prediction interval
se.pred = sqrt(sigma2*(1+t(new.Xvec)%*%invXTX%*%new.Xvec))
c(pred-cv*se.pred, pred+cv*se.pred)

# Verify PI using R function
predict(fit, newdata=data.frame(temp=new.X),
        interval="prediction")

Let us assume the knowledge that the least square estimator (may leave it as an assignment) is given by
$$
 \hat {\mathbf \beta} = ( {\bf X}^T  {\bf X})^{-1}  {\bf X}^T {\bf y}.
$$


Given this knowledge, we find
\begin{align*}
{\bf X}^T ( {\bf y} - {\bf X} \hat {\mathbf \beta} )
& = {\bf X}^T {\bf y} - {\bf X}^T {\bf X} ( {\bf X}^T  {\bf X})^{-1}  {\bf X}^T {\bf y}\\
& = {\bf X}^T {\bf y} - {\bf X}^T {\bf y}\\
&= 0.
\end{align*}


Since ${\bf r} = {\bf y} - {\bf X} \hat {\mathbf \beta}$, the above conclusion can be written as ${\bf X}^T {\bf r} = 0$.


* Nothing in the residual is still related to ${\bf X}$.

Because from the above equation we know that ${\bf y} - {\bf X} \hat {\mathbf \beta} = 0$ therefore we can also write that equation as $\bar{\bf y} - \bar{\bf X} \hat {\mathbf \beta}=0$ because the first column of the matrix X are all 1.

We may write
\begin{align*}
{\bf y} - \bar{\bf y}
& = ({\bf y} - \bf X \hat {\mathbf \beta})
   + (\bf X \hat {\mathbf \beta} - \bar{\bf y})\\
& = ({\bf y} - \bf X \hat {\mathbf \beta})
+ (\bf X - \bar{\bf X}) \hat {\mathbf \beta}.
\end{align*}


With this decomposition, we find
\begin{align*}
({\bf y} - \bar{\bf y})^T ({\bf y} - \bar{\bf y})
& = 
({\bf y} - \bf X \hat {\mathbf \beta})^T({\bf y} - \bf X \hat {\mathbf \beta})
 + 
\hat {\mathbf \beta}^T (\bf X - \bar{\bf X})^T (\bf X - \bar{\bf X}) \hat {\mathbf \beta}\\
& = 
({\bf y} - \hat{\bf y})^T({\bf y} - \hat{\bf y})
 + 
(\hat{\bf y} - \bar{\bf y} )^T(\hat{\bf y} - \bar{\bf y} )
\end{align*}

Watch closely on the decomposition:
\begin{align*}
({\bf y} - \bar{\bf y})^T ({\bf y} - \bar{\bf y})
& =  
\hat {\mathbf \beta}^T (\bf X - \bar{\bf X})^T (\bf X - \bar{\bf X}) \hat {\mathbf \beta}
+
({\bf y} - \bf X \hat {\mathbf \beta})^T({\bf y} - \bf X \hat {\mathbf \beta})
\end{align*}
or it is
\begin{align*}
({\bf y} - \bar{\bf y})^T ({\bf y} - \bar{\bf y})
& = (\hat {\bf y} - \bar {\bf y})^T (\hat {\bf y} - \bar {\bf y})
+ {\bf r}^T {\bf r}
\end{align*}


* the LHS is the total variation in ${\bf y}$.


* the first term the RHS the variation in $\hat {\bf y}$.


* the second term is the variation in residual ${\bf r}$.


This decomposition results in the analysis of variance table (ANOVA):

\begin{matrix}
\mbox{Source} & \mbox{Df} & \mbox{SS} & \mbox{MSS} & \mbox{F} \\
\mbox{Regr} & k & 
(\hat {\bf y} - \bar {\bf y})^T (\hat {\bf y} - \bar {\bf y}) 
&  (\hat {\bf y} - \bar {\bf y})^T (\hat {\bf y} - \bar {\bf y}) /k \\
\mbox{Residual/error} & N-k-1 & 
({\bf y}-{\bf X}\hat{\bf \beta})^T({\bf y}-{\bf X}\hat{\bf \beta}) &
({\bf y}-{\bf X}\hat{\bf \beta})^T({\bf y}-{\bf X}\hat{\bf \beta})/(N-k-1) & \\
Total & N-1 & {\bf y}^T{\bf y}- \bar{\bf y}^T\bar{\bf y}
\end{matrix}


The top-right entry of F should be
$$
F = 
\frac{(\hat {\bf y} - \bar {\bf y})^T (\hat {\bf y} - \bar {\bf y}) /k }
{({\bf y}-{\bf X}\hat{\bf \beta})^T({\bf y}-{\bf X}\hat{\bf \beta})/(N-k-1)}
=\dfrac{MSS_{regr}}{MSS_{error}}.
$$

$$ 
 T = \frac{\hat{\beta_j}}{se(\hat{\beta_j})} = \frac{\hat{\beta_j}}{\sqrt{\hat{\sigma}^2(X^TX)_{jj}^{-1}}}
$$


$$
 {se(\hat{\beta_j})} ={\sqrt{\hat{\sigma}^2(X^TX)_{jj}^{-1}}}
$$


$$
\hat{\sigma}^2 = MSS error
$$

How to know whether it is a good fitting?

$$
R^2 
= \frac{ 
(\hat{\bf y}-\bar{\bf y})^T(\hat{\bf y}-\bar{\bf y})}
{({\bf y} - \bar {\bf y})^T ({\bf y} - \bar {\bf y}) }.
$$


if $R^2$ big then it is a  good fit


### Variance of the experiment error 

We use the residual mean square to estimate $\sigma^2$:
$$
\hat \sigma^2 
= 
\frac{ 
{\bf r}^T {\bf r}}{N-k-1}
=
\frac{
({\bf y}-\hat{\bf y})^T({\bf y}-\hat{\bf y})}{(N-k-1)}.
$$


* If $\hat {\bf \beta} = {\bf \beta}$, 
we would have ${\bf r} = {\bf \epsilon}$. 
* The estimator is explained by $var(\epsilon_i) = \sigma^2$ 
according to the model assumption.

# 2 samples(2.2)
### Two-sided alternative (we want to know if there is a difference in the compared groups):
Consider the problem of testing for two-sided alternative
$$
H_0: \mu_1 = \mu_2;  \mbox{ vs } H_a: \mu_1 \neq \mu_2.
$$


Let us do it under the assumptions that two samples
are independent, each sample is made of iid observations
such that
$Y_{ij} \sim N( \mu_i; \sigma^2)$. -> note no subscript on variance, every observation from set 1 and 2 have equal variance (assumption)

Question we want to answer: do they have different means?


* Elements in the assumption: (1) independence, (2) normality, and (3) equal variance.


### Standard data analysis

Compute the sample means and variances

$$
\bar y_1 = n_1^{-1} \sum_{i=1}^{n_1} y_{1i}; ~~~
\bar y_2 = n_2^{-1} \sum_{i=1}^{n_2} y_{2i}.
$$
and

$$
s_1^2 = (n_1-1)^{-1} \sum_i (y_{1i} - \bar y_1)^2; ~~
s_2^2 = (n_2-1)^{-1} \sum_i (y_{2i} - \bar y_2)^2.
$$

Compute **pooled variance** estimator (note that $s_1^2$ and $s_2^2$ are the variances of sample 1 and 2), ONLY FOR EQUAL VARIANCE ASSUMPTION:

$$
s^2 =
\frac{(n_1-1) s_1^2 + (n_2 - 1) s_2^2}{n_1 + n_2 - 2},
$$

and then the t-statistic
$$
T = \frac{\bar y_1 - \bar y_2}{\sqrt{(1/n_1 + 1/n_2) s^2}}.
$$

### How does df affect T-distribution

As sample size increases (df -> $\infty$), it gets closer to normal distribution.

Larger df means slimmer density since variance gets smaller.

### R-code for calculating p-value for two-sided hypothesis test (equal variance, adjust for unequal variance):

In [None]:
ssPool = ((n1 - 1)*var(yy) + (n2 - 1)*var(xx))/ (n1 + n2 - 2)

T_obs = (mean(xx) - mean(yy))/((1/n1 + 1/n2)*ssPool)^.5

pValue = 2*(1 - pt(abs(T_obs), df = n1 + n2 - 2))
    ## two sided: key words: more extreme



### One-sided alternative (we want to know if there is a difference between 2 groups, and ALSO if it is positive or negative (direction)): 
  
  
* The hypotheses are $ H_0: \mu_1 = \mu_2; \mbox{ vs } H_1: \mu_1 > \mu_2$.


* The best practice is to ensure that $T$ is **lined up** with $H_1$.


* In this example, $H_1$ states $\mu_1 - \mu_2 > 0$. 

So we calculate

$$T =\frac{ (\hat{\mu}_1 - \hat{\mu}_2)} {\sqrt{ (1/n_1+1/n_2) s^2}}$$

where $\hat{\mu}_1 = \bar{y}_1$ and $\hat{\mu}_2 = \bar{y}_2$.
    
    
* Reject $H_0$ in favour of $H_1$ when $P( T > T_{obs})$ is below the nominal level (the usual choice is 5\%).

### R-code for calculating p-value for one-sided hypothesis test (equal variance, adjust for unequal variance):

In [None]:
ssPool = ((n1 - 1)*var(yy) + (n2 - 1)*var(xx))/(n1 + n2 - 2)

T_obs = (mean(xx) - mean(yy))/sqrt((1/n1+1/n2)*ssPool)

pValue = pt(T_obs, df = n1 + n2 - 2, lower.tail = F)  
      ## upper side is calculated

###check your work (remove alternative parameter for two-sided test)
## R programming language can do it in one strike.

t.test(xx, yy, alternative = "greater", var.equal = T)

t.test(xx, yy, alternative = "greater", var.equal = F)
## analysis under unequal variance assumption gives very similar p-value.

### Statistical reasoning

In statistics, we examine how much built-in variation is in $\bar y_1 - \bar y_2$.


Under the model assumption, we have

$$ Var(\bar y_1 - \bar y_2) = (1/n_1 + 1/n_2) \sigma^2.$$


In applications, we are not given the value of $\sigma^2$.
However, the pooled variance estimator $s^2$ is a good estimate.



### Note: t test statistic is constructed by standardizing the difference in means under the null, i.e., dividing by the standard deviation. A smaller variance will lead to a larger test statistic value.

The underlying distribution of the test statistic doesn't change with the options given in the question and so the smaller variance strictly increases the power.

## Equal Variance

Because of this, we get a good metric:
$$
T = \frac{\bar y_1 - \bar y_2}{\sqrt{(1/n_1 + 1/n_2) s^2}}.
$$


Statistic theory reveals that its distribution when $H_0$ is true is
t-distribution with degrees of freedom **df** $n_1 + n_2 - 2$.


## Unequal variance

### Effect of unequal variance (cannot use pooled variance as it does not make sense)

* One remedy to t-test in this case is to change the t-statistic itself to

$$
T = \frac{(\bar y_1 - \bar y_2)}{\sqrt{(s^2_1/n_1 + s^2_2/n_2)}}
$$

so that the denominator matches the variance of the numerator even if $\sigma_1^2 \neq \sigma_2^2$.


* Yet even after this remedy, $T$ still does not have t-distribution.


* The distribution of $T$ depends on the size of $\sigma_2/\sigma_1$ which is unknown in this situation.

### Welch's t-test remedy (how to calculate df for unequal variance only)

* Use the new definition of $T$.


* **PRETEND** $T$ has a t-distribution with $f$ degrees of freedom:

$$
\frac{1}{f} 
= \left ( \frac{R}{1+R} \right )^2 \frac{1}{n_1 - 1} 
+ 
\frac{1}{(n_2-1)(1+R)^2}
$$

where $R = (s_1^2/n_1)/(s_2^2/n_2)$.

### R code for calculating p value for UNEQUAL VARIANCE assumption (one-sided hypothesis test, using Welch's t-test)

In [None]:
#Calculate Welch's t-stat
T_obs = (mean(xx) - mean(yy))/sqrt(var(y1)/n1+var(y2)/n2)

#calculate df using Welch's eq:
R <- (var(y1)/n1)/(var(y2)/n2)
df_uneq <- 1/(((R/(1+R))^2 * (1/(n1-1)) + 1/((n2 - 1)*(1+R)^2)))

p2 <- 2*pt(abs(T_obs), df_uneq, lower.tail = FALSE)

#check p2
test.res2 = t.test(y1, y2)
test.res2$p.value
#reject null hypothesis if p2 < alpha

### The effect of increasing sample size (repetition)

* Larger $m, n$ leads to larger observed $|T|$, $|T_{obs}|$, if $\mu_1 \neq \mu_2$.


* Larger $|T_{obs}|$ leads to smaller p-value and hence increased power of detecting the fact that $\mu_1 \neq \mu_2$.


* If $\mu_1 = \mu_2$, the size of $T$ is unaffected by $n_1, n_2$. Hence, type I error remains under control.

### F-test in the two-sample problem (check whether $\sigma_1 = \sigma_2$ is a reasonble assumption)

* Obtain $F_{obs} = s_1^2/s_2^2$. 

* Both its value close to 0 or extremely large is suggestive to the violation of $H_0: \sigma_1 = \sigma_2$.

# 2.3

### Our p-value for two-sided alternative, randomization test

We use randomization test to avoid the normality assumption (helps with it) and prevent the influence of lurk variables.

* Let $d_{obs} = \bar y_1 - \bar y_2$: the observed difference in two sample means. (We used $d^*$ in the last slide).

* Let $d_i$ be the value of $\bar y_1 - \bar y_2$ based on permuted observations, $i=1, 2, \ldots, {n_1+n_2 \choose n_1}$.

* Let $c_1= n\{|d_i| > |d_{obs}|\}$, and $c_2 =n \{|d_i| = |d_{obs}|\}$.

We define the p-value as
*  pvalue $= (c_1 + 0.5c_2)/{n_1+n_2 \choose n_1} $.

We reject $H_0$ if the p-value is smaller than 0.05 (or another pre-agreed level).

### Code for performing randomization test

In [None]:
total = choose(n1+n2, n1)
mm = abs(mean(yy) - mean(xx))
zz = c(xx, yy)
zbar = mean(zz)
dd = combn(zz, n2, FUN=mean)
dd = abs(dd + (n2*dd - (n1+n2)*zbar)/10)
pp = (sum(dd > mm) + 0.5*sum(dd== mm)) / total   # p-value

### Single Factor with multiple levels

The T-test is mostly used to compare the effect of two treatments. We now consider the situation where more than two treatments derived from a single factor are being investigated.

To be called a **single factor** experiment, these treatments should intrinsically be connected.

For example, fertilizers of various kinds or mixtures; the temperature at different levels, the medicine of several kinds, or a medicine at various dosages.

Linear model proposed:

$$
y_{ij} = \eta + \tau_i + \epsilon_{ij}
$$
$\eta$ is the overall mean, $\tau_i$ is the mean response from the $i$th treatment after subtracting the overall mean. The error term $\epsilon_{ij}$ is what cannot be explained by **the treatment effect** $\tau_i$.

$$
\epsilon_{ij} \sim N(0, \sigma^2)
$$
and they are independent of each other

### Intuitive but easily justified estimates

Let

* $\bar{y}_{i \cdot} = (y_{i1} + y_{i2} + \cdots + y_{i n_i})/n_i$;


* $\bar{y}_{\cdot j} = (y_{1j} + y_{2j} + \cdots + y_{kj})/k$;


* $\bar{y}_{\cdot \cdot} = \sum_{i,j} y_{ij}/{N}$.



The following estimates of parameters are generally used:


* $\hat \eta = \bar{y}_{..}; ~~~\hat \tau_i = \bar{y}_{i \cdot} - \bar{y}_{..}$.


Each observed value can be decomposed as

$
y_{ij} 
 = \bar{y}_{..} + (\bar{y}_{i \cdot} - \bar{y}_{..}) + (y_{ij} - \bar{y}_{i \cdot})
 = \hat \eta + \hat \tau_i + r_{ij}.
$



## The question this experiment aims to answer


Do these treatments have different effects in terms of the brightness of the pulp sheets they produce?


In statistical language: test the hypothesis that $H_0: \tau_1 = \tau_2 = \cdots = \tau_k = 0$.


* If their sum is 0 and they are equal, then all of them must be zero.


#### Whether they are all zero or not is best reflected in the size of

SS$_{trt} = n_1\hat \tau_1^2 + n_2\hat \tau_2^2 + \cdots + n_k\hat\tau_k^2 
= \sum_{i=1}^k n_i(\bar y_{i\cdot} - \bar{y}_{\cdot \cdot})^2.$


* We call this quantity the treatment sum of squares.


## F-test/Distribution
Is $\mbox{SS}_{trt}$ sufficiently large to justify rejecting $H_0$?

* We compare its size against the residual sum of squares

$\mbox{SS}_{err} = \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar y_{i\cdot})^2$
This leads to the Analysis of Variance Table.

$F = \mbox{MSS}_{Trt}/\mbox{MSS}_{err}$ has an F-distribution with $k-1$ and $N-k$ degrees of freedom when $H_0$ is true.

An unusually large $F_{obs}$, indicates the treatment sum of squares is likely inflated due to unequal $\tau_i$ values.

Hence, we compute p-value as $P( F > F_{obs})$ and reject $H_0$ when the p-value is small (than 5%).


## ANOVA for One-way layout


\begin{matrix}
\mbox{Source} & \mbox{DF} & \mbox{SS}  & \mbox{MSS} & \mbox{F} \\
\mbox{Trt} & k-1
& 
\sum_{i=1}^k n_i(\bar y_{i \cdot}-\bar y_{\cdot\cdot})^2
&
\mbox{SS}_{trt}/(k-1) & * \\
\mbox{Resid/Error} & N-k
& 
\sum_{i=1}^k \sum_{j=1}^{n_i}(y_{i j} - \bar y_{i \cdot})^2  -> (SS_{tot} - SS_{trt})
&
\mbox{SS}_{err}/(N-k)& \\
\mbox{Total} & N-1 &
\sum_{i=1}^k \sum_{j=1}^{n_i}(y_{i j} - \bar y_{\cdot \cdot})^2
&
SS_{tot} / (N-1)
\end{matrix}



The F-statistic is defined to be
$F = \mbox{MSS}_{trt}/\mbox{MSS}_{err}$.

## Equations/Sample code to calculate ANOVA table

#### Sum of Squares for treatment 


* SS$_{trt}=\sum_{i=1}^k n_i(\bar y_{i \cdot}-\bar y_{\cdot\cdot})^2$


* MSS $_{trt}=\mbox{SS}_{trt}/(k-1)$

#### Sum of Squares for treatment 

* SS$_{trt}=\sum_{i=1}^k n_i(\bar y_{i \cdot}-\bar y_{\cdot\cdot})^2$

* MSS $_{trt}=\mbox{SS}_{trt}/(k-1)$

#### Sum of Squares for error


* SS$_{err} =\sum_{i=1}^k \sum_{j=1}^{n_i} ( y_{i j} - \bar y_{i \cdot})^2$ 


* MSS$_{err}$ = SS$_{err}/(N-k)$

In [None]:
aa = c(...) ; bb = c(...) ; cc = c(...) ; dd = c(...) ; 
yy = c(aa, bb, cc, dd)

aabar = mean(aa); bbbar = mean(bb); ccbar = mean(cc); ddbar = mean(dd); 
yybar = mean(yy)

SS.trt = n*((aabar - yybar)^2+(bbbar - yybar)^2 + (ccbar - yybar)^2 + (ddbar - yybar)^2)
MSS.trt = SS.trt/(k-1)

print(c(aabar, bbbar, ccbar, ddbar)) ; print(mean(yy)) ; print(SS.trt) ; print(MSS.trt)

#### Sum of Squares for error

* SS$_{err} =\sum_{i=1}^k \sum_{j=1}^{n_i} ( y_{i j} - \bar y_{i \cdot})^2$ 

* MSS$_{err}$ = SS$_{err}/(N-k)$

#### Total Sum of Squares/F-value/P-value:


SS$_{tot} = \sum_{i=1}^k \sum_{j=1}^{n_i} ( y_{i j} - \bar y_{\cdot \cdot})^2$


Remark: it is not used for inference. It should equal the sum of the other two.


In [None]:
SS.e = sum((aa - aabar)^2)+sum((bb-bbbar)^2)+sum((cc-ccbar)^2) + sum((dd-ddbar)^2)

MSS.e = SS.e/(N-k)

print(SS.e);  print(MSS.e)

#### Total Sum of Squares/F value/p-value:

SS$_{tot} = \sum_{i=1}^k \sum_{j=1}^{n_i} ( y_{i j} - \bar y_{\cdot \cdot})^2$

Remark: it is not used for inference. It should equal the sum of the other two.

In [None]:
SS.tot = sum( (yy - mean(yy))^2)

print(SS.tot)

#F-value
f = MSS.trt/MSS.e

#p-value
p.value = pf(f, k-1, N-k, lower.tail=F)

#Calculate f stat
qf(1-alpha, k-1, n-k)
#For F test, if f stat > f critical value, results statistically significant.
#may not be necessary to calculate 

In [None]:
#check your ans:

## standard R function for one-way anova
## we first organize the data into required data.frame format.
trt = c(rep("aa", n), rep("bb", n), rep("cc", n), rep("dd", n))
pulp.data = data.frame(yy, trt)

SS.tot = sum((yy-mean(yy))^2) 
## extra calculation for other purposes.

pulp.aov <- aov(yy ~ trt, pulp.data)
summary(pulp.aov)

# 3.1


### Note: Tukey’s method and the Bonferroni method CI, the intervals that does NOT contain 0 means that the corresponding group has significantly different mean.

ie. If the interval excludes 0, then the difference of the means of the two groups is significant.

### Applying this idea for one-way layout

### Bonferroni - used to control family-wise type I error, proof in lecture 3.1 sub

* there are $k' = k(1-k)/2$ (k choose 2) parameters of interest ($\mu_i - \mu_j$) 
    in one-way layout.

* for two-sided simultaneous CIs, set $\alpha' = \alpha/k'$.
* The Bonferroni method rejects any $H_{ij}: \mu_i = \mu_j$ only if
$
|t_{ij}| > t( 1-\alpha'/2; N-k).
$

* the t-statistic for this purpose 

$$
\frac{ (\bar y_{j \cdot} - \bar y_{i \cdot})-(\mu_j - \mu_i) }{\sqrt{ (1/n_i + 1/n_j) \hat \sigma^2}}.
$$


* the two-sided simultaneous CIs for $\mu_j - \mu_i$ derived from it:

$$
(\bar y_{j \cdot} - \bar y_{i \cdot})
\pm t(1-\alpha'/2, N-k) ~ \hat \sigma \sqrt{1/n_i + 1/n_j}.
$$


* the error variance $\hat \sigma^2$ is estimated by MSS$_{err}$.

### Variations:

* When $n_i = n_j = n$, we have $\sqrt{1/n_i + 1/n_j} = \sqrt{2/n}$.


* Logic: transfer $\theta \to \mu_j - \mu_i$, 
$\hat \theta_i \to (\bar y_{j \cdot} - \bar y_{i \cdot})$, 
and 
$\widehat{var}(\hat \theta) \to (1/n_i + 1/n_j) \hat \sigma^2$.


* Simultaneous one-sided CIs have form

$$
(\bar y_{j \cdot} - \bar y_{i \cdot})
\pm t(1-\alpha', N-k) ~ \hat \sigma \sqrt{1/n_i + 1/n_j}.
$$

### Tukey's method is one of many possible remedies. 


* We confine ourselves in the context of one-way layout.


* The difference in means are estimated by 
    $\bar y_{i\cdot}-\bar y_{j \cdot}$


* The key quantity is given by

$$ t_{ij} = 
\frac{(\bar y_{i\cdot}-\bar y_{j \cdot}) - (\mu_i - \mu_j)}
{\sqrt{ (1/n_i + 1/n_j)s^2}}.
$$


* A key quantity for testing $\mu_i - \mu_j = 0$ for all $i, j$
is 

$$
t^* = \sqrt{2} \max \{ |t_{ij}| \}
$$

> with $\mu_i - \mu_j = 0$ when calculating $t_{ij}$.


**Pay no attention to $sqrt{2}$ here, it is statistically irrelevant at this moment**

* Reject null hypothesis that all treatments have equal mean when 
$$|t_{ij}| > qtukey(1- \alpha; k, N-k)/\sqrt{2}$$.


* The null rejection rate is given by $\alpha$.

* the two-sided simultaneous CIs for $\mu_j - \mu_i$ derived from it:

$$
(\bar y_{j \cdot} - \bar y_{i \cdot})
\pm \frac {1}{\sqrt{2}}tukey(1-\alpha,k, N-k) ~ \hat \sigma \sqrt{1/n_i + 1/n_j}.
$$

### Simultaneous CI by Tukey method


We may use Tukey method to construct simultaneous CI based on the same idea:

$$
\frac{ \sqrt{2}|(\hat \tau_i - \hat \tau_j)-(\tau_i - \tau_j)|}
{\sqrt{(1/n_i + 1/n_j)s^2}} \leq qtukey(1- \alpha; k, N-k)
$$
> for all $i, j$.


In particular, for the pulp example, $k=4, N=20$, the
($\mu_B - \mu_D$) part of the simultaneous 95\% CIs is given by

$$
0.62 \pm 2.86\times \sqrt{1/5 + 1/5}\times \sqrt{0.106}
= [ 0.03, 1.21].
$$


Since 0 is not in this interval, the method again finds
the $\mu_B - \mu_D \neq 0$ at $0.05$ significance level.


**Remark: conclusions of different methods do not have to be identical**

## 3.3 
### Sample size determination.

* Let
$$
\mbox{SS}_{err} = \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar y_{i \cdot})^2
$$

>be the sum of squares due to error terms. 


* We will show that $\sum_{j=1}^{n_i} (y_{ij} - \bar y_{i \cdot})^2/\sigma^2$ has a chisquare distribution with df = $n_i-1$.


* This further implies $E[\sum_{j=1}^{n_i} (y_{ij} - \bar y_{i \cdot})^2] = (n_i-1) \sigma^2$.

* Similarly, we have E$(\mbox{SS}_{err}) = (N-k) \sigma^2$.


* Therefore, we choose


$$
s^2 = (N-k)^{-1} \mbox{SS}_{err}
$$


>as an estimator of $\sigma^2$. 


* We claimed that

$$
F = \frac{\mbox{MSS}_{trt}}{\mbox{MSS}_{err}} = \frac{\mbox{MSS}_{trt}}{s^2}
$$

> has F-distribution with $k-1$ and $N-k$ degrees of freedom **when $H_0$ is true**.


* These are basis for sample size calculation.

* Suppose that scientists hope to confirm that treatment effects are significantly different by an experiment.

* Even if such a difference is real, there is no guarantee that a simple experiment will provide convincing statistical evidence.

(1) As long as one cannot eliminate the randomness in the experiment,
there is no way to make the case with a 100% guarantee.

(2) The lower is the difference in treatment effects, the larger is
the required sample size.

(3) Increasing sample size (replicates) elevates the chance of detecting the difference **if there is any**.

**If we believe the effect is very low, there is no point to try to confirm it**