## Case Study: Framingham Heart Study

Data considered:
- n = 1329 men
- X = Cholesterol measurement in 1948 (mg/dl)
- Y = after 10 years, did they developed CVD (present/absernt)

#### Binomial Sampling
Let $\pi_H=p(present\mid high),\pi_L = P(present\mid low)$.

Hypothese: $H_0: \pi_H = \pi_L, H_a: \pi_H \neq \pi_L$

Assumptions: 
 - depending on level of cholesterol, each person is a Bernoulli trial with chance of developing CVD as $n_H = 284, n_L = 1043$. 
 - Then for fixed $n_H, n_L$, the count of the number of people who develop CVD $y_H\sim Binomial(n_H=286,\pi_H)$ $y_L\sim Binomial(n_L=1043,\pi_L)$
 - Then estimate of $\pi_H-\pi_L$ is $\hat\pi_H-\hat\pi_L$ where $\hat\pi_H = y_H/n_H, \hat\pi_L = y_L/n_L$ are the sample proportions.
 - $var(\hat\pi_H-\hat\pi_L) = var(\hat\pi_H) + var(\hat\pi_L)\\=n_H\pi_H(1-\pi_H)/n^2_H + n_L\pi_L(1-\pi_L)/n^2_L\\= \pi_H(1-\pi_H)/n_H+\pi_L(1-\pi_L)/n_L$
 - $se(\hat\pi_H-\hat\pi_L) = \sqrt{\hat\pi_c(1-\hat\pi_c)(n_H^{-1}+n_L^{-1})}$ where $\hat\pi_c=\frac{y_L+y_H}{n+L + n_H}$ is the combined sample proportion
 - By CLT, the test statistic $\sim N(0,1)$

Test statistic: $\frac{\hat\pi_H-\hat\pi_L}{se(\hat\pi_H-\hat\pi_L)}=5.575$

p-value $2P(Z\geq 5.575)<0.05$

Conclusion: We have strong evidence that the probability of developing CVD is not the same for High and Low cholesterol groups.


In [5]:
cvd<-matrix(c(41,245,51,992), nrow=2,byrow=TRUE)
dimnames(cvd)<-list(c("High","Low"), c("Present","Absent"))
names(dimnames(cvd))<-c("Cholesterol","Cardio Vascular Disease")
print(cvd)

           Cardio Vascular Disease
Cholesterol Present Absent
       High      41    245
       Low       51    992


In [8]:
# estimate for pi
pi_h = 41/(41+245)
pi_l = 51/(51+992)
print(pi_h)
print(pi_l)

[1] 0.1433566
[1] 0.04889741


In [10]:
# sample size 
n_h = 41 + 245
n_l = 51 + 992
conf.level = 0.95
crit.val = qnorm(1-(1-conf.level)/2)
crit.val

In [16]:
# standard error
se.hat = sqrt(pi_h * (1 - pi_h)/n_h + pi_l * (1 - pi_l)/n_l)
se.hat

In [17]:
# 95% CI
c((pi_h-pi_l)-crit.val*se.hat, (pi_h-pi_l)+crit.val*se.hat)

In [19]:
# easier way for bonimial sampling
prop.test(cvd, correct=FALSE)


	2-sample test for equality of proportions without continuity
	correction

data:  cvd
X-squared = 31.082, df = 1, p-value = 2.474e-08
alternative hypothesis: two.sided
95 percent confidence interval:
 0.05178874 0.13712972
sample estimates:
    prop 1     prop 2 
0.14335664 0.04889741 


In [21]:
# or chisq test
chisq.test(cvd, correct=F)


	Pearson's Chi-squared test

data:  cvd
X-squared = 31.082, df = 1, p-value = 2.474e-08


In [20]:
# Don't use this, provide different result from the manual way
prop.test(cvd)


	2-sample test for equality of proportions with continuity correction

data:  cvd
X-squared = 29.633, df = 1, p-value = 5.221e-08
alternative hypothesis: two.sided
95 percent confidence interval:
 0.0495611 0.1393574
sample estimates:
    prop 1     prop 2 
0.14335664 0.04889741 


The CI does not include 0. 

#### Contingency Table
Have a row factor with $I$ levels and a column factor with $J$ levels

Then, define $P(C=i,R=j)=\pi_{ij}, P(C=i)=\pi_{i\cdot}, P(R=j)=\pi_{\cdot j}$

Hypothesis: $H_0: \pi_{ij} = \pi_{i\cdot}\pi_{\cdot j}, H_a: \pi_{ij} \neq\pi_{i\cdot}\pi_{\cdot j}$
null: there is no relationship between the two factors

For each cell, estimated expected cell count $\hat\mu_{ij} = n\hat\pi_i\hat\pi_j = y_{i\cdot}y_{\cdot j}/n$

Test statistic: $X^2 = \sum_{j=1}^J\sum_{i=1}^I \frac{(y_{ij}-\hat\mu_{ij})^2}{\hat\mu_{ij}}\sim \chi^2_{(I-1)(J-1)} $

If $var(y)=E(y)=\mu\Rightarrow y\sim Poisson(\mu)$

For this case, test statistic: $ 31.08\sim \chi^2_{(2-1)(2-1)}$, p-value $<0.0001$

Strong evidence that the two factors are not independent, CVD status depends on cholesterol level.

When $I=J=2$, the chi-square test of independence is equivalent to comparing two proportions.

**Formal approach**  
Let $Y_{ij}$ be r.v. representing the number of observations in cell $(i,j)$.

Observe $y_{ij}$ be observed cell counts

Then multinomial $$P(Y=y)=\frac{n!\pi_{11}^{y_{11}}\pi_{12}^{y_{12}}\pi_{21}^{y_{21}}\pi_{22}^{y_{22}}} {y_{11}!y_{12}!y_{21}!y_{22}!}\sim Multinomial(n,\pi_{11},\pi_{12},\pi_{21},\pi_{22})$$

Log-likelihood is 
$$\log\mathcal{L}=\sum_{j=1}^2\sum_{i=1}^2 y_{ij}\log\pi_{ij}+\log{n\choose y_{11}y_{12}y_{21}y_{22}}$$

Maximize $\log\mathcal{L}$ w.r.t. $\pi$'s and $\sum\sum\pi_{ij}=1$, then $\hat\pi_{ij}=y_{ij}/n$

Under $H_0: \pi_{ij}=\pi_{i\cdot}\pi_{\cdot j}$, can substitute $\pi_{ij}$ and maximize the column and row $\pi$'s. 

$$G^2 = -2\log(\mathcal{L}_R/\mathcal{L}_F)\sim\chi^2_{(I-1)(J-1)}$$

To obtain the d.f. 
df(Unrestrcited / FULL)-df(Independence/REDUCED)  
=#parameters in FULL($\pi_{ij}$) - #parameters in REDUCED($\pi_{i\cdot},\pi_{\cdot j}$)  
= $IJ-1-(I+J-2)$  
$-1$ because constraint $\sum\sum\pi_{ij}=1$  
$-2$ because constraint $\sum\pi_{i\cdot}=1,\sum\pi_{\cdot j}=1$