Assignee: Martin He

### Problem 1: Linear Regression
<br>
For this problem, we use the Heart and Estrogen/progestin Replacement Study (HERS) dataset. HERS was
a randomized, blinded, placebo-controlled secondary prevention trial of estrogen plus progestin for secondary
prevention of coronary heart disease (CHD) in postmenopausal women.

1. Perform Perform multiple linear regression using the lm function in R, to predict baseline glucose levels in the
HERS clinical trial of hormoney therapy. Use exercise, age, drinkany, and BMI as the covariates, and
glocose as the response. Remove the rows with diabetes == yes

First, import data and apply filter "diabetes = no".

In [1]:
hers = read.csv("hersdata.csv")
hers = hers[hers$diabetes =="no",]
head(hers)

Unnamed: 0,ï..HT,age,raceth,nonwhite,smoking,drinkany,exercise,physact,globrat,poorfair,...,LDL,HDL,TG,tchol1,LDL1,HDL1,TG1,SBP,DBP,age10
1,placebo,70,African American,yes,no,no,no,much more active,good,no,...,122.4,52,73,201,137.6,48,77,138,78,7.0
2,placebo,62,African American,yes,no,no,no,much less active,good,no,...,241.6,44,107,216,150.6,48,87,118,70,6.2
4,placebo,64,White,no,yes,yes,no,much less active,good,no,...,116.2,56,159,207,122.6,57,137,152,72,6.4
5,placebo,65,White,no,no,no,no,somewhat less active,good,no,...,150.6,42,107,235,172.2,35,139,175,95,6.5
6,hormone therapy,68,African American,yes,no,yes,no,about as active,good,no,...,137.8,52,111,202,126.6,53,112,174,98,6.8
8,hormone therapy,69,White,no,no,no,yes,much more active,very good,no,...,121.2,46,139,190,113.4,54,113,178,82,6.9


Let's create a multiple linear regression:

In [2]:
fit.linreg <- lm(glucose ~ exercise + age + drinkany + BMI, data = hers)
summary(fit.linreg)


Call:
lm(formula = glucose ~ exercise + age + drinkany + BMI, data = hers)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.582  -6.381  -0.901   5.508  32.015 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  89.47279    7.11416  12.577   <2e-16 ***
exerciseyes  -0.91077    0.42933  -2.121   0.0340 *  
age           0.06093    0.03144   1.938   0.0528 .  
drinkanyno  -10.34285    6.66006  -1.553   0.1206    
drinkanyyes  -9.66632    6.66150  -1.451   0.1469    
BMI           0.48899    0.04163  11.746   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.406 on 2024 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.07188,	Adjusted R-squared:  0.06959 
F-statistic: 31.35 on 5 and 2024 DF,  p-value: < 2.2e-16


2. What is the interpretation of Estimate column in the 'exercise' row of the summary? It can be viewed
using the command summary(model).

From the result shown above, the estimate of the variable 'exercise' is -0.91.<br>
This estimate means that the glucose level would be 0.91 lower if an observee exercises, compared to an observee that<br>
doesn't exercise, given the other factors being unchanged (i.e., the observee in the same age, same BMI, and no drinking habit.

3. What is the R-square value of the model? What is its interpretation?

From the result shown above, the R-square is scored 0.072.<br>
This means our current multiple regression model explains 7.2% of the variance in the glucose measure using variables 'age', 'drinkany', 'BMI', and 'exercise.'<br>
Because of a low R-square score, we should reassess this regression model and try to improve it.

4. Use the command confint(model) to get the confidence interval of age. What is its interpretation?

In [3]:
confint(fit.linreg)

Unnamed: 0,2.5 %,97.5 %
(Intercept),75.52095,103.42462775
exerciseyes,-1.75275,-0.06879174
age,-0.0007263436,0.12258237
drinkanyno,-23.40413,2.71843279
drinkanyyes,-22.73044,3.39780235
BMI,0.4073452,0.57063578


From the result shown above, we can see that the 95% confidence interval of variable 'age' is [0.0007,0.1226].<br>
To interpret, we are 95% confident that an increase of one year old for an observee expects the glucose level between 0.0007 and 0.1226 mmol/L.

### Problem 2: Logistic Regression
<br>
The western collaborative group study (WCGS) was a large epidemiological study designed to investigate the
association between the “type A” behavior pattern and coronary heart disease (CHD). In WCGS, more than
3000 men were recruited to the study, and a number of (potential) risk factors were recorded at entry. The
men were then followed for about ten years, and it was recorded if they developed CHD or not.
<br>
<br>
In this exercise, we will restrict ourselves to study the effect of smoking and age on the risk for CHD.
You may read the WCGS data into R by the command:

In [4]:
wcgs = read.table("http://www.uio.no/studier/emner/matnat/math/STK4900/data/wcgs.txt", sep="\t", header=TRUE, na.strings=".")
head(wcgs)

id,age,height,weight,sbp,dbp,chol,behpat,ncigs,dibpat,...,wghtcat,agec,cholc,smoke,bmi,bage_50,chol_50,age_10,bmi_10,sbp_50
2343,50,67,200,132,90,249,1,25,1,...,200,2,2,1,31.321,1,0.4526,0.3721,0.6803,0.0673
3656,51,73,192,120,74,194,1,25,1,...,200,3,1,1,25.3286,1,-0.6474,0.4721,0.081,-0.1727
3526,59,70,200,158,94,258,1,0,1,...,200,4,3,0,28.6939,1,0.6326,1.2721,0.4176,0.5873
22057,51,69,150,126,80,173,1,0,1,...,170,3,1,0,22.1487,1,-1.0674,0.4721,-0.237,-0.0527
12927,44,71,160,126,80,214,1,0,1,...,170,1,2,0,22.313,0,-0.2474,-0.2279,-0.2205,-0.0527
16029,47,64,158,116,76,206,1,80,1,...,170,2,2,1,27.1177,0,-0.4074,0.0721,0.2599,-0.2527


1. Fit a logistic regression model with smoke as covariate. Is there a statistically significant effect
of smoking on the risk of developing CHD? Your outcome variable is “chd69” and your covariate is
“smoke”, which is a categorical binary variable.

In [5]:
fit.smoke <- glm(chd69 ~ smoke, data = wcgs, family = binomial)
summary(fit.smoke)


Call:
glm(formula = chd69 ~ smoke, family = binomial, data = wcgs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4731  -0.4731  -0.3497  -0.3497   2.3769  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.7636     0.1042  -26.54  < 2e-16 ***
smoke         0.6299     0.1337    4.71 2.47e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1781.2  on 3153  degrees of freedom
Residual deviance: 1758.4  on 3152  degrees of freedom
AIC: 1762.4

Number of Fisher Scoring iterations: 5


From the result shown above, we find a significant statistical effect of smoking on the risk of developing CHD, with <br>
a very positive p-value of 2.47e-06 (which is much smaller than 0.05).<br>

2. Compute a 95% confidence interval for the regression coefficient for smoking. Also estimate the
odds ratio for smoking and determine a 95% confidence interval for the odds ratio. Explain in words
the meaning of the estimated odds ratio and its confidence interval.

In [6]:
confint(fit.smoke)

Waiting for profiling to be done...


Unnamed: 0,2.5 %,97.5 %
(Intercept),-2.9741104,-2.5653968
smoke,0.3697937,0.8945827


The result above shows us that a 95% confidence interval of smoking is [0.37, 0.89].<br>

In [7]:
exp(coef(fit.smoke))

The result above shows that the odds of developing CHD for smokers over non-smokers is 1.8774.<br>
This means the probability of developing CHD for smokers is 87.74% higher than non-smokers.<br>

In [8]:
exp(confint(fit.smoke))

Waiting for profiling to be done...


Unnamed: 0,2.5 %,97.5 %
(Intercept),0.05109287,0.07688866
smoke,1.44743591,2.44631483


The result above shows that we are 95% confident that the odds ratio of developing CHD for smokers over non-smokers<br>
is between 1.4474 and 2.4463.

3. Now we move on to a continuous variable “age”. Similarly, use logistic regression to study the
effect of age (at entry to the study) on the risk of developing CHD. Compute the estimated odds ratio
and explain in words the meaning of it.

In [9]:
fit.age <- glm(chd69 ~ age, data = wcgs, family = binomial)
summary(fit.age)


Call:
glm(formula = chd69 ~ age, family = binomial, data = wcgs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6209  -0.4545  -0.3669  -0.3292   2.4835  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.93952    0.54932 -10.813  < 2e-16 ***
age          0.07442    0.01130   6.585 4.56e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1781.2  on 3153  degrees of freedom
Residual deviance: 1738.4  on 3152  degrees of freedom
AIC: 1742.4

Number of Fisher Scoring iterations: 5


In [10]:
exp(coef(fit.age))

The result above shows us that the odds of developing CHD for observees of one year older is 1.0773.<br>
This means each one year older of a observee increases the probability of developing CHD by 7.7%.

4. When interpreting the effect of age, it may be reasonable to give the odds ratio corresponding
to a ten-year increase in age (rather than a one-year increase as we did in the previous question).
The easiest way to achieve this is to fit the logistic model using age/10 as a covariate. You can use
“I(age/10)” as a new column in R. Compare the estimates of this model with the one in the previous
question. What do you see?<br>
(Hint: Think of the odds ratio in the new problem as the odds ratio in the old problem multiplied by
the corresponding factor year by year for 10 times.)

In [11]:
fit.ageten <- glm(chd69 ~ age_10, data = wcgs, family = binomial)
summary(fit.ageten)


Call:
glm(formula = chd69 ~ age_10, family = binomial, data = wcgs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6209  -0.4545  -0.3669  -0.3292   2.4835  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.49531    0.06927 -36.026  < 2e-16 ***
age_10       0.74423    0.11302   6.585 4.56e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1781.2  on 3153  degrees of freedom
Residual deviance: 1738.4  on 3152  degrees of freedom
AIC: 1742.4

Number of Fisher Scoring iterations: 5


In [12]:
exp(coef(fit.ageten))

The result above shows us that the odds of developing CHD for observees of 10 year olders is 2.1048.<br>
This means each 10 years older of a observee increases the probability of developing CHD by 110.48%.