-
Notifications
You must be signed in to change notification settings - Fork 1
/
05-statistical-analysis-r.Rmd
472 lines (320 loc) · 20.1 KB
/
05-statistical-analysis-r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
```{r eval = FALSE, include = FALSE}
source("setup.r")
```
# Statistical Analysis
## Setup
In this section, we'll provide some examples in R of basic statistical analyses you might need to implement at some point.
The analyses you conduct as part of real research will almost certainly be more complicated than the code shown here. However, the hope is that you'll get enough of a feel for the mechanics of R to begin putting the pieces together as demanded by a given project.
### Data
If you'd like to try any of these examples on your own, we're using the [NHANES I Epidemiologic Followup Study](https://wwwn.cdc.gov/nchs/nhanes/nhefs/default.aspx/) (NHEFS) data set freely available at the [website](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/) for *Causal Inference: What If* by Miguel A. Hernán and James M. Robins.
For both SAS and R, we'll use the CSV file they provide.
The website itself contains extensive code examples in SAS, Stata, R, and Python---everything you could want!---so those who will be taking advanced epidemiologic or causal inference coursework should refer to those resources for guidance on how to implement relevant analyses. We'll cover a little bit here, but only briefly.
### NHEFS Codebook
```{r nhefs-codebook, echo = FALSE}
readxl::read_excel("data/NHEFS_Codebook.xls") %>%
knitr::kable()
```
## Exploratory Data Analysis
First, let's read in our data and glance at it using a couple of different views.
```{r read-nhefs, message = F}
nhefs <- readr::read_csv("data/nhefs.csv")
head(nhefs)
tibble::glimpse(nhefs)
```
For now, let's focus on observations (rows) that don't have any missing data. To do this, let's get a summary of how much missingness we have across the `r format(nrow(nhefs), big.mark = ",")` observations.
```{r count-missing}
allvars <- names(nhefs)
df_miss <- data.frame(variable = allvars)
df_miss <- df_miss %>%
dplyr::rowwise() %>%
dplyr::mutate(n_miss = sum(is.na(nhefs[[variable]])))
print(df_miss, n = nrow(df_miss))
```
That looks a little hairy. Basically, all we did here was to create data.frame, save the NHEFS variable names to its first column, then look up each variable in the NHEFS data set and count the number of missing values.
However, we could have gotten this summary in a number of different ways.
```{r alt-missing-counts}
lapply(nhefs, function(x) table(is.na(x)))
sapply(nhefs, function(x) sum(is.na(x)))
nhefs[complete.cases(nhefs), ]
```
Let's drop every row with a missing value for any variable
```{r omit-too-many}
nhefs2 <- na.omit(nhefs)
tibble::glimpse(nhefs2)
```
Oops, we have only `r nrow(nhefs2)` rows left in the data set! Perhaps we should reconsider and pick several variables that will be of interest to us for our current purposes.
```{r subset-variables}
selectvars <- c(
"age", "alcoholfreq", "asthma", "wt71",
"smokeintensity", "sbp", "allergies",
"weakheart"
)
nhefs3 <- na.omit(nhefs[, selectvars])
tibble::glimpse(nhefs3)
```
That's better. We've discarded `r nrow(nhefs) - nrow(nhefs3)` rows and focused on a subset of variables of interest.
Look back at the output we got from `glimpse()`. Are the variables in the formats we expect/need? I see one that we may need to update, depending on how we plan to use it. (Hint: Look back at the codebook for the variable summaries)
It seems that `alcoholfreq` is currently being treated as a numeric variables, but really, it's a factor with 5 discrete levels. Furthermore, level 5 is Unknown, which we should probably treat as missing. We'll also use a factor format for `alcoholfreq`.
[We could leave `alcoholfreq` as a numeric variable and wrap it in the `factor()` function only when we needed to. In some cases, that approach might be preferable, in some cases not.]{.marginnote}
```{r factor-check}
nhefs3$alcfreq <- as.factor(nhefs3$alcoholfreq)
table(nhefs3$alcfreq)
```
Let's replace those 5 missing values with NA's, too, and drop the corresponding rows from our data set. Let's also get rid of the original `alcoholfreq` variable and just hang on to the factor variable we made.
```{r alcfreq}
# In words: where alcfreq in nhefs3 equals 5, assign NA
nhefs3$alcfreq[nhefs3$alcfreq == 5] <- NA
nhefs4 <- na.omit(nhefs3)
nhefs4$alcoholfreq <- NULL
tibble::glimpse(nhefs4)
```
Let's summarize our dataset:
```{r eda-stats}
# Continuous variables
contvars <- c("age", "wt71", "smokeintensity", "sbp")
summary(nhefs4[, contvars])
# Categorical variables
catvars <- c("asthma", "allergies", "alcfreq", "weakheart")
lapply(
X = catvars,
FUN = function(x) {
summarytools::freq(nhefs4[x])
})
```
[Based on this [tidy example](https://drsimonj.svbtle.com/quick-plot-of-all-variables).]{.marginnote}
A few plots may help us, too:
```{r eda-density, warning = FALSE, dev = "svg"}
library(ggplot2)
library(ggthemes)
nhefs4 %>%
dplyr::select(all_of(contvars)) %>%
tidyr::gather() %>%
ggplot(aes(x = value)) +
facet_wrap(~key, scales = "free") +
geom_density() +
theme_tufte(base_size = 16)
```
```{r eda-bars, warning = FALSE, dev = "svg"}
nhefs4 %>%
dplyr::select(all_of(catvars)) %>%
tidyr::gather() %>%
ggplot(aes(x = value)) +
facet_wrap(~key, nrow = 2, scales = "free") +
geom_bar(stat = "count", width = 0.5, fill = "lightgray") +
theme_tufte(base_size = 16)
```
Don't worry just now about the plot code. We'll be looking in some detail at the ggplot2 package later.
## Tabular Analysis
Soon you will be no stranger to two-by-two tables and contingency tables. Introductory statistics and epidemiologic methods courses will require you to analyze such tables, either to measure the strength of an exposure's effect on some outcome or to establish a statistical association between two variables.
Going back to our NHEFS data, let's say we're interested in a possible association between quitting smoking and weight loss. The `qsmk` variable is a binary indicator of whether or not an individual quit smoking between 1971 and 1982, while the `sbp` variable measures an individual's systolic blood pressure in 1982.
Typically, we would not want to dichotomize a continuous variable like `sbp`, but for the sake of this example, let's begin by creating a binary indicator of whether a subject had a high (SBP > 140) or low (SBP <= 140) systolic blood pressure in 1982.
```{r tabanalysis-sbp}
# check whether any values of sbp are missing
any(is.na(nhefs$sbp))
# count the number of missing observations
# is.na() will return TRUE if an observation is missing and FALSE if it is not
# we can use sum() to add up the number of TRUE values because R will treat a
# logical variable as numeric for this purpose, where TRUE = 1 and FALSE = 0
sum(is.na(nhefs$sbp))
# create the new variable, called sbp_hi
nhefs$sbp_hi <- NA #initialize with NA
nhefs$sbp_hi[nhefs$sbp > 140] <- 1
nhefs$sbp_hi[nhefs$sbp <= 140] <- 0
# make sure we coded the new variable correctly
summary(nhefs$sbp[nhefs$sbp_hi == 1])
summary(nhefs$sbp[nhefs$sbp_hi == 0])
```
The summary statistics we generated above should comply with our expectations: those we marked as having gained weight all have positive values of `sbp_hi`, while those we marked as not having gained weight all have negative values of `sbp_hi`.
Note, too, that we have `r sum(is.na(nhefs$sbp_hi))` missing values in our new variable, just as we saw in the original outcome.
```{r tabanalysis-2x2}
with(nhefs, table(sbp_hi, qsmk, exclude = NULL))
```
The option `exclude = NULL` above includes missing values in the table output. We'll ignore them for simplicity.
After doing so, we're left with the following two-by-two table for analysis.
```{r tabanalysis-pretty-2x2}
nhefs %>%
count(qsmk, sbp_hi) %>%
na.omit() %>%
pivot_wider(
names_from = sbp_hi,
values_from = n,
names_pref = "sbp"
)
```
A natural choice here would be to conduct a two-sample test for the equality of proportions, where we compare the proportion of subjects in the quit smoking group (`r with(nhefs %>% select(qsmk, sbp_hi) %>% na.omit(), round(mean(sbp_hi[qsmk == 1]), 3))`) who had high systolic blood pressure in 1982 versus the proportion in the non-quit smoking group (`r with(nhefs %>% select(qsmk, sbp_hi) %>% na.omit(), round(mean(sbp_hi[qsmk == 0]), 3))`).
```{r tabanalysis-proptest}
ptest <- prop.test(c(109, 251), c(109 + 284, 251 + 908))
ptest
```
The output above shows us the group proportions, the $\chi^2$ test statistic, and the corresponding *p*-value. We also get a 95% confidence interval for the difference in proportions.
To get a tidier output as a `tibble`, the `broom` package often proves useful:
```{r tabanalysis-broom-2x2}
broom::tidy(ptest)
```
We also could have given the table directly to the `prop.test()` function:
[**Know Your Output** Notice that the proportions and, as a result, the confidence interval given by the second `prop.test()` are different than the first, even though the $\chi^2$ statistic and the *p*-value are the same. The statistical test we used applied to the distribution of cell counts in the table. However, if we needed to report the difference in proportions and its confidence interval, the second method would have given us an incorrect estimate. Always be sure you understand functions' outputs!]{.marginnote .warning}
```{r tabanalysis-tabinput-proptest}
with(nhefs, prop.test(table(qsmk, sbp_hi)))
```
## Regression Models
In this section we will fit some regression models that you may encounter in classes or in your own research. We will leave the theoretical background to your statistics and methods courses but aim to provide enough information so that you understand the basic idea behind each approach and can refer back to this page in the future.
All upcoming examples use variables from the NHEFS dataset that we've referred back to several times by now.
[Note that you can pass all of the fit objects below to `broom::tidy()` in order to retrieve a data frame containing the regression information. Doing so will often make exporting tables and results much easier for you. However, we will focus on the default summary output for now.]{.marginnote}
### Linear regression
Exposure of interest: quitting smoking between 1971 and 1982 (`qsmk`) Outcome: systolic blood pressure in 1982 (continuous, `sbp`)
First, we can run an unadjusted model with weight change as the outcome and quitting smoking as the predictor variable. The symbol `~` indicates that the statement `sbp ~ qsmk` is a formula.
```{r regression-linear-unadj}
fitlm1 <- lm(sbp ~ qsmk, data = nhefs)
summary(fitlm1)
```
You can think of the `fitlm1` object that we saved as a list containing information about the model we fit:
```{r regression-linear-unadj-elements}
names(fitlm1)
```
We can output coefficients,
```{r regression-linear-unadj-elements-coef}
fitlm1$coefficients
```
residuals,
```{r regression-linear-unadj-elements-resid}
head(fitlm1$residuals)
```
and the data on which R fitted the model:
```{r regression-linear-unadj-elements-data}
head(fitlm1$model)
```
::: detour
<p style="text-align:center;">
**Detour**
</p>
Why did we bother to note that the model object saves the "data on which R fitted the model"? Didn't R fit the model on the data we told it to?
To answer this question, let's take a brief detour and check that the data frame saved in `fitlm1` has the same number of rows as the `nhefs` data frame we fed to the `lm()` function.
```{r regression-linear-unadj-elements-datarows}
nrow(fitlm1$model)
nrow(nhefs)
```
Interesting! The number of rows in each data set don't match, but why?
Recall that earlier on this page, we found `r sum(is.na(nhefs$sbp))` missing values of the outcome `sbp`, and if we subtract `r nrow(fitlm1$model)` from `r nrow(nhefs)`, we get `r nrow(nhefs) - nrow(fitlm1$model)`. That's because *R's modeling methods omit rows with missing data for any variables included in the model formula*. The model object saves the row numbers of dropped observations in `fitlm1$na`, and we can verify by extracting these rows from the original `nhefs` data set:
```{r regression-linear-unadj-elements-na}
# extract row numbers that were dropped from the fitlm1 model
# keep only sbp column and see that they're all missing
nhefs[fitlm1$na, c("sbp")] %>%
count(sbp)
```
Not to belabor the point, but let's convince ourselves a bit more by identifying the rows with missing values for `sbp` and comparing this to the list R made:
```{r regression-linear-unadj-compare-drop-rows}
# identify rows with missing values for sbp
# vector of numeric row indices
missindex <- which(is.na(nhefs$sbp))
## sort vectors in numerical order, generate a cross table
## NOTE sorting is unecessary in this case, but it's good
## practice to ensure elements are ordered properly
table(sort(missindex) == sort(fitlm1$na))
```
The index vector we created manually matches the one R created and saved to `fitlm1`.
:::
R also provides functions designed to extract specific information from model fits, though we'll suppress the output since we already printed out some of this information above:
```{r regression-linear-extraction-funs, eval = F}
# extract model coefficients
coef(fitlm1)
# extract model residuals
residuals(fitlm1)
```
Let's look at some plot diagnostics to make sure our linear regression complies with the model's assumptions:
[`par(mfrow = c(2, 2))` tells R that we want a 2x2 plot grid. If you ran `plot(fitlm1)` by itself, R would output four images in succession.]{.marginnote}
```{r regression-linear-unadj-diag, fig.asp = 1, dev = "svg"}
par(mfrow = c(2, 2))
plot(fitlm1)
```
Based on the Q-Q plot, we may have some violations of the normality assumption (*i.e.*, residuals are not normally distributed), as evidenced by the points' departure from the diagonal line. In a real analysis we would need to address this issue, possibly by transforming the outcome. However, our focus here is on implementing models in code, so we will proceed, leaving issues of statistical practice to your relevant coursework.
Our unadjusted estimate, therefore, suggests that those who did not quit smoking between 1971 and 1982 had a `r round(coef(fitlm1)['qsmk'], 2)` mm/Hg higher systolic blood pressure in 1982, on average, versus those who did quit smoking during that timeframe.
The unadjusted linear regression we just ran is actually a special case of the two-sample *t*-test (assuming equal variances between smoking group), to which you'll be introduced early on in your stats classes. Compare the output of the code below to that of the linear regression.
[**Comparing Outputs** What are the *t* statistics?<br /><br /> What are the *p*-values for the difference in weight gain between quit smoking groups?<br /><br /> What are the degrees of freedom?<br /><br /> When you take the difference of "mean of x" and "mean of y" from the output of `t.test()`, do you find that value in the linear regression summary?]{.marginnote .definition}
```{r}
with(nhefs, t.test(sbp[qsmk == 1], sbp[qsmk == 0], var.equal = T))
```
The code below fits a linear model for the effect of `qsmk` on `sbp` adjusted for years of smoking (`smokeyrs`), intensity of smoking (`smokeintensity`), diabetes, sex, and age. Note that we allow for an "interaction" between sex and age, denoted by `sex * age`.
```{r regression-linear-adj}
fitlm2 <- lm(
sbp ~ qsmk + smokeyrs + smokeintensity + diabetes + sex * age,
data = nhefs
)
summary(fitlm2)
```
Our adjusted estimate suggests that quitting smoking was associated with an average `r round(coef(fitlm2)['qsmk'], 2)` kg increase in weight between 1971 and 1982, conditional on the other variables we included in the second linear model.
Almost always, we will want to report the standard error estimates and/or confidence intervals with our measures of effect. The `broom::tidy()` function makes extracting 95% confidence intervals from regression fit objects (*i.e.*, objects like `fitlm1` or `fitlm2`):
```{r regression-linear-adj-tidy-95ci}
broom::tidy(fitlm2, conf.int = TRUE) %>%
dplyr::filter(term == "qsmk")
```
Alternatively, we can calculate the same confidence interval manually:
```{r regression-linear-adj-manual-95ci}
# point estimate
est <- unname(coef(fitlm2)["qsmk"])
# standard error of point estimate
se <- unname(summary(fitlm2)$coefficients[, c("Std. Error")]["qsmk"])
# lower and upper bounds of 95% confidence limits
ll <- est - 1.96 * se
ul <- est + 1.96 * se
c(estimate = est, ll95 = ll, ul95 = ul)
```
### Logistic regression
Let's revisit our tabular analysis, in which we were interested in whether quitting smoking affected the likelihood of a subject's gaining weight between 1971 and 1982.
We can answer the same question with a logistic regression model, using the `glm()` function.
```{r regression-logistic}
fitlog1 <- glm(
sbp_hi ~ qsmk,
data = nhefs,
family = binomial()
)
summary(fitlog1)
```
From this regression, we see that smokers who quit had an average increase of `r round(coef(fitlog1)["qsmk"], 3)` in their log-odds of weight gain between 1971 and 1982.
Usually, you will be asked to report associations from logistic regressions as odds ratios, in which case you would simply exponentiate the coefficient of interest. In this case we would run `exp(0.43304)` and get an odds ratio of `r round(exp(coef(fitlog1)["qsmk"]), 2)`. That is, those who quit smoking had `r round(exp(coef(fitlog1)["qsmk"]), 2)` times the odds of gaining weight between 1971 and 1982, compared to those who did not quit smoking.
As in the linear regression example, you could run a model adjusting for other factors, in which case you would modify the model formula accordingly. We will forego running an adjusted model here to avoid redundancy.
### Log-binomial regression
While the logistic regression remains the most frequently used generalized linear model (GLM) for binary outcomes, epidemiologists are often interested in risk differences and risk ratios.
Luckily, we can use the log-binomial model to estimate contrasts in risks.
```{r regression-logbin}
fitlbin <- glm(
sbp_hi ~ qsmk,
data = nhefs,
family = binomial(link = "log")
)
summary(fitlbin)
```
To get the risk ratio for the effect of quitting smoking on weight gain, we again exponentiate the beta coefficient of interest and, in doing so, estimate that those who quit smoking were `r round(exp(coef(fitlbin)["qsmk"]), 3)` times as likely to gain weight versus non-smokers.
[**Try this** Calculate the risk ratio manually and convince yourself that the log-binomial model gave you the correct answer.]{.marginnote .idea}
### Modified Poisson regression
Another way to get a risk ratio using GLMs involves a procedure called modified Poisson regression. "Modified" here refers to a necessary modification of the standard error estimates when using a Poisson regression model to estimate a risk ratio. Again, save the statistical details for later. All we want to know for the moment are the nuts-and-bolts: how to fit the model and how to modify the standard errors.
```{r regression-modpois}
fitmodpois <- glm(
sbp_hi ~ qsmk,
data = nhefs,
family = poisson()
)
summary(fitmodpois)
```
Let's compare the `qsmk` coefficient estimate to the one we got using the log-binomial model:
```{r regression-compare-lbin-modpois}
modcompare <- rbind(
broom::tidy(fitlbin) %>% mutate(model = "Log-binomial"),
broom::tidy(fitmodpois) %>% mutate(model = "Modified Poisson")
) %>%
filter(term == "qsmk") %>%
select(model, estimate, std.error)
modcompare %>%
knitr::kable()
```
Almost exactly the same point estimate, but we can see that the standard error from the modified Poisson model is considerably larger than that of the log-binomial model's.
To get a correct standard error for the modified Poisson model, we need to use a "robust" estimate of this error, also called the sandwich estimator:
```{r sandwich}
# sandwich() provided by the sandwich package
# first, get the variance-covariance matrix
swvar_modpois <- sandwich::sandwich(fitmodpois)
# extract the variance of the qsmk coefficient and
# take the square root to get the standard error
sqrt(swvar_modpois[2, 2])
```
Compare this modified standard error to that of the log-binomial model. We've accounted for the conservative estimate in the original modified Poisson model.
For more complex models, the log-binomial and modified Poisson models may provide slightly different answers. Each has other benefits and drawbacks that you will learn about if and when you take a class that introduces these methods.