/
birthweight.Rmd
171 lines (122 loc) · 9.68 KB
/
birthweight.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
title: "APES: birth weight data example"
author:
- name: Kevin Y.X. Wang
affiliation: School of Mathematics and Statistics, The University of Sydney, Australia
- name: Garth Tarr
affiliation: School of Mathematics and Statistics, The University of Sydney, Australia
- name: Jean Y.H. Yang
affiliation: School of Mathematics and Statistics, The University of Sydney, Australia
- name: Samuel Mueller
affiliation: Department of Mathematics and Statistics, Macquarie University, Australia
output:
rmarkdown::html_vignette:
fig_width: 8
fig_height: 6
vignette: |
%\VignetteIndexEntry{APES: birth weight data example}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Introduction
In this vignette, we will demonstrate a real application of APES on a logistic regression model. The data we have chosen is the birth weight (`birthwt`) data from the `MASS` package. In this data, there are 189 observations and the binary response variable is whether an infant is in the low-weight group (`low`). There are eight predictor variables in total comprising of two numeric variables (`age` and `lwt`) and six others are factor variables.
# Loading packages and setting up the data
```{r}
library(APES)
library(MASS)
library(tidyverse)
data("birthwt", package = "MASS")
theme_set(theme_classic(14) +
theme(legend.position = "bottom"))
```
We will simplify the data using the code below.
```{r}
bwt <- with(birthwt, {
race <- factor(race, labels = c("white", "black", "other"))
ptd <- factor(ptl > 0)
ftv <- factor(ftv)
levels(ftv)[-(1:2)] <- "2+"
data.frame(low = factor(low), ## indicator for low-weight group
age, ## mother's age in years
lwt, ## mother's weight in pounds at last menstrual period
race, ## mother's race (white, black, other)
smoke = (smoke > 0), ## smoking status during pregnancy
ptd, ## indicator for previous premature labours
ht = (ht > 0), ## indicator for history of hypertension
ui = (ui > 0), ## indicator for uterine irritability
ftv ## number of physician visits during the first trimester
)
})
glimpse(bwt)
```
# Single run of APES on birthweight data
We will first fit a "full" logistic regression model utilising all available variables to get a preliminary understanding of the data and the model.
```{r}
full_model = glm(low ~ ., family = binomial, data = bwt)
round(summary(full_model)$coef, 3)
```
While on the outset, many of these variables would have seem to have an impact on the birth weight of an infant, the summary of the logistic model (in particular the p-values) imply that some variables are redundant. So we may wish to perform variable selection on this data using the `APES` package with the main function being `apes`.
```{r}
apes_result = apes(model = full_model)
apes_result
```
The `apes_result` gives us an indiaction as to which variables are considered to be the most important by the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) under an approximated exhaustive search. Note that APES handles a factor variable in similar style as other `R` packages by (alphabetically) selecting a level in the factor as the reference level and compare all other levels against this reference.
# Variable selection stability
Suppose we are using the BIC as the main selection criterion, a single application of APES suggests that we should only select the `ptd` variable. However, [previous studies into this data](http://garthtarr.github.io/mplot/articles/birthweight.html) suggested that this data possess several interesting characteristics, one being variable selection instability. This means that the variables we select based on a single application of any model selection method cannot always be reproduced if the data is slightly perturbed. One way we can confirm the existence of this instability is to perform a bootstrap resampling of the observations, this introduces a perturbation on the data which affects model selection.
```{r}
set.seed(2020)
boot_bwt = bwt[sample(1:nrow(bwt), nrow(bwt), replace = TRUE), ]
full_model2 = glm(low ~ ., family = binomial, data = boot_bwt)
round(summary(full_model2)$coef, 3)
apes(full_model2)
```
We can see that when we perturb on the data, the p-value for the `ht` variable increases from 0.008 to 0.054, which could cause confusion if we take a naive approach to only select variables with p-values < 0.05. Similarly, this perturbation also affects APES, as we have now selected three extra variables under the BIC. Variable selection stability is not a criticism of any specific methods, but rather, it is an issue for data with ambiguous signals. We will describe how the `APES` package uses bootstrap resampling to improve variable selection stability in the next section.
# Bootstrapping with the `APES` package
One way that we can improve variable selection stability is to apply `APES` on many bootstrap resampling of the data and then average the computed results. As the main motivation of APES is to make exhaustive selection fast and it is therefore also ideally suited for such a procedure which requires a large amount of computation.
Running the `apes` function with an extra argument `n_boot` automatically invokes bootstrapping sampling on the rows of the data. The returned object is a `boot_apes` class object which stores all bootstrapped `apes` objects. This class comes with generic `print`, `summary` and `plot` methods, which make the interaction with such an object much easier.
The `summary` method summarises the averaged selection (empirical) probability of each variable for a given information criterion. In the next section, we will explore the `plot` generic methods for this object class.
```{r, results = 'hide'}
boot_result = apes(full_model, n_boot = 100)
class(boot_result)
print(boot_result)
head(names(boot_result))
summary(boot_result, ic = "BIC")
summary(boot_result, ic = "AIC")
```
# Variable importance plot
The variable importance plot is a technique explored in Murray et. al. (2013) and it shows the stability of each variable as empirical probability of selection against different penalty terms, assuming the a general information criterion formulation $-2 \ell + \lambda (p + 1) $, where $\ell$ is the log-likelihood of a model, `\lambda` is the penalty term and $p$ is the number of predictors (excluding the intercept term). The most important difference between this `plot` method and the `summary`/`method` methods is that we are no longer fixated on a single information criterion such as the AIC (where $\lambda = 2$) and the BIC (where $\lambda = \log(p)$). Instead, we can visualise the selection with ever increasing penalty term $\lambda$. Variables with stronger selection stability are those with high probabilities of selection with increasingly higher levels of penalisation.
We can see that the variables in order of strength of stability are `ptd`, `ht` and `lwt`.
```{r}
plot(boot_result, type = "vip")
```
## Tile version of VIP plot
This plot is identical in construction as the VIP plot above. However, the probability of selection is used as colours in a tile plot. The most stably selected variables are on the top of the y-axis.
```{r}
plot(boot_result, type = "vip_tile")
```
## Model averaged coefficient plot
APES stores all the coefficient estimates in each bootstrap run. We can use this information to compute cumulative averages of variable coefficients. This plot allows us to examine the stability of the model coefficient estimates whereas the previous VIP plots shows only the stability of variable selection.
```{r}
plot(boot_result, type = "ma")
```
## Information criterion pathway plot
During the bootstrap computations, APES records the best AIC/BIC models across all bootstrap runs. Due to the induced perturbation, the AIC/BIC-best model in each bootstrapped data do not always coincide. One way to examine the differences between these models is to look into the model size of the best selected model.
Here, each bootstrap run is represented by a black curve with the BIC-selected model of each run coloured as red. We can see that it is rare for a model of size 2 to be selected when using the BIC, with the majority of the models being between 3 and 6 in model size. This is a reason why we should perform such a bootstrap procedure to examine the model selection stability as a single run of APES only identifies a model of size two under the BIC selection criterion.
```{r}
plot(boot_result, type = "path", order = "BIC")
```
# Parallel processing
For large number of bootstrap runs, one might consider using parallel processing to reduce the total computational time. The parallel backend support of `APES` uses the `furrr` package. To invoke parallel processing, this requires a single parameter `workers`.
```{r, eval = FALSE}
parallel_result = apes(model = full_model, n_boot = 100, workers = 2)
parallel_result
```
# Reference
+ Mueller, S. and Welsh, A. H. (2010), On model selection curves. International Statistical Review, 78:240-256. doi: 10.1111/j.1751-5823.2010.00108.x
+ Murray, K., Heritier, S. and Mueller, S. (2013), Graphical tools for model selection in generalized linear models. Statistics in Medicine, 32:4438-4451. doi: 10.1002/sim.5855
+ Tarr G, Mueller S and Welsh AH (2018). mplot: An R Package for Graphical Model Stability and Variable Selection Procedures. Journal of Statistical Software, 83(9), pp. 1-28. doi: 10.18637/jss.v083.i09
+ Wang, K. Y., Tarr, G., Yang, J. Y., & Mueller, S. (2019). Fast and approximate exhaustive variable selection for generalised linear models with APES. Australian & New Zealand Journal of Statistics, 61(4), 445–465. https://doi.org/10.1111/anzs.12276
# Session Info
```{r}
sessionInfo()
```