vignettes/articles/birthweight.Rmd

---
title: "APES: birth weight data example"
author:
- name: Kevin Y.X. Wang
  affiliation: School of Mathematics and Statistics, The University of Sydney, Australia
- name: Garth Tarr
  affiliation: School of Mathematics and Statistics, The University of Sydney, Australia
- name: Jean Y.H. Yang
  affiliation: School of Mathematics and Statistics, The University of Sydney, Australia
- name: Samuel Mueller
  affiliation: Department of Mathematics and Statistics, Macquarie University, Australia
output: 
  rmarkdown::html_vignette:
    fig_width: 8
    fig_height: 6
vignette: |
  %\VignetteIndexEntry{APES: birth weight data example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction 

In this vignette, we will demonstrate a real application of APES on a logistic regression model. The data we have chosen is the birth weight (`birthwt`) data from the `MASS` package. In this data, there are 189 observations and the binary response variable is whether an infant is in the low-weight group (`low`). There are eight predictor variables in total comprising of two numeric variables (`age` and `lwt`) and six others are factor variables.

# Loading packages and setting up the data

```{r}
library(APES)
library(MASS)
library(tidyverse)
data("birthwt", package = "MASS")

theme_set(theme_classic(14) +
            theme(legend.position = "bottom"))
```

We will simplify the data using the code below.

```{r}
bwt <- with(birthwt, {
  race <- factor(race, labels = c("white", "black", "other"))
  ptd <- factor(ptl > 0)
  ftv <- factor(ftv)
  levels(ftv)[-(1:2)] <- "2+"
  data.frame(low = factor(low), ## indicator for low-weight group
             age, ## mother's age in years
             lwt, ## mother's weight in pounds at last menstrual period
             race, ## mother's race (white, black, other)
             smoke = (smoke > 0), ## smoking status during pregnancy
             ptd, ## indicator for previous premature labours
             ht = (ht > 0), ## indicator for history of hypertension
             ui = (ui > 0), ## indicator for uterine irritability
             ftv ## number of physician visits during the first trimester
             )
})

glimpse(bwt)
```

# Single run of APES on birthweight data

We will first fit a "full" logistic regression model utilising all available variables to get a preliminary understanding of the data and the model. 

```{r}
full_model = glm(low ~ ., family = binomial, data = bwt)
round(summary(full_model)$coef, 3)
```

While on the outset, many of these variables would have seem to have an impact on the birth weight of an infant, the summary of the logistic model (in particular the p-values) imply that some variables are redundant. So we may wish to perform variable selection on this data using the `APES` package with the main function being `apes`.

```{r}
apes_result = apes(model = full_model)
apes_result
```

The `apes_result` gives us an indiaction as to which variables are considered to be the most important by the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) under an approximated exhaustive search. Note that APES handles a factor variable in similar style as other `R` packages by (alphabetically) selecting a level in the factor as the reference level and compare all other levels against this reference. 

# Variable selection stability

Suppose we are using the BIC as the main selection criterion, a single application of APES suggests that we should only select the `ptd` variable. However, [previous studies into this data](http://garthtarr.github.io/mplot/articles/birthweight.html) suggested that this data possess several interesting characteristics, one being variable selection instability. This means that the variables we select based on a single application of any model selection method cannot always be reproduced if the data is slightly perturbed. One way we can confirm the existence of this instability is to perform a bootstrap resampling of the observations, this introduces a perturbation on the data which affects model selection. 

```{r}
set.seed(2020)
boot_bwt = bwt[sample(1:nrow(bwt), nrow(bwt), replace = TRUE), ]
full_model2 = glm(low ~ ., family = binomial, data = boot_bwt)
round(summary(full_model2)$coef, 3)
apes(full_model2)
```

We can see that when we perturb on the data, the p-value for the `ht` variable increases from 0.008 to 0.054, which could cause confusion if we take a naive approach to only select variables with p-values < 0.05. Similarly, this perturbation also affects APES, as we have now selected three extra variables under the BIC. Variable selection stability is not a criticism of any specific methods, but rather, it is an issue for data with ambiguous signals. We will describe how the `APES` package uses bootstrap resampling to improve variable selection stability in the next section. 

# Bootstrapping with the `APES` package

One way that we can improve variable selection stability is to apply `APES` on many bootstrap resampling of the data and then average the computed results. As the main motivation of APES is to make exhaustive selection fast and it is therefore also ideally suited for such a procedure which requires a large amount of computation. 

Running the `apes` function with an extra argument `n_boot` automatically invokes bootstrapping sampling on the rows of the data. The returned object is a `boot_apes` class object which stores all bootstrapped `apes` objects. This class comes with generic `print`, `summary` and `plot` methods, which make the interaction with such an object much easier. 

The `summary` method summarises the averaged selection (empirical) probability of each variable for a given information criterion. In the next section, we will explore the `plot` generic methods for this object class. 

```{r, results = 'hide'}
boot_result = apes(full_model, n_boot = 100)
class(boot_result)
print(boot_result)
head(names(boot_result))
summary(boot_result, ic = "BIC")
summary(boot_result, ic = "AIC")
```


# Variable importance plot

The variable importance plot is a technique explored in Murray et. al. (2013) and it shows the stability of each variable as empirical probability of selection against different penalty terms, assuming the a general information criterion formulation $-2 \ell + \lambda (p + 1) $, where $\ell$ is the log-likelihood of a model, `\lambda` is the penalty term and $p$ is the number of predictors (excluding the intercept term). The most important difference between this `plot` method and the `summary`/`method` methods is that we are no longer fixated on a single information criterion such as the AIC (where $\lambda = 2$) and the BIC (where $\lambda = \log(p)$). Instead, we can visualise the selection with ever increasing penalty term $\lambda$. Variables with stronger selection stability are those with high probabilities of selection with increasingly higher levels of penalisation. 

We can see that the variables in order of strength of stability are `ptd`, `ht` and `lwt`.

```{r}
plot(boot_result, type = "vip")
```

## Tile version of VIP plot

This plot is identical in construction as the VIP plot above. However, the probability of selection is used as colours in a tile plot. The most stably selected variables are on the top of the y-axis.

```{r}
plot(boot_result, type = "vip_tile")
```

## Model averaged coefficient plot 

APES stores all the coefficient estimates in each bootstrap run. We can use this information to compute cumulative averages of variable coefficients. This plot allows us to examine the stability of the model coefficient estimates whereas the previous VIP plots shows only the stability of variable selection. 

```{r}
plot(boot_result, type = "ma")
```

## Information criterion pathway plot

During the bootstrap computations, APES records the best AIC/BIC models across all bootstrap runs. Due to the induced perturbation, the AIC/BIC-best model in each bootstrapped data do not always coincide. One way to examine the differences between these models is to look into the model size of the best selected model.

Here, each bootstrap run is represented by a black curve with the BIC-selected model of each run coloured as red. We can see that it is rare for a model of size 2 to be selected when using the BIC, with the majority of the models being between 3 and 6 in model size. This is a reason why we should perform such a bootstrap procedure to examine the model selection stability as a single run of APES only identifies a model of size two under the BIC selection criterion. 

```{r}
plot(boot_result, type = "path", order = "BIC")
```

# Parallel processing

For large number of bootstrap runs, one might consider using parallel processing to reduce the total computational time. The parallel backend support of `APES` uses the `furrr` package. To invoke parallel processing, this requires a single parameter `workers`. 

```{r, eval = FALSE}
parallel_result = apes(model = full_model, n_boot = 100, workers = 2)
parallel_result
```


# Reference 

+ Mueller, S. and Welsh, A. H. (2010), On model selection curves. International Statistical Review, 78:240-256. doi: 10.1111/j.1751-5823.2010.00108.x

+ Murray, K., Heritier, S. and Mueller, S. (2013), Graphical tools for model selection in generalized linear models. Statistics in Medicine, 32:4438-4451. doi: 10.1002/sim.5855

+ Tarr G, Mueller S and Welsh AH (2018). mplot: An R Package for Graphical Model Stability and Variable Selection Procedures. Journal of Statistical Software, 83(9), pp. 1-28. doi: 10.18637/jss.v083.i09

+ Wang, K. Y., Tarr, G., Yang, J. Y., & Mueller, S. (2019). Fast and approximate exhaustive variable selection for generalised linear models with APES. Australian & New Zealand Journal of Statistics, 61(4), 445–465. https://doi.org/10.1111/anzs.12276

# Session Info
```{r}
sessionInfo()
```