analysis/susie_jrssb_example.Rmd

---
title: Illustration of SuSiE in a "difficult" situation
author: Peter Carbonetto
output: workflowr::wflow_html
---

Here we explore the behaviour of SuSiE in a toy example where the
correlations between the predictors are not so straightforward.
Despite being more complicated, SuSiE still correctly infers Credible
Sets.

```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold")
```

Load packages
-------------

The mvtnorm package is used to simulate the data.

```{r load-pkgs}
library(mvtnorm)
library(varbvs)
library(susieR)
```

Overview
--------

One of the referees was interested in understanding the behaviour of
SuSiE in an example in which the correlation structure is slightly
more complex than the motivating example provided in the paper.

Consider five correlated predictors in a simple linear regression, $y =
x_1\beta_1, \ldots, x_5\beta_5$, in which only two variables, $x1$ and
$x3$, have an effect on the outcome, $y$; that is, the coefficients
$\beta_1$ and $\beta_3$ are both nonzero. The first two predictors are
very strongly correlated with each other, and the third and fourth
predictors are also very strongly correlated. Therefore, it will be
difficult to distinguish between $x1$ and $x2$, and likewise for $x3$
and $x4$. This correlation structure is the same as the simple
motivating example given in the paper.

To complicate the example, there is a fifth predictor that has no
effect on the outcome, and is also strongly correlated with the other
variables. This is the correlation matrix for the five predictors:

```{r correlation-matrix}
S <- rbind(c(   1, 0.99,  0.8,  0.8, 0.9),
           c(0.99,    1,  0.8,  0.8, 0.9),
           c( 0.8,  0.8,    1, 0.99, 0.9),
           c( 0.8,  0.8, 0.99,    1, 0.9),
           c( 0.9,  0.9,  0.9,  0.9,   1))
```

Therefore, without an abundance of data, it will be difficult to
distinguish between variables 1, 2 and 5, and likewise for variables
3, 4 and 5. The inferential statement that would best capture our
uncertainty about which variables affect the outcome would look
something like this:

$$\mbox{Among coefficients $\beta_1, \ldots, \beta_5$, two (or more)
are not zero}$$

SuSiE is not designed to produce such inferential statements,
but could state instead:

$$(\beta_1 \neq 0 \mbox{ or } \beta_2 \neq 0 \mbox{ or } \beta_5 \neq
0) \mbox{ and } (\beta_3 \neq 0 \mbox{ or } \beta_4 \neq 0)$$

Another valid statement could be:

$$(\beta_1 \neq 0 \mbox{ or } \beta_2 \neq 0) \mbox{ and } (\beta_3
\neq 0 \mbox{ or } \beta_4 \neq 0 \mbox{ or } \beta_5 \neq 0)$$

Although neither of these statements completely capture our
uncertainty about which variables affect the outcome, it does satisfy
our aim as it is stated in the paper:

"A key feature of our method, which distinguishes it from most
existing BVSR methods, is that it produces *Credible Sets* of
variables that quantify uncertainty in which variable should be
selected when multiple, highly correlated variables compete with one
another. These Credible Sets are designed to be as small as possible
while still each capturing a relevant variable.

The key point here is that the Credible Sets are not intended to
capture full uncertainty in which variables should be selected. But
they are intended to capture *all relevant variables*.

Below, we investigate these claims empirically.

Simulate data
-------------

We simulate 200 samples in this scenario.

```{r simulate-data}
set.seed(1)
n <- 200
b <- c(1, 0, -1, 0, 0)
X <- rmvnorm(n,sigma = S)
y <- drop(X %*% b + rnorm(n)/0.3)
```

Multiple regression using varbvs
--------------------------------

First, to provide a counterexample that does not accomplish what we
set out to achieve, we fit the "varbvs" model, and show that varbvs,
although similar to SuSiE in many ways, does not adequately capture
uncertainty in the selected variables.

```{r fit-varbvs}
fit1 <- varbvs(X,NULL,y,logodds = log10(0.5),sa = 1,verbose = FALSE)
summary(fit1)$top.vars
```

varbvs has correctly determined that two variables are useful for
predicting the outcome. However, it incorrectly placed all the weight
on the first and third variables. This is expected behaviour---it is
due to the particular variational approximation used in varbvs---but
it is undesirable when we want to better quantify uncertainty in the
correlated variables.

Multiple regression using SuSiE
-------------------------------

Next, we fit a SuSiE model. (For simplicity, we fit a SuSiE model with
$L = 2$, but SuSiE will give similar results for $L > 2$.)

```{r fit-susie}
fit2 <- susie(X,y,L = 2,standardize = FALSE,estimate_prior_variance = FALSE,
              scaled_prior_variance = 1,min_abs_corr = 0)
susie_get_cs(fit2,X,min_abs_corr = 0)$cs
```

SuSiE identified two Credible Sets:

$$(\beta_1 \neq 0 \mbox{ or } \beta_2 \neq 0 \mbox{ or } \beta_5 \neq
0) \mbox{ and } (\beta_3 \neq 0 \mbox{ or } \beta_4 \neq 0 \mbox{ or }
\beta_5 \neq 0)$$

Looking more closely at the posterior inclusion probabilities, most of
the weight is assigned to $x_3$ and $x_4$ in the first Credible Set,
and to $x_1$ and $x_2$ in the second Credible Set, and, in light of
the strong correlations with $x_5$, *each Credible Set also allows for
the possibility that $x_5$ affects the outcome:*

```{r susie-inclusion-probabilities}
round(fit2$alpha,digits = 3)
```

Therefore, although this statement is not the same as, "among
coefficients $\beta_1, \ldots, \beta_5$, two (or more) are not zero",
SuSiE still has correctly achieves the useful aim of identifying
"Credible Sets" that capture all relevant variables.