# Assignment 3

**Joris LIMONIER**

_Note:_ For scrolling-time reasons, this notebook only contains calls to functions. The actual core of function is in the `assignment_utils.py` file.


In [1]:
import pandas as pd
import numpy as np
import stan
from assignment_utils import *

%load_ext autoreload
%autoreload 2

## Exercise

In this exercise, you will use the ADNI dataset from the past lesson.

---

### Preparation

We first load the data and describe it.


In [2]:
adni = ADNI()
adni.diag.describe()


Unnamed: 0,RID,APOE4,DX,AGE,WholeBrain.bl,ICV,norm_brain
count,826.0,826.0,826.0,826.0,826.0,826.0,826.0
mean,2686.322034,0.525424,0.38862,74.451574,1011453.0,1521185.0,-0.001905
std,2062.148046,0.65871,0.487732,6.648689,111362.3,168055.6,1.000892
min,2.0,0.0,0.0,55.1,727478.0,1100687.0,-2.765395
25%,673.25,0.0,0.0,70.5,932946.5,1396231.0,-0.719043
50%,2718.0,0.0,0.0,74.15,1008351.0,1504898.0,0.022915
75%,4690.5,1.0,1.0,78.9,1087573.0,1634110.0,0.684545
max,5296.0,2.0,1.0,90.9,1486036.0,2057399.0,3.236658


We note that the `AGE` variable is distributed very similar to a Gaussian. For this reason, we will

In [3]:
adni.plot_kde_vs_norm()

In [4]:
adni.plot_apoe4()

---


### Question 1

Fit a model to predict the diagnosis (DX) of the subjects using both AGE and APOE4 as predictors.

#### Answer

In order to drop the boilerplate C code that will be passed to Pystan, we write a function that uses parameters, their distribution and the formula to optimize; the rest is added automatically. This enforces some conventions such as the variable names to be `x1`, `x2`, ..., `x<number of parameters - 1>`, but it leaves all liberty otherwise.

The resulting C code is shown just after the first call of this function. It will not be printed anymore for the rest of the homework. Note that the indentation problem apparently has no effect on the final execution.

In [25]:
# Prepare C code that will be passed to Stan
code_age_apoe4_normal = adni._generate_pystan_code(
  param_distr={
    "a": "normal(0, 3)",
    "b": "normal(0, 3)",
    "c": "normal(0, 3)",
  },
  p_i_formula="p_i[i] = exp(a + b * x1[i] + c * x2[i])/(1 + exp(a + b * x1[i] + c * x2[i]));",
)
print(code_age_apoe4_normal)


    data {
      int<lower=1> N;
      int y[N];
      real x1[N];
real x2[N];
    }

    parameters {
      real a;
real b;
real c;
    }

transformed parameters {
  vector[N] p_i;
  for (i in 1:N) {
    p_i[i] = exp(a + b * x1[i] + c * x2[i])/(1 + exp(a + b * x1[i] + c * x2[i])); 
    }
}

    model {
      a ~ normal(0, 3);
b ~ normal(0, 3);
c ~ normal(0, 3);
      y ~ binomial(1, p_i);
    }


Now that we have the code for PyStan, we run the stan optimization algorithm.

In [21]:
posterior = adni.run_stan_model(
  features=["AGE", "APOE4"], program_code=code_age_apoe4_normal, num_samples=1000
)

posterior


Building...

/home/joris/.cache/httpstan/4.9.0/models/erlot6to/model_erlot6to.cpp: In constructor ‘model_erlot6to_namespace::model_erlot6to::model_erlot6to(stan::io::var_context&, unsigned int, std::ostream*)’:
   70 |       int pos__ = std::numeric_limits<int>::min();
      |           ^~~~~
In file included from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/prim/fun.hpp:124,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun/multiply.hpp:7,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun/elt_multiply.hpp:9,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun.hpp:55,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev.hpp:10,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math.hpp:19,
                 from /home/jor





Building: 24.0s, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:   0% (1/8000)
Sampling:   0% (2/8000)
Sampling:   0% (3/8000)
Sampling:   0% (4/8000)
Sampling:   1% (103/8000)
Sampling:   3% (203/8000)
Sampling:   4% (302/8000)
Sampling:   5% (401/8000)
Sampling:   6% (501/8000)
Sampling: 

<class 'stan.fit.Fit'>


<stan.Fit>
Parameters:
    a: ()
    b: ()
    c: ()
    p_i: (826,)
Draws: 4000

Now we compute the WAIC and save it to a dictionary in order to get an metric to compare models later.

In [22]:
# Save results in a dict
waic_res = {"AGE + APOE4": adni.get_waic(fit=posterior)}


  0%|          | 0/826 [00:00<?, ?it/s]

In [24]:
waic_res

{'AGE + APOE4': 951.828412891064}

We plot a box plot of the parameters. We notice from the box plot below that the `b` parameter seems to play a very minor role in the computation of the `DX` variable, whereas `a` and `c` seem to be non-negligeable, with a lot of variability for `a`.

In [8]:
adni.get_params_box_plot(fit=posterior, model_params=["a", "b", "c"])

---


### Question 2

Consider subjects who are 80 years old and check the effect of the APOE4 gene on the diagnosis.

Hint: You'll draw many samples from two binomial distributions. One where APOE4 is included in the computation of $p_i$ and one where it's not.


#### Answer

We first prepare the dataset with subjects that are exactly 80 only. \
We note that there are only 4 patients with this age, which is very little. Furthermore, only one of those 4 patients has developed Alzheimer's disease. This is close to what some researchers have to face in the medical field: a large number of features for very few observations, with only some of those patients that are sick.


In [9]:
adni.eighty

Unnamed: 0,RID,APOE4,DX,AGE,WholeBrain.bl,ICV,norm_brain
139,230,0.0,0,80.0,1051053.0,1714028.0,-1.030522
489,866,0.0,0,80.0,943825.0,1388961.0,0.238778
522,920,1.0,0,80.0,946606.0,1464818.0,-0.398452
740,1285,1.0,1,80.0,1025968.0,1626103.0,-0.691147


We want to compare the two following models:

1. `DX ~ `$\varnothing$
1. `DX ~ APOE4`

The first model is a model containing only the parameter $a$ responsible for the intercept, that is:

$$
p_i = \frac{\exp(a)}{1 + \exp(a)}
$$

while the second model also contains a parameter $b$ responsible for an increase in the `APOE4` variable, that is:

$$
p_i = \frac{\exp(a + b x_i)}{1 + \exp(a + b x_i)}
$$

We run the first model with a normal prior on `a`.


In [43]:
code_80_yo_intercept_normal = adni._generate_pystan_code(
  param_distr={"a": "normal(-1, 3)"}, p_i_formula="p_i[i] = exp(a)/(1 + exp(a));"
)

posterior_80_yo_intercept = adni.run_stan_model(
  features=[],
  program_code=code_80_yo_intercept_normal,
  num_samples=5000,
  data_name="80 yo",
)
posterior_80_yo_intercept


Building...

/home/joris/.cache/httpstan/4.9.0/models/uqyrggsu/model_uqyrggsu.cpp: In constructor ‘model_uqyrggsu_namespace::model_uqyrggsu::model_uqyrggsu(stan::io::var_context&, unsigned int, std::ostream*)’:
   57 |       int pos__ = std::numeric_limits<int>::min();
      |           ^~~~~
In file included from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/prim/fun.hpp:124,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun/multiply.hpp:7,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun/elt_multiply.hpp:9,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun.hpp:55,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev.hpp:10,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math.hpp:19,
                 from /home/jor





Building: 23.9s, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:  25% (6000/24000)
Sampling:  50% (12000/24000)
Sampling:  75% (18000/24000)
Sampling: 100% (24000/24000)
Sampling: 100% (24000/24000), done.
Messages received during sampling:
  Gradient evaluation took 3e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 7e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 7e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 7

<class 'stan.fit.Fit'>


<stan.Fit>
Parameters:
    a: ()
    p_i: (4,)
Draws: 20000

In [44]:
adni.get_params_box_plot(fit=posterior_80_yo_intercept, model_params=["a"])


In [63]:
waic_80_yo_intercept_normal = adni.get_waic(
  fit=posterior_80_yo_intercept, data_name="80 yo", sample_size_waic=5000
)
waic_80_yo = {"DX ~ N.A. (intercept only)": waic_80_yo_intercept_normal}
waic_80_yo

  0%|          | 0/4 [00:00<?, ?it/s]

{'DX ~ N.A. (intercept only)': 6.872105058012771}

Now we fit a model for 80 years olds where we take into consideration the `APOE4` variable.

In [51]:
code_apoe4_normal = adni._generate_pystan_code(
  param_distr={"a": "normal(-2, 3)", "b": "normal(1, 3)"},
  p_i_formula="p_i[i] = exp(a + b * x1[i])/(1 + exp(a + b * x1[i]));",
)

posterior_80_yo_apoe4 = adni.run_stan_model(
  features=["APOE4"],
  program_code=code_apoe4_normal,
  num_samples=5000,
  data_name="80 yo",
)
posterior_80_yo_apoe4


Building...

/home/joris/.cache/httpstan/4.9.0/models/kambqz3y/model_kambqz3y.cpp: In constructor ‘model_kambqz3y_namespace::model_kambqz3y::model_kambqz3y(stan::io::var_context&, unsigned int, std::ostream*)’:
   62 |       int pos__ = std::numeric_limits<int>::min();
      |           ^~~~~
In file included from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/prim/fun.hpp:124,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun/multiply.hpp:7,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun/elt_multiply.hpp:9,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun.hpp:55,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev.hpp:10,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math.hpp:19,
                 from /home/jor





Building: 24.7s, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:  25% (6000/24000)
Sampling:  50% (12000/24000)
Sampling:  75% (18000/24000)
Sampling: 100% (24000/24000)
Sampling: 100% (24000/24000), done.
Messages received during sampling:
  Gradient evaluation took 1e-05 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
  Adjust your expectations accordingly!
  Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
  Excepti

<class 'stan.fit.Fit'>


<stan.Fit>
Parameters:
    a: ()
    b: ()
    p_i: (4,)
Draws: 20000

In [52]:
adni.get_params_box_plot(fit=posterior_80_yo_apoe4, model_params=["a", "b"])


In [64]:
waic_80_yo_apoe4_normal = adni.get_waic(
  fit=posterior_80_yo_apoe4, data_name="80 yo", sample_size_waic=5000
)
waic_80_yo["DX ~ APOE4 (normal)"] = waic_80_yo_apoe4_normal


  0%|          | 0/4 [00:00<?, ?it/s]

{'DX ~ N.A. (intercept only)': 6.872105058012771,
 'DX ~ APOE4 (normal)': 6.654779102440702}

In [140]:
adni.pretty_print_waic(waic=waic_80_yo)

DX ~ N.A. (intercept only) ....................... 6.87211
DX ~ APOE4 (normal) .............................. 6.65478


We see that the model with `APOE4` performs better in terms of WAIC than the model with intercept only, but not by a huge margin. This hints that `APOE4` is useful in predicting `DX` (Alzheimer's disease) and it is also confirmed by several sources ([1](https://ici.radio-canada.ca/nouvelle/1866074/alzheimer-apoe4-role-lipide-transport), [2](https://ici.radio-canada.ca/nouvelle/1866074/alzheimer-apoe4-role-lipide-transport), in French) providing us with domain knowledge.

---



### Question 3

In the last lesson, we fitted a model to predict the diagnosis using only the size of the brain (norm_brain). Compare this model and the one of question 1 in terms of WAIC. Is one better than the other ?

#### Answer
