# Assignment 3

**Joris LIMONIER**

_Note:_ For scrolling-time reasons, this notebook only contains calls to functions. The actual core of function is in the `assignment_utils.py` file.


In [1]:
import pandas as pd
import numpy as np
import stan
from assignment_utils import *

%load_ext autoreload
%autoreload 2

## Exercise

In this exercise, you will use the ADNI dataset from the past lesson.

---

### Preparation

We first load the data and describe it.


In [2]:
adni = ADNI()
adni.diag.describe()


Unnamed: 0,RID,APOE4,DX,AGE,WholeBrain.bl,ICV,norm_brain
count,826.0,826.0,826.0,826.0,826.0,826.0,826.0
mean,2686.322034,0.525424,0.38862,74.451574,1011453.0,1521185.0,-0.001905
std,2062.148046,0.65871,0.487732,6.648689,111362.3,168055.6,1.000892
min,2.0,0.0,0.0,55.1,727478.0,1100687.0,-2.765395
25%,673.25,0.0,0.0,70.5,932946.5,1396231.0,-0.719043
50%,2718.0,0.0,0.0,74.15,1008351.0,1504898.0,0.022915
75%,4690.5,1.0,1.0,78.9,1087573.0,1634110.0,0.684545
max,5296.0,2.0,1.0,90.9,1486036.0,2057399.0,3.236658


We note that the `AGE` variable is distributed very similar to a Gaussian. For this reason, we will

In [3]:
adni.plot_kde_vs_norm()

In [4]:
adni.plot_apoe4()

---


### Question 1

Fit a model to predict the diagnosis (DX) of the subjects using both AGE and APOE4 as predictors.
#### Answer


In [20]:
# Prepare C code that will be passed to Stan
auto_gen_pystan_code = adni._generate_pystan_code(
  param_distr={
    "a": "normal(0, 3);",
    "b": "normal(0, 3);",
    "c": "normal(0, 3);",
  },
  p_i_formula="p_i[i] = exp(a + b * x1[i] + c * x2[i])/(1 + exp(a + b * x1[i] + c * x2[i]));",
)
print(auto_gen_pystan_code)


    data {
      int<lower=1> N;
      int y[N];
      real x1[N];
real x2[N];
    }

    parameters {
      real a;
real b;
real c;
    }

transformed parameters {
  vector[N] p_i;
  for (i in 1:N) {
    p_i[i] = exp(a + b * x1[i] + c * x2[i])/(1 + exp(a + b * x1[i] + c * x2[i])); 
    }
}

    model {
      a ~ normal(0, 3);;
b ~ normal(0, 3);;
c ~ normal(0, 3);;
      y ~ binomial(1, p_i);
    }


In [21]:
posterior = adni.run_stan_model(features=["AGE", "APOE4"], program_code=auto_gen_pystan_code, num_samples=1000)

posterior


Building...

/home/joris/.cache/httpstan/4.9.0/models/erlot6to/model_erlot6to.cpp: In constructor ‘model_erlot6to_namespace::model_erlot6to::model_erlot6to(stan::io::var_context&, unsigned int, std::ostream*)’:
   70 |       int pos__ = std::numeric_limits<int>::min();
      |           ^~~~~
In file included from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/prim/fun.hpp:124,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun/multiply.hpp:7,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun/elt_multiply.hpp:9,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun.hpp:55,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev.hpp:10,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math.hpp:19,
                 from /home/jor





Building: 24.0s, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:   0% (1/8000)
Sampling:   0% (2/8000)
Sampling:   0% (3/8000)
Sampling:   0% (4/8000)
Sampling:   1% (103/8000)
Sampling:   3% (203/8000)
Sampling:   4% (302/8000)
Sampling:   5% (401/8000)
Sampling:   6% (501/8000)
Sampling: 

<class 'stan.fit.Fit'>


<stan.Fit>
Parameters:
    a: ()
    b: ()
    c: ()
    p_i: (826,)
Draws: 4000

In [22]:
# Save results in a dict
waic_res = {"AGE + APOE4": adni.get_waic(fit=posterior)}


  0%|          | 0/826 [00:00<?, ?it/s]

In [24]:
waic_res

{'AGE + APOE4': 951.828412891064}

In [8]:
adni.get_params_box_plot(fit=posterior, model_params=["a", "b", "c"])

We notice from the box plot above that the `b` parameter seems to play a very minor role in the computation of the `DX` variable, whereas `a` and `c` seem to be non-negligeable, with a lot of variability for `a`.

---


### Question 2

Consider subjects who are 80 years old and check the effect of the APOE4 gene on the diagnosis.

Hint: You'll draw many samples from two binomial distributions. One where APOE4 is included in the computation of $p_i$ and one where it's not.


#### Answer

We first prepare the dataset with subjects that are exactly 80 only. \
We note that there are only 4 patients with this age, which is very little. Furthermore, only one of those 4 patients has developed Alzheimer's disease. This is close to what some researchers have to face in the medical field: a large number of features for very few observations, with only some of those patients that are sick.


In [9]:
adni.eighty

Unnamed: 0,RID,APOE4,DX,AGE,WholeBrain.bl,ICV,norm_brain
139,230,0.0,0,80.0,1051053.0,1714028.0,-1.030522
489,866,0.0,0,80.0,943825.0,1388961.0,0.238778
522,920,1.0,0,80.0,946606.0,1464818.0,-0.398452
740,1285,1.0,1,80.0,1025968.0,1626103.0,-0.691147


We want to compare the two following models:

1. `DX ~ `$\varnothing$
1. `DX ~ APOE4`

The first model is a model containing only the parameter $a$ responsible for the intercept, that is:

$$
p_i = \frac{\exp(a)}{1 + \exp(a)}
$$

while the second model also contains a parameter $b$ responsible for an increase in the `APOE4` variable, that is:

$$
p_i = \frac{\exp(a + b x_i)}{1 + \exp(a + b x_i)}
$$


In [57]:
code_80_yo_intercept = """
data {
  int<lower=1> N;
  int y[N];
}
parameters {
  real a;
}
transformed parameters {
  vector[N] p_i;
  for (i in 1:N) {
    p_i[i] = exp(a)/(1 + exp(a)); 
    }
}
model {
  a ~ normal(0, 3);
  y ~ binomial(1, p_i);
}
"""
posterior_80_yo_intercept = adni.run_stan_model(
  features=[], program_code=code_80_yo_intercept, num_samples=10, data_name="80 yo"
)
posterior_80_yo_intercept


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:  25% (1010/4040)
Sampling:  50% (2020/4040)
Sampling:  75% (3030/4040)
Sampling: 100% (4040/4040)
Sampling: 100% (4040/4040), done.
Messages received during sampling:
  Gradient evaluation took 2.7e-05 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 1e-05 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 1.1e-05 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation too

<class 'stan.fit.Fit'>


<stan.Fit>
Parameters:
    a: ()
    p_i: (4,)
Draws: 40

In [11]:
waic_80_yo_intercept = adni.get_waic(
  fit=posterior_80_yo_intercept, data_name="80 yo", sample_size_waic=5000
)
waic_80_yo = {"intercept only": waic_80_yo_intercept}


  0%|          | 0/4 [00:00<?, ?it/s]

In [58]:
from textwrap import dedent

s = """
  data {
    int<lower=1> N;
    int y[N];
  }"""
pystan_code = adni._generate_pystan_code(
  param_distr={
    "a": "normal(0, 3)",
    # "b": "uniform(1, 3)",
  },
  p_i_formula="p_i[i] = exp(a)/(1 + exp(a));",
)
posterior_80_yo_intercept = adni.run_stan_model(
  features=[], program_code=pystan_code, num_samples=10, data_name="80 yo"
)
posterior_80_yo_intercept


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:  25% (1010/4040)
Sampling:  50% (2020/4040)
Sampling:  75% (3030/4040)
Sampling: 100% (4040/4040)
Sampling: 100% (4040/4040), done.
Messages received during sampling:
  Gradient evaluation took 5e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 9e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 6e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 4

<class 'stan.fit.Fit'>


<stan.Fit>
Parameters:
    a: ()
    p_i: (4,)
Draws: 40

---



### Question 3

In the last lesson, we fitted a model to predict the diagnosis using only the size of the brain (norm_brain). Compare this model and the one of question 1 in terms of WAIC. Is one better than the other ?

#### Answer
