# Assignment 3

**Joris LIMONIER**

_Note:_ For scrolling-time reasons, this notebook only contains calls to functions. The actual core of function is in the `assignment_utils.py` file.


In [1]:
import pandas as pd
import numpy as np
import stan
from assignment_utils import *

%load_ext autoreload
%autoreload 2

## Exercise

In this exercise, you will use the ADNI dataset from the past lesson.

---

### Preparation

We first load the data and describe it.


In [2]:
adni = ADNI()
adni.diag.describe()


Unnamed: 0,RID,APOE4,DX,AGE,WholeBrain.bl,ICV,norm_brain
count,826.0,826.0,826.0,826.0,826.0,826.0,826.0
mean,2686.322034,0.525424,0.38862,74.451574,1011453.0,1521185.0,-0.001905
std,2062.148046,0.65871,0.487732,6.648689,111362.3,168055.6,1.000892
min,2.0,0.0,0.0,55.1,727478.0,1100687.0,-2.765395
25%,673.25,0.0,0.0,70.5,932946.5,1396231.0,-0.719043
50%,2718.0,0.0,0.0,74.15,1008351.0,1504898.0,0.022915
75%,4690.5,1.0,1.0,78.9,1087573.0,1634110.0,0.684545
max,5296.0,2.0,1.0,90.9,1486036.0,2057399.0,3.236658


We note that the `AGE` variable is distributed very similar to a Gaussian. For this reason, 

In [86]:
adni.plot_kde_vs_norm()

In [85]:
adni.plot_apoe4()

---


### Question 1

Fit a model to predict the diagnosis (DX) of the subjects using both AGE and APOE4 as predictors.
#### Answer


In [3]:
# Prepare C code that will be passed to Stan
code_to_stan = """
data {
  int<lower=1> N;
  int y[N];
  real x1[N];
  real x2[N];
}
parameters {
  real a;
  real b;
  real c;
}
transformed parameters {
  vector[N] p_i;
  for (i in 1:N) {
    p_i[i] = exp(a + b * x1[i] + c * x2[i])/(1 + exp(a + b * x1[i] + c * x2[i])); 
    }
}
model {
  a ~ normal(0, 3);
  b ~ normal(0, 3);
  c ~ normal(0, 3);
  y ~ binomial(1, p_i);
}
"""

posterior = adni.run_stan_model(features=["AGE", "APOE4"], program_code=code_to_stan, num_samples=100)
posterior

Building...

/home/joris/.cache/httpstan/4.9.0/models/zsip3y2b/model_zsip3y2b.cpp: In constructor ‘model_zsip3y2b_namespace::model_zsip3y2b::model_zsip3y2b(stan::io::var_context&, unsigned int, std::ostream*)’:
   67 |       int pos__ = std::numeric_limits<int>::min();
      |           ^~~~~
In file included from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/prim/fun.hpp:124,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun/multiply.hpp:7,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun/elt_multiply.hpp:9,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev/fun.hpp:55,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math/rev.hpp:10,
                 from /home/joris/.local/lib/python3.10/site-packages/httpstan/include/stan/math.hpp:19,
                 from /home/jor





Building: 27.3s, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:   0% (1/4400)
Sampling:   0% (2/4400)
Sampling:   0% (3/4400)
Sampling:   0% (4/4400)
Sampling:   2% (103/4400)
Sampling:   5% (202/4400)
Sampling:   7% (301/4400)
Sampling:   9% (400/4400)
Sampling:  11% (500/4400)
Sampling: 

<class 'stan.fit.Fit'>


<stan.Fit>
Parameters:
    a: ()
    b: ()
    c: ()
    p_i: (826,)
Draws: 400

In [28]:
# Save results in a dict
waic_res = {"AGE + APOE4": adni.get_waic(fit=posterior)}


sample_size_waic = 1000 is greater than n_samples_computed = 400. Limiting to available number of samples.


  0%|          | 0/826 [00:00<?, ?it/s]

In [27]:
adni.get_box_plot(fit=posterior, model_params=["a", "b", "c"])

We notice from the box plot above that the `b` parameter seems to play a very minor role in the computation of the `DX` variable, whereas `a` and `c` seem to be non-negligeable, with a lot of variability for `a`.

---


### Question 2

Consider subjects who are 80 years old and check the effect of the APOE4 gene on the diagnosis.

#### Answer

Hint: You'll draw many samples from two binomial distributions. One where APOE4 is included in the computation of $p_i$ and one where it's not.


---



### Question 3

In the last lesson, we fitted a model to predict the diagnosis using only the size of the brain (norm_brain). Compare this model and the one of question 1 in terms of WAIC. Is one better than the other ?

#### Answer
