# Assignment 3

**Joris LIMONIER**

_Note:_ For scrolling-time reasons, this notebook only contains calls to functions. The actual core of function is in the `assignment_utils.py` file.


In [1]:
import pandas as pd
import numpy as np
import stan
from assignment_utils import *

%load_ext autoreload
%autoreload 2

waic_res = {}

## Exercise

In this exercise, you will use the ADNI dataset from the past lesson.

---

### Preparation

We first load the data and describe it.


In [2]:
adni = ADNI()
adni.diag.describe()


Unnamed: 0,RID,APOE4,DX,AGE,WholeBrain.bl,ICV,norm_brain
count,826.0,826.0,826.0,826.0,826.0,826.0,826.0
mean,2686.322034,0.525424,0.38862,74.451574,1011453.0,1521185.0,-0.001905
std,2062.148046,0.65871,0.487732,6.648689,111362.3,168055.6,1.000892
min,2.0,0.0,0.0,55.1,727478.0,1100687.0,-2.765395
25%,673.25,0.0,0.0,70.5,932946.5,1396231.0,-0.719043
50%,2718.0,0.0,0.0,74.15,1008351.0,1504898.0,0.022915
75%,4690.5,1.0,1.0,78.9,1087573.0,1634110.0,0.684545
max,5296.0,2.0,1.0,90.9,1486036.0,2057399.0,3.236658


We note that the `AGE` variable is distributed very similar to a Gaussian. For this reason, we will fit its respective parameter with a Gaussian prior.

In [3]:
adni.plot_kde_vs_norm()

In [4]:
adni.plot_apoe4()

---


### Question 1

Fit a model to predict the diagnosis (DX) of the subjects using both AGE and APOE4 as predictors.

#### Answer

In order to drop the boilerplate C code that will be passed to Pystan, we write a function that uses parameters, their distribution and the formula to optimize; the rest is added automatically. This enforces some conventions such as the variable names to be `x1`, `x2`, ..., `x<number of parameters - 1>`, but it leaves all liberty otherwise.

The resulting C code is shown just after the first call of this function. It will not be printed anymore for the rest of the homework. Note that the indentation problem within the C code that is printed apparently has no effect on the final execution.

##### Fit of `DX ~ AGE + APOE4 (normal)`

We will always fit the age as normal because we saw that its distribution looks normal, but we will let the prior distribution for `APOE4` vary and see how our results evolve. Moreover, the choice of the prior distribution parameters for `APOE4` will be defined by our observation of the distribution in the preliminary section.

In this first case, we use a normal prior distribution for `APOE4`.


In [5]:
# Prepare C code that will be passed to Stan
code_age_apoe4_normal = adni._generate_pystan_code(
  param_distr={
    "a": "normal(0, 3)",
    "b": "normal(0, 3)",
    "c": "normal(0, 3)",
  },
  p_i_formula="p_i[i] = exp(a + b * x1[i] + c * x2[i])/(1 + exp(a + b * x1[i] + c * x2[i]));",
)
print(code_age_apoe4_normal)


    data {
      int<lower=1> N;
      int y[N];
      real x1[N];
real x2[N];
    }

    parameters {
      real a;
real b;
real c;
    }

transformed parameters {
  vector[N] p_i;
  for (i in 1:N) {
    p_i[i] = exp(a + b * x1[i] + c * x2[i])/(1 + exp(a + b * x1[i] + c * x2[i])); 
    }
}

    model {
      a ~ normal(0, 3);
b ~ normal(0, 3);
c ~ normal(0, 3);
      y ~ binomial(1, p_i);
    }


Now that we have the code for PyStan, we run the stan optimization algorithm.

In [6]:
posterior_age_apoe4_normal = adni.run_stan_model(
  features=["AGE", "APOE4"], program_code=code_age_apoe4_normal, num_samples=1000
)


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:   0% (1/8000)
Sampling:   0% (2/8000)
Sampling:   0% (3/8000)
Sampling:   0% (4/8000)
Sampling:   1% (103/8000)
Sampling:   3% (203/8000)
Sampling:   4% (302/8000)
Sampling:   5% (402/8000)
Sampling:   6% (501/8000)
S

Now we compute the WAIC and save it to a dictionary in order to get an metric to compare models later.

In [7]:
# Save results in a dict
waic_res["DX ~ AGE + APOE4 (normal)"] = adni.get_waic(fit=posterior_age_apoe4_normal)


  0%|          | 0/826 [00:00<?, ?it/s]

In [8]:
adni.pretty_print_waic(waic=waic_res)

DX ~ AGE + APOE4 (normal) ........................ 951.6598305


We plot a box plot of the parameters. \
We notice from the box plot below that the `b` parameter (relative to `AGE`) seems to play a very minor role in the computation of the `DX` variable, whereas `a` (intercept) and `c` (relative to `APOE4`) seem to be non-negligeable, with a lot of variability for `a`.

In [9]:
adni.get_params_box_plot(fit=posterior_age_apoe4_normal, model_params=["a", "b", "c"])

Although we mentioned that the `AGE` variable does not seem to play a crucial role in the prediction of the `DX` variable, we still notice from the 95% confidence interval below that 0 is not included in the 95% confidence interval for `b`. This hints that the role of the `AGE` variable is limited, but not useless. Also, small values of `b` may come from the fact that the `AGE` variable takes large values (the mean of the distribution is 75, compared for values in $\{0, 1, 2\}$ for `APOE4`), so maybe this is responsible for values of `b` being close to 0. One solution would be to normalize the `AGE` column of our dataset and see how the value of `b` evolves. In our case however, we will keep in mind that 0 does not lie within the 95% confidence interval and this is the main point.

In [10]:
adni.print_ci_param(fit=posterior_age_apoe4_normal, param_name="a")
adni.print_ci_param(fit=posterior_age_apoe4_normal, param_name="b")
adni.print_ci_param(fit=posterior_age_apoe4_normal, param_name="c")

95% confidence interval for 'a':
	--> 2.5% threshold................ -5.71838
	--> Median........................ -3.95411
	--> 97.5% threshold............... -2.16089
95% confidence interval for 'b':
	--> 2.5% threshold................  0.01270
	--> Median........................  0.03560
	--> 97.5% threshold...............  0.05909
95% confidence interval for 'c':
	--> 2.5% threshold................  1.22888
	--> Median........................  1.47045
	--> 97.5% threshold...............  1.73240


##### Fit of `DX ~ AGE + APOE4 (uniform)`

Now we use a uniform prior distribution for `APOE4`.


In [11]:
code_age_apoe4_uniform = adni._generate_pystan_code(
  param_distr={
    "a": "normal(0, 3)",
    "b": "normal(0, 3)",
    "c": "uniform(0, 2)",
  },
  p_i_formula="p_i[i] = exp(a + b * x1[i] + c * x2[i])/(1 + exp(a + b * x1[i] + c * x2[i]));",
)
posterior_age_apoe4_uniform = adni.run_stan_model(
  features=["AGE", "APOE4"], program_code=code_age_apoe4_uniform, num_samples=1000
)


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    c is given a uniform distribution. The uniform distribution is not
    recommended, for two reasons: (a) Except when there are logical or
    physical constraints, it is very unusual for you to be sure that a
    parameter will fall insid

In [12]:
# Save results in a dict
waic_res["DX ~ AGE + APOE4 (uniform)"] = adni.get_waic(fit=posterior_age_apoe4_uniform)


  0%|          | 0/826 [00:00<?, ?it/s]

In [13]:
adni.pretty_print_waic(waic=waic_res)

DX ~ AGE + APOE4 (normal) ........................ 951.6598305
DX ~ AGE + APOE4 (uniform) ....................... 952.1127553


In [14]:
adni.get_params_box_plot(fit=posterior_age_apoe4_uniform, model_params=["a", "b", "c"])

In [15]:
adni.print_ci_param(fit=posterior_age_apoe4_uniform, param_name="a")
adni.print_ci_param(fit=posterior_age_apoe4_uniform, param_name="b")
adni.print_ci_param(fit=posterior_age_apoe4_uniform, param_name="c")

95% confidence interval for 'a':
	--> 2.5% threshold................ -5.64866
	--> Median........................ -3.92481
	--> 97.5% threshold............... -2.23835
95% confidence interval for 'b':
	--> 2.5% threshold................  0.01342
	--> Median........................  0.03557
	--> 97.5% threshold...............  0.05809
95% confidence interval for 'c':
	--> 2.5% threshold................  1.22754
	--> Median........................  1.47255
	--> 97.5% threshold...............  1.73321


##### Fit of `DX ~ AGE + APOE4 (exponential)`

Now we use an exponential prior distribution for `APOE4` with $\lambda = 1$.


In [16]:
code_age_apoe4_exponential = adni._generate_pystan_code(
  param_distr={
    "a": "normal(0, 3)",
    "b": "normal(0, 3)",
    "c": "exponential(1)",
  },
  p_i_formula="p_i[i] = exp(a + b * x1[i] + c * x2[i])/(1 + exp(a + b * x1[i] + c * x2[i]));",
)
posterior_age_apoe4_exponential = adni.run_stan_model(
  features=["AGE", "APOE4"], program_code=code_age_apoe4_exponential, num_samples=1000
)


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    c is given a exponential distribution, which has strictly positive
    support, but c was not constrained to be strictly positive.
Sampling:   0%
Sampling:   0% (1/8000)
Sampling:   0% (2/8000)
Sampling:   0% (3/8000)
Sampling:   0% (4/80

In [17]:
waic_res["DX ~ AGE + APOE4 (exponential)"] = adni.get_waic(fit=posterior_age_apoe4_exponential)


  0%|          | 0/826 [00:00<?, ?it/s]

In [18]:
adni.pretty_print_waic(waic=waic_res)

DX ~ AGE + APOE4 (normal) ........................ 951.6598305
DX ~ AGE + APOE4 (uniform) ....................... 952.1127553
DX ~ AGE + APOE4 (exponential) ................... 952.2045760


In [19]:
adni.get_params_box_plot(fit=posterior_age_apoe4_exponential, model_params=["a", "b", "c"])

In [20]:
adni.print_ci_param(fit=posterior_age_apoe4_exponential, param_name="a")
adni.print_ci_param(fit=posterior_age_apoe4_exponential, param_name="b")
adni.print_ci_param(fit=posterior_age_apoe4_exponential, param_name="c")

95% confidence interval for 'a':
	--> 2.5% threshold................ -5.66180
	--> Median........................ -3.90151
	--> 97.5% threshold............... -2.30956
95% confidence interval for 'b':
	--> 2.5% threshold................  0.01421
	--> Median........................  0.03552
	--> 97.5% threshold...............  0.05839
95% confidence interval for 'c':
	--> 2.5% threshold................  1.21698
	--> Median........................  1.46051
	--> 97.5% threshold...............  1.72960


We see that the results for the three distributions are very close. We can say that (on this run):

1. The `DX ~ AGE + APOE4 (normal)` model perform best (lowest WAIC, 951.660)
1. Then, the `DX ~ AGE + APOE4 (uniform)` model achives poorer, but very close results (WAIC of 952.113)
1. Finally, the `DX ~ AGE + APOE4 (exponential)` model performs slightly worse (WAIC of 952.205)

Overall, all models are pretty close: the box plots of parameters look similar and the values for the confidence intervals don't vary much from one model to another.


In [21]:
adni.pretty_print_waic(waic=waic_res)

DX ~ AGE + APOE4 (normal) ........................ 951.6598305
DX ~ AGE + APOE4 (uniform) ....................... 952.1127553
DX ~ AGE + APOE4 (exponential) ................... 952.2045760


Now we try to remove on of the variables and see how the model evolves. For the `APOE4` variable, we will once again change the prior distribution and see if we can get better results with a simpler model.


##### Fit of `DX ~ AGE`

Now we use the `AGE` variable only.


In [23]:
code_age = adni._generate_pystan_code(
  param_distr={
    "a": "normal(0, 3)",
    "b": "normal(0, 3)",
  },
  p_i_formula="p_i[i] = exp(a + b * x1[i])/(1 + exp(a + b * x1[i]));",
)

posterior_age = adni.run_stan_model(
  features=["AGE"], program_code=code_age, num_samples=800
)


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:   0% (1/7200)
Sampling:   0% (2/7200)
Sampling:   0% (3/7200)
Sampling:   0% (4/7200)
Sampling:   1% (103/7200)
Sampling:   3% (203/7200)
Sampling:   4% (303/7200)
Sampling:   6% (402/7200)
Sampling:   7% (502/7200)
Sampling:   8% (602/7200)
Sampling:  10% (702/7200)
Sampling:  11% (802/7200)
Sampling:  13% (902/7200)
Sampling:  14% (1002/7200)
Sampling:  15% (1102/7200)
Sampling:  17% (1201/7200)
Sampling:  18% (1301/7200)
Sampling:  19% (1401/720

In [24]:
# Save results in a dict
waic_res["DX ~ AGE"] = adni.get_waic(fit=posterior_age)


  0%|          | 0/826 [00:00<?, ?it/s]

In [25]:
adni.pretty_print_waic(waic=waic_res)

DX ~ AGE + APOE4 (normal) ........................ 951.6598305
DX ~ AGE + APOE4 (uniform) ....................... 952.1127553
DX ~ AGE + APOE4 (exponential) ................... 952.2045760
DX ~ AGE ......................................... 1105.8930574


In [26]:
adni.get_params_box_plot(fit=posterior_age, model_params=["a", "b"])

In [27]:
adni.print_ci_param(fit=posterior_age, param_name="a")
adni.print_ci_param(fit=posterior_age, param_name="b")

Future exception was never retrieved
future: <Future finished exception=RuntimeError('write: Connection reset by peer [system:104]')>
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/joris/.local/lib/python3.10/site-packages/httpstan/services_stub.py", line 47, in _make_lazy_function_wrapper_helper
    return function(*args, **kwargs)  # type: ignore
RuntimeError: write: Connection reset by peer [system:104]
"""

The above exception was the direct cause of the following exception:

RuntimeError: write: Connection reset by peer [system:104]


95% confidence interval for 'a':
	--> 2.5% threshold................ -2.83073
	--> Median........................ -1.47997
	--> 97.5% threshold...............  0.04901
95% confidence interval for 'b':
	--> 2.5% threshold................ -0.00644
	--> Median........................  0.01375
	--> 97.5% threshold...............  0.03209



##### Fit of `DX ~ APOE4 (normal)`

Now we use a normal prior distribution for `APOE4`, without the `AGE` variable.


In [28]:
# Prepare C code that will be passed to Stan
code_apoe4_normal = adni._generate_pystan_code(
  param_distr={
    "a": "normal(0, 3)",
    "b": "normal(0, 3)",
  },
  p_i_formula="p_i[i] = exp(a + b * x1[i])/(1 + exp(a + b * x1[i]));",
)

posterior_apoe4_normal = adni.run_stan_model(
  features=["APOE4"], program_code=code_apoe4_normal, num_samples=800
)


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:   1% (100/7200)
Sampling:   3% (200/7200)
Sampling:   4% (300/7200)
Sampling:   7% (500/7200)
Sampling:  12% (900/7200)
Sampling:  15% (1100/7200)
Sampling:  19% (1400/7200)
Sampling:  22% (1600/7200)
Sampling:  26% (1900/7200)
Sampling:  31% (2200/7200)
Sampling:  33% (2400/7200)
Sampling:  39% (2800/7200)
Sampling:  53% (3800/7200)
Sampling:  69% (5000/7200)
Sampling:  86% (6200/7200)
Sampling: 100% (7200/7200)
Sampling: 100% (7200/7200), done.
M

In [29]:
# Save results in a dict
waic_res["DX ~ APOE4 (normal)"] = adni.get_waic(fit=posterior_apoe4_normal)


  0%|          | 0/826 [00:00<?, ?it/s]

In [30]:
adni.pretty_print_waic(waic=waic_res)

DX ~ AGE + APOE4 (normal) ........................ 951.6598305
DX ~ AGE + APOE4 (uniform) ....................... 952.1127553
DX ~ AGE + APOE4 (exponential) ................... 952.2045760
DX ~ AGE ......................................... 1105.8930574
DX ~ APOE4 (normal) .............................. 960.8828408


In [31]:
adni.get_params_box_plot(fit=posterior_apoe4_normal, model_params=["a", "b"])

In [32]:
adni.print_ci_param(fit=posterior_apoe4_normal, param_name="a")
adni.print_ci_param(fit=posterior_apoe4_normal, param_name="b")

95% confidence interval for 'a':
	--> 2.5% threshold................ -1.44713
	--> Median........................ -1.24081
	--> 97.5% threshold............... -1.02993
95% confidence interval for 'b':
	--> 2.5% threshold................  1.16566
	--> Median........................  1.40860
	--> 97.5% threshold...............  1.65536


##### Fit of `DX ~ APOE4 (uniform)`

Now we use a uniform prior distribution for `APOE4`, without the `AGE` variable.


In [33]:
# Prepare C code that will be passed to Stan
code_apoe4_uniform = adni._generate_pystan_code(
  param_distr={
    "a": "normal(0, 3)",
    "b": "uniform(0, 2)",
  },
  p_i_formula="p_i[i] = exp(a + b * x1[i])/(1 + exp(a + b * x1[i]));",
)

posterior_apoe4_uniform = adni.run_stan_model(
  features=["APOE4"], program_code=code_apoe4_uniform, num_samples=800
)


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    b is given a uniform distribution. The uniform distribution is not
    recommended, for two reasons: (a) Except when there are logical or
    physical constraints, it is very unusual for you to be sure that a
    parameter will fall inside a specified range, and (b) The infinite
    gradient induced by a uniform density can cause difficulties for Stan's
    sampling algorithm. As a consequence, we recommend soft constraints
    rather than hard constraints; for example

In [34]:
# Save results in a dict
waic_res["DX ~ APOE4 (uniform)"] = adni.get_waic(fit=posterior_apoe4_uniform)


  0%|          | 0/826 [00:00<?, ?it/s]

In [35]:
adni.pretty_print_waic(waic=waic_res)

DX ~ AGE + APOE4 (normal) ........................ 951.6598305
DX ~ AGE + APOE4 (uniform) ....................... 952.1127553
DX ~ AGE + APOE4 (exponential) ................... 952.2045760
DX ~ AGE ......................................... 1105.8930574
DX ~ APOE4 (normal) .............................. 960.8828408
DX ~ APOE4 (uniform) ............................. 960.4320280


In [36]:
adni.get_params_box_plot(fit=posterior_apoe4_uniform, model_params=["a", "b"])

In [37]:
adni.print_ci_param(fit=posterior_apoe4_uniform, param_name="a")
adni.print_ci_param(fit=posterior_apoe4_uniform, param_name="b")

95% confidence interval for 'a':
	--> 2.5% threshold................ -1.45097
	--> Median........................ -1.24627
	--> 97.5% threshold............... -1.02930
95% confidence interval for 'b':
	--> 2.5% threshold................  1.17078
	--> Median........................  1.41605
	--> 97.5% threshold...............  1.66926


##### Fit of `DX ~ APOE4 (exponential)`

Now we use an exponential prior distribution for `APOE4` with $\lambda = 1$, without the `AGE` variable.


In [38]:
# Prepare C code that will be passed to Stan
code_apoe4_exponential = adni._generate_pystan_code(
  param_distr={
    "a": "normal(0, 3)",
    "b": "exponential(1)",
  },
  p_i_formula="p_i[i] = exp(a + b * x1[i])/(1 + exp(a + b * x1[i]));",
)

posterior_apoe4_exponential = adni.run_stan_model(
  features=["APOE4"], program_code=code_apoe4_exponential, num_samples=800
)


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    b is given a exponential distribution, which has strictly positive
    support, but b was not constrained to be strictly positive.
Sampling:   0%
Sampling:   0% (1/7200)
Sampling:   1% (101/7200)
Sampling:   3% (201/7200)
Sampling:   4% (301/7200)
Sampling:   7% (500/7200)
Sampling:   8% (600/7200)
Sampling:  10% (700/7200)
Sampling:  12% (900/7200)
Sampling:  15% (1100/7200)
Sampling:  18% (1300/7200)
Sampling:  21% (1500/7200)
Sampling:  24% (1700/7200)
Sampling:  25

In [39]:
waic_res["DX ~ APOE4 (exponential)"] = adni.get_waic(fit=posterior_apoe4_exponential)


  0%|          | 0/826 [00:00<?, ?it/s]

In [40]:
adni.pretty_print_waic(waic=waic_res)

DX ~ AGE + APOE4 (normal) ........................ 951.6598305
DX ~ AGE + APOE4 (uniform) ....................... 952.1127553
DX ~ AGE + APOE4 (exponential) ................... 952.2045760
DX ~ AGE ......................................... 1105.8930574
DX ~ APOE4 (normal) .............................. 960.8828408
DX ~ APOE4 (uniform) ............................. 960.4320280
DX ~ APOE4 (exponential) ......................... 960.4046225


In [41]:
adni.get_params_box_plot(fit=posterior_apoe4_exponential, model_params=["a", "b"])

In [42]:
adni.print_ci_param(fit=posterior_apoe4_exponential, param_name="a")
adni.print_ci_param(fit=posterior_apoe4_exponential, param_name="b")

95% confidence interval for 'a':
	--> 2.5% threshold................ -1.45246
	--> Median........................ -1.23589
	--> 97.5% threshold............... -1.03340
95% confidence interval for 'b':
	--> 2.5% threshold................  1.15556
	--> Median........................  1.39967
	--> 97.5% threshold...............  1.64557


Finally, we see that the best models are the ones where we use the `AGE` and the `APOE4` together, regardless of the prior distribution for `APOE4`. Some prior distributions for for `DX ~ AGE + APOE4` work better than others, but comparatively, it is much better to take any of the `DX ~ AGE + APOE4` models we tried rather than dropping one of the variables (regardless of the prior distribution for the `DX ~ APOE4` case).

A summary of the results obtained can be found below.

---


### Question 2

Consider subjects who are 80 years old and check the effect of the APOE4 gene on the diagnosis.

Hint: You'll draw many samples from two binomial distributions. One where APOE4 is included in the computation of $p_i$ and one where it's not.


#### Answer

We first prepare the dataset with subjects that are exactly 80 only. \
We note that there are only 4 patients with this age, which is very little. Furthermore, only one of those 4 patients has developed Alzheimer's disease. This is close to what some researchers have to face in the medical field: a large number of features for very few observations, with only some of those patients that are sick.


In [43]:
adni.eighty

Unnamed: 0,RID,APOE4,DX,AGE,WholeBrain.bl,ICV,norm_brain
139,230,0.0,0,80.0,1051053.0,1714028.0,-1.030522
489,866,0.0,0,80.0,943825.0,1388961.0,0.238778
522,920,1.0,0,80.0,946606.0,1464818.0,-0.398452
740,1285,1.0,1,80.0,1025968.0,1626103.0,-0.691147


We want to compare the two following models:

1. `DX ~ `$\varnothing$
1. `DX ~ APOE4`

The first model is a model containing only the parameter $a$ responsible for the intercept, that is:

$$
p_i = \frac{\exp(a)}{1 + \exp(a)}
$$

while the second model also contains a parameter $b$ responsible for an increase in the `APOE4` variable, that is:

$$
p_i = \frac{\exp(a + b x_i)}{1 + \exp(a + b x_i)}
$$

We run the first model with a normal prior on `a`.


In [44]:
code_80_yo_intercept_normal = adni._generate_pystan_code(
  param_distr={"a": "normal(-1, 3)"}, p_i_formula="p_i[i] = exp(a)/(1 + exp(a));"
)

posterior_80_yo_intercept = adni.run_stan_model(
  features=[],
  program_code=code_80_yo_intercept_normal,
  num_samples=5000,
  data_name="80 yo",
)


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:  25% (6000/24000)
Sampling:  50% (12000/24000)
Sampling:  75% (18000/24000)
Sampling: 100% (24000/24000)
Sampling: 100% (24000/24000), done.
Messages received during sampling:
  Gradient evaluation took 9e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 7e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 9e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
  Adjust your expectations accordingly!
  Gradient evaluati

In [45]:
adni.print_ci_param(fit=posterior_80_yo_intercept, param_name="a")

95% confidence interval for 'a':
	--> 2.5% threshold................ -4.00829
	--> Median........................ -1.25547
	--> 97.5% threshold...............  0.79160


In [46]:
adni.get_params_box_plot(fit=posterior_80_yo_intercept, model_params=["a"])


In [47]:
waic_80_yo_intercept_normal = adni.get_waic(
  fit=posterior_80_yo_intercept, data_name="80 yo", sample_size_waic=5000
)
waic_80_yo = {"DX ~ N.A. (intercept only)": waic_80_yo_intercept_normal}
waic_80_yo

  0%|          | 0/4 [00:00<?, ?it/s]

{'DX ~ N.A. (intercept only)': 6.761528149733855}

Now we fit a model for 80 years olds where we take into consideration the `APOE4` variable.

In [48]:
code_apoe4_normal = adni._generate_pystan_code(
  param_distr={"a": "normal(-2, 3)", "b": "normal(1, 3)"},
  p_i_formula="p_i[i] = exp(a + b * x1[i])/(1 + exp(a + b * x1[i]));",
)

posterior_80_yo_apoe4 = adni.run_stan_model(
  features=["APOE4"],
  program_code=code_apoe4_normal,
  num_samples=5000,
  data_name="80 yo",
)


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:  25% (6000/24000)
Sampling:  50% (12000/24000)
Sampling:  75% (18000/24000)
Sampling: 100% (24000/24000)
Sampling: 100% (24000/24000), done.
Messages received during sampling:
  Gradient evaluation took 7e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 5e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.

In [49]:
adni.print_ci_param(fit=posterior_80_yo_apoe4, param_name="a")
adni.print_ci_param(fit=posterior_80_yo_apoe4, param_name="b")


95% confidence interval for 'a':
	--> 2.5% threshold................ -6.59817
	--> Median........................ -2.71352
	--> 97.5% threshold...............  0.18776
95% confidence interval for 'b':
	--> 2.5% threshold................ -1.36544
	--> Median........................  2.36138
	--> 97.5% threshold...............  6.44312


In [50]:
adni.get_params_box_plot(fit=posterior_80_yo_apoe4, model_params=["a", "b"])


In [51]:
waic_80_yo_apoe4_normal = adni.get_waic(
  fit=posterior_80_yo_apoe4, data_name="80 yo", sample_size_waic=5000
)
waic_80_yo["DX ~ APOE4 (normal)"] = waic_80_yo_apoe4_normal


  0%|          | 0/4 [00:00<?, ?it/s]

In [52]:
adni.pretty_print_waic(waic=waic_80_yo)

DX ~ N.A. (intercept only) ....................... 6.7615281
DX ~ APOE4 (normal) .............................. 6.4522064


We see that the model with `APOE4` performs better in terms of WAIC than the model with intercept only, but not by a huge margin. This hints that `APOE4` is useful in predicting `DX` (Alzheimer's disease), but that it may be dropped for simplicity purposes. Several sources ([1](https://ici.radio-canada.ca/nouvelle/1866074/alzheimer-apoe4-role-lipide-transport), [2](https://ici.radio-canada.ca/nouvelle/1866074/alzheimer-apoe4-role-lipide-transport), in French) provide us with domain knowledge arguing that `APOE4` is indeed a good predictor for Alzheimer's disease, but our data shows that this variable is of little help. Probably the tiny quantity of data does not represent perfectly the actual distribution of all patients that could be tested for `APOE4` and therefore, increasing the number of patient may give a more significant importance to the use of `APOE4`.

The limited use of `APOE4` given our dataset is also shown by the confidence intervals we computed just above. We see that 0 is within our 95% confidence interval for $b$ (the parameter corresponding to `APOE4`), which indicates that in some cases, the effect of the `APOE4` variable is almost annihilhated by the value of $b$.

---


### Question 3

In the last lesson, we fitted a model to predict the diagnosis using only the size of the brain (norm_brain). Compare this model and the one of question 1 in terms of WAIC. Is one better than the other ?

#### Answer

We run the PyStan optimization with one variable `norm_brain` that is computed in the Python file.

In [53]:
code_age_norm_brain_normal = adni._generate_pystan_code(
  param_distr={
    "a": "normal(-1, 2)",
    "b": "normal(-1, 2)",
  },
  p_i_formula="p_i[i] = exp(a + b * x1[i])/(1 + exp(a + b * x1[i]));",
)
posterior_age_norm_brain_normal = adni.run_stan_model(
  features=["norm_brain"], program_code=code_age_norm_brain_normal, num_samples=1000
)


Building...



Building: found in cache, done.Messages from stanc:
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Sampling:   0%
Sampling:   1% (100/8000)
Sampling:   2% (200/8000)
Sampling:   5% (400/8000)
Sampling:   8% (600/8000)
Sampling:  12% (1000/8000)
Sampling:  16% (1300/8000)
Sampling:  20% (1600/8000)
Sampling:  24% (1900/8000)
Sampling:  29% (2300/8000)
Sampling:  32% (2600/8000)
Sampling:  51% (4100/8000)
Sampling:  70% (5600/8000)
Sampling:  84% (6700/8000)
Sampling: 100% (8000/8000)
Sampling: 100% (8000/8000), done.
Messages received during sampling:
  Gradient evaluati

In [54]:
adni.get_params_box_plot(fit=posterior_age_norm_brain_normal, model_params=["a", "b"])

In [55]:
adni.print_ci_param(fit=posterior_age_norm_brain_normal, param_name="a")
adni.print_ci_param(fit=posterior_age_norm_brain_normal, param_name="b")

95% confidence interval for 'a':
	--> 2.5% threshold................ -0.74544
	--> Median........................ -0.58148
	--> 97.5% threshold............... -0.41693
95% confidence interval for 'b':
	--> 2.5% threshold................ -1.34461
	--> Median........................ -1.14644
	--> 97.5% threshold............... -0.96852


We compute the WAIC for both models.

In [56]:
waic_norm_brain = adni.get_waic(fit=posterior_age_norm_brain_normal)
waic_res["DX ~ norm_brain"] = waic_norm_brain


  0%|          | 0/826 [00:00<?, ?it/s]

In [57]:
adni.pretty_print_waic(waic_res)

DX ~ AGE + APOE4 (normal) ........................ 951.6598305
DX ~ AGE + APOE4 (uniform) ....................... 952.1127553
DX ~ AGE + APOE4 (exponential) ................... 952.2045760
DX ~ AGE ......................................... 1105.8930574
DX ~ APOE4 (normal) .............................. 960.8828408
DX ~ APOE4 (uniform) ............................. 960.4320280
DX ~ APOE4 (exponential) ......................... 960.4046225
DX ~ norm_brain .................................. 922.2999914


We note that the one using the `DX ~ norm_brain` variable is better (WAIC $\approx 922$) than the one using `DX ~ AGE + APOE4` (WAIC $\approx 952$). Since WAIC is our metric of choice and lower is better, we would choose the former model over the latter.

It is interesting to note that a smart choice of variable, which includes playing with two variables (dividing one by the other) performs better than simply taking a bigger model with more variables. This small experiment only validates the fact that working with experts, gaining domain knowledge or (in this case) thinking rather than blindly applying a bigger model may work better.

Refined outworks brut force ! (sometimes) 