## Exam 22/03/2019


We work with the database ADNI, containing information for a clinical cohort of healhty volunteers and patients with Alzheimer's disease.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import (
    RepeatedKFold,
    StratifiedKFold,
    cross_val_score,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
import itertools

pio.templates.default = "plotly_white"


In [2]:
dataset = pd.read_csv('data.csv')
dataset[:10]

Unnamed: 0.1,Unnamed: 0,RID,Hippocampus_volume,AGE,PTGENDER,PTEDUCAT,ADAS11,FDG,DX
0,1,2,0.0042,74.3,1,16,10.67,1.369264,NL
1,16,3,0.002769,81.3,1,18,22.0,1.09079,Dementia
2,27,5,0.004312,73.7,1,16,8.67,1.29799,NL
3,46,8,0.004355,84.5,2,18,5.0,1.276278,NL
4,60,10,0.003728,73.9,2,12,12.33,1.118814,Dementia
5,65,14,0.005301,78.5,2,12,4.33,1.25699,NL
6,80,16,0.005406,65.4,1,9,10.33,1.395434,NL
7,93,21,0.005607,72.6,2,18,6.67,1.38279,NL
8,114,23,0.005298,71.7,1,14,4.0,1.364222,NL
9,221,43,0.004564,76.2,1,16,7.0,1.308406,NL


Data fields:
- RID: subject's identifier
- Hippocampus_volume = normalized volume of the brain region hippocampus
- AGE: subject's age
- PTGENDER: subject's sex (1 Male, 2 Female)
- PTEDUCAT: years of education
- ADAS11: clinical score (Alzheimer's disease assessment scale)
- FDG: measure of average brain metabolism
- DX: clinical diagnosis. In order to severity we have NL (normal), MCI (mild cognitive impairment), and Dementia

__Exercise 1 (2 pts).__ Estimate mean and standard deviation for the classification accuracy of the algorithms Logistic Regression and Nearest Neighbours classifier for predicting clinical diagnosis based on the variables Hippocampus_volume, AGE, PTGENDER, PTEDUCAT, ADAS11, and FDG. (use at least 1000 repetitions)

In [3]:
pred_cols = ["Hippocampus_volume", "AGE", "PTGENDER", "PTEDUCAT", "ADAS11", "FDG"]
target = "DX"

X = dataset[pred_cols]
y = pd.Categorical(dataset[target]).codes


Setting `n_repeats = 1000` would take ~12 minutes to run, so if n_repeats < 1000, it is because I didn't have enough time.

In [4]:
n_repeats = 100
rkf = RepeatedKFold(n_splits=2, n_repeats=n_repeats)

acc = {"lr": [], "knn": []}

lr = LogisticRegression(max_iter=1e4, n_jobs=-1, tol=0.1)
knn = KNeighborsClassifier(n_jobs=-1)

for train, test in rkf.split(X, y):
    X_train, y_train = X.iloc[train], y[train]
    X_test, y_test = X.iloc[test], y[test]

    # Logistic Regression
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    acc["lr"].append(accuracy_score(y_test, y_pred))

    # KNN
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    acc["knn"].append(accuracy_score(y_test, y_pred))


In [5]:
df_acc = pd.DataFrame(acc)
px.histogram(df_acc, barmode="overlay")

In [6]:
print("---> Estimate of the mean")
print(df_acc.mean())
print()

print("---> Estimate of the std")
print(df_acc.std())

---> Estimate of the mean
lr     0.667738
knn    0.621199
dtype: float64

---> Estimate of the std
lr     0.015562
knn    0.016604
dtype: float64


__Exercise 2 (1.5 pts).__ Use boostrap to estimate the significance of the difference between the accuracy of Logistic regression and the one of Nearest Neighbours.

In [7]:
def compute_t(x, y):
    """
    Computes the t-statistic
    """
    # Compute mean
    mean_x = np.mean(x)
    mean_y = np.mean(y)

    # Compute var
    sigma2_x = np.var(x)
    sigma2_y = np.var(y)

    return (mean_x - mean_y) / np.sqrt(sigma2_x / len(x) + sigma2_y / len(y))


def center_concat_data(x, y):
    """
    Computes the centered data of the x and y distributions
    """

    # Center with respect to the concatenation of both distributions
    z = np.concatenate([x, y])
    mean_x = np.mean(x)
    mean_y = np.mean(y)

    # Translate the data with respect to the mean of z
    x_tilde = x - mean_x + np.mean(z)
    y_tilde = y - mean_y + np.mean(z)

    return x_tilde, y_tilde


def compute_t_boot(x, y, n_reps):
    """
    Computes the bootstrapped t statistics for two distributions x and y
    """

    t_boot = []

    # Computate centered x and y
    x_tilde, y_tilde = center_concat_data(x, y)

    # Generate `n_reps` bootstrapped samples from concatenation of centered x and y
    z_tilde = np.concatenate([x_tilde, y_tilde])
    boot_samples = np.random.choice(
        a=z_tilde,
        size=(n_reps, len(x) + len(y)),
        replace=True,
    )

    # compute the t statistic for each bootstrapped sample
    for i in range(n_reps):
        x_sim = boot_samples[i, : len(x)]  # first len(x) samples of i-th boot replicate
        y_sim = boot_samples[i, len(x) :]  # the rest of the i-th boot replicate
        t_boot.append(compute_t(x_sim, y_sim))
    return t_boot


def plot_null_distr(t_obs, t_boot):
    """
    Plots the null hypothesis distribution
    """
    fig = px.histogram(pd.Series(t_boot, name="Null distribution"))
    fig.add_vline(
        x=t_obs,
        line_dash="dot",
        annotation_text="Observed t",
        annotation_position="top",
        annotation_font_size=20,
        annotation_font_color="black",
    )
    fig.update_layout(showlegend=True)
    fig.show()


In [8]:
# Compute observed t-statistic and bootstrapped-t statistic
acc_lr = df_acc["lr"]
acc_knn = df_acc["knn"]
n_reps = 1000

t_obs = compute_t(acc_lr, acc_knn)
t_boot = compute_t_boot(
    x=acc_lr,
    y=acc_knn,
    n_reps=n_reps,
)
plot_null_distr(t_obs, t_boot)


In [9]:
def test_hypothesis(t_obs, t_boot, repetitions):
    """
    Compare an observed t and compute its bootstrap.
    Then, adequately accept or reject an hypothesis
    """
    # Computes test significance
    boot_stat = np.sum(np.abs(t_obs) > np.abs(t_boot)) / repetitions
    confidence_interval = np.quantile(a=t_boot, q=[0.025, 0.975])
    print(
        f"Observed t statistic: {np.round(t_obs, 4)}",
        f"95% confidence interval: {np.round(confidence_interval, 4)}",
        f"Significance of the test: {1-np.round(boot_stat, 4)}",
        sep="\n",
    )


test_hypothesis(t_obs, t_boot, n_reps)


Observed t statistic: 28.9939
95% confidence interval: [-1.7509  1.9378]
Significance of the test: 0.0


We see that the t-statistic is very far from the range of values in our null-distributions. It is also outside of our 95% confidence interval. We conclude that the accuracies are actually different between the KNN and the Logistic Regression classifiers. In other words, the difference between the accuracy of KNN and the one of LR is significant.
___

__Exercise 3 (2 pts).__ Use the information criteria to decide what is the best polynomial model that explains the relatioship between FDG (predictor) and ADAS11 (target) in the group MCI? And in the group NL? 

In [10]:
def fit_poly(X, y):
    """
    Fit a linear model with polynomial features to data `X` and target `y`.
    """
    w_ml = np.linalg.solve(X.T.dot(X), X.T.dot(y))
    w_ml.reshape(1, (len(w_ml)))
    sigma2_ml = np.mean((y - X.dot(w_ml.T)) ** 2)
    return w_ml, sigma2_ml


def gaussian_loglik(X, y, w_ml, sigma2_ml):
    """
    Compute the Gaussian log-likelihood of parameterers `w_ml`, `sigma2_ml` with data `X` and target `y`.
    """
    N = len(y)
    loglik = -N / 2 * np.log(2 * np.pi * sigma2_ml)  # term 1
    loglik -= 1 / (2 * sigma2_ml) * np.sum((y - X.dot(w_ml.T)) ** 2)  # term 2
    return loglik


def deviance(loglik):
    """
    Compute the deviance of a model, given its log-likelihood.
    """
    return -2 * np.array(loglik)


def AIC(training_deviance, n_params):
    """
    Compute the Akaike Information Criterion of a model.
    """
    return training_deviance + 2 * n_params


def AICc(training_deviance, n_params, n_obs):
    """
    Compute the corrected Akaike Information Criterion of a model.
    Usually used if `n_obs / n_params < 40`.
    """
    aic = AIC(training_deviance, n_params)
    corr = 2 * n_params * (n_params + 1) / (n_obs - n_params - 1)
    return aic + corr


def BIC(training_deviance, n_params, n_obs):
    """
    Compute the Akaike Information Criterion of a model.
    """
    return training_deviance + 2 * n_params * np.log(n_obs)


In [11]:
def inf_crit(DX_subset):
    X = dataset[dataset["DX"] == DX_subset][["FDG"]]
    y = dataset[dataset["DX"] == DX_subset]["ADAS11"]

    all_loglik = []
    all_AIC = []
    all_AICc = []
    all_BIC = []
    n_params = np.arange(1, 7)

    for d in n_params:
        X_poly = np.array([X["FDG"] ** i for i in range(d)]).T
        w_ml, sigma2_ml = fit_poly(X_poly, y)
        loglik = gaussian_loglik(X_poly, y, w_ml, sigma2_ml)
        all_loglik.append(loglik)
        
        
    dev = deviance(loglik=all_loglik)
    all_AIC.append(AIC(training_deviance=dev, n_params=n_params))
    all_AICc.append(AICc(training_deviance=dev, n_params=n_params, n_obs=len(X)))
    all_BIC.append(BIC(training_deviance=dev, n_params=n_params, n_obs=len(X)))

    df_info = pd.DataFrame(
        {
            "AIC": all_AIC[0],
            "AICc": all_AICc[0],
            "BIC": all_BIC[0],
        },
        index=n_params,
    )
    return df_info

In [12]:
df_info = inf_crit(DX_subset="MCI")
fig = px.line(df_info)
fig.update_layout(xaxis_title="Number of parameters", yaxis_title="Info criterion")
fig


Best is lowest BIC, that is with 2 parameters (polynomial of degree 1).

In [13]:
df_info = inf_crit(DX_subset="NL")
fig = px.line(df_info)
fig.update_layout(xaxis_title="Number of parameters", yaxis_title="Info criterion")
fig


Best is lowest BIC, that is with 1 parameter (polynomial of degree 0).

__Exercise 4 (1.5 pts).__ What is the best combination of variables (excluded DX) for predicting ADAS11 with a linear model?

__Exercise 5 (1 pts).__ Consider only the healtiest subjects for which diagnosis == NL and ADAS11 < 5. Compare the performance of the classifiers of Exercise 1 for discriminating this group from the group MCI.

In [17]:
pred_cols = ["Hippocampus_volume", "AGE", "PTGENDER", "PTEDUCAT", "ADAS11", "FDG"]
target = "DX"

X = dataset[dataset["ADAS11"] < 5][pred_cols]
y = pd.Categorical(dataset[dataset["ADAS11"] < 5][target]).codes


Setting `n_repeats = 1000` would take ~12 minutes to run, so if n_repeats < 1000, it is because I didn't have enough time.

In [18]:
n_repeats = 100
rkf = RepeatedKFold(n_splits=2, n_repeats=n_repeats)

acc = {"lr": [], "knn": []}

lr = LogisticRegression(max_iter=1e4, n_jobs=-1, tol=0.1)
knn = KNeighborsClassifier(n_jobs=-1)

for train, test in rkf.split(X, y):
    X_train, y_train = X.iloc[train], y[train]
    X_test, y_test = X.iloc[test], y[test]

    # Logistic Regression
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    acc["lr"].append(accuracy_score(y_test, y_pred))

    # KNN
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    acc["knn"].append(accuracy_score(y_test, y_pred))


In [19]:
df_acc = pd.DataFrame(acc)
px.histogram(df_acc, barmode="overlay")

In [20]:
print("---> Estimate of the mean")
print(df_acc.mean())
print()

print("---> Estimate of the std")
print(df_acc.std())

---> Estimate of the mean
lr     0.745376
knn    0.733441
dtype: float64

---> Estimate of the std
lr     0.038563
knn    0.037023
dtype: float64


__Exercise 6 (1 pts).__ Is it true that a model with the lowest AIC is the best one? 

Not necessarily. For example if the number of observations is low (rule of thumb: if `n_obs / n_params < 40`), the AICc may be prefered. \
There is also the BIC. Similarly to the AIC, the BIC penalizes the increase of the likelihood by the number of parameters used to fit the model. The penalty term is however much larger for BIC than for AIC.

__Exercise 7 (1 pts).__ Explain the concept of bias-variance decomposition

As the number of parameters in a model increases, the bias decreases because we are able to better fit the model (even up to over-fitting actually). However, the variance increases because the model needs to "bend" in order to fit as many data points as possible. This usually means that the model correctly predicts the training data points, but it is unable to generalize to (unseen) test data points.\
As a result, instead of monitoring the $bias$, or the $variance$ alone, we consider the $bias + variance$ and we look for an "elbow" in our plot. This elbow represents the best compromise between the two following phenomena:
- the $bias$ has decreased as much as possible (which makes $bias + variance$ decrease)
- the variance has not yet "blown up" too much

Note the bias decreases as you add more parameters, so "as much as possible" in the first point means "as much as possible, without having the variance blowing up".

__Exercise 8 (1 pts).__ From Exercise 1 compute the probability that the average prediction accuracy of Logistic classifier is greater than 0.66.

In [14]:
px.histogram(df_acc)