## NHANES Dataset Alpha Analysis

This notebook focuses on how alpha parameter influences the predictive model using the National Health and Nutrition Examination Survey (NHANES) dataset. The dataset can be downloaded at https://wwwn.cdc.gov/nchs/nhanes/. 

The dataset has undergone preprocessing steps which includes removing unnecessary variables as well as missing data. More info in the data directory.

In [1]:
import os
import pandas as pd
import random
import numpy as np
import torch
from train import train_conformance_inference
from data_real import load_data

torch.manual_seed(0)
random.seed(0)
np.random.seed(0)


We initialize the main parameters used during the training of the predictive model. The model is training using a `n_fold` cross-validation approach. The number of epochs, batch size, learning rate, weight decay are set to 1000, 64, 0.001, and 0.01, respectively. These values ought to be adjusted in accordance with the specific requirements of the problem, potentially through the application of a grid search technique.

The classification model is trained with an early stop of 4 and no delta. The regression model uses an early stop of 6 with a no delta.

In [2]:
n_folds = 5
n_epochs = 1000
batch_size = 64
learning_rate = 0.001
weight_decay = 0.01
alpha = 0.1
patience_classification = 4
min_delta_classification = 0.0
patience_regression = 6
min_delta_regression = 0.0

In this example, we employ a distinct combination of input variables, with the `diabetes` variable serving as the class variable. The dataset is partitioned into training and test sets in a 75/25 ratio.

In [3]:
# Clinical measures: body mass index, waist circunference, diastolic and diastolic pressure, pulse, cholesterol
selected_combination = ['RIDAGEYR.x', 'BMXHT', 'BMXWT', 'BMXBMI', 'BMXWAIST', 'BPXDI1', 'BPXSY1', 'BPXPLS', 'LBDSCHSI_43', 'LBXSTR_43', 'LBXSGL_43', 'RIAGENDR', 'LBXGH_39']
file = os.path.join("./real", "data_0.csv")
data = load_data(file, selected_combination, 'diabetes')

# We divide data into training and test splits in a 75-25 ratio.
n = len(data['y'])
indices = random.sample(np.arange(n).tolist(), n)
train_split = indices[:int((3*n) / 4)]
test_split = indices[int((3*n) / 4):]

train_data = {
    'seqn': data['seqn'][train_split],
    'x': data['x'][train_split],
    'y': data['y'][train_split],
    'w': data['w'][train_split],
    'z': data['z'][train_split]
}

We standardize the input data of the training set using the z-score.

In [4]:
# Scale training data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_data['x'] = scaler.fit_transform(train_data['x'])

The input data of the test set are also standardized using the standard scaler previously trained using the training set. 

In [5]:
test_data = {
    'seqn': data['seqn'][test_split],
    'x': data['x'][test_split],
    'y': data['y'][test_split],
    'w': data['w'][test_split],
    'z': data['z'][test_split]
}

# Scale test data
test_data['x'] = scaler.transform(test_data['x'])

In the following snippet, we show how the alpha parameter influences the predictive approach.

In [6]:
alphas = np.linspace(0.1, 1.0, num=10)

dfo = pd.DataFrame()
# We train a predictor for each alpha level
for alpha in alphas:

    # We train the predictor
    predictor = train_conformance_inference(
        data=train_data,
        n_folds=n_folds,
        n_epochs=n_epochs,
        batch_size=batch_size,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        alpha=alpha,
        patience_classification=4,
        min_delta_classification=0.0,
        patience_regression=6,
        min_delta_regression=0.0
    )

    # We test the predictor
    ci_result, correct = predictor.classify(test_data)

    result = {
        'alpha': alpha,
        'correct': correct
    }

    output_df = pd.DataFrame([result])
    dfo = pd.concat([dfo, output_df], ignore_index=True)


In [7]:
dfo

Unnamed: 0,alpha,correct
0,0.1,0.901218
1,0.2,0.804632
2,0.3,0.715163
3,0.4,0.574266
4,0.5,0.564822
5,0.6,0.40064
6,0.7,0.305241
7,0.8,0.194713
8,0.9,0.094824
9,1.0,0.0
