## Synthetic Dataset Analysis
This notebook focuses on how to use both the predictive model and the conformance inference model using a synthetic dataset.

In [1]:
import sys
sys.path.append("..")

import os
import pandas as pd
import random
import numpy as np
import torch
from train import train_conformance_inference
from data_sim import load_data

torch.manual_seed(0)
random.seed(0)
np.random.seed(0)


We initialize the main parameters used during the training of the predictive model. The model is training using a `n_fold` cross-validation approach. The number of epochs, batch size, learning rate, weight decay are set to 1000, 64, 0.0001, and 0.01, respectively. These values ought to be adjusted in accordance with the specific requirements of the problem, potentially through the application of a grid search technique.

The classification model is trained with an early stop of 4 and no delta. The regression model uses an early stop of 6 with a no delta.

In [None]:
n_folds = 5
n_epochs = 1000
batch_size = 64
learning_rate = 0.0001
weight_decay = 0.01
alpha = 0.1
patience_classification = 4
min_delta_classification = 0.0
patience_regression = 6
min_delta_regression = 0.0

In this instance, we employ a distinct combination of input variables, with the diabetes variable serving as the class variable. The dataset is partitioned into training and test sets in a 75/25 ratio. 

In [None]:
file = os.path.join("../data", "sim_1.csv")
# We select the following combination of variables
selected_combination = ['X1', 'X2', 'X3', 'X4', 'X5']
# We load the simulation data using load_data auxiliary function 
data = load_data(file, selected_combination)

# We divide data into training and test splits in a 75-25 ratio.
n = len(data['y'])
indices = random.sample(np.arange(n).tolist(), n)
train_split = indices[:int((3*n) / 4)]
test_split = indices[int((3*n) / 4):]

train_data = {
    'x': data['x'][train_split],
    'y': data['y'][train_split],
    'w': data['w'][train_split],
    'z': data['z'][train_split]
}


../data/sim_1.csv


TypeError: load_data() missing 1 required positional argument: 'output_var'

We standardize the input data of the training set using the z-score.

In [None]:
# Scale training data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_data['x'] = scaler.fit_transform(train_data['x'])

If needed, the input data of the training set can be balanced, in this case according to the output class variable. 

In [None]:
# Uncomment if data balancing is needed

# # Upsample unbalanced data of the training set
# from sklearn.utils import resample
# y = train_data['y'].reshape(-1)
# c1 = np.argwhere(y == 1.0).reshape(-1)
# c2 = np.argwhere(y == 0.0).reshape(-1)
# print('C1: {}'.format(c1.shape))
# print('C2: {}'.format(c2.shape))
# 
# if c1.shape[0] < c2.shape[0]:
#     c1_upsample = resample(c1, replace=True, n_samples=len(c2), random_state=42)
#     upsampled = np.concatenate([c1_upsample, c2]).reshape(-1)
# else: 
#     c2_upsample = resample(c2, replace=True, n_samples=len(c1), random_state=42)
#     upsampled = np.concatenate([c1, c2_upsample]).reshape(-1)
# print('Upsampled dataset: {}'.format(upsampled.shape))
# 
# train_data = {
#     'x': train_data['x'][upsampled],
#     'y': train_data['y'][upsampled],
#     'w': train_data['w'][upsampled],
#     'z': train_data['z'][upsampled]
# }

The input data of the test set are also standardized using the standard scaler previously trained using the training set. 

In [None]:
test_data = {
    'x': data['x'][test_split],
    'o': data['x'][test_split],
    'y': data['y'][test_split],
    'w': data['w'][test_split],
    'z': data['z'][test_split]
}

# Scale test data
test_data['x'] = scaler.transform(test_data['x'])

In the following snippet, we show how to train the conformance inference approach. In this example, we use an alpha value of 0.2. This parameter should be adjusted in accordance with the specific requirements of the problem, potentially through the application of a grid search technique.

In [None]:
    # We train the predictor
predictor = train_conformance_inference(
    data=train_data,
    n_folds=n_folds,
    n_epochs=n_epochs,
    batch_size=batch_size,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    alpha=0.2,
    patience_classification=4,
    min_delta_classification=0.0,
    patience_regression=6,
    min_delta_regression=0.0,
    hidden_sizes_bc=[64],
    hidden_sizes_rr=[64]
)

# We test the predictor
ci_result, ci_correct = predictor.classify(test_data)

The conformance inference approach can be used to predict the diabetes risk of patients, although the predictive model is trained with only the 33% of data. Results are returned in the `y` field of the `ci_result` variable. The function `compute_classification_metrics` can be used to compute the area under the curve (`auc`), the accuracy, the recall, precision, f1-score, confusion matrix (`cm`), and cross entropy (`ce`).

In [None]:
from metrics import compute_classification_metrics

compute_classification_metrics(test_data['w'], ci_result['y'], test_data['y'])

The predictive model can also be trained independently as follows, in this case, using the complete training set. 

In [None]:
from train_bc import cv_loop_bc
from sklearn.model_selection import KFold
from metrics import compute_classification_metrics

k_fold = KFold(n_splits=n_folds, shuffle=False)
indexes = sorted(range(len(train_data['y']) - 1))
splits = k_fold.split(indexes)

model_ce, metrics_ce = cv_loop_bc(
    data=train_data,
    splits=splits,
    n_epochs=n_epochs,
    batch_size=batch_size,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    patience=patience_classification,
    min_delta=min_delta_classification,
    hidden_sizes=[64]
    # hidden_sizes=[10, 5, 2]
)

In [None]:
metrics_ce

The following snippet shows how to invoke the predictive model with the test set. Note that `y` field of the test set is not used during the prediction, it is only used for computing the error and performance metrics.  

In [None]:
model_ce.eval()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
with torch.no_grad():
    yp = model_ce(torch.from_numpy(test_data['x']).to(device)).cpu().detach().numpy()

compute_classification_metrics(test_data['w'], yp, test_data['y'])

In the following snippet we show the test dataset concatenated with the results obtained by the predictive model.

In [None]:
import pandas as pd

df = pd.DataFrame(test_data['o'])
df.columns = selected_combination
df['Prob.'] = ci_result['p']
df['Output'] = ci_result['y']
df['CI'] = ci_result['ci']
df['Target'] = np.int32(test_data['y'])

df