# Early prediction of Alzheimer's disease and related dementias based on social determinants using TabNet
In this Jupyter Notebook we use social determinants to estimate a composite score, which is associated with Alzheimer's disease and related dementias. We use TabNet [1] to estimate the composite score, where higher score is better, and the maximum possible score is 384. TabNet is a transformer-based neural network tailor made for regression/classification based on tabular data. The data set have a lot of missing values. We use simple imputations using the mean for numerical variables and the most frequent for categorical variables. 
tart!

## Part 1: Packages and frameworks
First we needto oinstall and import some Python packages. In this notebook we will us ePytorc  t oimplement a TabNe model. o interpret the variables and predict the **composite_score**.

In [1]:
%%capture
!pip install pytorch-tabnet

In [2]:
import json
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from pytorch_tabnet.tab_model import TabNetRegressor
from sklearn.preprocessing import LabelEncoder, StandardScaler

## Part 2: Load data
In this part we load and do some initial exploration of the data

In [4]:
# Load datasets
train_features = pd.read_csv('dataset/train_features.csv') # load train features
train_labels = pd.read_csv('dataset/train_labels.csv') # load train labels
test_features = pd.read_csv('dataset/test_features.csv') # loead test features
submission_format = pd.read_csv('dataset/submission_format.csv') # submission format

In [5]:
print(f"In the training data we have " + str(train_features.shape[1]) + " columns")
print(f"In the test data we have " + str(test_features.shape[1]) + " columns")

In the training data we have 184 columns
In the test data we have 184 columns


In [6]:
print(f"In the training data we have " + str(train_features["uid"].nunique()) + " unique pasients and there are " + str(train_features.shape[0]) + " rows in the training data")
print(f"In the label data we have " + str(train_labels["uid"].nunique()) + " unique pasients and there are " + str(train_labels.shape[0]) + " rows in the training data")

In the training data we have 3276 unique pasients and there are 3276 rows in the training data
In the label data we have 3276 unique pasients and there are 4343 rows in the training data


We see that there are more rows in the label file than we have rows in the train features. This is because we want to estimate the composite score for 2016 (4 years in the future) and 2021 (9 years in the future) for some patients. This is also the cas for the test data:

In [7]:
print(f"In the training data we have " + str(test_features["uid"].nunique()) + " unique pasients and there are " + str(test_features.shape[0]) + " rows in the training data")
print(f"In the label data we have " + str(submission_format["uid"].nunique()) + " unique pasients and there are " + str(submission_format.shape[0]) + " rows in the training data")

In the training data we have 819 unique pasients and there are 819 rows in the training data
In the label data we have 819 unique pasients and there are 1105 rows in the training data


In [8]:
# Count number of patinents with composite score at both 2016 and 2021 in the training data
(train_labels.groupby("uid").count()["year"] == 2).value_counts()

year
False    2209
True     1067
Name: count, dtype: int64

In [9]:
# Count number of patinents we need to predict composite for both 2016 and 2021 in the test data
(submission_format.groupby("uid").count()["year"] == 2).value_counts()

year
False    533
True     286
Name: count, dtype: int64

Exploring missing values

In [10]:
# How many columns have at least one missing value (training data)
(train_features.isna().sum() != 0).value_counts()

True     182
False      2
Name: count, dtype: int64

In [11]:
# How many columns have at least one missing value (test data)
(test_features.isna().sum() != 0).value_counts()

True     182
False      2
Name: count, dtype: int64

In [12]:
round((train_features.isna().sum().sum() / (train_features.shape[0]*train_features.shape[1]))*100,2)

22.45

In [13]:
print("In total there are " + str(round((train_features.isna().sum().sum() / (train_features.shape[0]*train_features.shape[1]))*100,2)) + " % missing values in the training data")

In total there are 22.45 % missing values in the training data


In [14]:
print("In total there are " + str(round((test_features.isna().sum().sum() / (test_features.shape[0]*test_features.shape[1]))*100,2)) + " % missing values in the test data")

In total there are 21.92 % missing values in the test data


## Part 3: Preprocessing
In this part we preprocess the data based on our findings in part 2

In [15]:
# duplicate features where we need to estimate composite score for both 2016 and 2021 (training data)
train_data = train_labels.merge(train_features, on="uid")
train_data["pred_year"] = train_data["year"]-2012

In [16]:
# duplicate features where we need to estimate composite score for both 2016 and 2021 (test data)
aligned_test_features = submission_format[["uid","year"]].merge(test_features, on="uid")
aligned_test_features["pred_year"] = aligned_test_features["year"]-2012

In [None]:
# Separate features and target variable
X = train_data.drop(columns=['uid', 'year', 'composite_score']) # train_data.drop(columns=['uid', 'year', 'composite_score'])
y = train_data['composite_score']

In [18]:
# Handle missing values
num_imputer = SimpleImputer(strategy='mean')
cat_imputer = SimpleImputer(strategy='most_frequent')

In [19]:
# Select numerical and categorical columns
num_cols = X.select_dtypes(include=['float64', 'int64']).columns
cat_cols = X.select_dtypes(include=['object']).columns

In [20]:
# Impute training features
X[num_cols] = num_imputer.fit_transform(X[num_cols])
X[cat_cols] = cat_imputer.fit_transform(X[cat_cols])

In [21]:
# Encode categorical variables
label_encoders = {}
for col in cat_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le

In [22]:
# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## Part 4: Initialize model and start training
Here we initialize our TabNet model and start the training. We continously validate our model on the validation set for each epoch

In [None]:
# Initialize TabNet with regularization parameters
tabnet = TabNetRegressor(
    # Increase lambda_sparse for more aggressive feature selection regularization
    lambda_sparse=1e-3,
    # Set optimizer_params with weight decay (L2 regularization)
    optimizer_params={'lr': 2e-3, 'weight_decay': 1e-5},
)

# Train TabNet
tabnet.fit(
    X_train.values, np.asarray(y_train).reshape(-1, 1),
    eval_set=[(X_train.values, np.asarray(y_train).reshape(-1, 1)), (X_val.values, np.asarray(y_val).reshape(-1, 1))],
    eval_metric=['rmse'],
    max_epochs=100,
    patience=10,
    batch_size=256,
    virtual_batch_size=128
)

# Predictions and evaluation
predictions = tabnet.predict(X_val.values)
print('Validation RMSE:', np.sqrt(mean_squared_error(np.asarray(y_val).reshape(-1, 1), predictions)))



epoch 0  | loss: 27772.97882| val_0_rmse: 364.24076| val_1_rmse: 362.62515|  0:00:06s
epoch 1  | loss: 25746.66436| val_0_rmse: 1189.28869| val_1_rmse: 1189.72465|  0:00:06s
epoch 2  | loss: 22131.63837| val_0_rmse: 442.61707| val_1_rmse: 450.47517|  0:00:06s
epoch 3  | loss: 16583.43096| val_0_rmse: 2892.20303| val_1_rmse: 2900.39649|  0:00:07s
epoch 4  | loss: 10510.3039| val_0_rmse: 2617.58171| val_1_rmse: 2617.6321|  0:00:07s
epoch 5  | loss: 5684.32283| val_0_rmse: 3222.36787| val_1_rmse: 3221.00857|  0:00:08s
epoch 6  | loss: 3247.88749| val_0_rmse: 438.30011| val_1_rmse: 413.19221|  0:00:08s
epoch 7  | loss: 2649.79549| val_0_rmse: 60.66295| val_1_rmse: 58.09955|  0:00:09s
epoch 8  | loss: 2423.52051| val_0_rmse: 60.92015| val_1_rmse: 58.83104|  0:00:09s
epoch 9  | loss: 2318.90805| val_0_rmse: 61.42183| val_1_rmse: 59.4533 |  0:00:10s
epoch 10 | loss: 2213.03895| val_0_rmse: 64.87035| val_1_rmse: 58.70809|  0:00:10s
epoch 11 | loss: 2081.13734| val_0_rmse: 133.22496| val_1_rmse



Validation RMSE: 58.03038024812043


In [24]:
# Train a final model using hyperparameters (use best epoch from previuos cell) and all development data (train+val)
final_model = TabNetRegressor()

# Train TabNet
final_model.fit(
    X.values, np.asarray(y).reshape(-1, 1),
    eval_set=[(X.values, np.asarray(y).reshape(-1, 1))],
    max_epochs=45,
    batch_size=256,
    virtual_batch_size=128,
    eval_metric=['rmse'],
)



epoch 0  | loss: 27508.19373| val_0_rmse: 103.73148|  0:00:00s
epoch 1  | loss: 24490.67236| val_0_rmse: 88.35856|  0:00:01s
epoch 2  | loss: 19295.88348| val_0_rmse: 128.40075|  0:00:01s
epoch 3  | loss: 12034.27692| val_0_rmse: 98.64919|  0:00:02s
epoch 4  | loss: 5874.52087| val_0_rmse: 82.50635|  0:00:02s
epoch 5  | loss: 3163.83479| val_0_rmse: 64.01388|  0:00:03s
epoch 6  | loss: 2589.76241| val_0_rmse: 62.39752|  0:00:03s
epoch 7  | loss: 2354.72733| val_0_rmse: 62.53137|  0:00:04s
epoch 8  | loss: 2110.50743| val_0_rmse: 60.30189|  0:00:04s
epoch 9  | loss: 2070.09264| val_0_rmse: 57.31378|  0:00:05s
epoch 10 | loss: 1947.47581| val_0_rmse: 57.51101|  0:00:05s
epoch 11 | loss: 1909.17907| val_0_rmse: 55.24808|  0:00:06s
epoch 12 | loss: 1856.01827| val_0_rmse: 57.40648|  0:00:06s
epoch 13 | loss: 1768.50841| val_0_rmse: 61.5733 |  0:00:07s
epoch 14 | loss: 1704.65739| val_0_rmse: 52.52964|  0:00:07s
epoch 15 | loss: 1687.00571| val_0_rmse: 49.77721|  0:00:08s
epoch 16 | loss: 1



## Part 5: Inference and submission
Prepare test data for inference, do inference and save the estimations to a submission file 

In [None]:
# Impute missing values in test data
aligned_test_features[num_cols] = num_imputer.transform(aligned_test_features[num_cols])
aligned_test_features[cat_cols] = cat_imputer.transform(aligned_test_features[cat_cols])

# Encode categorical variables in test data
for col in cat_cols:
    aligned_test_features[col] = label_encoders[col].transform(aligned_test_features[col])

# Drop year and id from input features
X_aligned_test = aligned_test_features.drop(columns=['uid','year'])

In [26]:
# Predict on the aligned test dataset
aligned_test_predictions = final_model.predict(X_aligned_test.values)

In [27]:
# Prepare the submission file
submission = submission_format.copy()
submission['composite_score'] = aligned_test_predictions.round().astype(int)

# Save the submission file
submission_file_path = 'submission_tabnet.csv'
submission.to_csv(submission_file_path, index=False)
print(f"Submission file saved to: {submission_file_path}")

Submission file saved to: submission_tabnet.csv


## References
[1] [TabNet: Attentive Interpretable Tabular Learning](https://arxiv.org/abs/1908.07442)