# Part 2 - Predictive Model Development

## Section 1: Developing Classification Models

In this section, we will perform classification on the simulated patients data using various machine learning models. We will also look into NLP tasks such as sentiment analysis, classification, and clinical text interpretation using specialized language models (SLMs).

## Assignment Details
- Author: Khor Kean Teng
- Date: May 23, 2025
- Model: Gemini 2.5 Flash Preview-0520

## Deliverables
- Construct and evaluate predictive models, including traditional models (Random Forest, XGBoost, Neural Networks) and advanced Transformer-based models. 
- Use SLMs for specialized NLP tasks like sentiment analysis, clinical text interpretation, and classification of questionnaire responses. 

## 1.1 Data Preparation

In [1]:
# import libraries
import pandas as pd

# read the CSV file
df = pd.read_csv('data/processed/patients_with_ratings.csv')

# preview the data
display(df.head())

Unnamed: 0,patient_id,age,gender,medical_history,deterioration_label,timestamp,hear_rate,blood_pressure_sys,blood_pressure_dia,oxygen_saturation,...,has_cancer,has_heart attack,has_heart failure,has_copd,has_asthma,has_alzheimer,has_dementia,fatigue_level,activity_level,mental_health_level
0,9b04b,65,Male,History of hypertension and type 2 diabetes.,True,2023-10-27T10:00:00Z,95.5,160.2,98.7,90.3,...,0,0,0,0,0,0,0,5,1,1
1,bffd5,45,Female,No significant medical history.,False,2023-10-27T10:05:00Z,70.2,120.5,75.0,98.5,...,0,0,0,0,0,0,0,2,4,4
2,fb35e,78,Male,"Chronic obstructive pulmonary disease (COPD), ...",True,2023-10-27T10:10:00Z,105.0,150.0,90.0,88.0,...,0,1,0,1,0,0,0,5,2,1
3,1e30e,30,Female,Mild asthma.,False,2023-10-27T10:15:00Z,65.0,110.0,70.0,99.0,...,0,0,0,0,1,0,0,1,5,4
4,116a4,55,Male,High cholesterol.,False,2023-10-27T10:20:00Z,75.5,135.0,85.0,97.0,...,0,0,0,0,0,0,0,3,3,3


In [2]:
# exclude column
exclude_columns = ['patient_id', 'timestamp', 'medical_history', 'describe_fatigue_level', 'describe_lifestyle', 'describe_mental_health', 'extracted_diseases']

# prepare the data
df = df.drop(columns=exclude_columns)

# preview the data
display(df.head())

Unnamed: 0,age,gender,deterioration_label,hear_rate,blood_pressure_sys,blood_pressure_dia,oxygen_saturation,temperature,respiratory_rate,has_stroke,...,has_cancer,has_heart attack,has_heart failure,has_copd,has_asthma,has_alzheimer,has_dementia,fatigue_level,activity_level,mental_health_level
0,65,Male,True,95.5,160.2,98.7,90.3,38.5,22.1,0,...,0,0,0,0,0,0,0,5,1,1
1,45,Female,False,70.2,120.5,75.0,98.5,36.8,16.0,0,...,0,0,0,0,0,0,0,2,4,4
2,78,Male,True,105.0,150.0,90.0,88.0,37.9,25.5,0,...,0,1,0,1,0,0,0,5,2,1
3,30,Female,False,65.0,110.0,70.0,99.0,36.5,14.0,0,...,0,0,0,0,1,0,0,1,5,4
4,55,Male,False,75.5,135.0,85.0,97.0,37.0,17.0,0,...,0,0,0,0,0,0,0,3,3,3


We have removed unwanted columns and rows from the dataset. The data is now clean and ready for analysis. 

In [3]:
# Convert gender (categorical) and deterioration_label (boolean) to numerical
from sklearn.preprocessing import LabelEncoder

# Create a label encoder
le = LabelEncoder()

# Convert gender (object type) to numerical
df['gender'] = le.fit_transform(df['gender'])
print(f"Gender mapping: {dict(zip(le.classes_, range(len(le.classes_))))}")

# Convert deterioration_label (boolean) to integer (0/1)
df['deterioration_label'] = df['deterioration_label'].astype(int)

# Verify the conversions
print("\nUpdated data types:")
print(df.dtypes)

# Preview the transformed data
display(df.head())

Gender mapping: {'Female': 0, 'Male': 1, 'Other': 2}

Updated data types:
age                      int64
gender                   int32
deterioration_label      int32
hear_rate              float64
blood_pressure_sys     float64
blood_pressure_dia     float64
oxygen_saturation      float64
temperature            float64
respiratory_rate       float64
has_stroke               int64
has_diabetes             int64
has_hypertension         int64
has_cancer               int64
has_heart attack         int64
has_heart failure        int64
has_copd                 int64
has_asthma               int64
has_alzheimer            int64
has_dementia             int64
fatigue_level            int64
activity_level           int64
mental_health_level      int64
dtype: object


Unnamed: 0,age,gender,deterioration_label,hear_rate,blood_pressure_sys,blood_pressure_dia,oxygen_saturation,temperature,respiratory_rate,has_stroke,...,has_cancer,has_heart attack,has_heart failure,has_copd,has_asthma,has_alzheimer,has_dementia,fatigue_level,activity_level,mental_health_level
0,65,1,1,95.5,160.2,98.7,90.3,38.5,22.1,0,...,0,0,0,0,0,0,0,5,1,1
1,45,0,0,70.2,120.5,75.0,98.5,36.8,16.0,0,...,0,0,0,0,0,0,0,2,4,4
2,78,1,1,105.0,150.0,90.0,88.0,37.9,25.5,0,...,0,1,0,1,0,0,0,5,2,1
3,30,0,0,65.0,110.0,70.0,99.0,36.5,14.0,0,...,0,0,0,0,1,0,0,1,5,4
4,55,1,0,75.5,135.0,85.0,97.0,37.0,17.0,0,...,0,0,0,0,0,0,0,3,3,3


In [4]:
# check the data types
df.dtypes

age                      int64
gender                   int32
deterioration_label      int32
hear_rate              float64
blood_pressure_sys     float64
blood_pressure_dia     float64
oxygen_saturation      float64
temperature            float64
respiratory_rate       float64
has_stroke               int64
has_diabetes             int64
has_hypertension         int64
has_cancer               int64
has_heart attack         int64
has_heart failure        int64
has_copd                 int64
has_asthma               int64
has_alzheimer            int64
has_dementia             int64
fatigue_level            int64
activity_level           int64
mental_health_level      int64
dtype: object

In [5]:
# check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

Series([], dtype: int64)

We also check the data types to make sure they are appropriate for the analysis. 

## 2.2 Model Development - Classification

In the machine learning stage, we will perform classification one the `deterioration_label` column. We will make use of Random Forest, XGBoost and Neural Networks model from the `sklearn` and `xgboost` libraries. Furthermore, we also explore the use of transformer-based model for classification tasks. In particular, we make use of `TabTransformer` model which proposed the application of attention mechanism to tabular data.

### 2.2.1 Training and Testing Data

We will split the data into training and testing sets. The training set will be used to train the models, while the testing set will be used to evaluate their performance. 

In [6]:
# train test split
from sklearn.model_selection import train_test_split

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['deterioration_label']), df['deterioration_label'], test_size=0.3, random_state=42)

# check the shape of the data
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

# save the y_test
y_test.to_csv('data/processed/y_test_ml.csv', index=False)

# save X_train
X_train.to_csv('data/processed/X_train_ml.csv', index=False)

X_train shape: (842, 21)
X_test shape: (361, 21)


### 2.2.2 Traditional Models

We will start with setting up traditional machine learning models such as Random Forest, XGBoost, and Neural Networks.

In [7]:
# Import necessary libraries for modeling
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import pickle
import os

# Create directory for model storage if it doesn't exist
os.makedirs('models', exist_ok=True)

# Dictionary to store model results for later evaluation
model_results = {}


In [8]:
# ---- Random Forest ----
print("Training Random Forest model...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
rf_train_preds = rf_model.predict(X_train)
rf_test_preds = rf_model.predict(X_test)
rf_test_proba = rf_model.predict_proba(X_test)[:, 1]

# Store results
model_results['random_forest'] = {
    'train_preds': rf_train_preds,
    'test_preds': rf_test_preds,
    'test_proba': rf_test_proba
}

# Save the model
with open('models/random_forest_model.pkl', 'wb') as f:
    pickle.dump(rf_model, f)

print("Random Forest model trained and saved!")


Training Random Forest model...
Random Forest model trained and saved!


In [9]:
# ---- XGBoost ----
print("Training XGBoost model...")
from xgboost import XGBClassifier

xgb_model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
xgb_model.fit(X_train, y_train)

# Make predictions
xgb_train_preds = xgb_model.predict(X_train)
xgb_test_preds = xgb_model.predict(X_test)
xgb_test_proba = xgb_model.predict_proba(X_test)[:, 1]

# Store results
model_results['xgboost'] = {
    'train_preds': xgb_train_preds,
    'test_preds': xgb_test_preds,
    'test_proba': xgb_test_proba
}

# Save the model
with open('models/xgboost_model.pkl', 'wb') as f:
    pickle.dump(xgb_model, f)

print("XGBoost model trained and saved!")


Training XGBoost model...
XGBoost model trained and saved!


In [10]:
# ---- Neural Network ----
print("Training Neural Network model...")

# Scale features for neural network
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define and train the neural network
nn_model = MLPClassifier(
    hidden_layer_sizes=(64, 32),  # Two hidden layers
    activation='relu',
    solver='adam',
    alpha=0.0001,
    batch_size='auto',
    learning_rate='adaptive',
    max_iter=200,
    random_state=42
)
nn_model.fit(X_train_scaled, y_train)

# Make predictions
nn_train_preds = nn_model.predict(X_train_scaled)
nn_test_preds = nn_model.predict(X_test_scaled)
nn_test_proba = nn_model.predict_proba(X_test_scaled)[:, 1]

# Store results
model_results['neural_network'] = {
    'train_preds': nn_train_preds,
    'test_preds': nn_test_preds,
    'test_proba': nn_test_proba,
    'scaler': scaler  # Save the scaler for future predictions
}

# Save the model and scaler
with open('models/neural_network_model.pkl', 'wb') as f:
    pickle.dump({
        'model': nn_model,
        'scaler': scaler
    }, f)

print("Neural Network model trained and saved!")

Training Neural Network model...
Neural Network model trained and saved!


In [11]:
# Save all model results for later evaluation
with open('models/model_results.pkl', 'wb') as f:
    pickle.dump(model_results, f)

print("All models trained and results saved for evaluation!")

All models trained and results saved for evaluation!


### 2.2.3 Transformer-based Models

#### 2.2.3.1 FT Transformer

In this section, we will explore the use of FT Transformer model for tabular data. The FT Transformer model is a specialized transformer model that is designed to work with tabular data. It uses attention mechanisms to learn the relationships between different features in the data.

In [12]:
# ---- FTTransformer Implementation ----
import torch
import numpy as np
from tab_transformer_pytorch import FTTransformer
from sklearn.preprocessing import StandardScaler

# For reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Identify categorical and numerical columns
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Categorical columns: {categorical_cols}")
print(f"Numerical columns: {numerical_cols}")

# Prepare categorical data (if any)
categorical_dims = {}
for col in categorical_cols:
    X_train[col] = X_train[col].astype('category')
    categorical_dims[col] = len(X_train[col].cat.categories)

# Get categories for FTTransformer
if len(categorical_cols) > 0:
    categories = tuple([categorical_dims[col] for col in categorical_cols])
else:
    categories = tuple()  # Empty if no categorical columns

# Scale numerical features
scaler = StandardScaler()
X_train_num = scaler.fit_transform(X_train[numerical_cols].values)
X_test_num = scaler.transform(X_test[numerical_cols].values)

# Convert data to PyTorch tensors
if len(categorical_cols) > 0:
    X_train_cat = torch.tensor(X_train[categorical_cols].values, dtype=torch.long)
    X_test_cat = torch.tensor(X_test[categorical_cols].values, dtype=torch.long)
else:
    X_train_cat = torch.zeros((X_train.shape[0], 0), dtype=torch.long)
    X_test_cat = torch.zeros((X_test.shape[0], 0), dtype=torch.long)
    
X_train_num = torch.tensor(X_train_num, dtype=torch.float)
X_test_num = torch.tensor(X_test_num, dtype=torch.float)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float).reshape(-1, 1)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float).reshape(-1, 1)

# Define FTTransformer model
model = FTTransformer(
    categories=categories,            # categories tuple from above
    num_continuous=len(numerical_cols),  # number of numerical columns
    dim=32,                           # embedding dimension
    dim_out=1,                        # binary classification (0 or 1)
    depth=6,                          # transformer layers
    heads=8,                          # attention heads
    attn_dropout=0.1,                 # dropout rate
    ff_dropout=0.1                    # feed forward dropout
)

# Define loss function and optimizer
loss_fn = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
batch_size = 64
epochs = 30
n_samples = X_train_num.shape[0]
n_batches = n_samples // batch_size + (1 if n_samples % batch_size != 0 else 0)

print("Training FT Transformer model...")
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    
    # Create batches
    indices = torch.randperm(n_samples)
    
    for i in range(n_batches):
        # Get batch indices
        start_idx = i * batch_size
        end_idx = min((i + 1) * batch_size, n_samples)
        batch_indices = indices[start_idx:end_idx]
        
        # Get batch data
        if len(categorical_cols) > 0:
            X_cat_batch = X_train_cat[batch_indices]
        else:
            X_cat_batch = X_train_cat  # Empty tensor
            
        X_num_batch = X_train_num[batch_indices]
        y_batch = y_train_tensor[batch_indices]
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(X_cat_batch, X_num_batch)
        loss = loss_fn(outputs, y_batch)
        
        # Backward pass and optimize
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    
    # Print epoch statistics
    if (epoch + 1) % 2 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss/n_batches:.4f}")

print("FT Transformer model training complete!")

Categorical columns: []
Numerical columns: ['age', 'hear_rate', 'blood_pressure_sys', 'blood_pressure_dia', 'oxygen_saturation', 'temperature', 'respiratory_rate', 'has_stroke', 'has_diabetes', 'has_hypertension', 'has_cancer', 'has_heart attack', 'has_heart failure', 'has_copd', 'has_asthma', 'has_alzheimer', 'has_dementia', 'fatigue_level', 'activity_level', 'mental_health_level']
Training FT Transformer model...
Epoch 2/30, Loss: 0.1754
Epoch 4/30, Loss: 0.0983
Epoch 6/30, Loss: 0.0666
Epoch 8/30, Loss: 0.0559
Epoch 10/30, Loss: 0.0442
Epoch 12/30, Loss: 0.0454
Epoch 14/30, Loss: 0.0821
Epoch 16/30, Loss: 0.0289
Epoch 18/30, Loss: 0.0231
Epoch 20/30, Loss: 0.0548
Epoch 22/30, Loss: 0.0257
Epoch 24/30, Loss: 0.0186
Epoch 26/30, Loss: 0.0141
Epoch 28/30, Loss: 0.0128
Epoch 30/30, Loss: 0.0121
FT Transformer model training complete!


In [13]:
# Evaluation
model.eval()
with torch.no_grad():
    y_pred_proba = torch.sigmoid(model(X_test_cat, X_test_num)).numpy()
    y_pred = (y_pred_proba >= 0.5).astype(int)

# Store results
model_results['ft_transformer'] = {
    'test_preds': y_pred.flatten(),
    'test_proba': y_pred_proba.flatten()
}

# Save the model and additional info
model_save = {
    'model': model.state_dict(),
    'num_scaler': scaler,
    'categorical_cols': categorical_cols,
    'numerical_cols': numerical_cols
}

with open('models/ft_transformer_model.pkl', 'wb') as f:
    pickle.dump(model_save, f)

# Update all model results
with open('models/model_results.pkl', 'wb') as f:
    pickle.dump(model_results, f)

print("FT Transformer model evaluated and saved!")

FT Transformer model evaluated and saved!


#### 2.2.3.2 Tab Transformer

In this section, we will explore the use of TabTransformer model for tabular data. The TabTransformer model is a transformer-based architecture that is designed to work with tabular data. It uses self-attention mechanisms to capture relationships between features and can be used for both classification and regression tasks.

In [14]:
# ---- TabTransformer Implementation ----
from tab_transformer_pytorch import TabTransformer
import torch.nn as nn

# For reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Identify categorical and numerical columns
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Categorical columns: {categorical_cols}")
print(f"Numerical columns: {numerical_cols}")

# Prepare categorical data (if any)
categorical_dims = {}
for col in categorical_cols:
    X_train[col] = X_train[col].astype('category')
    categorical_dims[col] = len(X_train[col].cat.categories)

# Get categories for FTTransformer
if len(categorical_cols) > 0:
    categories = tuple([categorical_dims[col] for col in categorical_cols])
else:
    categories = tuple()  # Empty if no categorical columns

# Scale numerical features
scaler = StandardScaler()
X_train_num = scaler.fit_transform(X_train[numerical_cols].values)
X_test_num = scaler.transform(X_test[numerical_cols].values)

# Convert data to PyTorch tensors
if len(categorical_cols) > 0:
    X_train_cat = torch.tensor(X_train[categorical_cols].values, dtype=torch.long)
    X_test_cat = torch.tensor(X_test[categorical_cols].values, dtype=torch.long)
else:
    X_train_cat = torch.zeros((X_train.shape[0], 0), dtype=torch.long)
    X_test_cat = torch.zeros((X_test.shape[0], 0), dtype=torch.long)
    
X_train_num = torch.tensor(X_train_num, dtype=torch.float)
X_test_num = torch.tensor(X_test_num, dtype=torch.float)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float).reshape(-1, 1)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float).reshape(-1, 1)

# Define TabTransformer model
model = TabTransformer(
    categories=categories,            # categories tuple from above
    num_continuous=len(numerical_cols),  # number of numerical columns
    dim=32,                           # embedding dimension
    dim_out=1,                        # binary classification (0 or 1)
    depth=6,                          # transformer layers
    heads=8,                          # attention heads
    attn_dropout=0.1,                 # dropout rate
    ff_dropout=0.1,                    # feed forward dropout
    mlp_hidden_mults = (4, 2),          # relative multiples of each hidden dimension of the last mlp to logits
    mlp_act = nn.ReLU(),
)

# Define loss function and optimizer
loss_fn = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
batch_size = 64
epochs = 30
n_samples = X_train_num.shape[0]
n_batches = n_samples // batch_size + (1 if n_samples % batch_size != 0 else 0)

print("Training Tab Transformer model...")
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    
    # Create batches
    indices = torch.randperm(n_samples)
    
    for i in range(n_batches):
        # Get batch indices
        start_idx = i * batch_size
        end_idx = min((i + 1) * batch_size, n_samples)
        batch_indices = indices[start_idx:end_idx]
        
        # Get batch data
        if len(categorical_cols) > 0:
            X_cat_batch = X_train_cat[batch_indices]
        else:
            X_cat_batch = X_train_cat  # Empty tensor
            
        X_num_batch = X_train_num[batch_indices]
        y_batch = y_train_tensor[batch_indices]
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(X_cat_batch, X_num_batch)
        loss = loss_fn(outputs, y_batch)
        
        # Backward pass and optimize
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    
    # Print epoch statistics
    if (epoch + 1) % 2 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss/n_batches:.4f}")

print("Tab Transformer model training complete!")

Categorical columns: []
Numerical columns: ['age', 'hear_rate', 'blood_pressure_sys', 'blood_pressure_dia', 'oxygen_saturation', 'temperature', 'respiratory_rate', 'has_stroke', 'has_diabetes', 'has_hypertension', 'has_cancer', 'has_heart attack', 'has_heart failure', 'has_copd', 'has_asthma', 'has_alzheimer', 'has_dementia', 'fatigue_level', 'activity_level', 'mental_health_level']
Training Tab Transformer model...
Epoch 2/30, Loss: 0.4086
Epoch 4/30, Loss: 0.0933
Epoch 6/30, Loss: 0.0399
Epoch 8/30, Loss: 0.0265
Epoch 10/30, Loss: 0.0211
Epoch 12/30, Loss: 0.0265
Epoch 14/30, Loss: 0.0163
Epoch 16/30, Loss: 0.0140
Epoch 18/30, Loss: 0.0136
Epoch 20/30, Loss: 0.0111
Epoch 22/30, Loss: 0.0108
Epoch 24/30, Loss: 0.0095
Epoch 26/30, Loss: 0.0083
Epoch 28/30, Loss: 0.0079
Epoch 30/30, Loss: 0.0069
Tab Transformer model training complete!


In [15]:
# Evaluation
model.eval()
with torch.no_grad():
    y_pred_proba = torch.sigmoid(model(X_test_cat, X_test_num)).numpy()
    y_pred = (y_pred_proba >= 0.5).astype(int)

# Store results
model_results['tab_transformer'] = {
    'test_preds': y_pred.flatten(),
    'test_proba': y_pred_proba.flatten()
}

# Save the model and additional info
model_save = {
    'model': model.state_dict(),
    'num_scaler': scaler,
    'categorical_cols': categorical_cols,
    'numerical_cols': numerical_cols
}

with open('models/tab_transformer_model.pkl', 'wb') as f:
    pickle.dump(model_save, f)

# Update all model results
with open('models/model_results.pkl', 'wb') as f:
    pickle.dump(model_results, f)

print("Tab Transformer model evaluated and saved!")

Tab Transformer model evaluated and saved!


Now all the models are trained, and their evaluation results are saved in the local directory which will be further investigated in the next section. For the next part we will look into Natural Language Processing (NLP) tasks such as sentiment analysis, classification, and clinical text interpretation using specialized language models (SLMs). We will also explore the use of transformer-based models for these tasks.