<a href="https://www.kaggle.com/code/muyiwaobadara/cardiovascular-disease-risk-prediction-training?scriptVersionId=254565407" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mobadara/cardiovascular-disease-risk-prediction/blob/main/notebooks/model-training.ipynb)
[![Kaggle Notebook](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/new?source=https://github.com/mobadara/cardiovascular-disease-risk-prediction/blob/main/notebooks/model-training.ipynb)
[![Python](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)

# **Model Development - Cardiovascular Disease Risk Prediction**

## **Introduction**
This notebook marks the beginning of the model development phase for our Cardiovascular Disease Risk Prediction project. Having thoroughly explored the dataset in the Exploratory Data Analysis (EDA) and enriched it with new features during Feature Engineering, we are now ready to train and compare various machine learning models.

* **EDA Notebook:** [![Open In GitHub](https://img.shields.io/badge/View%20EDA%20Notebook-blue?logo=github)](https://github.com/mobadara/cardiovascular-disease-risk-prediction/blob/main/notebooks/exploratory-data-analysis.ipynb)
* **Feature Engineering Notebook:** [![Open In GitHub](https://img.shields.io/badge/View%20FE%20Notebook-blue?logo=github)](https://github.com/mobadara/cardiovascular-disease-risk-prediction/blob/main/notebooks/feature-engineering.ipynb)

In this notebook, we will focus on building and evaluating several classification models to predict cardiovascular disease (`cardio`), including:

* **Logistic Regression**
* **Decision Tree / Random Forest**
* **Gradient Boosting Machines (e.g., LightGBM, XGBoost)**
* And potentially others like **Support Vector Machines (SVM)** or **K-Nearest Neighbors (KNN)**.

We will also implement essential preprocessing steps such as One-Hot Encoding and Standard Scaling, and carefully evaluate each model's performance using relevant metrics. Let's get started!

## **Notebook Setup**

Before diving into model development, we need to ensure all necessary libraries are imported and initial settings are configured. The following code cell will import the required Python libraries for data manipulation, numerical operations, machine learning model building, and visualization, and set up basic display options for pandas.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,\
                            roc_auc_score, confusion_matrix, classification_report, roc_curve,\
                            auc
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
print('Setup Completed!')

Setup Completed!


## **Data Loading**

The initial step in this model development phase is to load the dataset that has undergone the complete feature engineering process. This dataset, enriched with new features like BMI, age in years, and blood pressure categories, was the output of our previous feature engineering notebook and has been saved to the GitHub repository.

The following cell will load this prepared dataset directly from its raw URL on GitHub into a pandas DataFrame.


In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/mobadara/cardiovascular-disease-risk-prediction/main/datasets/engineered.csv')
df.head()

Unnamed: 0,age,gender,height,cholesterol,gluc,smoke,alco,active,cardio,bmi,age_group,blood_pressure_category,pulse_pressure
0,50.35729,Male,168,Normal,Normal,No,No,Active,0,21.96712,Middle-Aged,Hypertension Stage 1,30
1,55.381246,Female,156,Well Above Normal,Normal,No,No,Active,1,34.927679,Senior,Hypertension Stage 2,50
2,51.627652,Female,165,Well Above Normal,Normal,No,No,Inactive,1,23.507805,Middle-Aged,Hypertension Stage 1,60
3,48.249144,Male,169,Normal,Normal,No,No,Active,1,28.710479,Middle-Aged,Hypertension Stage 2,50
4,59.997262,Female,151,Above Normal,Above Normal,No,No,Inactive,0,29.384676,Senior,Hypertension Stage 1,40


Let's make sure that the columns are in the right format.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57586 entries, 0 to 57585
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      57586 non-null  float64
 1   gender                   57586 non-null  object 
 2   height                   57586 non-null  int64  
 3   cholesterol              57586 non-null  object 
 4   gluc                     57586 non-null  object 
 5   smoke                    57586 non-null  object 
 6   alco                     57586 non-null  object 
 7   active                   57586 non-null  object 
 8   cardio                   57586 non-null  int64  
 9   bmi                      57586 non-null  float64
 10  age_group                57586 non-null  object 
 11  blood_pressure_category  57586 non-null  object 
 12  pulse_pressure           57586 non-null  int64  
dtypes: float64(2), int64(3), object(8)
memory usage: 5.7+ MB


Now, we need to set the data type of each column to the appropriate format. This makes it easy to apply the appropriate transformation to the neccessary column.

In [4]:
df['gender'] = df['gender'].astype('category')
df['cholesterol'] = df['cholesterol'].astype('category')
df['gluc'] = df['gluc'].astype('category')
df['smoke'] = df['smoke'].astype('category')
df['alco'] = df['alco'].astype('category')
df['active'] = df['active'].astype('category')
df['age_group'] = df['age_group'].astype('category')
df['blood_pressure_category'] = df['blood_pressure_category'].astype('category')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57586 entries, 0 to 57585
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   age                      57586 non-null  float64 
 1   gender                   57586 non-null  category
 2   height                   57586 non-null  int64   
 3   cholesterol              57586 non-null  category
 4   gluc                     57586 non-null  category
 5   smoke                    57586 non-null  category
 6   alco                     57586 non-null  category
 7   active                   57586 non-null  category
 8   cardio                   57586 non-null  int64   
 9   bmi                      57586 non-null  float64 
 10  age_group                57586 non-null  category
 11  blood_pressure_category  57586 non-null  category
 12  pulse_pressure           57586 non-null  int64   
dtypes: category(8), float64(2), int64(3)
memory usage: 2.6 MB


The data is loaded and transformed in to the appropriate format. We will now define a `preprocessing` pipeline. This pipeline enables us to apply **one-hot encoding** to categorical features, **standard scalling** to numerical columns and perform **feature selection**.

The trainig dataset does not contains missing value but we will handle it incase we encounter a missing datapoint in future predictions.

We are not applying log transformation since most of the numerical features are fairly normal and the **outliers** have been removed from the **feature engineering** section of the project.

The following code cell defines the preprocessing step.

In [6]:
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.drop('cardio')
categorical_features = df.select_dtypes(include=['category']).columns
num_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
cat_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessing = ColumnTransformer(transformers=[
        ('num', num_pipeline, numerical_features),
        ('cat', cat_pipeline, categorical_features),
    ],
    remainder='passthrough'
)
preprocessing

## **Train Set, Test Set**
Now, we split the data into `train set` and `test set`. This will allow for efficient model evaluation during testing. We set aside 20% of the total instance for testing purpose.

In [7]:
target = 'cardio'
X = df.drop(columns=[target])
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,
                                                    stratify=y)

## **Model Development**
Now, that the data is splitted, it is time for actual training. We will train the classification with the algorithms outlined in the introduction section above.

The training process also involves feature selection, this is performed dynamically as part of the training.

In [8]:
selector = Pipeline(steps=[
    ('preprocess', preprocessing),
    ('feature_selection', SelectKBest(score_func=f_classif, k=10)) # k will be tuned
])
selector

The training process involves the following steps
1. Initial model selection
2. Cross Validation
3. Hyperparameter Tuning
4. Final Model Tuning

### **Initial Model Selection**
In this project, we will focus on building and evaluating several classification models to predict cardiovascular disease (`cardio`), including:

* **Logistic Regression**
* **Decision Tree / Random Forest**
* **Gradient Boosting Machines (e.g., LightGBM, XGBoost)**
* And potentially others like **Support Vector Machines (SVM)** or **K-Nearest Neighbors (KNN)**.

### **Cross Validataio**
To get an estimate of the performance of the selected models, we apply the **k-fold** cross validation on the model using the training data (`X_train`, `y_train`)

In [11]:
mean_scores = []

models = [
    ['Logistic Regression', LogisticRegression()],
    ['Decision Tree', DecisionTreeClassifier()],
    ['Random Forest', RandomForestClassifier()],
    ['Gradient Boost', GradientBoostingClassifier()],
    ['XGBoost', XGBClassifier()],
    ['Light GBM', LGBMClassifier()],
    ['Support Vector Classifier', SVC()],
    ['k-Nearest Neighbor', KNeighborsClassifier()]
]

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for model in tqdm(models, desc='Performing cross validation'):
    pipeline = Pipeline(steps=[
        ('selector', selector),
        ('model', model[1])
    ])
    scores = cross_val_score(pipeline, X_train, y_train, cv=kfold, scoring='accuracy')
    mean_scores.append(scores.mean())
model_names = [model[0] for model in models]
mean_cv_results = pd.DataFrame({'Model': model_names, 'Accuracy': mean_scores}).sort_values(by='Accuracy', ascending=False)
mean_cv_results


Performing cross validation:  62%|██████▎   | 5/8 [02:03<01:15, 25.17s/it]

[LightGBM] [Info] Number of positive: 19737, number of negative: 21724
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002424 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 564
[LightGBM] [Info] Number of data points in the train set: 41461, number of used features: 10
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.476038 -> initscore=-0.095923
[LightGBM] [Info] Start training from score -0.095923
[LightGBM] [Info] Number of positive: 19737, number of negative: 21724
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002397 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 564
[LightGBM] [Info] Number of data points in the train set: 41461, number of used features: 10
[LightGBM] [Info] [b

Performing cross validation: 100%|██████████| 8/8 [18:17<00:00, 137.13s/it]


Unnamed: 0,Model,Accuracy
3,Gradient Boost,0.712729
5,Light GBM,0.712512
6,Support Vector Classifier,0.712013
0,Logistic Regression,0.705935
4,XGBoost,0.705566
2,Random Forest,0.670183
7,k-Nearest Neighbor,0.669858
1,Decision Tree,0.616176


In our estimate, it is seen that the models that produce high accuracies are enssemble models. We will take the five most accurate models and preform hyperparameter tuning on them. We will then compare the results.

## **Hyperparameter Tuning**
Now that we have an estimate of all the performance of the seleted models, we will perform an hyperparameter tunning on the top six models, as seen in the dataframe in the above.

We will use the same range of selected features `k` for all the tuning process.

In [10]:
k_options = [5, 7, 10, 15, 'all'] # 'all' means keep all features after preprocessing
all_best_results = {}

### **`XGBoost` Tuning**

In [13]:
print("\n--- Tuning XGBoost ---")
xgb_pipeline = Pipeline(steps=[
    ('selector', selector),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42))
])

xgb_param_grid = {
    'selector__feature_selection__k': k_options,
    'classifier__n_estimators': [100, 200, 300],
    'classifier__learning_rate': [0.01, 0.05, 0.1],
    'classifier__max_depth': [3, 5, 7],
    'classifier__subsample': [0.7, 0.9],
    'classifier__colsample_bytree': [0.7, 0.9]
}

print("XGBoost Parameter Grid Size:", np.prod([len(v) for v in xgb_param_grid.values()]))

xgb_grid_search = GridSearchCV(
    xgb_pipeline,
    xgb_param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

print("Starting XGBoost tuning...")

with tqdm(total=1, desc="XGBoost Tuning Progress") as pbar:
    xgb_grid_search.fit(X_train, y_train)
    pbar.update(1)

print("XGBoost Tuning Complete.")
all_best_results['XGBoost'] = {
    "score": xgb_grid_search.best_score_,
    "params": xgb_grid_search.best_params_
}

print(f"Best XGBoost Score: {all_best_results['XGBoost']['score']:.4f}")
print("Best XGBoost Params:", all_best_results['XGBoost']['params'])


--- Tuning XGBoost ---
XGBoost Parameter Grid Size: 540
Starting XGBoost tuning...


XGBoost Tuning Progress:   0%|          | 0/1 [00:00<?, ?it/s]

Fitting 5 folds for each of 540 candidates, totalling 2700 fits


XGBoost Tuning Progress: 100%|██████████| 1/1 [30:42<00:00, 1842.50s/it]

XGBoost Tuning Complete.
Best XGBoost Score: 0.7806
Best XGBoost Params: {'classifier__colsample_bytree': 0.7, 'classifier__learning_rate': 0.05, 'classifier__max_depth': 5, 'classifier__n_estimators': 100, 'classifier__subsample': 0.7, 'selector__feature_selection__k': 'all'}





### **`LightGBM` Tuning**

In [14]:
print("\n--- Tuning LightGBM ---")
lgbm_pipeline = Pipeline(steps=[
    ('selector', selector),
    ('classifier', LGBMClassifier(random_state=42))
])

lgbm_param_grid = {
    'selector__feature_selection__k': k_options,
    'classifier__n_estimators': [100, 200, 300],
    'classifier__learning_rate': [0.01, 0.05, 0.1],
    'classifier__max_depth': [5, 10, 15],
    'classifier__num_leaves': [31, 63], # Specific to LightGBM
    'classifier__subsample': [0.7, 0.9],
    'classifier__colsample_bytree': [0.7, 0.9]
}

print("LightGBM Parameter Grid Size:", np.prod([len(v) for v in lgbm_param_grid.values()]))

lgbm_grid_search = GridSearchCV(
    lgbm_pipeline,
    lgbm_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("Starting LightGBM tuning...")
with tqdm(total=1, desc="LightGBM Tuning Progress") as pbar:
    lgbm_grid_search.fit(X_train, y_train)
    pbar.update(1)

print("LightGBM Tuning Complete.")
all_best_results['LightGBM'] = {
    "score": lgbm_grid_search.best_score_,
    "params": lgbm_grid_search.best_params_
}
print(f"Best LightGBM Score: {all_best_results['LightGBM']['score']:.4f}")
print("Best LightGBM Params:", all_best_results['LightGBM']['params'])


--- Tuning LightGBM ---
LightGBM Parameter Grid Size: 1080
Starting LightGBM tuning...


LightGBM Tuning Progress:   0%|          | 0/1 [00:00<?, ?it/s]

Fitting 5 folds for each of 1080 candidates, totalling 5400 fits
[LightGBM] [Info] Number of positive: 21930, number of negative: 24138
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008703 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 635
[LightGBM] [Info] Number of data points in the train set: 46068, number of used features: 24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.476035 -> initscore=-0.095932
[LightGBM] [Info] Start training from score -0.095932


LightGBM Tuning Progress: 100%|██████████| 1/1 [1:12:19<00:00, 4339.37s/it]

LightGBM Tuning Complete.
Best LightGBM Score: 0.7169
Best LightGBM Params: {'classifier__colsample_bytree': 0.7, 'classifier__learning_rate': 0.01, 'classifier__max_depth': 15, 'classifier__n_estimators': 300, 'classifier__num_leaves': 63, 'classifier__subsample': 0.7, 'selector__feature_selection__k': 'all'}





## **Gradient Boosting**

In [None]:
print("\n--- Tuning Gradient Boosting Classifier ---")
gb_pipeline = Pipeline(steps=[
    ('selector', selector),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

gb_param_grid = {
    'selector__feature_selection__k': k_options,
    'classifier__n_estimators': [100, 200, 300],
    'classifier__learning_rate': [0.01, 0.05, 0.1],
    'classifier__max_depth': [3, 5, 7],
    'classifier__subsample': [0.7, 0.9],
    'classifier__max_features': [0.7, 0.9, 'sqrt'] # Specific to GBC
}

print("Gradient Boosting Parameter Grid Size:", np.prod([len(v) for v in gb_param_grid.values()]))

gb_grid_search = GridSearchCV(
    gb_pipeline,
    gb_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("Starting Gradient Boosting Classifier tuning...")
with tqdm(total=1, desc="GB Classifier Tuning Progress") as pbar:
    gb_grid_search.fit(X_train, y_train)
    pbar.update(1)

print("Gradient Boosting Classifier Tuning Complete.")
all_best_results['Gradient Boosting'] = {
    "score": gb_grid_search.best_score_,
    "params": gb_grid_search.best_params_
}
print(f"Best GB Classifier Score: {all_best_results['Gradient Boosting']['score']:.4f}")
print("Best GB Classifier Params:", all_best_results['Gradient Boosting']['params'])


--- Tuning Gradient Boosting Classifier ---
Gradient Boosting Parameter Grid Size: 810
Starting Gradient Boosting Classifier tuning...


GB Classifier Tuning Progress:   0%|          | 0/1 [00:00<?, ?it/s]

Fitting 5 folds for each of 810 candidates, totalling 4050 fits


### **Support Vector Classifier**

For SVC, especially with larger datasets, `RandomizedSearchCV` might be more practical

In [None]:
print("\n--- Tuning Support Vector Classifier (SVC) ---")
svc_pipeline = Pipeline(steps=[
    ('selector', selector),
    ('classifier', SVC(random_state=42, probability=True)) # probability=True for ROC AUC if needed
])

# SVC can be computationally expensive, so a smaller grid or RandomizedSearchCV is often better
svc_param_grid = {
    'selector__feature_selection__k': [5, 10, 'all'], # Reduced k options for faster tuning
    'classifier__C': [0.1, 1, 10], # Regularization parameter
    'classifier__kernel': ['linear', 'rbf'], # Kernel type
    'classifier__gamma': ['scale', 'auto'] # Kernel coefficient for 'rbf'
}

print("SVC Parameter Grid Size:", np.prod([len(v) for v in svc_param_grid.values()]))

svc_grid_search = GridSearchCV( # Using GridSearchCV for demonstration
    svc_pipeline,
    svc_param_grid,
    cv=3, # Reduced CV folds for SVC due to higher computational cost
    scoring='accuracy',
    n_jobs=-1,
    verbose=1,
    # n_iter=20 # For RandomizedSearchCV, specify number of iterations
)

print("Starting SVC tuning...")
with tqdm(total=1, desc="SVC Tuning Progress") as pbar:
    svc_grid_search.fit(X_train, y_train)
    pbar.update(1)

print("SVC Tuning Complete.")
all_best_results['Support Vector Classifier'] = {
    "score": svc_grid_search.best_score_,
    "params": svc_grid_search.best_params_
}
print(f"Best SVC Score: {all_best_results['Support Vector Classifier']['score']:.4f}")

## **Logistic Regression**

In [None]:
print("\n--- Tuning Logistic Regression ---")
lr_pipeline = Pipeline(steps=[
    ('selector', selector),
    ('classifier', LogisticRegression(solver='liblinear', random_state=42)) # liblinear is good for small datasets
])

lr_param_grid = {
    'selector__feature_selection__k': k_options,
    'classifier__C': [0.01, 0.1, 1, 10, 100], # Inverse of regularization strength
    'classifier__penalty': ['l1', 'l2'] # Regularization type
}

print("Logistic Regression Parameter Grid Size:", np.prod([len(v) for v in lr_param_grid.values()]))

lr_grid_search = GridSearchCV(
    lr_pipeline,
    lr_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("Starting Logistic Regression tuning...")
with tqdm(total=1, desc="Logistic Regression Tuning Progress") as pbar:
    lr_grid_search.fit(X_train, y_train)
    pbar.update(1)

print("Logistic Regression Tuning Complete.")
all_best_results['Logistic Regression'] = {
    "score": lr_grid_search.best_score_,
    "params": lr_grid_search.best_params_
}
print(f"Best Logistic Regression Score: {all_best_results['Logistic Regression']['score']:.4f}")
print("Best Logistic Regression Params:", all_best_results['Logistic Regression']['params'])

### **Random Forest**

In [None]:
print("\n--- Tuning Random Forest ---")
rf_pipeline = Pipeline(steps=[
    ('selector', selector),
    ('classifier', RandomForestClassifier(random_state=42))
])

rf_param_grid = {
    'selector__feature_selection__k': k_options,
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 10, 20], # None means nodes are expanded until all leaves are pure
    'classifier__min_samples_split': [2, 5],
    'classifier__min_samples_leaf': [1, 2]
}

print("Random Forest Parameter Grid Size:", np.prod([len(v) for v in rf_param_grid.values()]))

rf_grid_search = GridSearchCV(
    rf_pipeline,
    rf_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("Starting Random Forest tuning...")
with tqdm(total=1, desc="Random Forest Tuning Progress") as pbar:
    rf_grid_search.fit(X_train, y_train)
    pbar.update(1)

print("Random Forest Tuning Complete.")
all_best_results['Random Forest'] = {
    "score": rf_grid_search.best_score_,
    "params": rf_grid_search.best_params_
}
print(f"Best Random Forest Score: {all_best_results['Random Forest']['score']:.4f}")
print("Best Random Forest Params:", all_best_results['Random Forest']['params'])
print("-" * 50)


In [None]:
sorted_results = sorted(all_best_results.items(), key=lambda item: item[1]['score'], reverse=True)

for model_name, result in sorted_results:
    print(f"Model: {model_name}")
    print(f"  Best CV Score: {result['score']:.4f}")
    print(f"  Best Parameters: {result['params']}")
    print("-" * 30)