# Topic: Ensemble XGBoost vs LightGBM
**Members:**
- 22127070 - Nguyễn Quang Doãn
- 22127102 - Phan Vũ Gia Hân
- 22127373 - Trịnh Anh Tài

**This notebook implements a simple comparison between XGBoost and LightGBM on a tabular classification dataset (heart disease form Kaggle)**

## 1. Library Setup

### 1.1 Install missing packages

Install if you haven't done yet

In [1]:
!pip install xgboost lightgbm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
pip install --upgrade pip

Note: you may need to restart the kernel to use updated packages.


In [3]:
import sys
print(sys.executable)

/opt/homebrew/anaconda3/bin/python


In [4]:
!{sys.executable} -m pip install xgboost lightgbm



### 1.2 Import libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

import time

### 1.3 Define global configuration

In [2]:
RANDOM_STATE = 42

## 2. Dataset Overview
We use the dataset **Heart Failure Prediction** from Kaggle

### 2.1 Load dataset

In [3]:
# Load the dataset
df = pd.read_csv("heart.csv")

### 2.2 Inspect first rows

In [4]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


### 2.3 Check dataset information

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


### 2.4 Check missing values and statistics

In [6]:
df.isnull()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
913,False,False,False,False,False,False,False,False,False,False,False,False
914,False,False,False,False,False,False,False,False,False,False,False,False
915,False,False,False,False,False,False,False,False,False,False,False,False
916,False,False,False,False,False,False,False,False,False,False,False,False


In [7]:
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


## 3. Preprocessing

### 3.1 Identify categorical variables  

Models like XGBoost and LightGBM cannot work with string values.  
Categorical features such as `Sex`, `ChestPainType`, or `ST_Slope` are text labels, not numbers, so the model cannot compare or interpret them.

Encoding converts these labels into numerical form, ensuring the model:
- Understands each category as a separate group
- Does not assume any incorrect ordering between categories
- Can train properly on the full dataset.

Therefore, all non-numeric (categorical) columns must be encoded before training.

In [8]:
cat_variables = [
    'Sex',
    'ChestPainType',
    'RestingECG',
    'ExerciseAngina',
    'ST_Slope'
]   

### 3.2 One-hot encoding

One-hot encoding creates one binary (0/1) column for each category.  
Example: `ChestPainType` to `ChestPainType_ASY`, `ChestPainType_ATA`, etc.

This avoids false numeric ordering and is the safest encoding method for tree-based models.

Pandas provides a built-in function called `pd.get_dummies()` that performs **one-hot encoding**

In [9]:
df = pd.get_dummies(df, prefix=cat_variables, columns=cat_variables, dtype=int)

### 3.3 Verify numerical-only dataset after encoding

In [10]:
df.head()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_F,Sex_M,ChestPainType_ASY,...,ChestPainType_NAP,ChestPainType_TA,RestingECG_LVH,RestingECG_Normal,RestingECG_ST,ExerciseAngina_N,ExerciseAngina_Y,ST_Slope_Down,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,0,1,0,...,0,0,0,1,0,1,0,0,0,1
1,49,160,180,0,156,1.0,1,1,0,0,...,1,0,0,1,0,1,0,0,1,0
2,37,130,283,0,98,0.0,0,0,1,0,...,0,0,0,0,1,1,0,0,0,1
3,48,138,214,0,108,1.5,1,1,0,1,...,0,0,0,1,0,0,1,0,1,0
4,54,150,195,0,122,0.0,0,0,1,0,...,1,0,0,1,0,1,0,0,0,1


Now we see that all the variables become numerical that will be convenient for working with XGBoost and LightGBM later

### 3.4 Create feature matrix (X) and label (y)

We separate the dataset into:
- **X**: all input features used for prediction  
- **y**: the target variable *HeartDisease*

This is done by removing the target column from the DataFrame and keeping it as a separate label vector.

In [12]:
X = df.drop("HeartDisease", axis=1)
y = df["HeartDisease"]

## 4. Train–Validation Split

Use the function `train_test_split` from Scikit-learn to split the dataset into train and validation (not test) set
We use **train set for training** `X` and **evaluate on val set** `y` to ensure that the model not to be **overfitting**
This step ensures that the model is evaluated on data it has not seen during training.

#### Split the data
Splitting allows us to measure how well the model generalizes.  
If we train and evaluate on the same data, the model may simply memorize the training samples.
We split 80% of the data for training and 20% for validation.

#### Shuffle the data
Shuffling randomly mixes the rows before splitting.  
This avoids any unintended ordering patterns (for example, all positive cases grouped together) that could bias the split.

#### Validation set
In this project, our goal is to **compare XGBoost and LightGBM** using the same pipeline.  
A validation set is sufficient to compare their performance because:
- it provides an unbiased evaluation during development  
- we are not producing a final benchmark yet  
- a separate test set can be introduced later if needed

#### Output
- **X_train**: input data (features) used to train the model.
- **X_val**: input data used to evaluate the model after training.
- **y_train**: label corresponding to each sample in X_train, used as the “answer” for the model to learn.
- **y_val**: label corresponding to X_val, used to calculate accuracy and test the model's generalization ability.

In [13]:
X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    train_size = 0.8, # train set 80% and validation 20%
    random_state = RANDOM_STATE
)

## 5. Model Training

### 5.1 Train XGBoost

In [15]:
# Measure training time
start_time = time.time()

# Initialize XGBoost classifier
xgb_model = XGBClassifier(
    n_estimators = 300,       # Number of trees (boosting rounds)
    learning_rate = 0.1,     # Step size shrinkage (controls how fast the model learns)
    max_depth = 5,           # Maximum depth of each tree (controls model complexity)
    random_state = RANDOM_STATE  # For reproducible results
)

# Train (fit) the model on the training set
xgb_model.fit(X_train, y_train)

# Calculate total training time
xgb_train_time = time.time() - start_time

### 5.2 Train LightGBM  

In [19]:
# Measure training time
start_time = time.time()

# Initialize LightGBM classifier
lgbm = LGBMClassifier(
    n_estimators=300,
    learning_rate=0.05,
    num_leaves=31,
    objective="binary",
    random_state=RANDOM_STATE
)

# Train the model on the training data
lgbm_model.fit(X_train, y_train)

# Calculate total training time
lgbm_train_time = time.time() - start_time

[LightGBM] [Info] Number of positive: 401, number of negative: 333
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001051 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Number of data points in the train set: 734, number of used features: 20
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.546322 -> initscore=0.185819
[LightGBM] [Info] Start training from score 0.185819


## 6. Model Evaluation

### 6.1 Accuracy

We evaluate both models using **Accuracy**, which measures how many predictions match the true labels.  
Since our task is binary classification (HeartDisease: 0/1), accuracy is a straightforward and appropriate baseline metric.

We compute accuracy for:
- XGBoost model  
- LightGBM model  

Both models are evaluated on the **validation set (X_val)**, which was kept separate during training.

In [20]:
# Predict on validation set
y_pred_xgb = xgb_model.predict(X_val)
y_pred_lgbm = lgbm_model.predict(X_val)

# Compute accuracy
accuracy_xgb = accuracy_score(y_val, y_pred_xgb)
accuracy_lgbm = accuracy_score(y_val, y_pred_lgbm)

# Print results
print("XGBoost Accuracy:", round(accuracy_xgb, 4))
print("LightGBM Accuracy:", round(accuracy_lgbm, 4))

XGBoost Accuracy: 0.8641
LightGBM Accuracy: 0.8587


### 6.2 Classification report

While accuracy gives us an overall correctness score, it does not show how the model performs on each class.  
Therefore, we use the **Classification Report**, which includes:

- **Precision**: How many predicted positives are correct  
- **Recall**: How many actual positives are correctly detected  
- **F1-score**: Harmonic mean of precision and recall  
- **Support**: Number of samples in each class

This provides a more detailed view of model performance on the two classes (HeartDisease = 0 or 1).

In [21]:
# Generate classification report for XGBoost
print("=== XGBoost Classification Report ===")
print(classification_report(y_val, y_pred_xgb))

# Generate classification report for LightGBM
print("=== LightGBM Classification Report ===")
print(classification_report(y_val, y_pred_lgbm))

=== XGBoost Classification Report ===
              precision    recall  f1-score   support

           0       0.82      0.87      0.84        77
           1       0.90      0.86      0.88       107

    accuracy                           0.86       184
   macro avg       0.86      0.86      0.86       184
weighted avg       0.87      0.86      0.86       184

=== LightGBM Classification Report ===
              precision    recall  f1-score   support

           0       0.81      0.87      0.84        77
           1       0.90      0.85      0.88       107

    accuracy                           0.86       184
   macro avg       0.85      0.86      0.86       184
weighted avg       0.86      0.86      0.86       184



### 6.3 Summary comparison