# Topic: Ensemble XGBoost vs LightGBM
**Members:**
- 22127070 - Nguyễn Quang Doãn
- 22127102 - Phan Vũ Gia Hân
- 22127373 - Trịnh Anh Tài

**This notebook implements a simple comparison between XGBoost and LightGBM on a tabular classification dataset (heart disease form Kaggle)**

## 1. Library Setup

### 1.1 Install missing packages

Install if you haven't done yet

In [None]:
!pip install xgboost lightgbm

In [None]:
pip install --upgrade pip

In [None]:
import sys
print(sys.executable)

In [None]:
!{sys.executable} -m pip install xgboost lightgbm

### 1.2 Import libraries

In [None]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

import time

### 1.3 Define global configuration

In [None]:
RANDOM_STATE = 42

## 2. Dataset Overview
We use the dataset **Heart Failure Prediction** from Kaggle

### 2.1 Load dataset

In [None]:
# Load the dataset
df = pd.read_csv("heart.csv")

### 2.2 Inspect first rows

In [None]:
df.head()

### 2.3 Check dataset information

In [None]:
df.info()

### 2.4 Check missing values and statistics

In [None]:
df.isnull()

In [None]:
df.describe()

## 3. Preprocessing

### 3.1 Identify categorical variables  

Models like XGBoost and LightGBM cannot work with string values.  
Categorical features such as `Sex`, `ChestPainType`, or `ST_Slope` are text labels, not numbers, so the model cannot compare or interpret them.

Encoding converts these labels into numerical form, ensuring the model:
- Understands each category as a separate group
- Does not assume any incorrect ordering between categories
- Can train properly on the full dataset.

Therefore, all non-numeric (categorical) columns must be encoded before training.

In [None]:
cat_variables = [
    'Sex',
    'ChestPainType',
    'RestingECG',
    'ExerciseAngina',
    'ST_Slope'
]   

### 3.2 One-hot encoding

One-hot encoding creates one binary (0/1) column for each category.  
Example: `ChestPainType` to `ChestPainType_ASY`, `ChestPainType_ATA`, etc.

This avoids false numeric ordering and is the safest encoding method for tree-based models.

Pandas provides a built-in function called `pd.get_dummies()` that performs **one-hot encoding**

In [None]:
df = pd.get_dummies(df, prefix=cat_variables, columns=cat_variables, dtype=int)

### 3.3 Verify numerical-only dataset after encoding

In [None]:
df.head()

Now we see that all the variables become numerical that will be convenient for working with XGBoost and LightGBM later

### 3.4 Create feature matrix (X) and label (y)

We separate the dataset into:
- **X**: all input features used for prediction  
- **y**: the target variable *HeartDisease*

This is done by removing the target column from the DataFrame and keeping it as a separate label vector.

In [None]:
X = df.drop("HeartDisease", axis=1)
y = df["HeartDisease"]

## 4. Train–Validation Split

Use the function `train_test_split` from Scikit-learn to split the dataset into train and validation (not test) set
We use **train set for training** `X` and **evaluate on val set** `y` to ensure that the model not to be **overfitting**
This step ensures that the model is evaluated on data it has not seen during training.

#### Split the data
Splitting allows us to measure how well the model generalizes.  
If we train and evaluate on the same data, the model may simply memorize the training samples.
We split 80% of the data for training and 20% for validation.

#### Shuffle the data
Shuffling randomly mixes the rows before splitting.  
This avoids any unintended ordering patterns (for example, all positive cases grouped together) that could bias the split.

#### Validation set
In this project, our goal is to **compare XGBoost and LightGBM** using the same pipeline.  
A validation set is sufficient to compare their performance because:
- it provides an unbiased evaluation during development  
- we are not producing a final benchmark yet  
- a separate test set can be introduced later if needed

#### Output
- **X_train**: input data (features) used to train the model.
- **X_val**: input data used to evaluate the model after training.
- **y_train**: label corresponding to each sample in X_train, used as the “answer” for the model to learn.
- **y_val**: label corresponding to X_val, used to calculate accuracy and test the model's generalization ability.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    train_size = 0.8, # train set 80% and validation 20%
    random_state = RANDOM_STATE
)

## 5. Model Training

### 5.1 Train XGBoost

In [None]:
# Measure training time
start_time = time.time()

# Initialize XGBoost classifier
xgb_model = XGBClassifier(
    n_estimators = 300,       # Number of trees (boosting rounds)
    learning_rate = 0.1,     # Step size shrinkage (controls how fast the model learns)
    max_depth = 5,               # Maximum depth of each tree (controls model complexity)
    random_state = RANDOM_STATE  # For reproducible results
)

# Train (fit) the model on the training set
xgb_model.fit(X_train, y_train)

# Calculate total training time
xgb_train_time = time.time() - start_time

### 5.2 Train LightGBM  

In [None]:
# Measure training time
start_time = time.time()

# Initialize LightGBM classifier
lgbm_model = LGBMClassifier(
    n_estimators=300,
    learning_rate=0.05,
    num_leaves=31,
    objective="binary",
    random_state=RANDOM_STATE
)

# Train the model on the training data
lgbm_model.fit(X_train, y_train)

# Calculate total training time
lgbm_train_time = time.time() - start_time

## 6. Model Evaluation

### 6.1 Accuracy

We evaluate both models using **Accuracy**, which measures how many predictions match the true labels.  
Since our task is binary classification (HeartDisease: 0/1), accuracy is a straightforward and appropriate baseline metric.

We compute accuracy for:
- XGBoost model  
- LightGBM model  

Both models are evaluated on the **validation set (X_val)**, which was kept separate during training.

In [None]:
# Predict on validation set
y_pred_xgb = xgb_model.predict(X_val)
y_pred_lgbm = lgbm_model.predict(X_val)

# Compute accuracy
accuracy_xgb = accuracy_score(y_val, y_pred_xgb)
accuracy_lgbm = accuracy_score(y_val, y_pred_lgbm)

# Print results
print("XGBoost Accuracy:", round(accuracy_xgb, 4))
print("LightGBM Accuracy:", round(accuracy_lgbm, 4))

### 6.2 Classification report

While accuracy gives us an overall correctness score, it does not show how the model performs on each class.  
Therefore, we use the **Classification Report**, which includes:

- **Precision**: How many predicted positives are correct  
- **Recall**: How many actual positives are correctly detected  
- **F1-score**: Harmonic mean of precision and recall  
- **Support**: Number of samples in each class

This provides a more detailed view of model performance on the two classes (HeartDisease = 0 or 1).

In [None]:
# Generate classification report for XGBoost
print("=== XGBoost Classification Report ===")
print(classification_report(y_val, y_pred_xgb))

# Generate classification report for LightGBM
print("=== LightGBM Classification Report ===")
print(classification_report(y_val, y_pred_lgbm))

### 6.3 Summary comparison