# **Problem Statement**  
## **20. Use LightGBM on a large dataset and compare it with Random Forest.**

Manually compute the ROC curve and AUC (Area Under the Curve) for a binary classification model given the true labels and predicted probabilities, without using any built-in evaluation libraries.

### Constraints & Example Inputs/Outputs

### Constraints
- Dataset is large (high samples)
- Binary classification
- Use:
    - LightGBM
    - RandomForestClassifier
- Compare:
    - Accuracy
    - Training time
- No AutoML

### Example Input:
```python
Dataset size: 100,000 samples
Features: 20 numerical features
Target: Binary (0 / 1)

```

Expected Output:
```python
LightGBM accuracy > Random Forest
LightGBM training time << Random Forest
Model comparison summary

```

### Solution Approach

### What is LightGBM?
LightGBM:
- Histogram-based splitting
- Leaf-wise tree growth
- Extremely fast on large datasets
- Lower memory footprint

### Why Random Forest?
Random Forest:
- Bagging-based ensemble
- Strong baseline
- Slower on large datasets

### Comparison Strategy
1. Generate large dataset
2. Train Random Forest
3. Train LightGBM
4. Measure accuracy
5. Measure training time
6. Run test cases

### Solution Code

In [1]:
#Step 1: Install and Import Libraries
!pip install lightgbm

import numpy as np
import pandas as pd
import time

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb


Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m29.9 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: lightgbm
Successfully installed lightgbm-4.6.0


In [14]:
# Step 2: Create Large Dataset

X, y = make_classification(
    n_samples=100_000,
    n_features=20,
    n_informative=15,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [15]:
# Approach 1: Brute Force: Random Forest (Baseline)
rf = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

start = time.time()
rf.fit(X_train, y_train)
rf_train_time = time.time() - start


In [16]:
rf_preds = rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_preds)

print("Random Forest Accuracy:", rf_accuracy)
print("Random Forest Training Time:", rf_train_time)


Random Forest Accuracy: 0.97285
Random Forest Training Time: 4.52554178237915


### Alternative Solution

In [17]:
# Approach 2: Optimized: LightGBM
lgbm = lgb.LGBMClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=-1,
    num_leaves=31,
    random_state=42
)

start = time.time()
lgbm.fit(X_train, y_train)
lgbm_train_time = time.time() - start


[LightGBM] [Info] Number of positive: 40078, number of negative: 39922
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002201 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5100
[LightGBM] [Info] Number of data points in the train set: 80000, number of used features: 20
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500975 -> initscore=0.003900
[LightGBM] [Info] Start training from score 0.003900


In [18]:
lgbm_preds = lgbm.predict(X_test)
lgbm_accuracy = accuracy_score(y_test, lgbm_preds)

print("LightGBM Accuracy:", lgbm_accuracy)
print("LightGBM Training Time:", lgbm_train_time)


LightGBM Accuracy: 0.96795
LightGBM Training Time: 0.5945696830749512




In [19]:
# Model Comparison Summary
comparison_df = pd.DataFrame({
    "Model": ["Random Forest", "LightGBM"],
    "Accuracy": [rf_accuracy, lgbm_accuracy],
    "Training Time (s)": [rf_train_time, lgbm_train_time]
})

comparison_df


Unnamed: 0,Model,Accuracy,Training Time (s)
0,Random Forest,0.97285,4.525542
1,LightGBM,0.96795,0.59457


### Alternative Approaches

**Brute Force**
- Random Forest
- Bagging-based
- Easy to interpret
- Poor scalability

**Optimized**
- LightGBM
- Gradient boosting
- Histogram-based splitting
- Excellent for large datasets

**Advanced**
- XGBoost comparison
- CatBoost for categorical data
- Distributed LightGBM training

### Test Case

In [20]:
# Test Case 1: Models Trained Successfully

assert rf is not None and lgbm is not None
print("Test Case 1 Passed: Models trained")


Test Case 1 Passed: Models trained


In [21]:
# Test Case 2: Accuracy Check

assert lgbm_accuracy >= rf_accuracy
print("Test Case 2 Passed: LightGBM performs better or equal")


AssertionError: 

In [22]:
# Test Case 3: Training Time Comparison

assert lgbm_train_time < rf_train_time
print("Test Case 3 Passed: LightGBM trains faster")


Test Case 3 Passed: LightGBM trains faster


In [23]:
# Test Case 4: Single Sample Prediction

sample = X_test[:1]

print("RF Prediction:", rf.predict(sample))
print("LightGBM Prediction:", lgbm.predict(sample))


RF Prediction: [1]
LightGBM Prediction: [1]




## Complexity Analysis

**Random Forest**
- Time: O(T × N × log N)
- Space: O(T × nodes)

**LightGBM**
- Time: O(T × N)
- Space: O(histograms)

➡️ LightGBM is significantly faster on large datasets

#### Thank You!!