<a href="https://colab.research.google.com/github/peeka-boo0/ml-learning-journey/blob/main/notebooks/notebook_2/Day_17_boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1️⃣ Bagging (Bootstrap Aggregating)

Train many models independently on random subsets of the data.

Each model votes (classification) or averages (regression).

Goal = reduce variance (make predictions stable).

Example: Random Forest

💡 Think: “many trees trained separately, then majority vote.”

2️⃣ Boosting

Train models sequentially, each new model fixes the errors of the last one.

Goal = reduce bias (make predictions smarter).

Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost.

💡 Think: “one tree learns, then next tree improves the mistakes.”


---


## 🌳 Boosting Types (Easy Comparison)

| Method                           | How it Works                                                                                                                                                           | Pros                                 | Cons                                              |
| -------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------ | ------------------------------------------------- |
| **AdaBoost** (Adaptive Boosting) | Starts with a simple weak learner (usually stumps = depth-1 trees). In each round, it **increases weight** on wrongly classified samples so next tree focuses on them. | Simple, works well on clean data.    | Sensitive to noise (outliers get too much focus). |
| **Gradient Boosting**            | Each new tree learns from the **errors (residuals)** of the last one, using gradient descent to minimize loss.                                                         | Flexible, can optimize many losses.  | Can overfit if too many trees.                    |
| **XGBoost** (Extreme GB)         | Same as gradient boosting, but with **regularization** (to prevent overfit), **parallel training**, and **fast handling of missing data**.                             | Faster + better generalization.      | More hyperparams (tuning needed).                 |
| **LightGBM**                     | Uses **histograms + leaf-wise growth** → grows trees faster and deeper. Great for very large datasets.                                                                 | Extremely fast, handles huge data.   | Can overfit small data.                           |
| **CatBoost**                     | Specially optimized for **categorical features** (like gender, city, color) without needing manual encoding.                                                           | Best for categorical-heavy datasets. | Training can be slower than LightGBM.             |

---

## 📊 XGBoost Metrics

### 🔹 Classification

| Metric       | Meaning                                                                       | When to Use                                             |
| ------------ | ----------------------------------------------------------------------------- | ------------------------------------------------------- |
| **error**    | % of wrong predictions (1 - accuracy). Lower = better.                        | Quick check, balanced data.                             |
| **logloss**  | Considers both correctness and confidence of prediction. Lower = better.      | Multi-class classification (e.g., digits).              |
| **auc**      | Ability to rank positive vs negative correctly (0.5 = random, 1.0 = perfect). | Imbalanced binary classification (fraud, disease).      |
| **mlogloss** | Multi-class version of logloss.                                               | Multi-class problems (digits, images, text categories). |

---

### 🔹 Regression

| Metric    | Meaning                                                                   | When to Use                                                            |
| --------- | ------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **rmse**  | Square-root of average squared error. Big errors hurt more.               | When big mistakes are critical (house prices).                         |
| **mae**   | Average absolute error (treats all errors equally).                       | When all errors are equally bad.                                       |
| **rmsle** | Root mean squared log error (reduces impact of very large target values). | When target values vary a lot (like predicting population or revenue). |

---

✅ **Quick rule**:

* Balanced classes → `error` / `logloss`
* Imbalanced classes → `auc`
* Regression with outliers → `rmse`
* Regression stable → `mae`

---

In [1]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.2, random_state=42) #spliting the data

# Train XGBoost
xgb = XGBClassifier(n_estimators=50, learning_rate=1.0, max_depth=3, eval_metric='mlogloss')
xgb.fit(X_train, y_train)


# Case 1: High learning rate, few trees(fast but less acc)
#xgb1 = XGBClassifier(n_estimators=50, learning_rate=1.0, max_depth=3, eval_metric='mlogloss')
#xgb1.fit(X_train, y_train)
#print("High LR Accuracy:", accuracy_score(y_test, xgb1.predict(X_test)))

# Case 2: Low learning rate, many trees (slow but more acc)
#xgb2 = XGBClassifier(n_estimators=500, learning_rate=0.05, max_depth=3, eval_metric='mlogloss')
#xgb2.fit(X_train, y_train)
#print("Low LR Accuracy:", accuracy_score(y_test, xgb2.predict(X_test)))

# Predict
y_pred = xgb.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.9666666666666667
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        33
           1       0.93      0.96      0.95        28
           2       0.97      1.00      0.99        33
           3       1.00      0.94      0.97        34
           4       1.00      0.93      0.97        46
           5       0.96      0.96      0.96        47
           6       0.97      0.94      0.96        35
           7       0.97      0.97      0.97        34
           8       0.97      1.00      0.98        30
           9       0.91      0.97      0.94        40

    accuracy                           0.97       360
   macro avg       0.97      0.97      0.97       360
weighted avg       0.97      0.97      0.97       360



In [6]:
#practice problem for finding acc diffrence b/w xgb boosting and the normal randomforest model

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report


#loding the data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.2, random_state=42)

#inishlizing the models

#model 1
rf = RandomForestClassifier(n_estimators=50,max_depth=3,random_state=42)
rf.fit(X_train, y_train)

#model 2
xgb = XGBClassifier(n_estimators=50, learning_rate=1.0, max_depth=3, eval_metric='mlogloss')
xgb.fit(X_train, y_train)


#Geying the pridictions

y_pred_rf = rf.predict(X_test)
y_pred_xgb = xgb.predict(X_test)

#Evaluating the models

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))


Random Forest Accuracy: 0.8833333333333333
              precision    recall  f1-score   support

           0       0.94      0.97      0.96        33
           1       0.87      0.71      0.78        28
           2       0.83      0.91      0.87        33
           3       0.84      0.94      0.89        34
           4       0.98      0.91      0.94        46
           5       0.98      0.85      0.91        47
           6       0.87      0.97      0.92        35
           7       0.79      1.00      0.88        34
           8       0.91      0.70      0.79        30
           9       0.82      0.82      0.82        40

    accuracy                           0.88       360
   macro avg       0.88      0.88      0.88       360
weighted avg       0.89      0.88      0.88       360

XGBoost Accuracy: 0.9666666666666667
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        33
           1       0.93      0.96      0.95        2