<a href="https://colab.research.google.com/github/wingated/cs473/blob/main/mini_labs/week_12_xgboost.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BYU CS 473 — XGBoost

In this assignment, you will learn the core ideas behind XGBoost and apply the method to a dataset of your choice.
We’ll connect the math from the textbook to hands-on modeling.

---

## Learning Goals
- Explain the XGBoost objective function and its components.
- Define and use key terms: regularizer, second-order Taylor expansion, leaf weights, gain, and split criterion.
- Apply XGBoost to a dataset, tune hyperparameters, and evaluate results.
- Understand how XGBoost improves upon traditional boosting methods.

## Part 1 — Key Concepts from the Textbook  

Read through the definitions below. For each one, write a **1–2 sentence explanation in your own words**.  

### 1. Regularizer  
Equation (18.47):  
$\Omega(f) = \gamma J + \frac{1}{2} \lambda \sum_{j=1}^J w_j^2$  

**Question:** Why does XGBoost penalize both the **number of leaves** and the **magnitude of leaf weights**?  


Regularizer helps in penalizing leafs and weight of it to avoid complexity in tree which can lead to overfit in the model.

### 2. Second-order Taylor Expansion of the Loss  
Equation (18.49):  
$L_m(F_m) \approx \sum_{i=1}^N \Big[ \ell(y_i, f_{m-1}(x_i)) + g_{im} F_m(x_i) + \tfrac{1}{2} h_{im} F_m(x_i)^2 \Big] + \Omega(F_m)$  

**Question:** How does including the **Hessian term** (curvature) make boosting more accurate compared to using only gradients?  


Second-order Taylor Expansion of the Loss uses curvature information for better optimization which lead to more perfect and proper updates than gradient-only methods

### 3. Optimal Leaf Weights  
Equation (18.54):  
$w_j^* = - \frac{G_{jm}}{H_{jm} + \lambda}$  

**Question:** What does this formula mean about how leaf weights are chosen?  


The formula for optimal leaf weights shows that each leaf's score is determined by the sum of gradients and curvature within that leaf, it is normalized by a regularized denominator.

### 4. Gain of a Split  
Equation (18.56):  
$\text{gain} = \tfrac{1}{2}\Bigg( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \Bigg) - \gamma$  

**Question:** Why does XGBoost reject splits with **negative gain**?  


Since XGboost doesnt do any better in improve objective of the model. It rejects splits with negative gain an they add complexity without reducing the loss sufficiently.

## Part 2 — Visualizing Boosting  

### 2.1 Bagging vs Boosting (Recap)  
Fescribe in words how **bagging** and **boosting** differ in how they:  
- Use data sampling  
- Build models sequentially or in parallel  
- Reduce bias vs variance  



Bagging samples data with replacement and builds models in parallel to reduce variance, while boosting trains models sequentially, with each new model focusing on correcting errors of the previous ones, reducing bias.

## Part 3 — Implementing XGBoost on 2 Datasets  

### Step 1 — Look at the example dataset



In [1]:
# Example: load a dataset
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = xgb.XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    eta=0.1,        # learning rate
    max_depth=3,    # tree depth
    n_estimators=100,
    reg_lambda=1.0, # L2 regularization
    reg_alpha=0.0   # L1 regularization
)

model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.956140350877193


### Step 2 — Implement XGboost on a dataset of your choice  
- Example locations to find a dataset:  
  - A built-in dataset (e.g. `load_digits`)  
  - A Kaggle dataset  


In [3]:
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = xgb.XGBClassifier(objective="multi:softmax", eval_metric="mlogloss", eta=0.1, max_depth=3,
                          n_estimators=100, reg_lambda=1.0, reg_alpha=0.0, num_class=10)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Digits Accuracy:", accuracy_score(y_test, y_pred))

Digits Accuracy: 0.9638888888888889


### Step 3 — Experiment with Hyperparameters on your dataset and the Cancer dataset
- Change `max_depth`, `eta`, or `n_estimators`.  
- Add regularization with `reg_lambda` and `reg_alpha`.  
- **Question:** How do these changes affect performance?  


In [4]:
model = xgb.XGBClassifier(eta=0.05, max_depth=6, n_estimators=200, reg_lambda=2.0, reg_alpha=1.0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Tuned Accuracy:", accuracy_score(y_test, y_pred))

Tuned Accuracy: 0.9527777777777777


Higher depth and more estimators can increase accuracy but may cause overfitting
regularization can help but may underfit if we set it too high.

## Part 4 — Reflection  

Answer the following in complete sentences:  
1. What role does the **regularizer** play in preventing overfitting?  
2. How does using the **second-order Taylor expansion** help optimize the trees?  
3. What surprised you most when experimenting with hyperparameters?  
4. Why is XGBoost considered both a **statistical innovation** (Taylor expansion, regularization) and a **computer science innovation** (scalability, out-of-core learning)?  


1. Regularizer helps in preventing overly complex trees and large weights, which helps in model to generalize better.
2. Second-order Taylor Expansion of the Loss uses curvature information for better optimization which lead to more perfect and proper updates than gradient-only methods
3. Tweaking hypermeters often reveals that there is a sweet spot but it may effect accuracy through regularization.
4.  XGBoost is a statistical innovation because it has ability to advance the  optimization, and a computer science innovation due to its computational efficiency, scalability, and ability to handle very large dataset

