# Ensemble Methods & Boosting — Student Lab

Week 4 introduces sklearn models, but you must still explain *why* they work (bias/variance).

In [10]:
import numpy as np
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

def check(name: str, cond: bool):
    if not cond:
        raise AssertionError(f'Failed: {name}')
    print(f'OK: {name}')

rng = np.random.default_rng(0)

## Section 0 — Dataset (synthetic default, real optional)

### Task 0.1: Choose dataset
Use synthetic by default. Optionally switch to breast cancer dataset.

# TODO: set `use_real = False` or True

In [11]:
# Ensemble methods are going to combine multiple weaker models (e.g. decision trees) to create a stronger model. 
# Single models may overfit the training data, but by combining multiple models, ensemble methods can reduce overfitting and improve generalization.
# Bagging(Bootstrap Aggregating) Random forest
# Boosting eg = Gradient boosting
# stacking ex: SVM, Random forest


use_real = False  # TODO

if use_real:
    data = load_breast_cancer()
    X = data.data
    y = data.target
else:
    X, y = make_classification(
        n_samples=2000, # rows
        n_features=20, # columns
        n_informative=8,# these are those labels that are helping me decide how to predict the output/ If i just know these 8 columns, I can predict the output with good accuracy.
        n_redundant=4, # these are those labels that are not helping me decide how to predict the output/ If i just know these 4 columns, I cannot predict the output with good accuracy.
        class_sep=1.0, # this controls the separation between classes. Higher values make the classes more distinct and easier to classify.
        flip_y=0.03, # this is the noise in the data, it will flip the labels of 3% of the data points, making it more challenging for the model to learn and generalize well.
        random_state=0,
    )

Xtr, Xva, ytr, yva = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)
check('shapes', Xtr.shape[0]==ytr.shape[0] and Xva.shape[0]==yva.shape[0])
Xtr.shape

OK: shapes


(1400, 20)

## Section 1 — Baseline vs Trees vs Random Forest

### Task 1.1: Train baseline decision tree vs random forest

# TODO: Train:
- DecisionTreeClassifier(max_depth=?)
- RandomForestClassifier(n_estimators=?, max_depth=?, oob_score=True, bootstrap=True)

Compute accuracy + ROC-AUC on validation.

**Checkpoint:** Why does bagging reduce variance?

In [12]:
# Boosting : It's a core idea behind boosting is to build a strong model by sequentially training a series of weak models, where each subsequent model focuses on correcting the errors made by the previous models.
# characteristics of boosting:
# Sequential Learning: Boosting algorithms build models sequentially, with each model trying to correct the mistakes of the previous ones. This allows the ensemble to focus on difficult cases and improve overall performance.
# Here we assign weights based on difficulty of the samples.
# It typically uses simple models(like decision trees with limited depth) as weak learners, which are trained on weighted versions of the data.

# Strengths of boosting:
# 1.Improved Accuracy: Boosting can significantly improve the accuracy of predictions by combining multiple weak models into a strong ensemble.
# 2. Can handle Complex patterns
# 3. Reduces bias of weak learners


# Weaknesses of boosting:
# 1. Sensitive to outliers
# 2. It can overfit
# 3. Since its sequential, it can be slower to train than parallelizable methods like bagging.


# Popular Boosting algorithms include:
# 1. AdaBoost (Adaptive Boosting)
# 2. Gradient Boosting


# Baseline models: Means simplest reasonable model build to set a reference point.
# Random forest: multiple decision trees trained on different subsets of the data and features
tree = DecisionTreeClassifier(max_depth=3, random_state=0)
rf = RandomForestClassifier(n_estimators=300, max_depth=None, min_samples_leaf=2, oob_score=True, random_state=0, bootstrap=True, n_jobs=-1) 
# n_estimators=300 = looking for 300 decision trees, 
# oob(out of bag) evaluation that means it's going to use samples not seen by a tree as a validation data.
#  bootstrap=True means each tree is trained on a random subset of the data with replacement, which is a key aspect of the random forest algorithm.
# n_jobs=-1 means it will use all available CPU cores to train the trees in parallel, which can speed up the training process.


tree.fit(Xtr, ytr)
rf.fit(Xtr, ytr)

def eval_model(clf, X, y): # I have data in X and its ans in y
    pred = clf.predict(X) # I will use the model to predict the ans for the data in X and store it in pred
    acc = accuracy_score(y, pred) # how well prediction performed by comparing it with the actual ans in y and store it in acc
    # many sklearn classifiers have predict_proba; handle if not
    if hasattr(clf, 'predict_proba'):
        proba = clf.predict_proba(X)[:, 1]
        auc = roc_auc_score(y, proba)
    else:
        auc = float('nan')
    return acc, auc

print('tree', eval_model(tree, Xva, yva))
print('rf  ', eval_model(rf, Xva, yva))

if hasattr(rf, 'oob_score_'):
    print('rf oob_score', rf.oob_score_)

tree (0.805, 0.8664611111111112)
rf   (0.88, 0.9492277777777777)
rf oob_score 0.9085714285714286


### Task 1.2: Feature importance gotcha

Inspect `feature_importances_` and explain why correlated features can distort importances.

# TODO: print top 10 features by importance.

In [13]:
# Correlated features : when i am getting same information but differ in units, for example height in cm and height in inches, they are correlated features.

# TODO

imp = rf.feature_importances_ # feature importance is a measure of how much each feature contributes to the predictions made by the model.
top = np.argsort(-imp)[:10] # np.argsort(-imp) will sort the feature importances in descending order and return the indices of the sorted array. [:10] will give us the top 10 features based on their importance scores.
print('top idx', top)
print('top importances', imp[top])

top idx [17 15  3 12  7 11  4 18 13  1]
top importances [0.13533129 0.12294161 0.11825367 0.11781613 0.10023468 0.10020298
 0.05578841 0.03412471 0.03289905 0.03011124]


## Section 2 — Gradient Boosting

### Task 2.1: Train GradientBoostingClassifier

# TODO: Train GB with different n_estimators and learning_rate and compare.

**Checkpoint:** Why can boosting overfit with too many estimators?

In [14]:
# Gradient Boosting : Its powerful ML model, its going to use many weak models (may use many decision trees) to create a strong model. 
# Every time it fixes the mistakes of the previous model, it will give more weight to those samples that were misclassified by the previous model, so that the next model can focus on getting those right.
# Each decision tree serves as a tutor to the next one and this process keeps on going until we have a strong model that can make accurate predictions.

# I start with a simple decision tree and then I keep on adding more trees to fix the mistakes of the previous trees, and this process continues until I have a strong model that can make accurate predictions.

# Key Characteristics of Gradient Boosting:
# 1. Both classification and regression
# 2. Handle non-linear relationships
# 3. Strong performance on structured, tabular data

# Can overfit if not tuned properly.
# Slower to train


settings = [
    {'n_estimators': 50, 'learning_rate': 0.1, 'max_depth': 2},
    {'n_estimators': 200, 'learning_rate': 0.1, 'max_depth': 2},
    {'n_estimators': 200, 'learning_rate': 0.05, 'max_depth': 2},
]

for s in settings:
    gb = GradientBoostingClassifier(random_state=0, **s)
    gb.fit(Xtr, ytr) # Xtr is the training data and ytr is the training labels, we are fitting the model on the training data.
    print('gb', s, eval_model(gb, Xva, yva))

gb {'n_estimators': 50, 'learning_rate': 0.1, 'max_depth': 2} (0.8616666666666667, 0.9393722222222223)
gb {'n_estimators': 200, 'learning_rate': 0.1, 'max_depth': 2} (0.8883333333333333, 0.9497444444444444)
gb {'n_estimators': 200, 'learning_rate': 0.05, 'max_depth': 2} (0.8833333333333333, 0.9481888888888889)


## Section 3 — XGBoost-style knobs (conceptual)

### Task 3.1: Explain what each knob does
Write 2-3 bullets each:
- subsample
- colsample
- learning rate
- max_depth

- **subsample: Subset of a sample**
- **colsample: Fraction of features used in each tree**
- **learning_rate: How much each new tree is going to contribute**
- **max_depth: Maximum Depth of Each Tree And it tells me how complex each tree can be**

---
## Submission Checklist
- All TODOs completed
- Baseline vs RF vs GB compared
- OOB score discussed (if available)
- Feature importance gotcha explained