
# Module 5 — In-Class Activity (Starter)
**Topic:** Ensemble Learning in Practice  
**Time:** ~45–60 minutes

You will:
- Generate a small, human-readable dataset.
- Split and scale data.
- Train **baseline** models (Logistic Regression, Decision Tree, Random Forest).
- Train **ensemble** models (Bagging and AdaBoost).
- Build a **comparison table** (Accuracy, F1) and a quick bar chart.

> This is a **starter** notebook: several spots are marked with `# TODO:`.  
> Fill them in with Python code before running.


In [None]:

# ============================================
# Step 1. Imports and setup
# ============================================

# TODO: Import any extra packages you need as you go
# (You can add them here as you discover missing names.)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier

import warnings
warnings.filterwarnings("ignore")
np.random.seed(7)  # For reproducibility



## Step 2. Generate a simple dataset

Dataset: **Project Habits → High Grade**  
Each row represents a student. Your goal is to predict whether they achieve a high final project grade (≥ 85).


In [None]:

# ============================================
# Step 2. Generate a simple dataset
# ============================================

n = 240  # number of samples (students)

# randint(low, high, size) draws integers in [low, high)
# Make sure the ranges are sensible and easy to explain.
DraftsSubmitted    = np.random.randint(0, 6,  n)   # 0..5 drafts (min=0, max=5)
PeerReviewsGiven   = np.random.randint(0, 11, n)   # 0..10 peer reviews
MeetingsWithTA     = np.random.randint(0, 7,  n)   # 0..6 TA meetings
OnTimeSubmissions  = np.random.randint(0, 11, n)   # 0..10 on-time submissions
WeekendCodingHours = np.random.randint(0, 16, n)   # 0..15 hrs/weekend

# Combine into a dataframe
df = pd.DataFrame({
    "DraftsSubmitted": DraftsSubmitted,
    "PeerReviewsGiven": PeerReviewsGiven,
    "MeetingsWithTA": MeetingsWithTA,
    "OnTimeSubmissions": OnTimeSubmissions,
    "WeekendCodingHours": WeekendCodingHours
})

# TODO: Create a rule-based label column "HighGrade".
# Suggested approach:
#   1) Build a 'score' that increases with good practices.
#   2) Map score → probability with a logistic function.
#   3) Convert to label: HighGrade = 1 if prob > threshold else 0.

score = (
    1.0 * DraftsSubmitted
  + 0.6 * (PeerReviewsGiven / 2.0)
  + 0.9 * MeetingsWithTA
  + 0.8 * (OnTimeSubmissions / 2.0)
  + 0.3 * (WeekendCodingHours / 3.0)
)

prob = 1.0 / (1.0 + np.exp(-0.9 * (score - 6.5)))
df["HighGrade"] = (prob > 0.55).astype(int)

# Quick data check
display(df.head())
print("HighGrade rate:", df["HighGrade"].mean().round(3))



## Step 3. Split and scale the data

- Use `train_test_split` with `stratify=y` and `test_size=0.25`.
- Standardize features for models that need it (e.g., Logistic Regression) using `StandardScaler`.


In [None]:

# ============================================
# Step 3. Split and scale the data
# ============================================

X = df.drop(columns=["HighGrade"])
y = df["HighGrade"]

# Train/test split
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, random_state=7, stratify=y
)

# Standardize for models that benefit from scaling
scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_tr)
X_te_sc = scaler.transform(X_te)

len(X_tr), len(X_te), y.mean().round(3)



## Step 4. Train baseline models

Train three baselines:
- **Logistic Regression** (use **scaled** features)
- **Decision Tree** (unscaled)
- **Random Forest** (unscaled)

Store **Accuracy** and **F1** in a results dictionary.


In [None]:

# ============================================
# Step 4. Train baseline models
# ============================================

results = {}

# TODO: Initialize and fit Logistic Regression on *scaled* data
lr = LogisticRegression(max_iter=500, random_state=7)
lr.fit(X_tr_sc, y_tr)
pred_lr = lr.predict(X_te_sc)
results["Logistic Regression"] = {
    "Accuracy": accuracy_score(y_te, pred_lr),
    "F1": f1_score(y_te, pred_lr)
}

# TODO: Initialize and fit Decision Tree on *unscaled* data
tree = DecisionTreeClassifier(max_depth=5, random_state=7)
tree.fit(X_tr, y_tr)
pred_tree = tree.predict(X_te)
results["Decision Tree"] = {
    "Accuracy": accuracy_score(y_te, pred_tree),
    "F1": f1_score(y_te, pred_tree)
}

# TODO: Initialize and fit Random Forest on *unscaled* data
rf = RandomForestClassifier(n_estimators=250, random_state=7)
rf.fit(X_tr, y_tr)
pred_rf = rf.predict(X_te)
results["Random Forest"] = {
    "Accuracy": accuracy_score(y_te, pred_rf),
    "F1": f1_score(y_te, pred_rf)
}

pd.DataFrame(results).T



## Step 5. Train ensemble models

Implement two ensembles:
- **Bagging (Tree)** — variance reduction by averaging many trees.
- **AdaBoost** — sequentially fixes mistakes using shallow trees.

Add their metrics to the same results dictionary.


In [None]:

# ============================================
# Step 5. Train ensemble models
# ============================================

# TODO: Bagging with DecisionTreeClassifier as base estimator
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=5, random_state=7),
    n_estimators=200,
    random_state=7
)
bag.fit(X_tr, y_tr)
pred_bag = bag.predict(X_te)
results["Bagging (Tree)"] = {
    "Accuracy": accuracy_score(y_te, pred_bag),
    "F1": f1_score(y_te, pred_bag)
}

# TODO: AdaBoost with shallow trees (max_depth=2)
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=2, random_state=7),
    n_estimators=300,
    learning_rate=0.5,
    random_state=7
)
ada.fit(X_tr, y_tr)
pred_ada = ada.predict(X_te)
results["AdaBoost"] = {
    "Accuracy": accuracy_score(y_te, pred_ada),
    "F1": f1_score(y_te, pred_ada)
}

pd.DataFrame(results).T.sort_values("F1", ascending=False)



## Step 6. Visualize comparison

Create a quick bar chart of F1 scores to compare models.
Then, tweak **two hyperparameters** (e.g., `n_estimators`, `max_depth`, `learning_rate`) and re-run.


In [None]:

# ============================================
# Step 6. Visualize comparison
# ============================================

comparison = pd.DataFrame(results).T.sort_values("F1", ascending=False)

plt.figure(figsize=(7, 4))
plt.barh(comparison.index, comparison["F1"], alpha=0.85, color="teal")
plt.xlabel("F1 Score")
plt.title("Model Comparison: Project Habits Dataset")
plt.grid(axis="x", alpha=0.35)
plt.show()

comparison
