# üìò Day 1: Bagging and Boosting

**üéØ Goal:** Master ensemble methods that power modern AI systems

**‚è±Ô∏è Time:** 90-120 minutes

**üåü Why This Matters for AI:**
- Ensemble methods dramatically improve AI model accuracy
- Used in production RAG systems to improve retrieval accuracy
- Foundation for Kaggle-winning solutions and real-world ML
- Critical for Agentic AI decision-making reliability
- Powers ranking systems in modern search and recommendation engines
- Essential for building robust production AI systems in 2024-2025

---

## ü§î What are Ensemble Methods?

**Ensemble = Combining multiple models to make better predictions**

**The Wisdom of Crowds:**
- One expert might be wrong
- But ask 100 experts and average their opinions ‚Üí More accurate!

**Real-World Analogy:**
- üè• **Medical Diagnosis:** Get second, third opinions from multiple doctors
- üéØ **Jury Decision:** 12 people decide together (not just 1 judge)
- üìä **Weather Forecast:** Combines multiple prediction models

**Two Main Approaches:**
1. **Bagging** (Bootstrap Aggregating): Train models independently, then vote
2. **Boosting**: Train models sequentially, each fixing previous errors

Let's explore both! üëá

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# Ensemble methods
from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    AdaBoostClassifier,
    VotingClassifier
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# Make plots beautiful
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ Libraries imported successfully!")
print("Let's build powerful ensemble models! üöÄ")

## üìä Our Dataset: Document Relevance for RAG Systems

**Real AI Application:** Building a document classifier for Retrieval-Augmented Generation (RAG)

**Scenario:** You're building a RAG system for customer support. When a user asks a question, your system needs to:
1. Retrieve potentially relevant documents
2. **Classify which documents are actually relevant** ‚Üê We're building this!
3. Feed only relevant docs to the LLM

**Features:**
- `query_doc_similarity`: Semantic similarity score (0-1)
- `keyword_matches`: Number of matching keywords
- `doc_popularity`: How often this doc helped others
- `doc_recency`: Days since doc was updated
- `doc_length`: Document word count
- `user_feedback_score`: Historical relevance ratings

**Target:**
- `is_relevant`: 1 = Relevant, 0 = Not Relevant

In [None]:
# Create realistic RAG document classification dataset
np.random.seed(42)

n_samples = 2000

# Generate features for relevant and non-relevant documents
data = {
    'query_doc_similarity': np.concatenate([
        np.random.beta(8, 2, 800),      # Relevant: high similarity
        np.random.beta(2, 5, 1200)       # Not relevant: low similarity
    ]),
    'keyword_matches': np.concatenate([
        np.random.poisson(8, 800),
        np.random.poisson(2, 1200)
    ]),
    'doc_popularity': np.concatenate([
        np.random.exponential(50, 800),
        np.random.exponential(10, 1200)
    ]),
    'doc_recency': np.concatenate([
        np.random.exponential(30, 800),   # Relevant: more recent
        np.random.exponential(100, 1200)  # Not relevant: older
    ]),
    'doc_length': np.concatenate([
        np.random.normal(500, 100, 800),
        np.random.normal(300, 150, 1200)
    ]),
    'user_feedback_score': np.concatenate([
        np.random.beta(9, 2, 800) * 5,    # Relevant: high ratings (0-5)
        np.random.beta(2, 4, 1200) * 5    # Not relevant: low ratings
    ]),
    'is_relevant': [1] * 800 + [0] * 1200  # 40% relevant, 60% not relevant
}

df = pd.DataFrame(data)

# Shuffle the data
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print("üìö RAG Document Classification Dataset Created!")
print(f"Total documents: {len(df)}")
print(f"Relevant documents: {df['is_relevant'].sum()}")
print(f"Not relevant: {(df['is_relevant'] == 0).sum()}")
print("\nFirst few documents:")
df.head()

In [None]:
# Visualize the difference between relevant and non-relevant documents
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('üìä RAG System: Relevant vs Non-Relevant Documents', fontsize=16, fontweight='bold')

features = ['query_doc_similarity', 'keyword_matches', 'doc_popularity', 
            'doc_recency', 'doc_length', 'user_feedback_score']

for idx, feature in enumerate(features):
    ax = axes[idx // 3, idx % 3]
    
    df[df['is_relevant'] == 1][feature].hist(ax=ax, alpha=0.6, label='Relevant', bins=30, color='green')
    df[df['is_relevant'] == 0][feature].hist(ax=ax, alpha=0.6, label='Not Relevant', bins=30, color='red')
    
    ax.set_title(feature.replace('_', ' ').title())
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')
    ax.legend()

plt.tight_layout()
plt.show()

print("üìà Notice the patterns! Relevant docs have higher similarity, more keywords, etc.")

## üîß Prepare Data for Training

In [None]:
# Separate features (X) and target (y)
X = df.drop('is_relevant', axis=1)
y = df['is_relevant']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ Data prepared!")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\nFeatures: {list(X.columns)}")

## üìä Baseline: Single Decision Tree

Let's start with a single decision tree to establish a baseline. Then we'll see how ensemble methods improve it!

In [None]:
# Train a single decision tree (baseline)
single_tree = DecisionTreeClassifier(random_state=42, max_depth=10)
single_tree.fit(X_train, y_train)

# Predictions
y_pred_single = single_tree.predict(X_test)
accuracy_single = accuracy_score(y_test, y_pred_single)

print("üå≥ Single Decision Tree (Baseline):")
print(f"Accuracy: {accuracy_single:.2%}")
print("\nüìä Classification Report:")
print(classification_report(y_test, y_pred_single, target_names=['Not Relevant', 'Relevant']))

print("\nüí° This is our baseline. Can ensemble methods beat it? Let's find out!")

## 1Ô∏è‚É£ Bagging (Bootstrap Aggregating)

**What is Bagging?**
- Train multiple models on different random samples of the data
- Each model votes on the final prediction
- Combine predictions by majority vote

**How it works:**
1. Create multiple random samples from training data (with replacement)
2. Train one model on each sample
3. For prediction: Each model votes, majority wins!

**Benefits:**
- ‚úÖ Reduces overfitting
- ‚úÖ More stable predictions
- ‚úÖ Works well with high-variance models (like decision trees)

**üéØ Real AI Use Cases:**
- **RAG systems**: Multiple retrievers vote on document relevance
- **Content moderation**: Ensemble of classifiers reduces false positives
- **Anomaly detection**: Multiple models catch different types of anomalies

In [None]:
# Create Bagging Classifier
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=10),
    n_estimators=50,  # 50 decision trees
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

# Train
bagging.fit(X_train, y_train)

# Predict
y_pred_bagging = bagging.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)

print("üéí Bagging Classifier Results:")
print(f"Number of trees: {bagging.n_estimators}")
print(f"Accuracy: {accuracy_bagging:.2%}")
print(f"\nüìà Improvement over single tree: {(accuracy_bagging - accuracy_single):.2%}")
print("\nüìä Classification Report:")
print(classification_report(y_test, y_pred_bagging, target_names=['Not Relevant', 'Relevant']))

## 2Ô∏è‚É£ Random Forests (Advanced Bagging)

**What is Random Forest?**
- Bagging + Extra randomness
- Each tree uses random subset of features
- More diversity = Better ensemble!

**Key Differences from Bagging:**
- Bagging: All trees see all features
- Random Forest: Each tree sees random subset of features

**Benefits:**
- ‚úÖ Even better accuracy than simple bagging
- ‚úÖ Handles high-dimensional data well
- ‚úÖ Built-in feature importance
- ‚úÖ Robust to outliers and noise

**üéØ Real AI Use Cases (2024-2025):**
- **Query routing in RAG**: Which knowledge base to search?
- **Intent classification**: Route user queries to correct AI agent
- **Feature extraction**: Pre-processing for Transformer models
- **Ranking systems**: Combine with neural networks in production

In [None]:
# Create Random Forest Classifier
rf = RandomForestClassifier(
    n_estimators=100,  # 100 trees
    max_depth=15,
    min_samples_split=5,
    random_state=42,
    n_jobs=-1
)

# Train
rf.fit(X_train, y_train)

# Predict
y_pred_rf = rf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

print("üå≤ Random Forest Results:")
print(f"Number of trees: {rf.n_estimators}")
print(f"Accuracy: {accuracy_rf:.2%}")
print(f"\nüìà Improvement over single tree: {(accuracy_rf - accuracy_single):.2%}")
print(f"üìà Improvement over bagging: {(accuracy_rf - accuracy_bagging):.2%}")
print("\nüìä Classification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['Not Relevant', 'Relevant']))

In [None]:
# Analyze feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("üî• Most Important Features for Document Relevance:")
print(feature_importance)

# Visualize
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'], color='forestgreen')
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('üéØ Random Forest: Feature Importance for RAG Document Classification', 
          fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüí° Insight: query_doc_similarity is the most important feature!")
print("This makes sense - semantic similarity is crucial for RAG systems!")

## 3Ô∏è‚É£ Boosting: AdaBoost

**What is Boosting?**
- Train models sequentially (not in parallel like bagging)
- Each new model focuses on errors from previous models
- Combine models with weighted voting

**AdaBoost (Adaptive Boosting):**
1. Train first model on all data
2. Identify misclassified examples
3. Give MORE weight to misclassified examples
4. Train next model (focuses on hard examples)
5. Repeat!

**Key Difference from Bagging:**
- **Bagging**: Models trained independently
- **Boosting**: Models trained sequentially, learning from mistakes

**Benefits:**
- ‚úÖ Often higher accuracy than bagging
- ‚úÖ Reduces both bias and variance
- ‚úÖ Works well with weak learners

**Challenges:**
- ‚ö†Ô∏è More prone to overfitting
- ‚ö†Ô∏è Sensitive to noisy data and outliers
- ‚ö†Ô∏è Sequential training (can't parallelize)

**üéØ Real AI Use Cases:**
- **Face detection**: Original AdaBoost use case (Viola-Jones)
- **Click-through rate prediction**: Ad ranking systems
- **Fraud detection**: Catching subtle patterns
- **RAG re-ranking**: Fine-tuning document relevance scores

In [None]:
# Create AdaBoost Classifier
adaboost = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=3),  # Weak learners (shallow trees)
    n_estimators=100,
    learning_rate=1.0,
    random_state=42
)

# Train
adaboost.fit(X_train, y_train)

# Predict
y_pred_ada = adaboost.predict(X_test)
accuracy_ada = accuracy_score(y_test, y_pred_ada)

print("üöÄ AdaBoost Results:")
print(f"Number of estimators: {adaboost.n_estimators}")
print(f"Learning rate: {adaboost.learning_rate}")
print(f"Accuracy: {accuracy_ada:.2%}")
print(f"\nüìà Improvement over single tree: {(accuracy_ada - accuracy_single):.2%}")
print("\nüìä Classification Report:")
print(classification_report(y_test, y_pred_ada, target_names=['Not Relevant', 'Relevant']))

## üìä Compare All Models

Let's see which ensemble method performs best!

In [None]:
# Compare all models
results = pd.DataFrame({
    'Model': ['Single Tree', 'Bagging', 'Random Forest', 'AdaBoost'],
    'Accuracy': [accuracy_single, accuracy_bagging, accuracy_rf, accuracy_ada],
    'Type': ['Baseline', 'Ensemble (Bagging)', 'Ensemble (Bagging)', 'Ensemble (Boosting)']
}).sort_values('Accuracy', ascending=False)

print("üèÜ Model Comparison for RAG Document Classification:")
print(results.to_string(index=False))
print(f"\nü•á Best Model: {results.iloc[0]['Model']} with {results.iloc[0]['Accuracy']:.2%} accuracy")

# Visualize comparison
plt.figure(figsize=(12, 6))
colors = ['#95a5a6', '#3498db', '#2ecc71', '#e74c3c']
bars = plt.bar(results['Model'], results['Accuracy'], color=colors)
plt.ylabel('Accuracy', fontsize=12)
plt.title('üéØ Ensemble Methods Comparison: RAG Document Classification', fontsize=14, fontweight='bold')
plt.ylim(0.7, 1.0)
plt.axhline(y=accuracy_single, color='gray', linestyle='--', alpha=0.5, label='Baseline')

# Add accuracy labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.2%}',
             ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.legend()
plt.tight_layout()
plt.show()

print("\nüí° Key Insight: Ensemble methods consistently outperform single models!")

## üåü Real AI Example: Improving RAG Retrieval Accuracy

**Problem:** In production RAG systems, retrieving irrelevant documents wastes:
- üí∞ LLM tokens (costs money)
- ‚ö° Processing time (slower responses)
- üéØ Answer quality (confuses the LLM)

**Solution:** Use ensemble classifier to filter documents BEFORE sending to LLM

**Pipeline:**
1. User asks question
2. Vector search retrieves 100 candidate documents
3. **Ensemble classifier filters to top 10 most relevant** ‚Üê Our model!
4. Send only top 10 to LLM
5. LLM generates better answer, faster and cheaper!

Let's simulate this!

In [None]:
# Simulate RAG retrieval scenario
print("üîç RAG System Simulation\n" + "="*50)

# Simulate: Vector search returned 20 candidate documents
sample_docs = X_test.head(20).copy()
sample_labels = y_test.head(20).copy()

# Get predictions from our best model (Random Forest)
predictions = rf.predict(sample_docs)
probabilities = rf.predict_proba(sample_docs)[:, 1]  # Probability of being relevant

# Create results dataframe
rag_results = pd.DataFrame({
    'doc_id': range(1, 21),
    'similarity': sample_docs['query_doc_similarity'].values,
    'predicted_relevant': predictions,
    'relevance_score': probabilities,
    'actual_relevant': sample_labels.values
}).sort_values('relevance_score', ascending=False)

print("\nüìö Retrieved Documents (sorted by ensemble relevance score):")
print(rag_results.head(10).to_string(index=False))

# Calculate metrics
top_10_docs = rag_results.head(10)
precision_at_10 = top_10_docs['actual_relevant'].sum() / 10

# Compare to naive approach (just using similarity)
naive_top_10 = rag_results.nlargest(10, 'similarity')
naive_precision = naive_top_10['actual_relevant'].sum() / 10

print(f"\nüìä Results:")
print(f"  Ensemble Classifier Precision@10: {precision_at_10:.0%}")
print(f"  Naive Similarity Precision@10: {naive_precision:.0%}")
print(f"  Improvement: {(precision_at_10 - naive_precision):.0%}")

print("\nüí° Impact:")
print(f"  Relevant docs sent to LLM: {int(precision_at_10 * 10)}/10")
print(f"  Irrelevant docs filtered: {10 - int(precision_at_10 * 10)}/10")
print(f"  Token savings: ~{(1 - precision_at_10) * 100:.0%} reduction in wasted tokens!")

## üó≥Ô∏è Bonus: Voting Classifier (Combining Different Models)

**What if we combine different types of models?**

**Voting Classifier:**
- Combines predictions from multiple different algorithms
- Example: Logistic Regression + Random Forest + AdaBoost
- Two types:
  - **Hard Voting**: Majority vote wins
  - **Soft Voting**: Average probabilities (usually better)

**üéØ Real AI Use Cases:**
- **Production ML**: Combine traditional ML + neural networks
- **Agentic AI**: Multiple specialized agents vote on action
- **Multimodal AI**: Combine text, image, and audio classifiers

In [None]:
# Create diverse base models
log_reg = LogisticRegression(random_state=42, max_iter=1000)
rf_voter = RandomForestClassifier(n_estimators=50, random_state=42)
ada_voter = AdaBoostClassifier(n_estimators=50, random_state=42)

# Create voting classifier with soft voting
voting_clf = VotingClassifier(
    estimators=[
        ('lr', log_reg),
        ('rf', rf_voter),
        ('ada', ada_voter)
    ],
    voting='soft'  # Average probabilities
)

# Train on scaled data (Logistic Regression needs scaling)
voting_clf.fit(X_train_scaled, y_train)

# Predict
y_pred_voting = voting_clf.predict(X_test_scaled)
accuracy_voting = accuracy_score(y_test, y_pred_voting)

print("üó≥Ô∏è Voting Classifier Results:")
print(f"Models combined: Logistic Regression + Random Forest + AdaBoost")
print(f"Voting type: Soft (probability averaging)")
print(f"Accuracy: {accuracy_voting:.2%}")
print("\nüìä Classification Report:")
print(classification_report(y_test, y_pred_voting, target_names=['Not Relevant', 'Relevant']))

print("\nüí° Voting ensembles can combine the strengths of different algorithm families!")

## üéØ YOUR TURN: Exercise 1 - Email Priority Classification

**Challenge:** Build an ensemble classifier for email priority detection!

**Scenario:** You're building an AI assistant that needs to classify emails as:
- High Priority (needs immediate attention)
- Low Priority (can wait)

**Your Task:**
1. Train a Random Forest classifier
2. Train an AdaBoost classifier
3. Compare their accuracy
4. Which works better?

Don't worry - experiment and learn! üí™

In [None]:
# Email priority dataset
np.random.seed(42)

n_emails = 1500

email_data = {
    'from_vip': np.concatenate([
        np.random.binomial(1, 0.7, 600),   # High priority: often from VIP
        np.random.binomial(1, 0.2, 900)    # Low priority: rarely from VIP
    ]),
    'has_deadline_keywords': np.concatenate([
        np.random.binomial(1, 0.8, 600),
        np.random.binomial(1, 0.1, 900)
    ]),
    'reply_expected': np.concatenate([
        np.random.binomial(1, 0.9, 600),
        np.random.binomial(1, 0.3, 900)
    ]),
    'cc_count': np.concatenate([
        np.random.poisson(5, 600),
        np.random.poisson(1, 900)
    ]),
    'sender_email_frequency': np.concatenate([
        np.random.exponential(10, 600),
        np.random.exponential(2, 900)
    ]),
    'is_high_priority': [1] * 600 + [0] * 900
}

email_df = pd.DataFrame(email_data)
email_df = email_df.sample(frac=1, random_state=42).reset_index(drop=True)

print("üìß Email Priority Dataset:")
print(email_df.head())
print(f"\nTotal emails: {len(email_df)}")
print(f"High priority: {email_df['is_high_priority'].sum()}")
print(f"Low priority: {(email_df['is_high_priority'] == 0).sum()}")

In [None]:
# YOUR CODE HERE!
# Hint: Follow the same steps as above

# Step 1: Separate X and y
X_email = # YOUR CODE
y_email = # YOUR CODE

# Step 2: Split data
# YOUR CODE

# Step 3: Train Random Forest
# YOUR CODE

# Step 4: Train AdaBoost
# YOUR CODE

# Step 5: Compare accuracy
# YOUR CODE

<details>
<summary>üìñ Click here for solution</summary>

```python
# Step 1: Separate X and y
X_email = email_df.drop('is_high_priority', axis=1)
y_email = email_df['is_high_priority']

# Step 2: Split data
X_train_em, X_test_em, y_train_em, y_test_em = train_test_split(
    X_email, y_email, test_size=0.2, random_state=42
)

# Step 3: Train Random Forest
rf_email = RandomForestClassifier(n_estimators=100, random_state=42)
rf_email.fit(X_train_em, y_train_em)
rf_email_acc = accuracy_score(y_test_em, rf_email.predict(X_test_em))

# Step 4: Train AdaBoost
ada_email = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_email.fit(X_train_em, y_train_em)
ada_email_acc = accuracy_score(y_test_em, ada_email.predict(X_test_em))

# Step 5: Compare
print(f"Random Forest Accuracy: {rf_email_acc:.2%}")
print(f"AdaBoost Accuracy: {ada_email_acc:.2%}")
```
</details>

## üéì Key Takeaways

**You just learned:**

### 1. **Bagging (Bootstrap Aggregating)**
   - ‚úÖ Train models independently on random samples
   - ‚úÖ Combine via majority vote
   - ‚úÖ Reduces overfitting and variance
   - **Use when:** You have high-variance models (e.g., deep decision trees)

### 2. **Random Forests**
   - ‚úÖ Advanced bagging with feature randomness
   - ‚úÖ Excellent for high-dimensional data
   - ‚úÖ Built-in feature importance
   - **Use when:** You need robust, accurate predictions with interpretability

### 3. **Boosting (AdaBoost)**
   - ‚úÖ Sequential training, focus on errors
   - ‚úÖ Often higher accuracy than bagging
   - ‚ö†Ô∏è More prone to overfitting
   - **Use when:** You need maximum accuracy and have clean data

### 4. **Voting Ensembles**
   - ‚úÖ Combine different algorithm types
   - ‚úÖ Leverages diverse model strengths
   - **Use when:** You want to combine traditional ML + neural networks

**üåü Real-World AI Applications (2024-2025):**
- **RAG Systems:** Ensemble classifiers improve retrieval precision
- **Agentic AI:** Multiple models vote on which action to take
- **Production ML:** Random Forests power real-time ranking systems
- **Content Moderation:** Ensembles reduce false positives/negatives
- **Fraud Detection:** Boosting catches subtle patterns

**When to Use What:**
- **Need speed + interpretability?** ‚Üí Random Forest
- **Need maximum accuracy?** ‚Üí Boosting (or wait for Day 2: XGBoost!)
- **Have noisy data?** ‚Üí Random Forest (more robust)
- **Combining models?** ‚Üí Voting Classifier

## üöÄ Next Steps

**Practice Exercises:**
1. Experiment with different `n_estimators` (50, 100, 200) - what happens?
2. Try different `max_depth` values in Random Forest
3. Create a Voting Classifier with 4-5 different models

**Coming Next:**
- **Day 2:** Gradient Boosting (XGBoost, LightGBM, CatBoost)
- **Day 3:** Advanced ML Techniques (Stacking, Pipelines, Feature Engineering)

---

**üéâ Congratulations!** You now understand the ensemble methods that power:
- Kaggle winning solutions
- Production RAG systems
- Real-world ML applications at scale

**üí¨ Questions?** Review the notebook, experiment with hyperparameters, and see how ensemble methods improve accuracy!

---

*Remember: Modern AI systems often combine ensemble methods with deep learning. Random Forests for feature selection, then Transformers for final predictions!* üåü