# Model Tuning and Ensemble Methods
---
## Outline

1. **Introduction to Model Tuning and Pipelines**
2. **GridSearchCV – Hyperparameter Tuning**
3. **Ensemble Methods – Why Combine Models?**
4. **Random Forests – The Forest is Smarter than the Tree**
5. **Tree Ensembles – More Trees, Better Predictions**
6. **Gradient Boosting – Turning Weakness into Strength**
7. **XGBoost – The Power Tool for Boosting**
8. **Mini Exercises**

---

## 1. Introduction to Model Tuning and Pipelines

### What is Model Tuning?

Imagine you’re cooking spaghetti. You adjust the salt, boiling time, or sauce thickness to get the best taste. Model tuning is the same - tweaking the settings (hyperparameters) of your algorithm to get the best performance.

### What is a Pipeline?

A pipeline is like a recipe. It organizes the steps you take in building your model:

1. Preprocessing (like cleaning or scaling data)
2. Training the model
3. Making predictions



In [1]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV,train_test_split, cross_val_score
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

**Example**:

In [2]:
# Define the pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),   # Step 1: scale features
    ('model', LogisticRegression()) # Step 2: train model
])

---

## 2. GridSearchCV – Hyperparameter Tuning

### Why Tune?

Different models have different "dials" to adjust. For example, in a Random Forest:

* How many trees?
* How deep should they grow?

GridSearch tries all combinations of values and picks the best.

**Example**:

In [3]:
# Generate synthetic classification data

X, y = make_classification(n_samples=500, n_features=8, n_informative=5, n_classes=2, random_state=42)

In [4]:
# Split your data (replace X and y with your actual feature and target variables)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 10]
}

model = RandomForestClassifier()
grid = GridSearchCV(model, param_grid, cv=5)

grid.fit(X_train, y_train)
print("Best Parameters:", grid.best_params_)

Best Parameters: {'max_depth': 10, 'n_estimators': 50}


**Explanation**:

* `n_estimators`: Number of trees.
* `max_depth`: How deep the tree can go.
* `cv=5`: Try combinations using 5-fold cross-validation.

---

## 3. Ensemble Methods – Why Combine Models?

Think about voting in a class. If one person is unsure, the group can still make a good decision. That’s the idea behind ensemble methods: combine several models (weak or strong) to make better predictions.

---

## 4. Random Forests – The Forest is Smarter than the Tree

Random Forest is an ensemble of decision trees. Each tree:

* Sees only part of the data
* Grows differently
* Votes on the final answer

**Example**:

In [5]:
rf = RandomForestClassifier(n_estimators=100, max_depth=5)
rf.fit(X_train, y_train)
print("Accuracy:", rf.score(X_test, y_test))

Accuracy: 0.94


---

## 5. Tree Ensembles – More Trees, Better Predictions

Why not just grow one big tree?

Because:

* One tree may overfit (memorize data)
* Multiple smaller trees reduce overfitting
* Each tree captures different parts of the data

---

## 6. Gradient Boosting – Turning Weakness into Strength

Unlike Random Forests (which train trees in parallel), Gradient Boosting trains one tree at a time.

Each new tree tries to correct the mistakes of the last one.

**Simple Analogy**:

* Student takes a test and gets some questions wrong.
* Teacher gives a mini-quiz only on the wrong questions.
* The student learns from mistakes and improves over time.

**Example**:

In [6]:
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
gb.fit(X_train, y_train)
print("Accuracy:", gb.score(X_test, y_test))

Accuracy: 0.94


---

## 7. XGBoost – The Power Tool for Boosting

XGBoost is short for *Extreme Gradient Boosting*. It’s faster and often more accurate than basic boosting models.

Why it’s popular:

* Efficient memory use
* Regularization to prevent overfitting
* Handles missing data

**Example**:

In [7]:
xgb_model = xgb.XGBClassifier(n_estimators=100, max_depth=3, learning_rate=0.1)
xgb_model.fit(X_train, y_train)
print("Accuracy:", xgb_model.score(X_test, y_test))

Accuracy: 0.91


---

## Visual Comparison

Compare different models side by side:

In [8]:
models = {
    'RandomForest': RandomForestClassifier(),
    'GradientBoosting': GradientBoostingClassifier(),
    'XGBoost': xgb.XGBClassifier()
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name} Avg Score: {scores.mean():.2f}")

RandomForest Avg Score: 0.92
GradientBoosting Avg Score: 0.91
XGBoost Avg Score: 0.91


---

## Exercises

1. Create a pipeline that scales features and fits a logistic regression.
2. Use `GridSearchCV` to tune a Random Forest on any dataset (e.g., Titanic).
3. Compare accuracy of RandomForest, GradientBoosting, and XGBoost on the same data.
4. Plot feature importances from Random Forest or XGBoost to see what variables matter most.
5. Try reducing overfitting in a boosted model by adjusting learning rate and tree depth.

---
