# Module 6: Ensemble Learning and Decision Trees
In this module, we explore tree-based models — simple and powerful tools for classification and regression — and extend them with ensemble learning methods like Random Forests.

## 🎯 Learning Objectives
- Understand how decision trees work
- Learn the pros and cons of tree-based models
- Apply and evaluate a Random Forest classifier
- Interpret feature importance in ensemble models

## 🌳 Decision Trees
A decision tree splits the dataset into branches based on conditions. At each node, it chooses the best feature that separates the data.

Pros:
- Easy to understand and interpret
- Handles both numeric and categorical data

Cons:
- Prone to overfitting
- Sensitive to small data changes

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load iris dataset
X, y = load_iris(return_X_y=True)
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X, y)

# Plot the tree
plt.figure(figsize=(12, 6))
plot_tree(clf, filled=True, feature_names=load_iris().feature_names, class_names=load_iris().target_names)
plt.title('Decision Tree')
plt.show()

## 🌲 Random Forests
Random Forests build multiple trees using different subsets of data and features, and combine their predictions for better performance.

Advantages:
- Reduces overfitting
- Handles large datasets
- Provides feature importance metrics

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred):.2f}")

## 🔍 Feature Importance
Random Forests can tell us which features were most influential in making predictions.

In [None]:
import pandas as pd

# Feature importance
feature_importances = pd.Series(rf.feature_importances_, index=load_iris().feature_names)
feature_importances.sort_values().plot(kind='barh')
plt.title('Feature Importance in Random Forest')
plt.xlabel('Importance')
plt.show()

## ✅ Practice Exercises
1. Train a decision tree on the `wine` or `breast cancer` dataset from `sklearn.datasets`.
2. Try limiting the depth of the tree — how does accuracy change?
3. Use a Random Forest classifier and evaluate its performance.
4. Plot and interpret feature importances.
5. Reflect: Why are Random Forests less prone to overfitting than a single decision tree?