# 🌳 Introduction to Decision Trees, Bagging, and Random Forests

### 📘 Welcome to Your First AI/ML Model Session!

Welcome! In this 2-hour interactive session, we'll journey from a single predictive model to a powerful ensemble of models. We'll demystify some of the most fundamental and widely used algorithms in machine learning.

**Our Learning Objectives for Today:**

1.  **Understand Decision Trees:** Learn what a Decision Tree is, how it makes decisions, and how to build one in Python.
2.  **Discover Ensemble Learning:** Grasp the idea of 'wisdom of the crowd' by learning about Bagging.
3.  **Master Random Forests:** See how Random Forests improve upon Bagging to become one of the most effective ML models.
4.  **Get Hands-On:** Apply these concepts with simple, fun coding exercises.

Let's get started! 🚀

## Topic 1: The Decision Tree 🌳

A Decision Tree is just like a flowchart. It asks a series of questions about your data to arrive at a decision. It's one of the most intuitive and easy-to-understand models in machine learning!

**How it works:**
*   **Root Node:** The starting point, which contains all your data.
*   **Decision Nodes:** The points where the tree asks a question (e.g., "Is the outlook sunny?").
*   **Branches:** The paths that represent the answer to a question (e.g., 'Yes' or 'No').
*   **Leaf Nodes:** The final endpoints, which give you the prediction (e.g., "Play Tennis").

The tree learns the best questions to ask by finding splits that make the data in each branch as 'pure' or similar as possible. A common way to measure this is with **Gini Impurity**.

In [None]:
# 💻 Code Example: Predicting Tennis Play

# First, we need to import the necessary tools from scikit-learn
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
import pandas as pd

# Let's create the 'Play Tennis' dataset from our notes
data = {
    'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild'],
    'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal'],
    'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak'],
    'Play Tennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes']
}
df = pd.DataFrame(data)

# Machine learning models need numbers, not text! So, we convert our text data into numbers.
le = preprocessing.LabelEncoder()
df_encoded = df.apply(le.fit_transform)

# Separate the features (X) from the target (y)
X = df_encoded.drop('Play Tennis', axis=1)
y = df_encoded['Play Tennis']

# Create and train our Decision Tree model!
tree_model = DecisionTreeClassifier(criterion='gini')
tree_model.fit(X, y)

# Let's make a prediction! What if: Outlook=Sunny (2), Temp=Mild (2), Humidity=High (0), Wind=Weak (1)
prediction_encoded = tree_model.predict([[2, 2, 0, 1]])

# Convert the prediction back to text ('Yes' or 'No')
prediction = le.inverse_transform(prediction_encoded)
print(f"Prediction for a Sunny, Mild, High Humidity, Weak Wind day: {prediction[0]} Tennis!")

### 🎯 Practice Task: Fruit Identifier

Your turn! Let's create a model to guess if a fruit is an Apple or an Orange based on its weight and texture.

1.  Create two lists: `features` (containing weight and texture) and `labels` (the fruit name).
2.  Train a `DecisionTreeClassifier` on this data.
3.  Predict what fruit has `weight=145` and `texture='Smooth' (1)`.

In [None]:
# Your code here!
from sklearn.tree import DecisionTreeClassifier

# Data: 0 for Bumpy, 1 for Smooth. 0 for Apple, 1 for Orange.
features = [[140, 1], [130, 1], [150, 0], [170, 0]]
labels = [0, 0, 1, 1] # Apple, Apple, Orange, Orange

# 1. Create a Decision Tree Classifier
fruit_classifier = DecisionTreeClassifier()

# 2. Train the classifier using the .fit() method
# fruit_classifier.fit(..., ...)

# 3. Predict for a fruit that weighs 145g and is Smooth (1)
# prediction = fruit_classifier.predict([[145, 1]])
# if prediction == 0:
#     print("It's probably an Apple!")
# else:
#     print("It's probably an Orange!")

print("Task not yet completed. Fill in the code above!")

## Topic 2: Bagging (Bootstrap Aggregating) 🛍️

A single decision tree can sometimes learn the training data *too* well, a problem called **overfitting**. It might not perform well on new, unseen data.

💡 **Idea:** What if we train *many* slightly different trees on slightly different versions of the data and let them vote on the final answer? This is the core idea of Bagging!

**The Bagging Process:**
1.  **Bootstrap:** Create many new datasets from the original one by sampling *with replacement*. Imagine you have a bag of marbles; you pick one, note its color, and *put it back*. You do this until your new bag is full. Some marbles will be picked multiple times, some not at all.
2.  **Train:** Train a separate model (like a decision tree) on each of these new datasets in parallel.
3.  **Aggregate (Vote):** For a new prediction, ask every model for its opinion. The final answer is the one that gets the most votes!

The main goal of Bagging is to **reduce variance**, making the model more stable and reliable.

In [None]:
# 💻 Code Example: Upgrading to a Bagging Classifier

# We'll use the same tennis data as before.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# X and y are already loaded and encoded from the first example.

# We create a base decision tree - this is the model we will make copies of.
base_tree = DecisionTreeClassifier()

# Now, we create the Bagging model.
# n_estimators=100 means we will create 100 decision trees.
bagging_model = BaggingClassifier(base_estimator=base_tree, n_estimators=100, random_state=42)

# Train the bagging model
bagging_model.fit(X, y)

# Let's make the same prediction as before!
# Outlook=Sunny (2), Temp=Mild (2), Humidity=High (0), Wind=Weak (1)
prediction_encoded = bagging_model.predict([[2, 2, 0, 1]])
prediction = le.inverse_transform(prediction_encoded)

print(f"Bagging Model Prediction: {prediction[0]} Tennis!")
print("By combining 100 trees, we get a more robust prediction.")

# 🧪 Try changing the n_estimators value (e.g., to 10 or 500) and see if the prediction changes!

### 🎯 Practice Task: Bagging the Fruits

Now apply the Bagging technique to our fruit identifier problem. Will a committee of classifiers do better than a single one?

Use `BaggingClassifier` with a `DecisionTreeClassifier` base to train on the fruit data. How does this compare to the single tree?

In [None]:
# Your code here!
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Data: 0 for Bumpy, 1 for Smooth. 0 for Apple, 1 for Orange.
features = [[140, 1], [130, 1], [150, 0], [170, 0]]
labels = [0, 0, 1, 1] # Apple, Apple, Orange, Orange

# 1. Create the Bagging Classifier
# Hint: Use n_estimators=10 to start
# fruit_bagging_model = BaggingClassifier(..., n_estimators=10, random_state=42)

# 2. Train the model
# fruit_bagging_model.fit(..., ...)

# 3. Predict for a fruit that weighs 145g and is Smooth (1)
# prediction = fruit_bagging_model.predict([[145, 1]])
# if prediction == 0:
#     print("The committee of trees thinks it's an Apple!")
# else:
#     print("The committee of trees thinks it's an Orange!")
    
print("✅ Well done on learning about Bagging! Just fill in the code above.")

## Topic 3: The Random Forest 🌲🌲🌲

The Random Forest is a tweaked, and often more powerful, version of Bagging. It's one of the most popular and effective off-the-shelf models you can use!

It works just like Bagging, but with one extra trick:

**✨ The Magic Ingredient: Feature Randomness ✨**

When a normal bagged tree decides to make a split (ask a question), it looks at *all* the available features (Outlook, Temperature, Humidity, Wind) to find the best one.

In a Random Forest, when a tree wants to make a split, it is only allowed to look at a *random subset* of the features. For example, it might only be allowed to consider 'Humidity' and 'Wind' for one split, and 'Outlook' and 'Temperature' for another.

**Why does this help?**
It forces the trees to be more diverse and creative! If one feature is very strong (like 'Outlook'), Bagging might create 100 trees that all look very similar. By restricting the features, Random Forest ensures it builds many different kinds of trees, leading to a more robust and accurate final vote.

In [None]:
# 💻 Code Example: Unleashing the Random Forest

# scikit-learn makes this super easy with RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# X and y are the same encoded tennis data

# Create and train the Random Forest model
# n_estimators=100 means it will build 100 trees
forest_model = RandomForestClassifier(n_estimators=100, random_state=42)
forest_model.fit(X, y)

# Let's make our prediction one last time
# Outlook=Sunny (2), Temp=Mild (2), Humidity=High (0), Wind=Weak (1)
prediction_encoded = forest_model.predict([[2, 2, 0, 1]])
prediction = le.inverse_transform(prediction_encoded)

print(f"Random Forest Model Prediction: {prediction[0]} Tennis!")

### 🎯 Practice Task: A Forest for Fruits

You know the drill! Apply the `RandomForestClassifier` to the fruit dataset. This is often the go-to model for classification tasks like this.

Experiment with `n_estimators`. Does using 10 trees give a different result than using 100?

In [None]:
# Your code here!
from sklearn.ensemble import RandomForestClassifier

# Data: 0 for Bumpy, 1 for Smooth. 0 for Apple, 1 for Orange.
features = [[140, 1], [130, 1], [150, 0], [170, 0]]
labels = [0, 0, 1, 1] # Apple, Apple, Orange, Orange

# 1. Create the Random Forest Classifier with 100 estimators
# fruit_forest_model = ...

# 2. Train the model
# fruit_forest_model.fit(..., ...)

# 3. Predict for a fruit that weighs 145g and is Smooth (1)
# prediction = ...
# if prediction == 0:
#     print("The Random Forest says it's an Apple!")
# else:
#     print("The Random Forest says it's an Orange!")
    
print("Task not yet completed. Give it a try!")

## 🧠 Final Revision Assignment

Congratulations on learning about three powerful machine learning models! Now it's time to put all your new knowledge to the test with a slightly bigger, more realistic dataset: the famous Iris dataset.

This dataset contains measurements for 3 different species of Iris flowers. Your goal is to build models to predict the species based on the measurements.

In [None]:
# First, let's load the dataset from scikit-learn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

# We split the data so we can train on one part and test on another
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Dataset is loaded and split. Ready for your models!")

**Assignment Tasks:**

1.  **Task 1: The Single Tree.** Train a `DecisionTreeClassifier` on the `X_train` and `y_train` data.
2.  **Task 2: Evaluate the Tree.** Use your trained tree to make predictions on `X_test`. Calculate its accuracy using the `accuracy_score` function.
3.  **Task 3: The Bagging Committee.** Train a `BaggingClassifier` (with a Decision Tree base) on the training data.
4.  **Task 4: Evaluate Bagging.** Predict on `X_test` with the bagging model and calculate its accuracy.
5.  **Task 5: The Mighty Forest.** Train a `RandomForestClassifier` on the training data.
6.  **Task 6: Evaluate the Forest.** Predict on `X_test` with the random forest and calculate its accuracy. Which model performed the best?
7.  **Task 7 (Conceptual):** A bank wants to predict loan defaults. Why might a Random Forest be a better choice than a single Decision Tree for this important task? (Write your answer in a markdown cell or as a comment in the code).

In [None]:
# --- Your Code for the Final Assignment ---

# Task 1 & 2: Decision Tree
print("--- Decision Tree ---")
# your_tree = DecisionTreeClassifier(random_state=42)
# your_tree.fit(X_train, y_train)
# tree_preds = your_tree.predict(X_test)
# tree_accuracy = accuracy_score(y_test, tree_preds)
# print(f"Decision Tree Accuracy: {tree_accuracy:.2f}")

# Task 3 & 4: Bagging Classifier
print("\n--- Bagging Classifier ---")
# your_bagging = BaggingClassifier(n_estimators=100, random_state=42)
# your_bagging.fit(X_train, y_train)
# bagging_preds = your_bagging.predict(X_test)
# bagging_accuracy = accuracy_score(y_test, bagging_preds)
# print(f"Bagging Accuracy: {bagging_accuracy:.2f}")

# Task 5 & 6: Random Forest Classifier
print("\n--- Random Forest ---")
# your_forest = RandomForestClassifier(n_estimators=100, random_state=42)
# your_forest.fit(X_train, y_train)
# forest_preds = your_forest.predict(X_test)
# forest_accuracy = accuracy_score(y_test, forest_preds)
# print(f"Random Forest Accuracy: {forest_accuracy:.2f}")

# Task 7 (Conceptual Answer as a comment):
# A Random Forest is better for a bank because...

## 🎉 You've Completed the Session!

Fantastic work! You've gone from the basics of a single Decision Tree to understanding and implementing powerful ensemble methods like Bagging and Random Forests. These are essential tools for any data scientist or AI practitioner.

Keep experimenting and happy coding!