# Decision Trees: Making Decisions with Data

Welcome to the sixth notebook in our **Machine Learning Basics for Beginners** series! After exploring linear and logistic regression, let's dive into **Decision Trees**, a versatile supervised learning algorithm that can be used for both classification and regression tasks. Decision trees mimic how humans make decisions by asking a series of questions.

**What You'll Learn in This Notebook:**
- What decision trees are and when to use them.
- How decision trees work in simple terms.
- A hands-on example of classifying whether someone will play tennis based on weather conditions.
- An interactive exercise to build a mini decision tree and see predictions.
- Visualizations to understand the tree structure and decision-making process.

Let's get started!

## 1. What are Decision Trees?

**Decision Trees** are a supervised learning algorithm that models decisions as a tree-like structure. Each "branch" of the tree represents a decision or question based on a feature, and each "leaf" represents an outcome (a class for classification or a value for regression).

- **Goal**: Split the data into smaller groups by asking a series of yes/no questions (or thresholds) until reaching a final prediction.
- **When to Use It**: Use decision trees for classification (e.g., spam or not spam) or regression (e.g., predicting house prices) when you want an interpretable model that’s easy to understand. They work well with both numerical and categorical data.
- **Examples**:
  - Classifying if a customer will churn (leave a service) based on usage patterns.
  - Predicting house prices based on location, size, and age.
  - Deciding if someone should play a sport based on weather conditions.

**Analogy**: Imagine you're deciding whether to go outside. You might ask: "Is it raining?" If yes, stay in. If no, ask: "Is it cold?" If yes, wear a jacket; if no, go out as is. A decision tree works the same way, breaking down a complex decision into a series of simple questions.

## 2. How Do Decision Trees Work?

Decision trees work by recursively splitting the data based on features to create a tree structure. Let’s break it down:

1. **Start at the Root**: The tree begins with all the data at the "root node."
2. **Choose a Feature to Split**: The algorithm picks a feature and a value (or category) to split the data into two or more groups. It chooses the split that best separates the data based on the target (e.g., for classification, it tries to group similar classes together). This is often measured using criteria like "Gini impurity" or "entropy" (for classification) or variance (for regression).
3. **Create Branches**: Each split creates branches leading to new nodes (subgroups of data).
4. **Repeat Splitting**: The process repeats for each node, splitting on new features or values until a stopping condition is met (e.g., maximum depth, minimum samples per node, or pure groups).
5. **Make Predictions at Leaves**: The final nodes (leaves) contain the predictions. For classification, it’s the most common class in that group; for regression, it’s the average value.

**Analogy**: Think of playing a game of "20 Questions." You ask yes/no questions to narrow down possibilities (e.g., "Is it an animal?" "Does it fly?"), and each answer takes you closer to guessing the object. A decision tree builds a similar flowchart to reach a final answer.

**Key Advantage**: Decision trees are easy to interpret—you can follow the path of questions to understand why a prediction was made.

## 3. Example: Predicting Whether to Play Tennis

Let’s see a decision tree in action with a small dataset about whether someone will play tennis based on weather conditions. We’ll use features like outlook, temperature, humidity, and wind.

**Dataset** (simplified):
- Outlook: Sunny, Sunny, Overcast, Rain, Rain
- Temperature: Hot, Hot, Hot, Mild, Cool
- Humidity: High, High, High, High, Normal
- Wind: Weak, Strong, Weak, Weak, Weak
- Play Tennis (Label): No, No, Yes, Yes, Yes

We’ll use Python’s `scikit-learn` library to create a decision tree model, train it on this data, and predict for a new day. Focus on the steps and output, not the code details.

**Instructions**: Run the code below to see how a decision tree predicts whether to play tennis and visualizes the tree structure.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# Create a small dataset
data = {
    'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool'],
    'Humidity': ['High', 'High', 'High', 'High', 'Normal'],
    'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak'],
    'Play Tennis': ['No', 'No', 'Yes', 'Yes', 'Yes']
}
df = pd.DataFrame(data)
print("Dataset:")
print(df)
print()

# Convert categorical data to numerical using LabelEncoder
le_outlook = LabelEncoder()
le_temp = LabelEncoder()
le_humidity = LabelEncoder()
le_wind = LabelEncoder()
le_play = LabelEncoder()

df['Outlook'] = le_outlook.fit_transform(df['Outlook'])
df['Temperature'] = le_temp.fit_transform(df['Temperature'])
df['Humidity'] = le_humidity.fit_transform(df['Humidity'])
df['Wind'] = le_wind.fit_transform(df['Wind'])
df['Play Tennis'] = le_play.fit_transform(df['Play Tennis'])

# Features and target
X = df[['Outlook', 'Temperature', 'Humidity', 'Wind']]
y = df['Play Tennis']

# Create and train the decision tree model
model = DecisionTreeClassifier(max_depth=3, random_state=0)
model.fit(X, y)

# Predict for a new day: Outlook=Overcast, Temperature=Mild, Humidity=Normal, Wind=Strong
new_day = np.array([[le_outlook.transform(['Overcast'])[0], 
                     le_temp.transform(['Mild'])[0], 
                     le_humidity.transform(['Normal'])[0], 
                     le_wind.transform(['Strong'])[0]])
prediction = model.predict(new_day)[0]
print(f"New Day (Outlook=Overcast, Temperature=Mild, Humidity=Normal, Wind=Strong): Predicted as {'Yes' if prediction == 1 else 'No'} to Play Tennis")

# Visualize the decision tree
plt.figure(figsize=(10, 6))
plot_tree(model, feature_names=['Outlook', 'Temperature', 'Humidity', 'Wind'], 
          class_names=['No', 'Yes'], filled=True, rounded=True)
plt.title('Decision Tree for Playing Tennis')
plt.show()

print("Look at the tree above:")
print("- Each box (node) shows a decision based on a feature.")
print("- Follow the path based on the conditions to see how a prediction is made.")
print("- Colored boxes (leaves) show the final prediction (Yes or No to Play Tennis).")

## 4. Interactive Exercise: Build a Mini Decision Tree Path

Now it’s your turn to interact with a decision tree concept! In this exercise, you’ll manually follow a simplified decision path to predict whether to play tennis based on conditions you input. This mimics how a decision tree makes splits.

**Instructions**:
- Run the code below.
- Answer the questions about weather conditions when prompted.
- See the prediction based on your inputs and understand the decision path.

In [None]:
# Interactive exercise to simulate a decision tree path
print("Welcome to the 'Build a Mini Decision Tree Path' Exercise!")
print("Answer questions about weather conditions to predict if you should play tennis.")
print("This simulates how a decision tree makes decisions by splitting on features.\n")

# Start at the root node (first decision)
outlook = input("What is the Outlook? (Sunny/Overcast/Rain): ").strip().capitalize()
if outlook not in ['Sunny', 'Overcast', 'Rain']:
    print("Invalid input. Defaulting to Sunny.")
    outlook = 'Sunny'

if outlook == 'Sunny':
    humidity = input("What is the Humidity? (High/Normal): ").strip().capitalize()
    if humidity not in ['High', 'Normal']:
        print("Invalid input. Defaulting to High.")
        humidity = 'High'
    if humidity == 'High':
        print("Prediction: No, don't play tennis (Sunny with High Humidity often means No).")
    else:
        print("Prediction: Yes, play tennis (Sunny with Normal Humidity often means Yes).")
elif outlook == 'Overcast':
    print("Prediction: Yes, play tennis (Overcast often means Yes, regardless of other conditions).")
else:  # Rain
    wind = input("What is the Wind? (Strong/Weak): ").strip().capitalize()
    if wind not in ['Strong', 'Weak']:
        print("Invalid input. Defaulting to Strong.")
        wind = 'Strong'
    if wind == 'Strong':
        print("Prediction: No, don't play tennis (Rain with Strong Wind often means No).")
    else:
        print("Prediction: Yes, play tennis (Rain with Weak Wind often means Yes).")

print("\nThis path shows how a decision tree splits data based on conditions:")
print(f"- First split on Outlook: {outlook}")
if outlook == 'Sunny':
    print(f"- Then split on Humidity: {humidity}")
elif outlook == 'Rain':
    print(f"- Then split on Wind: {wind}")
print("Each answer narrows down the prediction, just like branches in a tree lead to a leaf!")

## 5. Key Considerations for Decision Trees

Decision trees are intuitive and powerful, but they come with some considerations to keep in mind:

- **Overfitting Risk**: Decision trees can easily overfit, especially if the tree grows too deep. They might memorize the training data (including noise) instead of learning general patterns, leading to poor performance on new data. Techniques like limiting depth or pruning (cutting off branches) help prevent this.
- **Instability**: Small changes in the data can lead to a completely different tree structure, making them less stable compared to other models.
- **Bias Toward Simple Splits**: Decision trees often favor features with more categories or values for splits, which might not always be the best choice.

**Analogy**: A decision tree is like a very detailed flowchart for decision-making. If you make the flowchart too complicated with every tiny detail, it might only work for the exact situations you’ve seen before and fail for anything new.

Despite these issues, decision trees are widely used because they’re easy to understand and form the basis for more advanced methods like Random Forests (which combine many trees to improve performance).

## 6. Key Takeaways

- **Decision Trees** are a supervised learning algorithm for both classification and regression, modeling decisions as a tree of questions and answers.
- They work by splitting data based on features to create branches, continuing until a prediction is made at a leaf node.
- Use them for tasks like customer churn prediction or price estimation when interpretability is important and data can be split into clear groups.
- Be aware of limitations: they can overfit if too complex, are sensitive to data changes, and may bias toward certain features.

You’ve now learned a flexible and intuitive algorithm! Decision trees are a stepping stone to understanding more complex ensemble methods and provide a clear way to see how data drives predictions.

**What's Next?**
Move on to **Notebook 7: K-Nearest Neighbors** to learn about a simple yet effective algorithm for classification and regression based on similarity. See you there!