# Lesson 2: Decision Trees for House Price Prediction

## Table of Contents

# Decision Trees for House Price Prediction

## Table of Contents

1. [Introduction](#introduction)
2. [Intuition Behind Decision Trees](#intuition-behind-decision-trees)
   - [Why Choose Decision Trees for House Prices?](#why-choose-decision-trees-for-house-prices)
3. [Anatomy of a Decision Tree](#anatomy-of-a-decision-tree)
4. [Preparing Data for Decision Trees](#preparing-data-for-decision-trees)
   - [Numerical Data](#numerical-data)
   - [Categorical Data](#categorical-data)
   - [One-Hot Encoding](#one-hot-encoding)
   - [Target Encoding](#target-encoding)
   - [Smoothed Target Encoding](#smoothed-target-encoding)
   - [Practical Guide to Smoothed Encoding](#practical-guide-to-smoothed-encoding)
   - [Ordinal and Binary Features](#ordinal-and-binary-features)
   - [Combining Different Encoding Methods](#combining-different-encoding-methods)
   - [Guide to Choosing Encoding Methods](#guide-to-choosing-encoding-methods)
5. [Splitting Criteria Explained](#splitting-criteria-explained)
   - [For Regression Tasks](#for-regression-tasks-eg-predicting-house-prices)
   - [Mean Squared Error](#mean-squared-error-mse)
   - [Evaluating Decision Points](#evaluating-decision-points-understanding-split-quality-in-decision-trees)
   - [Mean Squared Error vs Mean Absolute Error](#mean-squared-error-mse-vs-mean-absolute-error-mae)
   - [For Classification Tasks](#for-classification-tasks-eg-predicting-if-a-house-will-sell-quickly)
     - [Gini Impurity](#1-gini-impurity)
     - [Entropy](#2-entropy)
     - [Information Gain](#3-information-gain)
     - [Comparison: Splits with Different Information Gains](#comparison-splits-with-different-information-gains)
6. [Interpretability and Visualization](#interpretability-and-visualization)
   - [Why Interpretability Matters](#why-interpretability-matters)
   - [How to Interpret Decision Trees](#how-to-interpret-decision-trees)
7. [Understanding Bias, Variance, Tree Depth and Complexity](#understanding-bias-variance-tree-depth-and-complexity)
   - [Bias](#bias)
   - [Variance](#variance)
   - [Identifying the Bias/Variance Tradeoff](#identifying-the-biasvariance-tradeoff)
   - [Managing the Bias/Variance Tradeoff](#managing-the-biasvariance-tradeoff)
   - [Visual Indicators of Bias/Variance](#visual-indicators-of-biasvariance)
8. [Feature Importance and Advanced Capabilities](#feature-importance-and-advanced-capabilities)
   - [Feature Importance in Decision Trees](#feature-importance-in-decision-trees)
   - [Advanced Capabilities](#advanced-capabilities)
   - [Limitations and Solutions](#limitations-and-solutions)
   - [Practical Applications](#practical-applications)
9. [Limitations and Ethical Considerations](#limitations-and-ethical-considerations)
   - [Technical Limitations](#technical-limitations)
   - [Solutions and Mitigations](#solutions-and-mitigations)
   - [Ethical Considerations for Decision Tree Models](#ethical-considerations-for-decision-tree-models)
   - [Best Practices for Ethical Use](#4-best-practices-for-ethical-use)
10. [Theory Conclusion](#theory-conclusion)
    - [Core Concepts](#core-concepts)
    - [Data Handling and Model Characteristics](#data-handling-and-model-characteristics)
    - [Error Metrics and Evaluation](#error-metrics-and-evaluation)
    - [Next Steps](#next-steps)

## Introduction

Decision trees are a versatile machine learning model for both classification and regression tasks. In this lesson, we'll use decision trees to predict house prices based on features like location, size, and amenities.

Imagine you're a real estate agent trying to estimate the fair price of a house based on its characteristics. This is where decision trees can help. They learn a set of rules from historical data to make predictions on new, unseen houses.

Essentially, a decision tree is used to make predictions on the target variable - say price - by recursively splitting the data based on the values of the features, choosing splits that maximize the similarity of the target variable (prices) within each subset.

The result is a tree-like model of decisions and their consequences.

By the end of this lesson, you'll understand how decision trees work, how to train and interpret them, and how they compare to other models for regression tasks.

## Intuition Behind Decision Trees

Imagine you're trying to predict the price of a house based on its features. You might start by asking broad questions like "Is it in a desirable location?" and then progressively get more specific: "How many bedrooms does it have? What's the square footage?".

At each step, you're trying to split the houses into groups that are as similar as possible in terms of price. This is exactly how a decision tree works - it asks a series of questions about the features, each time trying to split the data into more homogeneous subsets.

### Why Choose Decision Trees for House Prices?

Decision trees are particularly well-suited for this task because of several key advantages that become apparent when comparing them to other popular algorithms:

1. **Working with Different Types of Data**
   While decision trees need numbers to make their calculations, they have elegant ways of handling different types of data:
   - Numerical: Price (£180,000 to £39,750,000), square footage (274 to 15,405 sq ft)
     - Used directly as they're already numbers
   - Categorical: Location ("Chelsea", "Hackney"), house type ("Flat", "House", "Penthouse")
     - Can be converted to numbers in smart ways:
       - One-hot encoding: Like giving each location its own yes/no column
       - Target encoding: Converting locations to average prices in that area
     - We'll explore these in detail later in the course
   - Ordinal: Number of bedrooms (1-10), bathrooms (1-10), receptions (1-10)
     - Already in a natural order, easy to use

2. **No Feature Scaling Required**
   Unlike many other algorithms, decision trees work with raw values directly. 
   
   Compare this to:
   - Linear/Logistic Regression: Requires scaling to prevent features with larger values from dominating the model
   - Neural Networks: Needs normalized inputs (usually between 0-1) for stable gradient descent
   - Support Vector Machines (SVM): Highly sensitive to feature scales, requires standardization
   - K-Nearest Neighbors: Distance calculations are skewed by different scales, needs normalization

   The tree makes splits based on relative ordering, not absolute values. 
   
   For example, these splits are all equivalent to a decision tree:
   ```python
   # Original scale (Decision Tree works fine)
   if square_footage > 2000:
       predict_price = 1200000
   else:
       predict_price = 800000

   # Scaled by 1000 (needed for Neural Networks)
   if square_footage/1000 > 2:  # Same result for decision tree
       predict_price = 1200000
   else:
       predict_price = 800000

   # Standardized (needed for SVM)
   if (square_footage - mean)/std > 1.2:  # Same result for decision tree
       predict_price = 1200000
   else:
       predict_price = 800000
   ```

3. **Interpretable Decision Making**
   While algorithms like Neural Networks act as "black boxes" and Linear Regression gives abstract coefficients, decision trees create clear, actionable rules. Here's a simple example:
   ```python
   # The computer converts locations to simple yes/no questions
   if location_hackney == 1:  # Is it in Hackney?
       if square_footage > 1200:
           predict_price = "£950K"
       else:
           predict_price = "£650K"
   elif location_wimbledon == 1:  # Is it in Wimbledon?
       if bedrooms > 3:
           predict_price = "£1.2M"
       else:
           predict_price = "£800K"
   ```
   These rules are easy to explain to stakeholders, unlike trying to interpret neural network weights or SVM kernel transformations. The yes/no questions (location_hackney == 1) simply mean "Is this property in Hackney?" - a question anyone can understand!

4. **Handling Missing Data**
   Real estate data often has missing values. For example, some listings might not include the square footage or number of bathrooms.
   
   While most algorithms require these missing values to be filled in or removed, decision trees have clever ways to handle missing data:
   - They can make predictions even when some feature values are unknown
   - They can use alternative features when a preferred feature is missing
   - They maintain good accuracy even with incomplete information

These advantages mean we can focus on understanding the relationships in our data rather than spending time on complicated data preprocessing. 

This makes decision trees an excellent choice for our house price prediction task, especially when interpretability and ease of use are priorities.



## Anatomy of a Decision Tree

A decision tree is composed of:

- Nodes: Where a feature is tested
- Edges: The outcomes of the test
- Leaves: Terminal nodes that contain the final predictions

A simplified example of a house prices prediction decision tree might look like this:

![structure of a house prices prediction decision tree](../static/house-prices-decision-tree-and-structure.png)

The tree is built by splitting the data recursively, choosing at each step a feature and a numerical split point on that feature that results in the greatest reduction in impurity or error. For example, the first split could be on the feature "square footage" with a split point of 2000 sq ft because this results in the greatest reduction in impurity or error.



## Preparing Data for Decision Trees

Before we delve into how decision trees make split decisions it's important to first understand what data we can use.

While decision trees can handle various types of data, we need to convert all features into numerical formats for training. This process is called encoding. 

Different types of features require different encoding approaches:

1. **Numerical Features**
   - Already in usable format (e.g., prices, areas)
   - No encoding needed

2. **Categorical Features**
   - Need conversion to numbers
   - Multiple encoding strategies available
   - Examples: locations, house types

3. **Ordinal Features**
   - Categories with natural order
   - Need to preserve order relationship
   - Example: size (small, medium, large)

4. **Binary Features**
   - Yes/no features
   - Simple 1/0 encoding
   - Example: has_parking, has_garden

Let's explore how to handle each type effectively, understanding the trade-offs and choosing the right approach for our data.

### Numerical Data

Numerical features provide a solid foundation for decision trees because they:
- Work directly without transformation
- Don't require scaling
- Can handle different value ranges
- Support both integers and floating-point numbers

Common numerical features in housing data:
- Price (e.g., £250,000)
- Square footage (e.g., 1,500 sq ft)
- Number of rooms (e.g., 3 bedrooms)
- Age of property (e.g., 25 years)


### Categorical Data

Categorical features are variables that take on a limited number of discrete values. In housing data, these might include:
- Location (Chelsea, Hackney, Mayfair)
- Property type (Flat, House, Penthouse)
- Style (Modern, Victorian, Georgian)

We have three main approaches for encoding categorical data:

1. **One-Hot Encoding**
   - Creates binary columns for each category
   - Best for low/medium cardinality - cardinality is the number of unique categories in a feature
   - Preserves all category information
   - No implied ordering

2. **Target Encoding**
   - Replaces categories with target statistics for each category, for example the mean price for each location
   - Best for features with high cardinality as one-hot encoding will explode the number of features
   - Two variants:
     - Simple (target statistic per category - for instance the mean price for each location)
     - Smoothed (statistic for the category balanced with global statistic)

3. **Binary Encoding**
   - For true yes/no features
   - Simple 1/0 conversion
   - Most memory efficient

Let's examine each approach in detail:

### One-Hot Encoding

One-hot encoding transforms categorical variables by:
- Creating a new binary column for each category
- Setting 1 where the category is present, 0 otherwise
- No information loss or ordering implied

**Ideal for:**
- Categorical variables with few unique values
- When memory isn't a constraint
- When interpretability is important

**Example:**
Property Type (Flat, House, Penthouse) becomes:
- property_type_flat: [1,0,0]
- property_type_house: [0,1,0]
- property_type_penthouse: [0,0,1]

Let's implement one-hot encoding:

In [None]:
import pandas as pd

# Create sample categorical data
data = {
    'property_type': ['Flat', 'House', 'Penthouse', 'Flat', 'House'],
    'location': ['Chelsea', 'Hackney', 'Chelsea', 'Putney', 'Chelsea']
}
df = pd.DataFrame(data)

# One-hot encode multiple columns
df_encoded = pd.get_dummies(df, prefix=['type', 'loc'])

print("Original data:")
print(df)
print("\nFully encoded data:")
print(df_encoded)



### Target Encoding

Target encoding replaces categorical values with statistics calculated from the target variable. For housing data, this means replacing each location with its average house price.

**Advantages:**
- Handles high cardinality efficiently
- Captures relationship with target variable
- Memory efficient
- Works well for decision trees

**Challenges:**
- Risk of overfitting
- Needs handling for rare categories
- Requires cross-validation
- Can leak target information - for example if we were predicting house prices and we encoded the location with the mean price for each location, the model would know the price of the houses in that location before they were predicted, which would be a problem. To avoid this in practice we split the data into a training and validation set and only use the training set to calculate the mean price for each location.

**Simple Target Encoding Example:**
```
Location   | Count | Avg Price
Chelsea    |   100 | £800,000
Hackney    |    50 | £500,000
Mayfair    |    10 | £2,000,000
```

Let's first look at basic target encoding before exploring smoothing:

In [None]:
import pandas as pd
import numpy as np

# Create sample data with clear price patterns
data = {
    'location': ['Chelsea', 'Chelsea', 'Chelsea', 'Hackney', 'Hackney',
                 'Mayfair', 'Chelsea', 'Hackney', 'Mayfair', 'Chelsea'],
    'price': [800000, 820000, 780000, 500000, 520000,
              2000000, 810000, 510000, 1900000, 790000]
}
df = pd.DataFrame(data)

# Simple mean encoding, setting the mean price for each location
location_means = df.groupby('location')['price'].mean()
df['location_encoded'] = df['location'].map(location_means)

# Show encoding results
print("Original data with encoding:")
summary = df.groupby('location').agg({
    'price': ['count', 'mean'],
    'location_encoded': 'first'
}).round(2)

print(summary)

# Demonstrate potential overfitting with rare categories
rare_data = df.copy()
# Create new row with all columns
new_row = pd.DataFrame({
    'location': ['Knightsbridge'],
    'price': [3000000],
    'location_encoded': [None]  # Will be updated after encoding
})
rare_data = pd.concat([rare_data, new_row], ignore_index=True)

# Encode including rare category
rare_means = rare_data.groupby('location')['price'].mean()
rare_data['location_encoded'] = rare_data['location'].map(rare_means)

print("\nEncoding with rare category:")
print(rare_data[rare_data['location'] == 'Knightsbridge'])

display(rare_data)

For a rare category such as "Knightsbridge" our simplified model has assigned it's actual mean price. This is a problem as the model has effectively leaked information from the validation set into the training set and is causing it to overfit to that one row.

### Smoothed Target Encoding

Smoothed target encoding addresses the instability of simple target encoding by balancing between:
- The category's mean (which might be unstable)
- The global mean (which is stable but loses category information)

The smoothing formula is:
```
smoothed_value = (n × category_mean + α × global_mean) / (n + α)
```
Where:
- n = number of samples in the category
- α = smoothing factor
- category_mean = mean price for the location
- global_mean = mean price across all locations

**Effect of Smoothing Factor (α):**
- Large n (many samples):
  - (n >> α) → result close to category mean
  - Example: n=100, α=10 → mostly category mean
- Small n (few samples):
  - (n << α) → result close to global mean
  - Example: n=2, α=10 → mostly global mean

This balancing act helps prevent overfitting while preserving useful category information.

In [None]:
import pandas as pd
import numpy as np

def smoothed_target_encode(df, column, target, alpha=10):
    """
    Apply smoothed target encoding
    
    Parameters:
    - df: DataFrame
    - column: Category column name
    - target: Target variable name
    - alpha: Smoothing factor
    """
    # Calculate global mean
    global_mean = df[target].mean()
    
    # Calculate category stats
    category_stats = df.groupby(column).agg({
        target: ['count', 'mean']
    }).reset_index()
    category_stats.columns = [column, 'count', 'mean']
    
    # Apply smoothing
    category_stats['smoothed_mean'] = (
        (category_stats['count'] * category_stats['mean'] + alpha * global_mean) /
        (category_stats['count'] + alpha)
    )
    
    return dict(zip(category_stats[column], category_stats['smoothed_mean']))

# Create sample data with varying category frequencies
data = {
    'location': ['Chelsea'] * 50 + ['Hackney'] * 20 + ['Mayfair'] * 5 + ['Putney'] * 2,
    'price': ([800000 + np.random.randn() * 50000 for _ in range(50)] +  # Chelsea
              [500000 + np.random.randn() * 30000 for _ in range(20)] +  # Hackney
              [2000000 + np.random.randn() * 100000 for _ in range(5)] + # Mayfair
              [600000 + np.random.randn() * 40000 for _ in range(2)])    # Putney
}
df = pd.DataFrame(data)

# Compare different smoothing levels
alphas = [0, 5, 20, 100]
results = pd.DataFrame()

for alpha in alphas:
    encoded_values = smoothed_target_encode(df, 'location', 'price', alpha)
    results[f'alpha_{alpha}'] = df['location'].map(encoded_values)

# Add original mean for comparison
original_means = df.groupby('location')['price'].mean()
results['original_mean'] = df['location'].map(original_means)
results['location'] = df['location']
results['count'] = df.groupby('location')['price'].transform('count')

# Show results for one location from each frequency group
print("Effect of smoothing by location frequency:")
for loc in ['Chelsea', 'Hackney', 'Mayfair', 'Putney']:
    sample = results[results['location'] == loc].iloc[0]
    print(f"\n{loc} (n={int(sample['count'])})")
    print(f"Original mean:  £{sample['original_mean']:,.0f}")
    for alpha in alphas:
        print(f"Alpha {alpha:3d}:      £{sample[f'alpha_{alpha}']:,.0f}")

### Practical Guide to Smoothed Encoding

**Choosing α (Smoothing Factor):**

1. **Low α (1-5)**
   - Minimal smoothing
   - Use when categories are very distinct
   - Good with large sample sizes
   - Risk: Might not handle rare categories well

2. **Medium α (10-20)**
   - Balanced smoothing
   - Good default choice
   - Works well with mixed sample sizes
   - Provides some protection against outliers

3. **High α (50+)**
   - Heavy smoothing
   - Use with many rare categories
   - Good for noisy data
   - Risk: Might lose category signal

**Best Practices:**

1. **Cross-Validation**
   - Compute encoding using only training data
   - Apply those mappings to validation/test data
   - Never peek at test set statistics

2. **Category Analysis**
   - Check sample size distribution
   - Consider higher α for skewed distributions
   - Monitor rare categories carefully

3. **Domain Knowledge**
   - Use business context to validate encodings
   - Watch for unexpected category relationships
   - Consider grouping related rare categories

### Ordinal and Binary Features

Ordinal and binary features are simpler to handle than general categorical features, but proper encoding is still important.

**Ordinal Features**
- Have a natural order between categories
- Examples:
  - Property condition (Poor → Fair → Good → Excellent)
  - Size category (Small → Medium → Large)
  - Building quality (Basic → Standard → Luxury)

**Binary Features**
- Have exactly two possible values
- Examples:
  - Has parking (Yes/No)
  - Is new build (Yes/No)
  - Has garden (Yes/No)

These features are simpler because:
1. Ordinal features maintain their order relationship
2. Binary features need only two values (0/1)

Let's look at how to encode these properly:

In [None]:
import pandas as pd
import numpy as np

# Create sample data with ordinal and binary features
data = {
    'condition': ['Poor', 'Good', 'Excellent', 'Fair', 'Good'],
    'size_category': ['Small', 'Medium', 'Large', 'Small', 'Large'],
    'has_parking': ['Yes', 'No', 'Yes', 'No', 'Yes'],
    'is_new_build': [True, False, True, False, True]
}
df = pd.DataFrame(data)

# Ordinal encoding using mapping
condition_map = {
    'Poor': 0,
    'Fair': 1,
    'Good': 2,
    'Excellent': 3
}

size_map = {
    'Small': 0,
    'Medium': 1,
    'Large': 2
}

# Apply ordinal encoding
df['condition_encoded'] = df['condition'].map(condition_map)
df['size_encoded'] = df['size_category'].map(size_map)

# Binary encoding
df['parking_encoded'] = (df['has_parking'] == 'Yes').astype(int)
df['new_build_encoded'] = df['is_new_build'].astype(int)

print("Original and encoded data:")
print(df)

# Demonstrate mapping preservation
print("\nCondition value ordering:")
for condition, value in sorted(condition_map.items(), key=lambda x: x[1]):
    print(f"{condition}: {value}")

print("\nSize category ordering:")
for size, value in sorted(size_map.items(), key=lambda x: x[1]):
    print(f"{size}: {value}")

# Memory usage comparison
print("\nMemory usage comparison:")
print(f"Original condition column: {df['condition'].memory_usage()} bytes")
print(f"Encoded condition column: {df['condition_encoded'].memory_usage()} bytes")

### Combining Different Encoding Methods

Real datasets usually require multiple encoding approaches. Let's create a complete example that:
1. Handles numerical features directly
2. One-hot encodes low-cardinality categoricals
3. Target encodes high-cardinality categoricals
4. Ordinally encodes ordered categories
5. Binary encodes yes/no features

This represents a typical data preparation pipeline for a housing dataset.

In [None]:
import pandas as pd
import numpy as np

# Create a realistic housing dataset
data = {
    # Numerical features
    'price': np.random.normal(800000, 200000, 100),
    'square_feet': np.random.normal(1500, 300, 100),
    'bedrooms': np.random.randint(1, 6, 100),
    
    # Low-cardinality categorical (one-hot encode)
    'property_type': np.random.choice(['Flat', 'House', 'Penthouse'], 100),
    
    # High-cardinality categorical (target encode)
    'location': np.random.choice([
        'Chelsea', 'Hackney', 'Mayfair', 'Putney', 'Richmond',
        'Hampstead', 'Islington', 'Brixton', 'Camden', 'Greenwich'
    ], 100),
    
    # Ordinal features
    'condition': np.random.choice(['Poor', 'Fair', 'Good', 'Excellent'], 100),
    
    # Binary features
    'has_parking': np.random.choice(['Yes', 'No'], 100),
    'is_new_build': np.random.choice([True, False], 100)
}

df = pd.DataFrame(data)

class HousingEncoder:
    """Complete encoding pipeline for housing data"""
    
    def __init__(self, alpha=10):
        self.alpha = alpha
        self.encoders = {}
        self.target_stats = {}
    
    def fit_transform(self, df, target_column='price'):
        df_encoded = pd.DataFrame()
        
        # 1. Keep numerical features as is
        numerical_features = ['square_feet', 'bedrooms']
        df_encoded[numerical_features] = df[numerical_features]
        
        # 2. One-hot encode low-cardinality categorical
        onehot_features = ['property_type']
        onehot_encoded = pd.get_dummies(df[onehot_features])
        df_encoded = pd.concat([df_encoded, onehot_encoded], axis=1)
        
        # 3. Target encode high-cardinality categorical
        self.target_stats = self._compute_target_encoding(
            df, 'location', target_column
        )
        df_encoded['location_encoded'] = df['location'].map(self.target_stats)
        
        # 4. Ordinal encode ordered categories
        condition_map = {
            'Poor': 0, 'Fair': 1, 'Good': 2, 'Excellent': 3
        }
        df_encoded['condition_encoded'] = df['condition'].map(condition_map)
        
        # 5. Binary encode yes/no features
        df_encoded['has_parking'] = (df['has_parking'] == 'Yes').astype(int)
        df_encoded['is_new_build'] = df['is_new_build'].astype(int)
        
        return df_encoded
    
    def _compute_target_encoding(self, df, column, target):
        """Compute smoothed target encoding"""
        global_mean = df[target].mean()
        stats = df.groupby(column).agg({
            target: ['count', 'mean']
        }).reset_index()
        stats.columns = [column, 'count', 'mean']
        
        # Apply smoothing
        stats['smoothed_mean'] = (
            (stats['count'] * stats['mean'] + self.alpha * global_mean) /
            (stats['count'] + self.alpha)
        )
        
        return dict(zip(stats[column], stats['smoothed_mean']))

# Apply encoding
encoder = HousingEncoder(alpha=10)
df_encoded = encoder.fit_transform(df)

# Display results
print("Original data sample:")
display(df)

# print("\nFeature summary:")
# print("\nNumerical features:", df_encoded.select_dtypes(include=[np.number]).columns.tolist())
print("\nShape before encoding:", df.shape)
print("Shape after encoding:", df_encoded.shape)

display(df_encoded)

### Guide to Choosing Encoding Methods

#### Decision Framework

1. **For Numerical Features**
   - Use directly without encoding
   - No scaling needed for decision trees
   - Consider creating derived features if meaningful

2. **For Categorical Features**
   - **Use One-Hot Encoding when:**
     - Few unique categories (<30)
     - No natural order
     - Memory isn't constrained
     - Need model interpretability

   - **Use Target Encoding when:**
     - Many unique categories (30+)
     - Strong relationship with target
     - Memory is constrained
     - Have sufficient samples per category

3. **For Ordinal Features**
   - Use ordinal encoding when clear order exists
   - Maintain order relationship
   - Document ordering logic

4. **For Binary Features**
   - Always use simple 1/0 encoding
   - Consistent encoding for Yes/No values
   - Consider combining related binary features

#### Best Practices

1. **Data Quality**
   - Handle missing values before encoding
   - Check for rare categories
   - Validate category relationships

2. **Cross-Validation**
   - Compute encodings only on training data
   - Apply same encodings to validation/test
   - Never leak target information

3. **Memory & Performance**
   - Monitor memory usage for one-hot encoding
   - Use target encoding for high-cardinality
   - Consider feature importance in selection

4. **Documentation**
   - Document encoding decisions
   - Save encoding mappings
   - Track feature transformations

Remember: The goal is to balance information preservation, model performance, and practical constraints.

## Splitting Criteria Explained

To build a decision tree, we need a way to determine the best feature and value to split on at each node. 

The goal is to create child nodes that are more "pure" or homogeneous than their parent node. The method for measuring this purity and choosing the best split differs between regression and classification tasks.

### For Regression Tasks (e.g., Predicting House Prices):

In regression problems, we're trying to predict a continuous value, like house prices. The goal is to split the data in a way that minimizes the variance of the target variable within each resulting group.

The most common metric used for regression trees is the Mean Squared Error (MSE). This is the default criterion used by scikit-learn's DecisionTreeRegressor. Let's break down how this works:

Imagine you're a real estate agent with a magical ability to instantly sort houses. Your goal? To group similar houses together as efficiently as possible. This is essentially what a decision tree does, but instead of magical powers, it uses mathematics. Let's dive in!

#### Mean Squared Error (MSE)

Imagine you're playing a house price guessing game. Your goal is to guess the prices of houses as accurately as possible.

Let's say we have 5 houses, and their actual prices are:
```pre
House 1: £200,000
House 2: £250,000
House 3: £180,000
House 4: £220,000
House 5: £300,000
```

#### Step 1: Calculate the average price
`(200,000 + 250,000 + 180,000 + 220,000 + 300,000) / 5 = £230,000`

So, your guess for any house would be £230,000.

#### Step 2: Calculate how wrong you are for each house
```pre
House 1: 230,000 - 200,000 = 30,000 
House 2: 230,000 - 250,000 = -20,000
House 3: 230,000 - 180,000 = 50,000
House 4: 230,000 - 220,000 = 10,000
House 5: 230,000 - 300,000 = -70,000
```

#### Step 3: Square these differences
```pre
House 1: 30,000² = 900,000,000
House 2: (-20,000)² = 400,000,000
House 3: 50,000² = 2,500,000,000
House 4: 10,000² = 100,000,000
House 5: (-70,000)² = 4,900,000,000
```
#### Step 4: Add up all these squared differences
`
900,000,000 + 400,000,000 + 2,500,000,000 + 100,000,000 + 4,900,000,000 = 8,800,000,000
`
#### Step 5: Divide by the number of houses

`8,800,000,000 ÷ 5 = 1,760,000,000`

This final number, 1,760,000,000, is your Mean Squared Error (MSE).

In mathematical notation, this whole process looks like:

$MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y})^2$

Let's break this down:
- $n$ is the number of houses (5 in our example)
- $y_i$ is the actual price of each house
- $\hat{y}$ is your guess (the average price, £230,000 in our example)
- $\sum_{i=1}^n$ means "add up the following calculation for each house from the first to the last"
- The $i$ in $y_i$ is just a counter, going from 1 to $n$ (1 to 5 in our example)

As a python function, this would look like:

In [None]:
def calculate_mse(actual_prices, predicted_price):
    n = len(actual_prices)
    squared_errors = []
    
    for actual_price in actual_prices:
        error = predicted_price - actual_price
        squared_error = error ** 2
        squared_errors.append(squared_error)
    
    mse = sum(squared_errors) / n
    return mse

# Example usage
actual_prices = [200000, 250000, 180000, 220000, 300000]
predicted_price = sum(actual_prices) / len(actual_prices)  # Average price

mse = calculate_mse(actual_prices, predicted_price)
print(f"Mean Squared Error: {mse:.2f}")

### Evaluating Decision Points: Understanding Split Quality in Decision Trees

Now, when we split our houses into two groups, we want to measure if this split has made our predictions better. We do this by comparing the error before and after splitting using this formula:

$\Delta MSE = MSE_{before} - (({\text{fraction of houses in left group} \times MSE_{left}} + {\text{fraction of houses in right group} \times MSE_{right}}))$

Let's work through a real example to understand this:

Imagine we have 5 houses with these prices:
```pre
House 1: £200,000
House 2: £250,000
House 3: £180,000
House 4: £220,000
House 5: £300,000
```

We're considering splitting these houses based on whether they have more than 2 bedrooms:
- Left group (≤2 bedrooms): Houses 1, 3 (£200,000, £180,000)
- Right group (>2 bedrooms): Houses 2, 4, 5 (£250,000, £220,000, £300,000)

#### 1. First, let's calculate $MSE_{before}$
```pre
Mean price = (200k + 250k + 180k + 220k + 300k) ÷ 5 = £230,000

Squared differences from mean:
House 1: (230k - 200k)² = 900,000,000
House 2: (230k - 250k)² = 400,000,000
House 3: (230k - 180k)² = 2,500,000,000
House 4: (230k - 220k)² = 100,000,000
House 5: (230k - 300k)² = 4,900,000,000

MSE_before = (900M + 400M + 2,500M + 100M + 4,900M) ÷ 5
           = 1,760,000,000
```

#### 2. Now for the left group (≤2 bedrooms):
```pre
Mean price = (200k + 180k) ÷ 2 = £190,000

Squared differences:
House 1: (190k - 200k)² = 100,000,000
House 3: (190k - 180k)² = 100,000,000

MSE_left = (100M + 100M) ÷ 2 = 100,000,000
```

#### 3. And the right group (>2 bedrooms):
```pre
Mean price = (250k + 220k + 300k) ÷ 3 = £256,667

Squared differences:
House 2: (256.67k - 250k)² = 44,448,889
House 4: (256.67k - 220k)² = 1,344,448,889
House 5: (256.67k - 300k)² = 1,877,778,889

MSE_right = (44.45M + 1,344.45M + 1,877.78M) ÷ 3 = 1,088,892,222
```

#### 4. Finally, let's put it all together:
```pre
ΔMSE = MSE_before - ((2/5 × MSE_left) + (3/5 × MSE_right))
```
The second part calculates our weighted mean MSE after splitting:

- Left group has 2/5 of the houses, so we multiply its MSE by 2/5
- Right group has 3/5 of the houses, so we multiply its MSE by 3/5

This weighting ensures each house contributes equally to our final calculation.

Let's solve it:
```pre
     = 1,760,000,000 - ((2/5 × 100,000,000) + (3/5 × 1,088,892,222))
     = 1,760,000,000 - (40,000,000 + 653,335,333)
     = 1,760,000,000 - 693,335,333        # This is our weighted mean MSE after splitting
     = 1,066,664,667                      # ΔMSE: The reduction in prediction error
```
The ΔMSE (1,066,664,667) represents the difference between the original MSE and the weighted average MSE after splitting. This number is always non-negative due to a fundamental property of squared errors:

1. MSE is always positive (we're squaring differences from the mean)
2. When we split a group:
   - The parent uses one mean for all samples
   - Each subgroup uses its own mean, which minimises squared errors for that subgroup
   - The subgroup means must perform at least as well as the parent mean (due to minimising squared errors locally)
   - Therefore, the weighted average MSE of subgroups cannot exceed the parent MSE

Therefore:
- ΔMSE > 0 means the split has improved predictions (as in our case)
- ΔMSE = 0 means the split makes no difference
- ΔMSE < 0 is mathematically impossible


The larger the ΔMSE, the more effective the split is at creating subgroups with similar house prices. Our large ΔMSE of 1,066,664,667 indicates this is a very effective split.

### A simplified decision tree algorithm in python:
In practise, you'd use a library like `sklearn` to build a decision tree, but here's a simplified version in python to illustrate the concept:

In [None]:
import numpy as np
from typing import List, Dict, Any

class House:
    def __init__(self, features: Dict[str, float], price: float):
        self.features = features
        self.price = price

def find_best_split(houses: List[House], feature: str) -> tuple:
    values = sorted(set(house.features[feature] for house in houses))
    
    best_split = None
    best_delta_mse = float('-inf')

    for i in range(len(values) - 1):
        split_point = (values[i] + values[i+1]) / 2
        left = [h for h in houses if h.features[feature] < split_point]
        right = [h for h in houses if h.features[feature] >= split_point]

        if len(left) == 0 or len(right) == 0:
            continue

        mse_before = np.var([h.price for h in houses])
        mse_left = np.var([h.price for h in left])
        mse_right = np.var([h.price for h in right])

        delta_mse = mse_before - (len(left)/len(houses) * mse_left + len(right)/len(houses) * mse_right)

        if delta_mse > best_delta_mse:
            best_delta_mse = delta_mse
            best_split = split_point

    return best_split, best_delta_mse

def build_tree(houses: List[House], depth: int = 0, max_depth: int = 3) -> Dict[str, Any]:
    if depth == max_depth or len(houses) < 2:
        return {'type': 'leaf', 'value': np.mean([h.price for h in houses])}

    features = houses[0].features.keys()
    best_feature = None
    best_split = None
    best_delta_mse = float('-inf')

    for feature in features:
        split, delta_mse = find_best_split(houses, feature)
        if delta_mse > best_delta_mse:
            best_feature = feature
            best_split = split
            best_delta_mse = delta_mse

    if best_feature is None:
        return {'type': 'leaf', 'value': np.mean([h.price for h in houses])}

    left = [h for h in houses if h.features[best_feature] < best_split]
    right = [h for h in houses if h.features[best_feature] >= best_split]

    return {
        'type': 'node',
        'feature': best_feature,
        'split': best_split,
        'left': build_tree(left, depth + 1, max_depth),
        'right': build_tree(right, depth + 1, max_depth)
    }

def predict(tree: Dict[str, Any], house: House) -> float:
    if tree['type'] == 'leaf':
        return tree['value']
    
    if house.features[tree['feature']] < tree['split']:
        return predict(tree['left'], house)
    else:
        return predict(tree['right'], house)

# Example usage
houses = [
    House({'bedrooms': 2, 'area': 80, 'distance_to_tube': 15}, 200),
    House({'bedrooms': 3, 'area': 120, 'distance_to_tube': 10}, 250),
    House({'bedrooms': 2, 'area': 75, 'distance_to_tube': 20}, 180),
    House({'bedrooms': 3, 'area': 100, 'distance_to_tube': 5}, 220),
    House({'bedrooms': 4, 'area': 150, 'distance_to_tube': 2}, 300),
    House({'bedrooms': 3, 'area': 110, 'distance_to_tube': 12}, 240),
    House({'bedrooms': 2, 'area': 70, 'distance_to_tube': 25}, 190),
    House({'bedrooms': 4, 'area': 140, 'distance_to_tube': 8}, 280),
    House({'bedrooms': 3, 'area': 130, 'distance_to_tube': 6}, 260),
    House({'bedrooms': 2, 'area': 85, 'distance_to_tube': 18}, 210)
]

tree = build_tree(houses)

def print_tree(node, indent=""):
    if node['type'] == 'leaf':
        print(f"{indent}Predict price: £{node['value']:.2f}k")
    else:
        print(f"{indent}{node['feature']} < {node['split']:.2f}")
        print(f"{indent}If True:")
        print_tree(node['left'], indent + "  ")
        print(f"{indent}If False:")
        print_tree(node['right'], indent + "  ")

print_tree(tree)

# Test prediction
new_house = House({'bedrooms': 3, 'area': 105, 'distance_to_tube': 7}, 0)  # price set to 0 as it's unknown
predicted_price = predict(tree, new_house)
print(f"\nPredicted price for new house: £{predicted_price:.2f}k")

### Mean Squared Error (MSE) vs Mean Absolute Error (MAE)

When evaluating our decision tree's performance, we need to understand the difference between training metrics and evaluation metrics.

![mean-squared-error-mean-absolute-error](../static/mean-squared-error-mean-absolute-error.png)

Our decision tree algorithm uses MSE as the splitting criterion but measures final performance using MAE. 

Here's why we use these different metrics:

##### 1. Mean Squared Error (MSE)

   **Calculation:** (predicted house price - actual house price)²

   For example, if we predict £200,000 for a house that's actually worth £150,000, the error is £50,000 and MSE is £50,000² = £2.5 billion

   **Visualisation**

   If we plot how wrong our house price prediction is (like £50,000 too high or -£50,000 too low) on the x-axis, and plot the squared value of this error (like £2.5 billion) on the y-axis, we get a U-shaped curve. Because MSE squares the errors, it gives more weight to data points that are further from the mean, making it a good measure of variance within groups.

   **Purpose**

   The decision tree uses MSE to decide where to split data because minimizing MSE is equivalent to minimizing the variance within each group, which helps find splits that create distinct groups of house prices.

  ##### 2. Mean Absolute Error (MAE)

   **Calculation:** |predicted house price - actual house price|

   Using the same example, if we predict £200,000 for a £150,000 house, MAE is |£50,000| = £50,000

   **Visualisation**

   If we plot how wrong our prediction is on the x-axis (like £50,000 too high or -£50,000 too low), and plot the absolute value of this error on the y-axis (always positive, like £50,000), we get a V-shaped curve

   **Purpose**
   
   We use MAE to evaluate our final model because it's easier to understand - it directly tells us how many pounds we're off by on average

\
The decision tree uses MSE's mathematical properties to make splitting decisions, but we report MAE because "off by £50,000 on average" makes more sense than "off by £2.5 billion squared pounds"!

\
Here's an example to illustrate the difference:

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error 
y_true = [100, 200, 300]
y_pred = [90, 210, 320]

mse = mean_squared_error(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")

Output:

```pre
Mean Squared Error: 200.00
Mean Absolute Error: 13.33
```

In this example, MSE and MAE provide different views of the error. MSE is more sensitive to the larger error (20) in the third prediction, while MAE treats all errors equally.

For house price prediction, MAE is often preferred as it directly translates to the average error in pounds. However, MSE is still commonly used as a splitting criterion in decision trees because minimizing MSE helps create groups with similar target values by minimizing the variance within each group.

### For Classification Tasks (e.g., Predicting if a House Will Sell Quickly):

In classification problems, we're trying to predict a categorical outcome, like whether a house will sell quickly or not. The goal is to split the data in a way that maximizes the "purity" of the classes within each resulting group.

There are several metrics used for classification trees, with the most common being Gini Impurity and Entropy. These metrics measure how mixed the classes are within a group.

Let's explore how different distributions of marbles affect our measures of impurity. We will then explore information gain, a measure used in conjuction with impurity metrics to decide how to split the data.

We'll use red marbles to represent quick-selling houses and blue marbles for slow-selling houses.

#### 1. Gini Impurity:
   Gini Impurity measures the probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the distribution in the set.

   Formula: $Gini = 1 - \sum_{i=1}^{c} (p_i)^2$

   Where $c$ is the number of classes and $p_i$ is the probability of an object being classified to a particular class.

   Let's compare three scenarios:

```pre
   a) 10 marbles: 7 red, 3 blue
      Fraction of red = 7/10 = 0.7
      Fraction of blue = 3/10 = 0.3
      
      Gini = 1 - (0.7² + 0.3²) = 1 - (0.49 + 0.09) = 1 - 0.58 = 0.42
```

```pre
   b) 10 marbles: 5 red, 5 blue
      Fraction of red = 5/10 = 0.5
      Fraction of blue = 5/10 = 0.5
      
      Gini = 1 - (0.5² + 0.5²) = 1 - (0.25 + 0.25) = 1 - 0.5 = 0.5
      most impure set
```

```pre
   c) 10 marbles: 9 red, 1 blue
      Fraction of red = 9/10 = 0.9
      Fraction of blue = 1/10 = 0.1
      
      Gini = 1 - (0.9² + 0.1²) = 1 - (0.81 + 0.01) = 1 - 0.82 = 0.18
      purest set
```

**The lower the Gini Impurity, the purer the set. Scenario (c) has the lowest Gini Impurity, indicating it's the most homogeneous.**

#### 2. Entropy:

Entropy is another measure of impurity, based on the concept of information theory. It quantifies the amount of uncertainty or randomness in the data.

$Entropy = -\sum_{i=1}^{c} p_i \log_2(p_i)$

Where $c$ is the number of classes and $p_i$ is the probability of an object being classified to a particular class.

Imagine you're playing a guessing game with marbles in a bag. Entropy measures how surprised you'd be when pulling out a marble. The more mixed the colours, the more surprised you might be, and the higher the entropy.

#### Let's use our marble scenarios:

10 marbles: 7 red, 3 blue

To calculate entropy, we follow these steps:

1. Calculate the fraction of each colour:
```pre
   Red: 7/10 = 0.7
   Blue: 3/10 = 0.3
```

2. For each colour, multiply its fraction by the log2 of its fraction:   
```pre
   Red: 0.7 × log2(0.7) = 0.7 × -0.5146 = -0.360
   Blue: 0.3 × log2(0.3) = 0.3 × -1.7370 = -0.5211
```

3. Sum these values and negate the result:
```pre
Entropy = -(-0.3602 + -0.5211) = 0.8813
```

#### Let's do this for all scenarios:

a) 7 red, 3 blue
```pre
   Entropy = 0.8813
```
b) 5 red, 5 blue
```pre
   Red: 0.5 × log2(0.5) = 0.5 × -1 = -0.5
   Blue: 0.5 × log2(0.5) = 0.5 × -1 = -0.5
   Entropy = -(-0.5 + -0.5) = 1

Highest entropy, least predictable set
```

c) 9 red, 1 blue
```pre
   Red: 0.9 × log2(0.9) = 0.9 × -0.1520 = -0.1368
   Blue: 0.1 × log2(0.1) = 0.1 × -3.3219 = -0.3322
   Entropy = -(-0.1368 + -0.3322) = 0.4690

Lowest entropy, most predictable set
```

Lower entropy means less surprise or uncertainty. Scenario (c) has the lowest entropy, confirming it's the most predictable (or least mixed) set.

In Python, we could calculate entropy like this:

In [None]:
import math

def calculate_entropy(marbles):
    total = sum(marbles.values())
    entropy = 0
    for count in marbles.values():
        fraction = count / total
        entropy -= fraction * math.log2(fraction)
    return entropy

# Example usage
scenario_a = {"red": 7, "blue": 3}
entropy_a = calculate_entropy(scenario_a)
print(f"Entropy for scenario A: {entropy_a:.4f}")

#### 3. Information Gain:

Information Gain measures how much a split improves our ability to predict the outcome. It's a way of measuring how much better you've sorted your marbles after dividing them into groups.

Formula: $IG(T, a) = I(T) - \sum_{v \in values(a)} \frac{|T_v|}{|T|} I(T_v)$

Where:
- $T$ is the parent set
- $a$ is the attribute on which the split is made
- $v$ represents each possible value of attribute $a$
- $T_v$ is the subset of $T$ for which attribute $a$ has value $v$
- $I(T)$ is the impurity measure (Entropy or Gini) of set $T$


#### Let's use a scenario to calculate Information Gain:

We have 20 marbles total, and we're considering splitting them based on a feature (e.g., house size: small or large).
```pre
Before split: 12 red, 8 blue
```

Step 1: Calculate the entropy before the split
```pre
Entropy_before = 0.9710 (calculated as we did above)
```

After split:
```pre
Small houses: 8 red, 2 blue
Large houses: 4 red, 6 blue
```
Step 2: Calculate entropy for each group after the split
Entropy_small = 0.7219 (calculated for 8 red, 2 blue)
Entropy_large = 0.9710 (calculated for 4 red, 6 blue)

Step 3: Calculate the weighted average of the split entropies
```pre
Weight_small = 10/20 = 0.5 (half the marbles are in small houses)
Weight_large = 10/20 = 0.5 (half the marbles are in large houses)
Weighted_entropy_after = (0.5 × 0.7219) + (0.5 × 0.9710) = 0.8465
```

Step 4: Calculate Information Gain
```pre
Information Gain = Entropy_before - Weighted_entropy_after
                 = 0.9710 - 0.8465
                 = 0.1245
```

This positive Information Gain indicates that the split has improved our ability to predict marble colours (or in our house analogy, to predict if a house will sell quickly).

#### In Python, we could calculate Information Gain like this:

In [None]:
def calculate_information_gain(before, after):
    entropy_before = calculate_entropy(before)
    
    total_after = sum(sum(group.values()) for group in after)
    weighted_entropy_after = sum(
        (sum(group.values()) / total_after) * calculate_entropy(group)
        for group in after
    )
    
    return entropy_before - weighted_entropy_after

# Example usage
before_split = {"red": 12, "blue": 8}
after_split = [
    {"red": 8, "blue": 2},  # Small houses
    {"red": 4, "blue": 6}   # Large houses
]

info_gain = calculate_information_gain(before_split, after_split)
print(f"Information Gain: {info_gain:.4f}")

#### Comparison: Splits with Different Information Gains

The decision tree algorithm always chooses the split that provides the most Information Gain. 

Let's consider two potential splits of our 20 marbles:

1. Split by house size (small vs large):
   - Small houses: 8 red, 2 blue
   - Large houses: 4 red, 6 blue
   - Information Gain: 0.1245

2. Split by garage presence:
   - Houses with garage: 6 red, 4 blue
   - Houses without garage: 6 red, 4 blue
   - Information Gain: 0

The algorithm would choose the split by house size because it provides more Information Gain. 

Zero Information Gain occurs when a split doesn't change the distribution of the target variable (in this case, marble colours or house selling speed). This happens when the proportions in each resulting group are identical to the proportions in the parent group.

In practice, splits with exactly zero Information Gain are rare. More commonly, you'll see splits with varying degrees of positive Information Gain, and the algorithm will choose the one with the highest value.

Features that provide little or no Information Gain are typically less valuable for prediction and should be considered for removal from the model. Eliminating these low-impact features can simplify the model, potentially improving its generalization ability and computational efficiency without significantly compromising predictive performance.

## Interpretability and Visualization

After understanding how decision trees split data using criteria like MSE and Gini impurity, it's crucial to explore one of their greatest strengths: interpretability.

Unlike many machine learning models that act as "black boxes," decision trees provide clear insights into their decision-making process.

### Why Interpretability Matters

For house price prediction, interpretability allows us to:
- Explain predictions to stakeholders (buyers, sellers, agents)
- Validate model logic against domain knowledge
- Identify potential biases or errors
- Meet regulatory requirements for transparency

### How to Interpret Decision Trees

#### 1. Reading Tree Structure

Consider this simplified tree for house prices:
```
Area > 2000 sq ft?
├── Yes: Location = "Chelsea"?
│   ├── Yes: £2.5M (n=50)
│   └── No: £1.8M (n=150)
└── No: Number of bedrooms > 2?
    ├── Yes: £950K (n=200)
    └── No: £650K (n=100)
```

Each node tells us:
- The decision rule (e.g., "Area > 2000 sq ft?")
- The number of samples (n)
- The predicted value (for leaf nodes)

#### 2. Decision Paths

Each path from root to leaf represents a complete prediction rule. For example:
- IF area > 2000 sq ft AND location = "Chelsea" THEN price = £2.5M
- IF area ≤ 2000 sq ft AND bedrooms > 2 THEN price = £950K

This allows us to provide clear explanations for any prediction.

#### 3. Feature Importance

Decision trees naturally reveal feature importance through:

a) Position in tree:
- Features closer to root affect more predictions
- Top-level splits handle larger portions of data

b) Usage frequency:
- Features used multiple times may be more important
- Different contexts show feature interactions

c) Impact on predictions:
- Splits that create large value differences are important
- Features that reduce variance significantly

## Visualizing Decision Trees

While our simple example above is easy to read, real trees can be much more complex. Here are key visualization approaches:

1. **Full Tree Visualization**
   - Shows complete structure
   - Good for understanding overall patterns
   - Can become overwhelming for deep trees

2. **Pruned Tree Views**
   - Show top few levels
   - Focus on most important decisions
   - More manageable for presentation

3. **Feature Importance Plots**
   - Bar charts of feature importance
   - Easier to digest than full trees
   - Good for high-level insights

## Understanding Bias, Variance, Tree Depth and Complexity

### Bias
- **The error introduced by approximating a real-world problem with a simplified model**
- Represents how far off the model's predictions are from the true values on average
- High bias means the model consistently misses the true patterns (underfitting)

    1. **Shallow Trees (High Bias)**
    ```pre
    Root: Area > 2000 sq ft?
    ├── Yes: £2M
    └── No: £800K
    ```
    - Very simple rules
    - Misses many important factors
    - Similar predictions for different houses

### Variance
- **The model's sensitivity to fluctuations in the training data**
- Represents how much predictions change with different training sets
- High variance means predictions vary significantly with small changes in training data (overfitting)

    2. **Deep Trees (High Variance)**
    ```pre
    Root: Area > 2000 sq ft?
    ├── Yes: Location = "Chelsea"?
    │   ├── Yes: Bedrooms > 3?
    │   │   ├── Yes: Garden = True?
    │   │   │   ├── Yes: £3.2M
    │   │   │   └── No: £2.9M
    ...
    ```
    - Very specific rules
    - Might memorize training data
    - Can make unstable predictions


## Identifying the Bias/Variance Tradeoff

Consider these scenarios:

### Scenario 1: Too Simple (High Bias)
```python
# Example of underfitting
predictions = {
    "2500 sq ft in Chelsea": £2M,
    "2500 sq ft in Hackney": £2M,  # Same prediction despite location
    "2500 sq ft in Mayfair": £2M   # Location ignored
}
```

### Scenario 2: Too Complex (High Variance)
```python
# Example of overfitting
predictions = {
    "2500 sq ft, Chelsea, 4 bed, garden": £3.2M,
    "2500 sq ft, Chelsea, 4 bed, no garden": £2.9M,
    # Small changes lead to large prediction differences
    "2499 sq ft, Chelsea, 4 bed, garden": £2.7M  # Just 1 sq ft difference
}
```

### Scenario 3: Balanced
```python
# Example of good balance
predictions = {
    "Large house in Chelsea": £2.5M-3.0M,
    "Large house in Hackney": £1.5M-2.0M,
    # Reasonable variations based on key features
}
```

## Managing the Bias/Variance Tradeoff

When building a decision tree, we need to find the right balance between making it too simple (underfitting) and too complex (overfitting). 

Let's explore how to find this balance.

### 1. Control Tree Complexity
We can control how detailed our tree becomes using parameters:
- Maximum depth (how many questions we can ask)
- Minimum samples per leaf (how many houses needed for a conclusion)
- Minimum improvement threshold (how much better a split needs to be)

### 2. Understanding Training vs Validation Error

Training error is how well our model predicts house prices for houses it learned from, while validation error is how well it predicts prices for houses it hasn't seen before.

Think of it like this:
- **Training Error**: How well you can predict prices of houses you studied
- **Validation Error**: How well you can predict prices of new houses

Let's look at how these errors change as we make our tree more complex:


```code
Depth   Training Error  Validation Error   What's Happening
3       £250K           £260K              #  Tree is too simple
                                           #  - Both errors are high
                                           #  - Tree isn't learning enough patterns
 
5       £180K           £200K              #  Tree is just right
                                           #  - Both errors are reasonable
                                           #  - Tree learns genuine patterns
 
7       £120K           £220K              #  Tree is getting too complex
                                           #  - Training error keeps dropping
                                           #  - Validation error starts rising
                                           #  - Starting to memorise training data
 
10      £50K            £300K              #  Tree is way too complex
                                           #  - Training error is very low
                                           #  - Validation error is very high
                                           #  - Tree has memorised training data
```

### 3. Finding the Best Depth Using Cross-Validation

To find the best depth, we:
1. Try different depths
2. Test each one on multiple splits of our data
3. Choose the depth with lowest validation error

```python
# Test different tree depths
depths = [3, 5, 7, 10, 15]
for depth in depths:
    scores = cross_validate(tree, depth)
    # Choose depth where validation error is lowest
```

In our example, depth=5 gives the best balance because:
- Training error (£180K) shows it's learning meaningful patterns
- Validation error (£200K) shows these patterns generalise well to new houses
- The gap between training and validation error is reasonable

This balance means our tree has learned genuine relationships in house prices without memorising specific examples from the training data.

## Visual Indicators of Bias/Variance

### 1. Learning Curves

![model-complexity-bias-variance-contributing-to-total-error](../static/model-complexity-bias-variance-contributing-to-total-error.png)

As the model complexity increases, the training error decreases and the validation error increases. 

Total error is the sum of bias (the error introduced by approximating a real-world problem with a simplified model) and variance (the error caused by the model's sensitivity to fluctuations in the training data).

Underfitting occurs when the model is too simple (high bias), resulting in both training set and validation set total errors being high.

Overfitting occurs when the model is too complex (high variance), resulting in a large gap between training and validation set total errors.

![model-complexity-error-training-test-samples](../static/model-complexity-error-training-test-samples.png)


![performance-model-complexity-training-validation-sets-overfitting](../static/performance-model-complexity-training-validation-sets-overfitting.png)


## Practical Guidelines

1. **Start Simple**
   - Begin with shallow trees
   - Add complexity gradually
   - Monitor performance changes

2. **Use Domain Knowledge**
   - Consider reasonable decision granularity
   - Identify important feature interactions
   - Set meaningful constraints

3. **Regular Validation**
   - Test on unseen data
   - Check prediction stability
   - Monitor for overfitting signs

Understanding this tradeoff is crucial for:
- Setting appropriate tree depth
- Choosing regularization parameters
- Deciding when to use ensemble methods

Now that we understand how to build well-balanced decision trees, we need to know which features are driving their decisions. 

In the next section, we'll explore how decision trees determine which features are most important for making predictions (like whether location matters more than size for house prices) and discover their advanced capabilities in handling different types of data. This knowledge is crucial for building more effective models and gaining insights from your data.

## Feature Importance and Advanced Capabilities

After understanding how to balance bias and variance, it's crucial to explore how decision trees determine feature importance and their advanced capabilities in handling different types of data and relationships.

### Feature Importance in Decision Trees

#### How Trees Measure Importance

1. **Split Position**
   ```pre
   Root (Most Important)
   ├── Level 1
   │   ├── Level 2
   │   └── Level 2
   └── Level 1
       ├── Level 2
       └── Level 2
   ```
   - Higher splits affect more samples
   - Root splits are most influential
   - Earlier splits indicate greater importance

2. **Impurity Reduction**
   ```python
   # Conceptual example
   importance = sum([
       node.samples * node.impurity_reduction
       for node in feature_splits
   ])
   ```
   - Larger reductions = more important
   - Weighted by number of samples affected
   - Accumulated across all splits using feature

3. **Usage Frequency**
   - Features used multiple times may be more important
   - Different contexts show feature interactions
   - Patterns of use reveal complexity of relationship

#### Example: London Housing Features

Typical importance hierarchy:
```pre
1. Location (30-40% importance)
   - Primary price determinant
   - Used in multiple splits
   - Strong predictor at all levels

2. Area (20-30% importance)
   - Key size indicator
   - Often appears near root
   - Clear price relationship

3. Property Type (15-20% importance)
   - Important categorical feature
   - Interacts with location/area
   - Distinct price levels

4. Bedrooms (10-15% importance)
   - Secondary size indicator
   - Often appears lower in tree
   - Correlated with area
```

### Advanced Capabilities

#### 1. Handling Non-linear Relationships

Trees naturally capture non-linear patterns:
```pre
Price vs. Area Relationship:
< 1000 sq ft: £500K
1000-2000 sq ft: £1M
2000-3000 sq ft: £2.5M
> 3000 sq ft: £5M
```
- No assumption of linearity
- Step-wise approximation
- Automatic threshold finding

### 2. Feature Interaction Detection

Trees automatically find interactions:
```pre
Area > 2000 sq ft?
├── Yes: Location = "Chelsea"?
│   ├── Yes: £3M (premium location + large)
│   └── No: £2M (large but standard location)
└── No: Location = "Chelsea"?
    ├── Yes: £1.5M (premium location but small)
    └── No: £800K (standard location and small)
```
- Different location effects by size
- Automatic path discovery
- Hierarchical relationships

#### 3. Missing Value Handling

Trees can handle missing data through:

1. **Surrogate Splits**
```
Primary Split: Area > 2000 sq ft?
Surrogate: Bedrooms > 3?  # If area is missing
```

2. **Built-in Mechanisms**
```python
# Conceptual handling
if value_is_missing(area):
    use_surrogate_split(bedrooms)
else:
    use_primary_split(area)
```

#### 4. Categorical Variable Handling

Automatic handling of:
- Property types
- Locations
- Amenities
- No encoding needed

### Limitations and Solutions

#### 1. Instability
```python
# Small data changes can cause different splits
# Example:
data1 = [..., 999, 1001, ...]  # Splits at 1000
data2 = [..., 1001, 999, ...]  # Splits elsewhere
```

**Solution:** Ensemble methods
- Random Forests
- Gradient Boosting
- Bagging

#### 2. Linear Relationship Inefficiency
```
True relationship: price = 1000 * area
Tree approximation:
area <= 1000: price = 1M
area <= 2000: price = 2M
area <= 3000: price = 3M
```

**Solution:** Feature engineering
- Create derived features
- Transform variables
- Combine with linear models

#### 3. Extrapolation Limitation
```python
# Training data: areas up to 3000 sq ft
# Cannot reliably predict for:
area = 4000  # Beyond training range
```

**Solution:**
- Domain constraints
- Careful feature ranges
- Hybrid models

### Practical Applications

1. **Feature Selection**
   - Use importance scores to select features
   - Remove redundant variables
   - Focus on strongest predictors

2. **Model Improvement**
   - Identify key interaction patterns
   - Guide feature engineering
   - Inform data collection

3. **Business Insights**
   - Understand market drivers
   - Identify value factors
   - Guide decision making

Understanding these capabilities helps in:
- Choosing appropriate problems for decision trees
- Setting realistic expectations
- Leveraging tree strengths while mitigating weaknesses

In the next section, we'll explore practical limitations and ethical considerations when using decision trees for real-world applications.

## Limitations and Ethical Considerations

Having explored the capabilities of decision trees, it's crucial to understand their limitations and the ethical considerations in their application, particularly in sensitive domains like housing prices.

### Technical Limitations

#### 1. Decision Boundary Limitations

```python
# Decision trees create rectangular decision boundaries
# Example regions in feature space:
regions = {
    "Chelsea, >2000 sq ft": "High Price",
    "Chelsea, ≤2000 sq ft": "Medium Price",
    "Other, >2000 sq ft": "Medium Price",
    "Other, ≤2000 sq ft": "Low Price"
}

# Cannot easily represent diagonal or curved boundaries
# May need many splits to approximate smooth transitions
```

#### 2. Data Fragmentation
```python
# Deep trees can create very specific rules
path = "Area > 2000 sq ft AND Location = 'Chelsea' AND 
        Bedrooms = 4 AND Has_Garden = True AND..."

# Problem: Few samples per leaf
samples_per_rule = {
    "specific_rule_1": 2,  # Too few to be reliable
    "specific_rule_2": 3,
    "specific_rule_3": 1
}
```

#### 3. Prediction Discontinuities
```
Area   Price
1999   £800K  # Just below threshold
2001   £1.2M  # Just above threshold
```

### Solutions and Mitigations

#### 1. Ensemble Methods
```python
# Instead of single tree:
predictions = {
    'random_forest': sum(tree.predict() for tree in trees) / len(trees),
    'gradient_boost': sum(tree.predict() for tree in sequential_trees),
    'bagging': weighted_average(tree.predict() for tree in bagged_trees)
}
```

#### 2. Regularization Techniques
```python
tree_params = {
    'max_depth': 10,          # Prevent excessive splitting
    'min_samples_leaf': 20,   # Ensure reliable leaf nodes
    'min_impurity_decrease': 0.01  # Require meaningful splits
}
```

#### 3. Cross-Validation
```python
# Test different parameter combinations
params_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_leaf': [10, 20, 50],
    'min_impurity_decrease': [0.01, 0.02, 0.05]
}
```

## Ethical Considerations for Decision Tree Models

When applying decision trees to housing price prediction, we must carefully consider the ethical implications and societal impact of our models.

### 1. Bias in Training Data

#### Understanding Data Bias

Historical housing data often reflects societal inequalities and biases:
- Certain areas may be over or under-represented
- Quality of data may vary by neighborhood
- Historical redlining effects may persist in the data
- Property features may be inconsistently recorded across areas

#### Example of Data Bias
Consider two neighborhoods:

**Affluent Area:**
- 1000+ property records
- Complete feature sets (area, condition, amenities)
- Regular price updates
- Detailed property descriptions

**Developing Area:**
- Only 100 property records
- Missing features
- Irregular price updates
- Basic property information only

This disparity in data quality and quantity can lead to:
- Less accurate predictions in underrepresented areas
- Reinforcement of existing price disparities
- Lower confidence in predictions for certain areas

#### Mitigation Strategies

1. **Data Collection**
   - Actively gather data from underrepresented areas
   - Standardize data collection across all neighborhoods
   - Partner with community organizations for local insights

2. **Model Development**
   - Weight samples to balance representation
   - Use stratified sampling across neighborhoods
   - Include confidence intervals with predictions

3. **Regular Auditing**
   - Monitor prediction accuracy across different areas
   - Track error rates by neighborhood
   - Assess impact on different communities

### 2. Fairness and Discrimination

#### Protected Characteristics

Decision trees must not perpetuate discrimination based on:
- Race, ethnicity, or national origin
- Religion
- Gender
- Age
- Disability status
- Family status

#### Direct and Indirect Bias

Consider these two approaches:

**Problematic Approach:**
```pre
If neighborhood = "historically_disadvantaged":
    Predict lower value
```

**Better Approach:**
```pre
If distance_to_amenities < 1km:
    If property_condition = "excellent":
        Predict based on objective features
```

The second approach uses objective criteria rather than potentially biased historical patterns.

#### Monitoring for Fairness

1. Track prediction ratios across different groups
2. Compare error rates between communities
3. Analyze the impact of model updates on different areas
4. Review feature importance for potential proxy discrimination

### 3. Market Impact and Social Responsibility

#### Housing Market Effects

Our models can influence:
1. **Buyer Behaviour**
   - Setting price expectations
   - Influencing negotiation starting points
   - Affecting perceived neighborhood value

2. **Market Dynamics**
   - Property valuation standards
   - Investment patterns
   - Neighborhood development

3. **Housing Accessibility**
   - Affordability assessments
   - Mortgage approvals
   - Insurance rates

#### Responsible Implementation
1. **Transparency**
   - Clearly explain model limitations
   - Provide confidence intervals
   - Document all assumptions
   - Share key factors affecting predictions

2. **Community Impact**
   - Engage with local stakeholders
   - Consider neighborhood stability
   - Monitor displacement risks
   - Support housing accessibility

3. **Market Stability**
   - Avoid reinforcing speculation
   - Maintain price prediction stability
   - Consider local market conditions
   - Support sustainable growth

### 4. Best Practices for Ethical Use

#### Development Guidelines

1. **Data Collection**
   - Ensure representative samples
   - Document data sources
   - Validate data quality
   - Address historical biases

2. **Model Design**
   - Use interpretable features
   - Avoid proxy discrimination
   - Include uncertainty measures
   - Document design choices

3. **Testing and Validation**
   - Test across diverse scenarios
   - Validate with community input
   - Monitor for unintended consequences
   - Regular fairness audits

#### Deployment Considerations
1. **Model Release**
   - Gradual rollout
   - Monitor impact
   - Gather feedback
   - Ready to adjust

2. **Ongoing Oversight**
   - Regular audits
   - Community feedback
   - Impact assessment
   - Update protocols

#### Documentation Requirements

Your model documentation should include:
1. Training data sources and limitations
2. Feature selection rationale
3. Fairness considerations and tests
4. Known biases and limitations
5. Intended use guidelines
6. Impact monitoring plan

Ethical considerations aren't just a compliance checklist—they're fundamental to building models that serve society fairly and responsibly. Regular review and adjustment of these practices ensures our models contribute positively to the housing market and community well-being.

## Theory Conclusion

Now that we've explored the key concepts behind decision trees, let's summarize the main points and how they apply to our house price prediction task:

### Core Concepts

1. **Regression Trees vs Classification Trees** 
   - For house price prediction, we use regression trees
   - Unlike classification trees (Gini impurity/entropy), regression trees minimize variance in target variable (house prices) within each node
   - Different metrics for different tasks:
     - MSE for regression
     - Gini/Entropy for classification

2. **Splitting Criterion**
   - Regression trees use reduction in Mean Squared Error (MSE)
   - At each node, algorithm chooses split maximizing reduction:

   $\Delta MSE = MSE_{parent} - (w_{left} * MSE_{left} + w_{right} * MSE_{right})$

   Where $w_{left}$ and $w_{right}$ are the proportions of samples in left and right child nodes

3. **Recursive Splitting**
   - Tree built by recursively applying splitting process
   - Creates hierarchy of decision rules
   - Continues until stopping condition met:
     - Maximum tree depth reached
     - Minimum samples per leaf achieved
     - No further improvement possible

4. **Prediction Process**
   - Follow decision rules from root to leaf node
   - Prediction is mean price of houses in leaf node
   - Clear, interpretable decision path

### Data Handling and Model Characteristics

5. **Data Preparation**
   - Numerical features: Use directly without transformation
   - Categorical features require encoding:
     - One-hot encoding for low-cardinality
     - Target encoding for high-cardinality
     - Ordinal encoding for ordered categories
   - Binary features: Simple 1/0 encoding

6. **Interpretability**
   - Can visualize tree and follow decision path
   - Provides insights into feature importance
   - Clear decision rules for predictions
   - Natural feature selection through split choices

7. **Bias-Variance Trade-off**
   - Deeper trees: More complex relationships but risk overfitting (high variance)
   - Shallower trees: More generalizable but may oversimplify (high bias)
   - Balance crucial for optimal performance
   - Cross-validation helps find optimal depth

8. **Feature Importance**
   - Natural feature selection through tree construction
   - More important features appear:
     - Higher in tree
     - In more splits
     - With larger reduction in impurity

9. **Advanced Capabilities**
   - Handles non-linear relationships unlike linear regression
   - Captures complex interactions between features
   - No feature scaling required
   - Natural handling of missing values

10. **Limitations and Solutions**
    - Instability: Small data changes can result in very different trees
    - Solution: Ensemble methods like Random Forests
    - Struggles with smooth, linear relationships
    - Limited extrapolation capability
    - May create biased trees if data is unbalanced

### Error Metrics and Evaluation

11. **Understanding Error Metrics**
    - Training uses MSE for splitting decisions
    - Evaluation often uses MAE for interpretability
    - MSE formula for node impurity:
      $MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y})^2$

### Next Steps

As we move to practical implementation, we'll focus on:
1. Applying these concepts to real housing data
2. Using scikit-learn's decision tree implementation
3. Tuning hyperparameters for optimal performance
4. Interpreting and visualizing tree decisions
5. Understanding feature importance
6. Handling real-world data challenges

This theoretical foundation prepares us for the practical challenges of implementing decision trees for house price prediction, while understanding both the power and limitations of the approach. The next lesson will demonstrate how to implement these concepts using Python and scikit-learn, and how to gain insights into the London housing market using decision trees.

As we move forward to apply these concepts to our London housing dataset, keep in mind that while the theory provides the foundation, the real insights often come from experimenting with the data, tuning the model, and interpreting the results in the context of the problem at hand.

### Next lesson: [2b_decision_trees_practical.ipynb](./2b_decision_trees_practical.ipynb)