# Practical: Applying decision trees on London Housing Data

In this practical, we will explore using decision trees to predict housing prices in London.

We'll walk through the key steps of loading data, exploratory analysis, data preprocessing, model training and evaluation, and drawing insights.

## Table of Contents

1. [Load imports](#load-imports)
2. [Load data](#Load-the-dataset-and-display-the-data-table)
3. [Exploratory data analysis](#exploratory-data-analysis)
4. [Exploratory Data Analysis Discussion](#exploratory-data-analysis-discussion)
5. [Data Processing Deep dive](#data-processing-deep-dive)
   - [Data Cleaniness](#data-cleaniness)
   - [Handling Missing Data](#handling-missing-data)
   - [Handling Missing Categorical Data](#handling-missing-categorical-data)
   - [Handling Missing Continuous Data](#handling-missing-continuous-data)
   - [Features of the data](#features-of-the-data)
   - [Feature engineering](#feature-engineering)
   - [Preparing the data for modeling](#preparing-the-data-for-modeling)
      - [Brief discussion on feature scaling](#feature-scaling)
6. [Running the intial model and comparing missing data methods](#running-the-intial-model-and-comparing-missing-data-methods)
7. [Feature selection](#feature-selection)
8. [Comparing model types: Random Forest](#comparing-model-types-random-forest)
9. [Comparing models: XGBoost gradient boosting](#comparing-models-xgboost-gradient-boosting)
10. [Model persistence](#model-persistence-saving-and-loading-trained-models)
    - [Loading saved models and making predictions](#loading-saved-models-and-making-predictions)
11. [Limitations of Decision Trees](#limitations-of-decision-trees)
12. [Ethical Considerations](#ethical-considerations)
13. [Machine Learning Model Deployment: From Development to Production](#machine-learning-model-deployment-from-development-to-production)
14. [Conclusion](#conclusion)
    - [Further Reading](#further-reading)


## Load imports


In [321]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

np.random.seed(42)

## Load the data

Lets print the data table and have a look at the data.

We'll see it has 11 columns and 3480 rows. The first column is the row number which will need to be removed. The second column is the price of each house and the remaining columns are features of each house.

The initial features are:
- Property Name
- Price
- House Type
- Area in sq ft
- No. of Bedrooms
- No. of Bathrooms
- No. of Receptions
- Location
- City/County
- Postal Code

In [None]:
def load_data(file_path):
    """
    Load data with proper NaN handling and verification
    """
    # Read CSV with na_values parameter to properly interpret NaN values
    df = pd.read_csv(file_path, na_values=['NaN', 'nan', 'NAN', '', 'null', 'NULL'])
    
    return df

# Load the data
df = load_data("../data/London_Housing_Data.csv")

# Display first 10 rows with headers in a more readable format
pd.set_option('display.max_columns', None)  # Show all columns
print("\nFirst 10 rows of the dataset with headers:")
display(df.head(10))

# Remove unnamed column with row numbers
df = df.drop(columns=['Unnamed: 0'])
print("\nDataset shape after removing unnamed column:", df.shape)
display(df.head(10))

# Let's verify the Location column specifically
print("\nUnique values in Location column:")
display(df['Location'].value_counts(dropna=False).head())

## Exploratory data analysis

Let's explore the data to get a better understanding of it, identify any issues and get some insights that will help us prepare it for model training.

In [None]:
# Set pandas display format to be more human readable with 2 decimal places
pd.set_option('display.float_format', lambda x: '{:,.2f}'.format(x))


def explore_data(df):
    print("\nStatistical Summary:")
    print(df.describe())
    
    plt.figure(figsize=(10,6))
    sns.histplot(df['Price']/1000000, kde=True)
    plt.title('Distribution of House Prices')
    plt.xlabel('Price (£ millions)')
    plt.ylabel('Frequency')
    plt.show()
    
    plt.figure(figsize=(10,6))
    plt.scatter(df['Area in sq ft'], df['Price']/1000000, alpha=0.5)
    plt.title('Price vs. Area')
    plt.xlabel('Area in sq ft')
    plt.ylabel('Price (£ millions)')
    plt.show()
    
    # Normal correlation matrix (numeric variables only)
    numeric_cols = ['Price', 'Area in sq ft', 'No. of Bedrooms', 'No. of Bathrooms', 'No. of Receptions']
    corr_matrix = df[numeric_cols].corr()
    plt.figure(figsize=(10,8))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Matrix (Numeric Variables)')
    plt.show()

    print("\nCorrelation Matrix (Numeric Variables):")
    print(corr_matrix)

    print("\nUnique Values Analysis:")
    categorical_and_discrete = ['House Type', 'Location', 'City/County', 'Postal Code', 
                              'No. of Bedrooms', 'No. of Bathrooms', 'No. of Receptions']
    
    total_rows = len(df)
    for col in categorical_and_discrete:
        unique_count = df[col].nunique()
        null_count = df[col].isnull().sum()
        unique_ratio = (unique_count / total_rows) * 100
        
        print(f"\n{col}:")
        print(f"- Unique values: {unique_count}")
        print(f"- Null values: {null_count}")
        print(f"- Unique ratio: {unique_ratio:.2f}% of total rows")
        print(f"- Most common values:")
        print(df[col].value_counts().head(3))

    # Enhanced correlation matrix (including encoded categorical variables)
    categorical_cols = ['House Type', 'Location', 'City/County', 'Postal Code']
    df_encoded = df.copy()
    le = LabelEncoder()
    for col in categorical_cols:
        df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))

    enhanced_corr_columns = numeric_cols + categorical_cols
    enhanced_corr_matrix = df_encoded[enhanced_corr_columns].corr()

    plt.figure(figsize=(12,10))
    sns.heatmap(enhanced_corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Enhanced Correlation Matrix (Including Encoded Categorical Variables)')
    plt.show()

    print("\nEnhanced Correlation Matrix (Including Encoded Categorical Variables):")
    print(enhanced_corr_matrix)

    # Analyze categorical variables
    for col in categorical_cols:
        print(f"\nTop 5 {col} by Average Price:")
        print(df.groupby(col)['Price'].mean().sort_values(ascending=False).head())
        
    # Additional visualizations for categorical variables
    for col in categorical_cols:
        plt.figure(figsize=(12, 6))
        sns.boxplot(x=col, y='Price', data=df.sort_values('Price', ascending=False).head(100))
        plt.xticks(rotation=90)
        plt.title(f'Price Distribution by {col} (Top 100 Expensive Properties)')
        plt.ylabel('Price (Millions £)')
        # Convert y-axis values to millions and set fixed locator
        ax = plt.gca()
        yticks = ax.get_yticks()
        ax.yaxis.set_major_locator(plt.FixedLocator(yticks))
        ax.set_yticklabels(['{:.1f}'.format(x/1000000) for x in yticks])
        plt.show()

explore_data(df)

## Exploratory Data Analysis Discussion

Our exploratory data analysis reveals several key insights about the London housing market dataset that will influence our approach to building decision tree models.

### Price Distribution and Implications

The price distribution shows significant right-skew, with properties ranging from £180,000 to £39.75M. This 222x range presents both challenges and opportunities:

#### Advantages for Decision Trees
- Trees naturally handle non-normal distributions without transformation
- Binary splits can effectively separate luxury properties from standard homes 
- No assumption of linearity required

#### Considerations
- Very expensive properties (>£20M) might need special treatment
- When evaluating our model we should consider performance across different price ranges
- We will want to use price bands for stratified sampling when splitting our data into training and test sets

### Feature Relationships

#### Strong Predictors

#### Area (sq ft)
- Strongest correlation with price (0.67)
- Also shows right-skew distribution
- Natural primary splitting candidate for decision trees

#### Room Counts
- Perfect correlation (1.0) between bedrooms, bathrooms, and receptions
- Suggests potential data quality issue
- Should select only one room count feature to avoid redundancy

#### Location Features
- 460 unique locations with 27.64% missing values
- High cardinality - lots of unique values, shared between few samples - presents a challenge for one-hot encoding method of handling missing data
- Other features such as Postal code can provide hierarchical geographic information yet also have a high number of unique values

### Data Quality Considerations

#### Missing Data
The 27.64% missing location data requires careful handling:
- Complete case analysis - dropping rows with missing values - would lose too much information
- One-hot encoding with missing indicator preserves data patterns
- Decision trees can handle missing values effectively

#### Identical Room Distributions
All room count features showing identical distributions (mean: 3.10, range: 0-10) suggests:
- Possible data entry issues
- Need for data validation
- We should select only one room count feature to avoid redundancy and co-linearity issues

### Recommendations for Model Development

#### 1. Feature Selection Strategy
- Use Area as primary numerical predictor
- Use house type as categorical predictor
- Select one room count feature
- Parse outcode instead of full postocode to reduce the cardinality of this feature
- Explore other location features
- Consider adding other engineered features


#### 2. Data Preprocessing Approach
- Retain original price scale (no need for transformation)
- Handle missing locations with indicator method - onehot encoding with missing category
- Consider creating price bands for stratified sampling - when splitting the test train split this will ensure higher priced ranges are well represented in both datasets
- Consider binning area into meaningful ranges as a new feature

#### 3. Model Building Considerations
- Set min_samples_leaf to handle price outliers - higher samples per leaf is more robust to outliers, but may miss some of the more subtle patterns in the data
- Use cross-validation with stratification
- Consider separate models for different price ranges

#### 4. Evaluation Strategy
- Evaluate performance across price bands
- Use both absolute and percentage errors
- Compare against simpler baseline models

### Next Steps

This analysis suggests we should focus on:
- Handling missing categorical data effectively
- Managing high cardinality features
- Dealing with extreme price outliers
- Creating meaningful geographic features
- Evaluating model performance across price ranges

These insights will help us build more robust and interpretable decision tree models for London house price prediction.

## Data Processing Deep dive!

Before training our models, we need to prepare our data appropriately. In this lesson we will create multiple datasets to compare different approaches. 

We will:

1. **Create and compare multiple clean Datasets that handle missing data in different ways**
   - Dataset A: Leave missing values as is and let the decision tree model handle it
   - Dataset B: Drop rows with missing values
   - Dataset B: Fill missing values with 'Unknown'
   - Dataset C: Use one-hot encoding with missing category

2. **Feature Engineering and Selection**
   - Drop columns we deem non-predictive (e.g., Property Name)
   - Create feature sets of increasing complexity:
     - Basic: Bedrooms, Bathrooms, Receptions
     - Property: Basic + area in sq ft + House Type
     - Property + Location Combinations
     - Full: All available features including new engineered features such as postcode outcode

3. **Data Preprocessing**
   - Convert categorical variables to numerical using `LabelEncoder`
   - Split each dataset variant into training and test sets
   - Apply feature scaling where appropriate

This structured approach will allow us to:
- Compare the impact of different missing data handling methods
- Using the best missing data approach then evaluate the models performance with different feature combinations
- We will then compare the performance of the decision tree model with a linear regression and a more advanced technique such as a random forest model.

The following code sections will implement these preparation steps and create our model comparison framework.

### Data Cleaniness

Our data is dirty! By this we mean it contains missing values and possibly incorrect data. Not to worry, the world is messy and this is a great opportunity to practice our data cleaning and preprocessing skills.

We've uncovered two issues with our data:
- We have missing values in the 'Location' field in 27% of our data
- We have a perfect correlation between bedroom, bathrooms and receptions which suggests a data quality issue

We will need to handle both of these issues before we can train our models.


## Handling Missing Data

Our data has a significant amount of missing values, particularly in the 'Location' field where 27% of the values are NaN. 

While missing data can be problematic for many machine learning models, modern decision tree implementations handle missing values elegantly through advanced techniques.

Some libraries use "surrogate splits" where alternative features that best approximate the original split are used when the primary split feature has missing values. 

Others, like scikit-learn, use a technique called "fractional samples" or "fractional instances" where samples with missing values are assigned fractional weights to each branch based on the proportion of non-missing samples that go to each branch.

The approach to handling missing data can differ based on whether the variable is categorical or continuous.

### Handling Missing Categorical Data

For categorical variables like 'Location', we can consider the following methods:

1. **Leaving the missing values as is and let our decision tree model handle it**
2. **Dropping rows with missing values**
3. **Filling NaNs with a specific value (e.g., 'Unknown')**
3. **One-hot encoding with a separate 'Missing' category**

Let's explore these methods and their potential impact on our decision tree model.

#### 1. Leaving the missing values as is and letting our decision tree model handle it

Scikit-learn's DecisionTreeRegressor has built-in support for handling missing values without the need for explicit imputation or removal of rows. It uses a technique called "fractional samples" or "fractional instances" to effectively work with missing data.

##### Fractional Samples Technique:

When considering a split for a feature with missing values, the decision tree assigns fractional weights to samples with missing values based on the proportion of non-missing samples that go to each branch.

For example, if 70% of the non-missing samples go to the left branch and 30% go to the right branch, a sample with a missing value would be assigned a weight of 0.7 for the left branch and 0.3 for the right branch.

The decision tree algorithm then proceeds to evaluate the split based on the weighted samples, considering both the non-missing samples and the fractional weights assigned to the missing samples.

During prediction, if a new sample has a missing value for a feature used in a split, the sample follows both branches of the split, and the final prediction is a weighted average of the predictions from the leaves reached in each branch.

**Pros:**

- No data loss: All available information is preserved, including rows with missing values.
- Avoids introducing bias: The decision tree algorithm can learn from the patterns in the non-missing values without making assumptions about the missing data.
- Handles missing values during prediction: The trained model can make predictions on new data that has missing location values.
- Efficient and convenient: Eliminates the need for separate data preprocessing steps to handle missing values.

**Cons:**

- Increased complexity: The decision tree algorithm needs to handle missing values internally, which can add complexity to the model training process.
- Potential overfitting: If the missing values have a specific pattern or meaning, the decision tree might overfit to that pattern, leading to reduced generalization performance.
- Interpretability: The fractional samples approach may make the decision tree splits and structure less intuitive and harder to interpret compared to a clean dataset without missing values.

#### 2. Dropping Rows with Missing Values

The simplest approach is to drop any row that has a missing value. 

In [None]:
def drop_rows_with_missing(df):
    return df.dropna()

df_dropped = drop_rows_with_missing(df)
print(f"Rows after dropping: {len(df_dropped)}")
print(f"Percentage of data lost: {(1 - len(df_dropped) / len(df)) * 100:.2f}%")

Pros:
- Easy to implement
- Results in a clean dataset without any missing values

Cons:
- We lose a significant portion of our data (27% in this case), which can reduce the representativeness of our dataset and the predictive power of our model
- If the missing values are not Missing Completely At Random (MCAR) - where the probability of a value being missing is unrelated to any other variable, this can introduce bias
- The model will not be able to make predictions on new data that has missing location values

#### 3. Filling NaNs with a Specific Value

We can fill in the missing values with a specific value that indicates missingness, such as 'Unknown'.

In [325]:
def fill_with_value(df, value):
    return df.fillna(value)

df_filled_unknown = fill_with_value(df, 'Unknown')

Pros:
- Easy to implement: It requires just a single fillna() operation in pandas, making it one of the simplest approaches to handle missing data
- Retains all data points: Unlike dropping rows, this method preserves your entire dataset, ensuring no information is discarded

Cons:
- Problematic assumptions about missing data: When we replace NaN with 'Unknown', we're making an assumption that all missing values represent a specific category that is known to us and related to the observed data.

- This assumption may not be true because:
  - Values might be missing for various unrelated reasons (data entry errors, not collected, etc.)
  - Missing values might not have any meaningful relationship to each other
  - The missingness itself might not be informative

- Potential bias introduction: 
  - If 'Unknown' has a specific meaning in your domain (e.g., deliberately withheld information), using it as a catch-all for missing values could confuse the model and lead to incorrect predictions

#### 4. One-Hot Encoding with a Separate 'Missing' Category

In this approach, we create a separate category for missing values during one-hot encoding. For example, with our 'Location' column:

```code
Original data:
+------------+-----------+
| Row Number | Location  |
+------------+-----------+
| Row 1      | Chelsea   |
| Row 2      | NaN       |
| Row 3      | Hackney   |
| Row 4      | Chelsea   |
+------------+-----------+

After one-hot encoding with missing values:
+------------+------------------+------------------+-------------+
| Row Number | Location_Chelsea | Location_Hackney | Location_NaN|
+------------+------------------+------------------+-------------+
| Row 1      |        1         |        0         |      0      |
| Row 2      |        0         |        0         |      1      |
| Row 3      |        0         |        1         |      0      |
| Row 4      |        1         |        0         |      0      |
+------------+------------------+------------------+-------------+
```

In [None]:
def onehot_encode_with_missing(df: pd.DataFrame, column: str) -> pd.DataFrame:
    """
    One-hot encodes a column while handling missing values.
    
    Args:
        df: Input DataFrame
        column: Name of column to encode
        
    Returns:
        DataFrame with one-hot encoded columns replacing original column
    """
    # Create a copy to avoid modifying original
    df_encoded = df.copy()
    
    # Create dummy variables including NaN values
    dummy_cols = pd.get_dummies(df_encoded[column], dummy_na=True, prefix=column)
    
    # Remove original column and add dummy columns
    df_encoded = df_encoded.drop(columns=[column])
    df_encoded = pd.concat([df_encoded, dummy_cols], axis=1)
    
    return df_encoded

# Create new DataFrame with encoded values
df_onehot = onehot_encode_with_missing(df, 'Location')
display(df_onehot.head())
print(f"Total columns: {len(df_onehot.columns)}")

Pros:
- Retains all data points (3,480 rows)
- Allows the model to treat missing values as a separate category
- Preserves location-specific patterns in the data
- No assumptions made about missing values

Cons:
- Dramatic increase in dimensionality:
  - From 1 column to 656 columns (656 unique locations + 1 NaN column)
  - Creates very sparse matrix (most values will be 0)
  - Even common locations like Putney (96 rows) only use 2.8% of their column
- High memory usage:
  - Each row needs 657 boolean values instead of one categorical value
  - Significant impact on model training time
- Risk of overfitting:
  - Many locations have very few examples (low signal-to-noise ratio)
  - Decision trees might make splits based on rare locations

### Handling Missing Continuous Data

Whilst our data is present for all our other variables, lets quickly discuss how we would handle missing data for a continous variable.

For continuous variables like 'Area in sq ft', we can consider the following methods:

1. **Dropping rows with missing values**
2. **Imputing missing values**
   - Using statistics like mean, median
   - Using advanced imputation methods like KNN or MICE
3. **Binning the continuous variable and treating missing values as a separate bin**

#### Dropping Rows with Missing Values

The simplest approach is to drop any row that has a missing value. 

Just like we did with our categorical data its has the same pros and cons.

Pros:
- Easy to implement
- Results in a clean dataset without any missing values

Cons:
- We lose data points, which can reduce the representativeness of our dataset and the predictive power of our model
- If the missing values are not completely at random (MCAR), this can introduce bias

#### Imputing Missing Values

Instead of dropping rows, we can fill in (impute) the missing values using statistics like mean or median.

```python
def fill_with_stats(df, method):
    if method == 'mean':
        return df.fillna(df.mean())
    elif method == 'median':
        return df.fillna(df.median())
    else:
        raise ValueError(f"Unsupported method: {method}")

df_filled_mean = fill_with_stats(df, 'mean')
df_filled_median = fill_with_stats(df, 'median')

```

Pros:
- Retains all data points
- If the missingness is unrelated to the observed data - Missing At Random (MAR), then this method can give unbiased estimates

Cons:
- Imputed values are estimates, which can introduce "noise" - statistical variability or random fluctuations in the data that don't represent true patterns
- If missingness is related to the unobserved data also known as Missing Not At Random (MNAR) - where missing values are related to the unobserved variable, then this method can give biased estimates

#### Advanced imputation methods: KNN and MICE

We can also use more advanced imputation methods like KNN or MICE (available in libraries like scikit-learn and impyute).

 Key Difference:
 - KNN looks at similar samples
 - MICE builds relationships between features

 ##### KNN (K-Nearest Neighbors) Imputation:
 - K represents the number of neighbours to consider (e.g., if K=5, we look at 5 similar samples)
 - For each sample with missing values:
   1. Look at all other complete samples in the dataset
   2. Calculate similarity between samples using features that aren't missing
   3. Find the K most similar samples (the "neighbours")
   4. Fill missing values using:
      - For numerical: average of the K neighbors' values
      - For categorical: most common value among K neighbors
 - Larger K values (e.g., 10-20):
   + More stable predictions
   - Might miss local patterns
 - Smaller K values (e.g., 3-5):
   + Better at capturing local patterns
   - More sensitive to noise

   
```python
from sklearn.impute import KNNImputer

def fill_with_knn(df, n_neighbors=5):
    imputer = KNNImputer(n_neighbors=n_neighbors)
    return pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

df_filled_knn = fill_with_knn(df)
```

Pros:
- Can capture more complex patterns in the data when imputing
- Can work well if the data conforms to the assumptions of the method (e.g., data is missing at random for KNN)

Cons:
- More computationally expensive than simple methods
- Requires careful selection of parameters (e.g., number of neighbors for KNN)
- Can still introduce bias if the assumptions are not met

 ##### MICE (Multiple Imputation by Chained Equations):
 A more sophisticated approach that works like this:
 1. Initial Setup:
    - Start with rough guesses for all missing values (e.g., mean/mode)
    - Create multiple copies of the dataset (usually 3-5)

 2. For each copy:
   - a) For each feature with missing values:
       - Temporarily remove the current guesses
       - Build a regression/classification model using other features
       - Predict new values for the missing data
       - Update the guesses with these predictions
   - b) Repeat this process several times (iterations)

 3. Final Step:
    - Now have multiple complete datasets
    - Each represents a possible version of reality
    - Combine them for final estimates, capturing uncertainty

### Example using MICE:

```python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np

def fill_with_mice(df, n_imputations=5, max_iter=10, random_state=42):
    # Create multiple imputations
    all_imputations = []
    
    for i in range(n_imputations):
        # Initialize MICE imputer
        mice_imputer = IterativeImputer(
            max_iter=max_iter,
            random_state=random_state + i
        )
        
        # Fit and transform the data
        imputed_data = mice_imputer.fit_transform(df)
        all_imputations.append(pd.DataFrame(imputed_data, columns=df.columns))
    
    # Combine imputations by taking the mean
    final_df = pd.concat(all_imputations).groupby(level=0).mean()
    
    return final_df

# Apply MICE imputation
df_filled_mice = fill_with_mice(df)

# Compare original vs imputed values for a column with missing data
missing_col = df.columns[df.isnull().any()][0]
print(f"\nComparison for {missing_col}:")
print(f"Original mean: {df[missing_col].mean():.2f}")
print(f"Imputed mean: {df_filled_mice[missing_col].mean():.2f}")
```

Pros of MICE:
 - Preserves relationships between variables by using other features to predict missing values
 - Creates multiple imputations to capture uncertainty in the missing data
 - Can handle different variable types (continuous and categorical)
 - More sophisticated than simple mean/median imputation
 - Accounts for the randomness in the imputation process

 Cons of MICE:
 - Computationally intensive, especially with large datasets
 - Assumes missing data is MAR (Missing At Random)
 - May not perform well if relationships between variables are highly non-linear
 - Can be sensitive to the order of variables in the imputation process
 - Multiple imputations need to be combined, which adds complexity


#### Binning the Continuous Variable

We can bin the continuous variable into discrete intervals and treat missing values as a separate bin.

```python

def bin_with_missing(df, column, n_bins=5):
    df[column] = pd.qcut(df[column], n_bins, labels=False, duplicates='drop')
    df[column] = df[column].astype('object')
    df[column] = df[column].fillna('Missing')
    return df

df_binned = bin_with_missing(df, 'Area in sq ft')
```

Pros:
- Retains all data points
- Allows the model to treat missing values as a separate category
- Can capture non-linear relationships between the variable and the target

Cons:
- Loses some information by discretizing the continuous variable
- The choice of the number of bins can impact model performance

### Using Models that Handle Missing Data

More advanced implementations using multiple trees offer sophisticated approaches to missing data:

1. **Random Forests** 
   - An ensemble method that builds multiple decision trees and averages their predictions
   - Handles missing values through proximity-based methods, which is a form of surrogate splitting at the ensemble level
   - Instead of finding correlated features, it looks at similar samples across multiple trees to determine the best path for missing values

2. **XGBoost (Extreme Gradient Boosting)**
   - A boosting algorithm that builds trees sequentially, with each tree correcting errors from previous trees
   - Takes a different approach by learning the optimal default direction for missing values at each split
   - During training, it tests sending missing values both left and right, choosing the direction that minimizes error

Pros:
- No separate imputation step needed (for supported models)
- Missing patterns can contribute to the model's learning
- Often performs better than simple imputation methods

Cons:
- Not all implementations support these methods (e.g., scikit-learn's decision trees don't implement surrogate splits)
- Less transparent than explicit handling methods
- May not be suitable when understanding the missing data mechanism is important

In the next section, we'll examine the features of our London housing dataset and begin building our initial model.

##  Features of the data

Lets have a look at our dataset again:

In [None]:
# Print the data types and value ranges for each column
print("Data types and value ranges for each column:\n")

for column in df.columns:
    print(f"\n{column}:")
    if df[column].dtype in ['int64', 'float64']:
        print(f"Type: {df[column].dtype}")
        print(f"Range: {df[column].min():,.2f} to {df[column].max():,.2f}")
        print(f"Mean: {df[column].mean():,.2f}")
    else:
        print(f"Type: {df[column].dtype}")
        print("Categories:")
        value_counts = df[column].value_counts()
        for value, count in value_counts.items():
            print(f"  - {value}: {count:,} occurrences")




The dataset has 3478 properties including:

**Target variable:**

Price 
- dtype (int64 type)
- Continuous variable
- Range: £180,000 to £39,750,000
- Mean: £1,864,173

**Features:**

Property Name 
  - dtype pandas object - strings, no missing values
  - Contains 785 unique property names
  - Categorical variable with text values
  - Most common: "Television Centre" (17 occurrences)


House Type 
- dtype pandas object - strings, no missing values
- Categorical variable with 8 categories:
  - Flat/Apartment (1,565)
  - House (1,430)
  - New development (357)
  - Penthouse (100)
  - Studio (10)
  - Bungalow (9)
  - Duplex (7)
  - Mews (2)

Area in sq ft 
- dtype int64 
- Continuous variable
- Range: 274 to 15,405 square feet
- Mean: 1,713 sq ft


Number of Bedrooms 
- dtype int64 
- Discrete numerical variable
- Range: 0 to 10 bedrooms
- Mean: 3.10


Number of Bathrooms 
- dtype int64 
- Discrete numerical variable
- Range: 0 to 10 bathrooms
- Mean: 3.10


Number of Receptions 
- dtype int64 
- Discrete numerical variable
- Range: 0 to 10 reception rooms
- Mean: 3.10


Location 
- dtype pandas object - strings, has missing values as NAN
- Categorical variable with 460 unique locations
- Most common: "Putney" (96 occurrences)


City/County 
- dtype pandas object - string, no missing values
- Categorical variable with 53 unique values
- Most common: "London" (2,972 occurrences)


Postal Code 
- dtype pandas object - strings, no missing values
- Contains 1,284 unique postal codes
- Alphanumeric string format (e.g., "SW6 3LF")
- Most common: "SW6 3LF" (14 occurrences)

We also have the Postal Code, which we can use to extract more geographical information.

### Feature engineering

Before training our models, we can enhance our dataset through feature engineering - the process of creating new features that might help capture important patterns in the data.

One key opportunity in our London housing dataset is to extract more meaningful geographical information from the postcodes.

Other ideas could include:
- Binning the area feature into more meaningful ranges
- Distance from city center
- Distance from nearest tube station
- School quality score for the area
- Crime rate for the area
- Green space percentage for the area
- Number of amenities (shops, restaurants, etc.) within a radius

#### Adding Postcode Outcode feature
Currently, our dataset has 1,284 unique postcodes spread across 3,478 properties, meaning we have an average of only 2.7 properties per postcode. This sparsity could make it difficult for the model to learn meaningful patterns, as many postcodes will have just 1-2 examples.

We can improve this by extracting the "outcode" - the first part of a UK postcode (e.g., "SW6" from "SW6 3LF"). 

Outcodes represent broader geographical areas and offer several advantages:

1. Increased data density: Each outcode will contain more properties, giving the model more examples to learn from within each area

2. Better generalization: The model can learn broader geographical patterns - a borough - rather than overfitting to specific streets

3. Reduced dimensionality: Instead of 1,284 unique values, we'll have far fewer unique outcodes, making the feature space more manageable

4. Statistical significance: More properties per group means more reliable average prices and trends for each area


#### Let's add this new feature to each of our cleaned datasets:


In [None]:
def extract_outcode(postcode: str) -> str:
    """Extract the outcode (first part) from a postcode."""
    return postcode.split()[0] if isinstance(postcode, str) else None

def add_outcode_feature(df: pd.DataFrame) -> pd.DataFrame:
    """Add outcode feature derived from Postal Code column."""
    df_with_outcode = df.assign(
        Outcode=df['Postal Code'].map(extract_outcode)
    )
    
    n_unique = df_with_outcode['Outcode'].nunique()
    avg_properties = len(df_with_outcode) / n_unique
    
    print(f"Created {n_unique} unique outcodes")
    print(f"Average properties per outcode: {avg_properties:.1f}")
    
    return df_with_outcode

# Apply to each of our cleaned datasets
df_original_with_outcode = add_outcode_feature(df)
df_dropped_with_outcode = add_outcode_feature(df_dropped)
df_filled_unknown_with_outcode = add_outcode_feature(df_filled_unknown)
df_onehot_with_outcode = add_outcode_feature(df_onehot)

display(df_original_with_outcode.head(10))
display(df_dropped_with_outcode.head(10))
display(df_filled_unknown_with_outcode.head(10))
display(df_onehot_with_outcode.head(10))
# Example analysis of how outcodes relate to price
print("\nTop 5 outcodes by average price:")

print(df_dropped_with_outcode.groupby('Outcode')['Price'].agg(['mean', 'count'])
      .sort_values('mean', ascending=False)
      .head())

### Preparing the data for modeling

For our initial model will compare df_original to our missing data approaches df_dropped, df_filled_unknown and df_onehot:

We need to prepare the data for modeling by:
- Dropping any columns that are not useful for training our model (e.g. property name and 2 of the room count features)
- Converting any categorical variables to numerical variables using `LabelEncoder`
- Splitting our data into training and test sets using a 80/20 split and we'll implement a stratified split on the price variable to ensure the test and training sets have a similar distribution of prices

#### Feature scaling

We will not applying any scaling to the data as decision trees are not sensitive to the scale of the data. It is worth noting however, as revealed in our exploratory analysis, that the price variable has a very large range and is not normally distributed. 

We need to account for this when spliting our data into training and test sets and be mindful of this when interpreting our model's results - errors are likely to be larger for properties with higher prices.


In [None]:
def prepare_data_for_modeling(df):
    """
    Prepare data for modeling with stratified sampling based on price bands
    """
    # Create a copy to avoid modifying original
    df_model = df.copy()
    
    # Create price bands for stratification
    df_model['price_band'] = pd.qcut(df_model['Price'], q=10, labels=False)

    # Drop unnecessary columns if present
    unnecessary_columns = ['Property Name', 'No. of Bathrooms', 'No. of Receptions']
    df_model = df_model.drop([col for col in unnecessary_columns if col in df_model.columns], axis=1)

    # Convert categorical variables using LabelEncoder
    categorical_cols = ['House Type', 'Location', 'City/County', 'Postal Code', 'Outcode']
    label_encoder = LabelEncoder()
    
    for col in categorical_cols:
        if col in df_model.columns:
            df_model[col] = label_encoder.fit_transform(df_model[col])
    
    # Split features and target
    X = df_model.drop(['Price', 'price_band'], axis=1)
    y = df_model['Price']
    
    # Stratified split using price bands
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=0.2, 
        random_state=42,
        stratify=df_model['price_band']  # Use price bands for stratification
    )
    
    print("Features included:", X.columns.tolist())
    # print("Training set shape:", X_train.shape)
    # print("Test set shape:", X_test.shape)
    
    # Verify stratification worked
    train_price_stats = y_train.describe()
    test_price_stats = y_test.describe()
    # print("\nPrice distribution comparison:")
    # print("\nTraining set:")
    # print(train_price_stats)
    # print("\nTest set:")
    # print(test_price_stats)
    
    return X_train, X_test, y_train, y_test

df_original_prepared = prepare_data_for_modeling(df)
df_dropped_prepared = prepare_data_for_modeling(df_dropped)
df_filled_unknown_prepared = prepare_data_for_modeling(df_filled_unknown)
df_onehot_prepared = prepare_data_for_modeling(df_onehot)

display(df_original_prepared[0].head(5))
display(df_dropped_prepared[0].head(5))
display(df_filled_unknown_prepared[0].head(5))
display(df_onehot_prepared[0].head(5))

## Running the intial model and comparing missing data methods

Let's create and evaluate our initial model using the `df_dropped` dataset and compare the results to the other missing data approaches.


In [None]:
def create_and_evaluate_model(prepared_data: tuple, model_name: str = "Model") -> dict:
    """
    Creates and evaluates a decision tree model using prepared data tuples.
    
    Args:
        prepared_data: Tuple of (X_train, X_test, y_train, y_test)
        model_name: Name identifier for the model
        
    Returns:
        dict: Dictionary containing model metrics and feature importance
    """

    
    # Unpack the prepared data
    X_train, X_test, y_train, y_test = prepared_data
    
    # Create and train model using default parameters on the training set
    model = DecisionTreeRegressor(random_state=42)
    model.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    results = {
        'name': model_name,
        'metrics': {
            'MAE': mean_absolute_error(y_test, y_pred),
            'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
            'R2': r2_score(y_test, y_pred)
        },
        'feature_importance': dict(zip(X_train.columns, model.feature_importances_))
    }
    
    # Print results
    print(f"\nResults for {model_name}:")
    print(f"MAE: £{results['metrics']['MAE']:,.2f}")
    print(f"RMSE: £{results['metrics']['RMSE']:,.2f}")
    print(f"R2 Score: {results['metrics']['R2']:.4f}")
    
    # Print top 5 most important features
    importance = results['feature_importance']
    sorted_importance = dict(sorted(importance.items(), key=lambda x: x[1], reverse=True)[:5])
    print("\nTop 5 Most Important Features:")
    for feature, importance in sorted_importance.items():
        print(f"{feature}: {importance:.4f}")

    return results, model, X_test

# Individual calls for each prepared dataset
original_results, decision_tree_model, X_test = create_and_evaluate_model(df_original_prepared, "Original")
dropped_results = create_and_evaluate_model(df_dropped_prepared, "Dropped NAs")
unknown_results = create_and_evaluate_model(df_filled_unknown_prepared, "Unknown Values")
onehot_results = create_and_evaluate_model(df_onehot_prepared, "One-hot Encoded")

# Collect all results
all_results = {
    'original': original_results,
    'dropped': dropped_results,
    'unknown': unknown_results,
    'onehot': onehot_results
}

# used to test saved model later in lesson
decision_tree_test_sample = X_test.sample(10)
# display(decision_tree_test_sample)


Woah! One-hot encoding is the best performing method and area in sq ft is the most important feature.

We determine this because it has the lowest MAE and RMSE and the highest R2 score.

Let's verify this by comparing the performance of each missing data method with our engineered feature outcode included.

In [None]:

df_original_with_outcode_prepared = prepare_data_for_modeling(df_original_with_outcode)
df_dropped_with_outcode_prepared = prepare_data_for_modeling(df_dropped_with_outcode)
df_filled_unknown_with_outcode_prepared = prepare_data_for_modeling(df_filled_unknown_with_outcode)
df_onehot_with_outcode_prepared = prepare_data_for_modeling(df_onehot_with_outcode)

original_with_outcode_results, decision_tree_model = create_and_evaluate_model(df_original_with_outcode_prepared, "Original with Outcode")
dropped_with_outcode_results = create_and_evaluate_model(df_dropped_with_outcode_prepared, "Dropped NAs with Outcode")
filled_unknown_with_outcode_results = create_and_evaluate_model(df_filled_unknown_with_outcode_prepared, "Unknown Values with Outcode")
onehot_with_outcode_results = create_and_evaluate_model(df_onehot_with_outcode_prepared, "One-hot Encoded with Outcode")

display(original_with_outcode_results)
display(dropped_with_outcode_results)
display(filled_unknown_with_outcode_results)
display(onehot_with_outcode_results)


### Performance Changes with Outcode Addition

Let's analyze the results from our second test including our engineered feature postcode outcode.

First, let's recall what these metrics mean:
- MAE (Mean Absolute Error): Average absolute difference between predicted and actual prices 
- RMSE (Root Mean Square Error): Square root of average squared differences, penalizes large errors more
- R2 (R-squared): Proportion of variance explained by model, higher is better (max 1.0)

#### Original Dataset with Outcode
- MAE: ↓0.3% (£693,983.58)
- RMSE: ↓1.6% (£1,884,277.57)
- R²: ↑3.6% (0.4864)
- Top Features:
    1. Area in sq ft: 0.6167
    2. Postal Code: 0.1364
    3. City/County: 0.0796
    4. House Type: 0.0582
    5. No. of Bedrooms: 0.0407

#### Dropped NAs Dataset with Outcode
- MAE: ↓2.5% (£604,233.61)
- RMSE: ↓3.3% (£1,411,814.09)
- R²: ↑5.6% (0.5641)
- Top Features:
    1. Area in sq ft: 0.5567
    2. Postal Code: 0.1117
    3. City/County: 0.1022
    4. Outcode: 0.0674
    5. House Type: 0.0670

#### Unknown Values Dataset with Outcode
- MAE: ↓2.1% (£655,992.20)
- RMSE: ↓1.7% (£1,786,127.58)
- R²: ↑3.0% (0.5385)
- Top Features:
    1. Area in sq ft: 0.6202
    2. Postal Code: 0.1216
    3. City/County: 0.0802
    4. House Type: 0.0581
    5. Outcode: 0.0500

#### One-hot Encoded Dataset with Outcode
- MAE: ↑2.7% (£619,448.17)
- RMSE: ↑3.1% (£1,629,301.54)
- R²: ↓3.6% (0.6160)
- Top Features:
    1. Area in sq ft: 0.5188
    2. Location_Mayfair: 0.1102
    3. Postal Code: 0.0883
    4. No. of Bedrooms: 0.0694
    5. House Type: 0.0585

### Key Findings

1. **Performance Impact**
   - Outcode improved performance in 3 out of 4 approaches
   - Largest improvements in Dropped NAs approach (↑5.6% R²)
   - One-hot encoding showed slight performance degradation with outcode addition

2. **Feature Importance**
   - Area in sq ft remained dominant (51-62%) across all variations
   - Outcode partially absorbed importance from Postal Code
   - Location features collectively account for ~25-30%

3. **Model Behavior**
   - Simple models benefited most from outcode addition
   - One-hot encoding showed potential feature redundancy
   - Outcode provides meaningful signal for location-based pricing

### Recommendations

1. **Model Selection**
   - Use outcode in simpler models where dimensionality is a concern
   - Consider dropping outcode for one-hot encoded models

2. **Feature Engineering**
   - Investigate combining location features to reduce redundancy
   - Consider creating location clusters using outcode and postal code
   - Explore interaction terms between outcode and other features

3. **Next Steps**
   - Test reduced feature sets to optimize model complexity
   - Evaluate performance on specific price ranges
   - Consider ensemble approaches combining different encoding methods

**Next, let's try one-hot encoding with different combinations of features to reduce dimensionality while maintaining model performance**

### Feature selection

Feature selection is the process of selecting a subset of relevant features (variables) from a larger set of features. 

This is important because:

- It can help improve model performance by reducing overfitting and improving generalization
- It can make the model more interpretable by focusing on the most important features
- It can reduce computational complexity and improve training speed

In this section, we'll explore different feature subsets within the one-hot encoded dataset. It's important to note that this dataset has 665 features because it is one-hot encoded on the location field where it is had NAN values. this adds an additional layer of complexity to our feature selection process. 

We'll try different variation of the location features including the removal of the one-shot encoded location field and the addition of the engineered outcode field.

 Let's try different feature combinations based on the following subsets:
 
 1. Basic Features
    - Bedrooms, Bathrooms
 
 2. Minimal Features
    - Basic + Receptions
 
 3. Minimal Extended Features
    3.1. Minimal + Area in sq ft
    3.2. Minimal + House Type
 
 4. Property Features (Core)
    - Minimal + Area + House Type
 
 5. Property Location Combinations
    5.1. Property + Outcode
    5.2. Property + Location
    5.3. Property + City/County
    5.4. Property + Full Postcode
 
 6. Property Double Location Combinations
    6.1. Property + Location + City
    6.2. Property + City + Outcode
    6.3. Property + City + Postcode
    6.4. Property + Outcode + Location
    6.5. Property + Postcode + Location
 
 7. Property Triple Location Combinations
    7.1. Property + Location + City + Outcode
    7.2. Property + Location + City + Postcode
    7.3. Property + Outcode + Postcode + Location
 
 8. Full Property Profile
    - All features combined
 
The feature groupings above test combinations in three logical steps:

 1. Core property metrics (bedrooms, bathrooms, receptions)
 2. Property classification combinations (adding house type and area)
 3. Location data at different scales (outcode, full location, city/county, postal code)




This helps identify which features and geographic granularity provide the best predictive power.

In [None]:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Define feature subsets
feature_subsets = {
    # Basic Features
    'Basic': ['No. of Bedrooms', ],
    
    # Minimal Extended Features
    'Minimal + Area': ['No. of Bedrooms',  'Area in sq ft'],
    'Minimal + House Type': ['No. of Bedrooms',  'House Type'],
    
    # Property Features (Core)
    'Property': ['No. of Bedrooms',  'Area in sq ft', 'House Type'],
    
    # Property Location Combinations
    'Property + Outcode': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Outcode'],
    'Property + Location': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'Location'],
    'Property + City': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'City/County'],
    'Property + Postcode': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'Postal Code'],
    
    # Property Double Location Combinations
    'Property + Location + City': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Location', 'City/County'],
    'Property + City + Outcode': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'City/County', 'Outcode'],
    'Property + City + Postcode': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'City/County', 'Postal Code'],
    'Property + Outcode + Location': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Outcode', 'Location'],
    'Property + Postcode + Location': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Postal Code', 'Location'],
    
    # Property Triple Location Combinations
    'Property + Location + City + Outcode': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'Location', 'City/County', 'Outcode'],
    'Property + Location + City + Postcode': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Location', 'City/County', 'Postal Code'],
    'Property + Outcode + Postcode + Location': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Outcode', 'Postal Code', 'Location'],
    
    # Full Property Profile
    'Full': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Location', 'City/County', 'Postal Code', 'Outcode']
}

def get_feature_columns(df, feature_subset):
    """
    Gets all relevant columns including one-hot encoded ones for Location
    """
    columns = []
    for feature in feature_subset:
        if feature == 'Location':
            # Add all Location_ columns for one-hot encoded data
            location_cols = [col for col in df.columns if col.startswith('Location_')]
            columns.extend(location_cols)
        elif feature in df.columns:
            columns.append(feature)    
    return columns

def evaluate_feature_subsets(df, feature_subsets):
    """
    Evaluates model performance for different feature subsets
    """
    results = {}
    
    for subset_name, features in feature_subsets.items():
        print(f"\nEvaluating {subset_name} subset...")
        # Get relevant columns including one-hot encoded ones
        selected_columns = get_feature_columns(df, features)
        
        # Add Price column to selected columns
        selected_columns.append('Price')
        
        # Create subset of data
        df_subset = df[selected_columns]
        
        # Prepare data using existing function
        X_train, X_test, y_train, y_test = prepare_data_for_modeling(df_subset)
        
        # Train model
        model = DecisionTreeRegressor(random_state=42)
        model.fit(X_train, y_train)
        
        # Make predictions
        y_pred = model.predict(X_test)
        
        # Calculate metrics
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, y_pred)
        
        # Store results
        results[subset_name] = {
            'MAE': mae,
            'RMSE': rmse,
            'R2': r2,
            'Feature Count': X_train.shape[1]
        }
        
        print(f"Number of features: {X_train.shape[1]}")
        print(f"MAE: £{mae:,.2f}")
        print(f"RMSE: £{rmse:,.2f}")
        print(f"R2 Score: {r2:.4f}")
            
    return pd.DataFrame(results).T

def plot_results(results_df):
    """
    Creates plots comparing model performance across feature subsets
    """
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10))
    
    # R² plot
    results_df['R2'].plot(marker='o', ax=ax1)
    ax1.set_title('R² Score by Feature Subset')
    ax1.set_ylabel('R² Score')
    ax1.grid(True)
    ax1.tick_params(axis='x', rotation=45)
    
    # RMSE plot
    results_df['RMSE'].plot(marker='o', ax=ax2)
    ax2.set_title('RMSE by Feature Subset')
    ax2.set_ylabel('RMSE')
    ax2.grid(True)
    ax2.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

# Run evaluation on one-hot encoded dataset
print("\nEvaluating one-hot encoded dataset...")

results = evaluate_feature_subsets(df_onehot_with_outcode, feature_subsets)
display(results)
# Plot results
plot_results(results)

# Print detailed results
print("\nOne-hot Encoded Dataset Results:")
print(results.sort_values('R2', ascending=False))

### Feature Subset Performance Analysis

Right! Which feature subset performs the best?

##### Feature Subset Performance (Ordered by R² Score) top 5:

1. **Property + Outcode + Postcode + Location** (Best)
   - MAE: £566,853
   - RMSE: £1,283,867
   - R² Score: 0.7615
   - Features: 662

2. **Property + Location + Postcode**
   - MAE: £599,924
   - RMSE: £1,481,158
   - R² Score: 0.6826
   - Features: 661

3. **Full Dataset**
   - MAE: £602,994
   - RMSE: £1,599,813
   - R² Score: 0.6297
   - Features: 663

4. **Property + Location + City + Outcode**
   - MAE: £593,725
   - RMSE: £1,587,228
   - R² Score: 0.6355
   - Features: 662

5. **Property + Location + City + Postcode**
   - MAE: £599,672
   - RMSE: £1,567,275
   - R² Score: 0.6446
   - Features: 662

### Key Findings

#### 1. Feature Importance
- Basic property features alone (bedrooms, area) perform poorly (R² = 0.1749)
- Adding area improves performance significantly (R² increases to 0.2919)
- Location features provide the biggest boost to performance
- The combination of granular location data (Postcode + Outcode) with property features performs best

#### 2. Model Evolution
- Basic → +Area: +11.7% R² improvement
- +House Type: +7.66% additional improvement
- +Location features: +23.94% additional improvement
- Fine-grained location (Postcode + Outcode): +19.36% final improvement

#### 3. Optimal Feature Set
The best performing combination includes:
- Property characteristics (bedrooms, area, house type)
- Outcode (broader postal area)
- Full postal code
- Specific location details
- This combination achieves significantly better results than using either postcode or outcode alone

### Performance Metrics Analysis

#### MAE (Mean Absolute Error)
- Best: £566,853 (Property + Outcode + Postcode + Location)
- Worst: £1,068,158 (Basic subset)
- Adding location features reduces MAE by approximately 47%

#### RMSE (Root Mean Square Error)
- Best: £1,283,867 (Property + Outcode + Postcode + Location)
- Worst: £2,388,161 (Basic subset)
- Indicates presence of some large prediction errors even in best model, most likely due to the skew and large outliers in the data.

#### R² Score
- Best: 0.7615 (Property + Outcode + Postcode + Location)
- Worst: 0.1749 (Basic subset)
- Shows model explains 76.15% of price variance at best

### Implications & Recommendations

1. **Feature Selection Strategy**
   - Retain all granular location data
   - Include both outcode and full postal code
   - Keep property characteristics as baseline features

2. **Model Improvements**
   - Consider feature engineering for more location interactions
   - Investigate non-linear relationships, especially in property features
   - Possible benefit from ensemble methods

3. **Data Collection**
   - Focus on gathering more detailed location data
  - Get a better dataset, we've lost the rooms data of this dataset is poor quality!
   - Consider additional property features
   - Potential value in temporal data (sale dates, market conditions)

4. **Practical Application**
   - At present this model is probably not suitable for initial pricing estimates.
   - We may be able to seperate the model in two, for smaller and larger properties, giving us a more accurate model for the smaller properties.
   - Error margins should be communicated (±£566K on average)

### Limitations

1. **Error Magnitude**
   - Even best model has significant average error (£566K)
   - RMSE indicates some very large prediction errors

2. **Feature Complexity**
   - Large number of features (662) may lead to overfitting
   - Sparse matrix from location encoding

3. **Location Dependency**
   - Heavy reliance on location features
   - May perform poorly in areas with limited data

### Next Steps

1. Consider implementing:
   - Feature reduction techniques (PCA)
   - Ensemble methods (Random Forest, XGBoost)
   - Cross-validation for more robust evaluation

2. Explore:
   - Feature interaction effects
   - Non-linear transformations of numeric features
   - More sophisticated location encoding methods

3. Investigate:
   - Temporal aspects of pricing
   - Market condition indicators
   - Regional price trends

This updated analysis shows significant improvement over previous results, particularly in the R² score, but still indicates room for improvement in prediction accuracy.

## Comparing model types: Random Forest

Random forests are an ensemble learning method that combines multiple decision trees to create a more robust and accurate model. Each tree in the forest is trained on a random subset of the data and features, and the final prediction is made by averaging the predictions of all trees. 

This approach helps reduce overfitting and improves generalisation compared to a single decision tree.

It's called an ensemble method because random forests exemplify the "wisdom of crowds" principle - where combining many simpler models (the trees) leads to better performance than any individual model alone. This makes them particularly effective for complex regression and classification tasks.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

def get_feature_columns(df, feature_subset):
    """
    Gets all relevant columns including one-hot encoded ones for Location
    """
    columns = []
    for feature in feature_subset:
        if feature == 'Location':
            # Add all Location_ columns for one-hot encoded data
            location_cols = [col for col in df.columns if col.startswith('Location_')]
            columns.extend(location_cols)
        elif feature in df.columns:
            columns.append(feature)    
    return columns

def train_random_forest(X_train, X_test, y_train, y_test):
    """
    Trains and evaluates a Random Forest model
    """
    # Initialize the model
    rf_model = RandomForestRegressor(
        n_estimators=100,  # Number of trees
        max_depth=None,    # Let trees grow fully
        min_samples_split=2,
        min_samples_leaf=1,
        random_state=42
    )
    
    # Train the model
    rf_model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = rf_model.predict(X_test)
    
    # Calculate metrics
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    # Get feature importance and aggregate location features
    feature_importance = dict(zip(X_train.columns, rf_model.feature_importances_))
    
    # Aggregate location feature importance if present
    if any('Location_' in col for col in X_train.columns):
        location_importance = sum(
            importance for col, importance in feature_importance.items() 
            if 'Location_' in col
        )
        # Add aggregated location importance
        feature_importance['Location (aggregated)'] = location_importance
        # Remove individual location features from importance dict
        feature_importance = {k: v for k, v in feature_importance.items() 
                            if not k.startswith('Location_')}
    
    top_features = dict(sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)[:10])
    
    print("\nRandom Forest Results:")
    print(f"MAE: £{mae:,.2f}")
    print(f"RMSE: £{rmse:,.2f}")
    print(f"R2 Score: {r2:.4f}")
    
    print("\nTop 10 Most Important Features:")
    for feature, importance in top_features.items():
        print(f"{feature}: {importance:.4f}")
    
    return rf_model, mae, rmse, r2, feature_importance

# Get the selected features
selected_features = [
    'No. of Bedrooms',
    'Area in sq ft', 'House Type', 'Outcode', 'Location', 'Postal Code'
]

# Get relevant columns including one-hot encoded ones
selected_columns = get_feature_columns(df_onehot_with_outcode, selected_features)
selected_columns.append('Price')

df_subset = df_onehot_with_outcode[selected_columns]
# Prepare the data using the existing function
X_train, X_test, y_train, y_test = prepare_data_for_modeling(df_subset)

# Train and evaluate the random forest
rf_model, rf_mae, rf_rmse, rf_r2, rf_importance = train_random_forest(X_train, X_test, y_train, y_test)

# Compare with previous Decision Tree results
print("\nComparison with Decision Tree:")
print(f"{'Metric':<20} {'Decision Tree':<15} {'Random Forest':<15}")
print("-" * 50)
print(f"{'MAE':<20} £{566853:<14,.0f} £{rf_mae:<14,.0f}")
print(f"{'RMSE':<20} £{1283867:<14,.0f} £{rf_rmse:<14,.0f}")
print(f"{'R2 Score':<20} {0.7615:<14.3f} {rf_r2:<14.3f}")

Viola! Our random forest on the property + location + postcode + outcode one hot encoded dataset has a lower MAE, yet a higher RMSE and lower R^2 score. 

```code
Comparison with Decision Tree:
Metric               Decision Tree   Random Forest  
--------------------------------------------------
MAE                  £566,853        £496,928       
RMSE                 £1,283,867      £1,424,932     
R2 Score             0.761          0.706  
```

Lets compare our feature subsets of the one-hot encoded dataset using the random forest model:

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Define feature subsets
feature_subsets = {
    # Basic Features
    'Basic': ['No. of Bedrooms', ],
    
    # Minimal Extended Features
    'Minimal + Area': ['No. of Bedrooms',  'Area in sq ft'],
    'Minimal + House Type': ['No. of Bedrooms',  'House Type'],
    
    # Property Features (Core)
    'Property': ['No. of Bedrooms',  'Area in sq ft', 'House Type'],
    
    # Property Location Combinations
    'Property + Outcode': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Outcode'],
    'Property + Location': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'Location'],
    'Property + City': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'City/County'],
    'Property + Postcode': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'Postal Code'],
    
    # Property Double Location Combinations
    'Property + Location + City': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Location', 'City/County'],
    'Property + City + Outcode': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'City/County', 'Outcode'],
    'Property + City + Postcode': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'City/County', 'Postal Code'],
    'Property + Outcode + Location': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Outcode', 'Location'],
    'Property + Postcode + Location': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Postal Code', 'Location'],
    
    # Property Triple Location Combinations
    'Property + Location + City + Outcode': ['No. of Bedrooms',  'Area in sq ft', 'House Type', 'Location', 'City/County', 'Outcode'],
    'Property + Location + City + Postcode': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Location', 'City/County', 'Postal Code'],
    'Property + Outcode + Postcode + Location': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Outcode', 'Postal Code', 'Location'],
    
    # Full Property Profile
    'Full': ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Location', 'City/County', 'Postal Code', 'Outcode']
}

def get_feature_columns(df, feature_subset):
    """
    Gets all relevant columns including one-hot encoded ones for Location
    """
    columns = []
    for feature in feature_subset:
        if feature == 'Location':
            # Add all Location_ columns for one-hot encoded data
            location_cols = [col for col in df.columns if col.startswith('Location_')]
            columns.extend(location_cols)
        elif feature in df.columns:
            columns.append(feature)    
    return columns

def evaluate_feature_subsets_rf(df, feature_subsets):
    """
    Evaluates Random Forest model performance for different feature subsets
    """
    results = {}
    feature_importances = {}
    
    for subset_name, features in feature_subsets.items():
        print(f"\nEvaluating {subset_name} subset...")
        selected_columns = get_feature_columns(df, features)
        selected_columns.append('Price')
        
        df_subset = df[selected_columns]
        X_train, X_test, y_train, y_test = prepare_data_for_modeling(df_subset)
        
        model = RandomForestRegressor(
            n_estimators=100,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X_train, y_train)
        
        y_pred = model.predict(X_test)
        
        # Calculate metrics
        mae = mean_absolute_error(y_test, y_pred)
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        
        # Get feature importance
        importances = dict(zip(X_train.columns, model.feature_importances_))
        top_features = dict(sorted(importances.items(), key=lambda x: x[1], reverse=True)[:5])
        feature_importances[subset_name] = top_features
        
        # Store results
        results[subset_name] = {
            'Mean Absolute Error': mae,
            'Root Mean Squared Error': rmse,
            'R-squared': r2,
            'Feature Count': X_train.shape[1]
        }
        
        print(f"Number of features: {X_train.shape[1]}")
        print(f"MAE: £{mae:,.2f}")
        print(f"RMSE: £{rmse:,.2f}")
        print(f"R² Score: {r2:.4f}")
        print("\nTop 5 Most Important Features:")
        for feature, importance in top_features.items():
            print(f"{feature}: {importance:.4f}")
            
    return pd.DataFrame(results).T, feature_importances

def plot_results(results_df):
    """
    Creates plots comparing model performance across feature subsets
    """
    fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 15))
    
    # R² plot
    results_df['R-squared'].plot(marker='o', ax=ax1)
    ax1.set_title('R² Score by Feature Subset')
    ax1.set_ylabel('R² Score')
    ax1.grid(True)
    ax1.tick_params(axis='x', rotation=45)
    
    # RMSE plot
    results_df['Root Mean Squared Error'].plot(marker='o', ax=ax2)
    ax2.set_title('RMSE by Feature Subset')
    ax2.set_ylabel('RMSE')
    ax2.grid(True)
    ax2.tick_params(axis='x', rotation=45)
    
    # MAE plot
    results_df['Mean Absolute Error'].plot(marker='o', ax=ax3)
    ax3.set_title('MAE by Feature Subset')
    ax3.set_ylabel('MAE')
    ax3.grid(True)
    ax3.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

# Run evaluation
print("\nEvaluating Random Forest models on one-hot encoded dataset...")

results_rf, feature_importances = evaluate_feature_subsets_rf(df_onehot_with_outcode, feature_subsets)
display(results_rf)
plot_results(results_rf)

# Print detailed results
print("\nRandom Forest Results (sorted by R-squared):")
print(results_rf.sort_values('R-squared', ascending=False))

# Print feature importances for best performing model
best_model = results_rf.sort_values('R-squared', ascending=False).index[0]
print(f"\nFeature Importances for Best Model ({best_model}):")
for feature, importance in feature_importances[best_model].items():
    print(f"{feature}: {importance:.4f}")

 Based on the results, the "Property + Location + City + Outcode" combination model is the best for the following reasons:

 Best Performance Metrics:
 - Strong R-squared (0.71): Explains 71% of the variance in house prices
 - Lowest Mean Absolute Error (£491,578) among all models
 - Very competitive Root Mean Squared Error (£1,408,315)

 Model Characteristics:
 - Uses 662 features to capture complex relationships
 - Combines property fundamentals with comprehensive location data:
   - Basic property features (bedrooms, area)
   - Detailed location (specific area)
   - City/County level context
   - Outcode for broader geographical grouping

 Performance Comparison:
 - Significantly outperforms simpler models:
   - Basic: R² = 0.19, MAE = £1,059,008
   - Property only: R² = 0.52, MAE = £808,336
 - Marginally better than other complex models:
   - Full model (663 features): MAE = £492,159
   - Property + City + Postcode (5 features): MAE = £529,612

 Let's validate this model's performance without one-hot encoding to see if we can maintain accuracy with simpler feature representation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

def evaluate_feature_subsets_on_original_data_with_random_forest(df, feature_subsets):
    """
    Evaluates Random Forest model performance for different feature subsets
    """

    # selected_columns = ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'City/County', 'Postal Code', 'Location']
    # selected_columns = ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'City/County', 'Postal Code', 'Location', "Outcode"]
    selected_columns = ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Postal Code', 'Location', "Outcode"]

    selected_columns.append('Price')
    
    df_subset = df[selected_columns]
    X_train, X_test, y_train, y_test = prepare_data_for_modeling(df_subset)
    
    model = RandomForestRegressor(
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    )
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    # Get feature importance
    importances = dict(zip(X_train.columns, model.feature_importances_))
    top_features = dict(sorted(importances.items(), key=lambda x: x[1], reverse=True)[:5])


    # Store results
    results = {
        'Mean Absolute Error': mae,
        'Root Mean Squared Error': rmse,
        'R-squared': r2,
        'Feature Count': X_train.shape[1]
    }
    
    print(f"Number of features: {X_train.shape[1]}")
    print(f"MAE: £{mae:,.2f}")
    print(f"RMSE: £{rmse:,.2f}")
    print(f"R² Score: {r2:.4f}")
    print("\nTop 5 Most Important Features:")
    for feature, importance in top_features.items():
        print(f"{feature}: {importance:.4f}")
        
    return results, model, X_test


# # Run evaluation
print("\nEvaluating Random Forest models on original dataset...")

results_rf, random_forest_model, X_test = evaluate_feature_subsets_on_original_data_with_random_forest(df_original_with_outcode, feature_subsets)
display(results_rf)

# used to test saved model later in lesson
random_forest_test_sample = X_test.sample(10)
# display(random_forest_test_sample)

Our random forest model performs better on the one-hot encoded dataset than the original dataset. Interestingly our original dataset has a similar (albiet slightly higher) R^2 score, yet poorer error metrics.

One-hot encoded dataset - property + postcode + location + outcode:
 - Mean Absolute Error: £491,578
 - Root Mean Squared Error: £1,408,315
 - R-squared: 0.7065

Original dataset - property + postcode + location + outcode:
 - Number of features: 6
 - MAE: £520,728.92
 - RMSE: £1,399,084.58
 - R² Score: 0.7168




### Comparing models: XGBoost gradient boosting

XGBoost (eXtreme Gradient Boosting) is another powerful ensemble learning method that builds on the principles of gradient boosting. It sequentially creates decision trees where each new tree tries to correct the errors made by the previous trees.

What makes XGBoost particularly effective is its use of regularisation techniques to prevent overfitting, along with optimisations that make it computationally efficient. It has become one of the most popular algorithms for structured/tabular data due to its strong predictive performance and ability to handle complex relationships in data.

Like random forests, XGBoost combines multiple trees, but does so in a more focused way by giving more weight to previously misclassified examples. This often results in better performance than random forests, especially for complex regression tasks like house price prediction.

Finally lets have a look at how we could implement XGBoost gradient boosting to our original and our onehot encoded dataset for feature subset: property + postcode + location + outcode.

In [None]:
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
from xgboost import XGBRegressor

def evaluate_feature_subsets_on_original_data_with_xgboost(df, feature_subsets):
    """
    Evaluates XGBoost model performance for different feature subsets
    """

    selected_columns = ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Postal Code', 'Location', "Outcode"]
    selected_columns.append('Price')
    
    df_subset = df[selected_columns]
    X_train, X_test, y_train, y_test = prepare_data_for_modeling(df_subset)
    
    model = XGBRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=6,
        random_state=42,
        n_jobs=-1
    )
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    # Get feature importance
    importances = dict(zip(X_train.columns, model.feature_importances_))
    top_features = dict(sorted(importances.items(), key=lambda x: x[1], reverse=True)[:5])
    
    # Store results
    results = {
        'Mean Absolute Error': mae,
        'Root Mean Squared Error': rmse,
        'R-squared': r2,
        'Feature Count': X_train.shape[1]
    }
    
    print(f"Number of features: {X_train.shape[1]}")
    print(f"MAE: £{mae:,.2f}")
    print(f"RMSE: £{rmse:,.2f}")
    print(f"R² Score: {r2:.4f}")
    print("\nTop 5 Most Important Features:")
    for feature, importance in top_features.items():
        print(f"{feature}: {importance:.4f}")
        
    return results, model, X_test


# Run evaluation
print("\nEvaluating XGBoost models on original dataset...")

results_xgb, xgb_original_model, xgb_x_test_sample = evaluate_feature_subsets_on_original_data_with_xgboost(df_original_with_outcode, feature_subsets)
display(results_xgb)

# used to test saved model later in lesson
xgb_test_sample = xgb_x_test_sample.sample(10)
display(xgb_test_sample)

In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

def get_feature_columns(df, feature_subset):
    """
    Gets all relevant columns including one-hot encoded ones for Location
    """
    columns = []
    for feature in feature_subset:
        if feature == 'Location':
            # Add all Location_ columns for one-hot encoded data
            location_cols = [col for col in df.columns if col.startswith('Location_')]
            columns.extend(location_cols)
        elif feature in df.columns:
            columns.append(feature)    
    return columns

def train_xgboost(X_train, X_test, y_train, y_test):
    """
    Trains and evaluates an XGBoost model
    """
    # Initialize the model
    xgb_model = XGBRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=6,
        random_state=42,
        n_jobs=-1
    )
    
    # Train the model
    xgb_model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = xgb_model.predict(X_test)
    
    # Calculate metrics
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    # Get feature importance and aggregate location features
    feature_importance = dict(zip(X_train.columns, xgb_model.feature_importances_))
    
    # Aggregate location feature importance if present
    if any('Location_' in col for col in X_train.columns):
        location_importance = sum(
            importance for col, importance in feature_importance.items() 
            if 'Location_' in col
        )
        # Add aggregated location importance
        feature_importance['Location (aggregated)'] = location_importance
        # Remove individual location features from importance dict
        feature_importance = {k: v for k, v in feature_importance.items() 
                            if not k.startswith('Location_')}
    
    top_features = dict(sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)[:10])
    
    print("\nXGBoost Results:")
    print(f"MAE: £{mae:,.2f}")
    print(f"RMSE: £{rmse:,.2f}")
    print(f"R2 Score: {r2:.4f}")
    
    print("\nTop 10 Most Important Features:")
    for feature, importance in top_features.items():
        print(f"{feature}: {importance:.4f}")
    
    return xgb_model, mae, rmse, r2, feature_importance

# Get the selected features
selected_features = [
    'No. of Bedrooms',
    'Area in sq ft', 'House Type', 'Outcode', 'Location', 'Postal Code'
]

# Get relevant columns including one-hot encoded ones
selected_columns = get_feature_columns(df_onehot_with_outcode, selected_features)
selected_columns.append('Price')

df_subset = df_onehot_with_outcode[selected_columns]
# Prepare the data using the existing function
X_train, X_test, y_train, y_test = prepare_data_for_modeling(df_subset)

# Train and evaluate XGBoost
xgb_model, xgb_mae, xgb_rmse, xgb_r2, xgb_importance = train_xgboost(X_train, X_test, y_train, y_test)

Woahey! XGBoost is a powerful model that performs better on both the original and one-hot encoded datasets.


Original dataset:

Features included: ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Postal Code', 'Location', 'Outcode']

- Number of features: 6
- MAE: £474,797.03
- RMSE: £1,296,308.09
- R² Score: 0.7569

One-hot encoded dataset:

Features included: ['No. of Bedrooms', 'Area in sq ft', 'House Type', 'Outcode', 'Location', 'Postal Code']

XGBoost Results:
- Number of features: 662
- MAE: £472,931.22
- RMSE: £1,301,992.45
- R2 Score: 0.7548


MAE, RMSE are lower for both models, but the R^2 score is higher for both models than when compared to both decision trees and random forest. 

The original dataset using XGBoost has the best score of all the models so far.


## Discuss interpretability, Bias-Variance trade-off

## Cover hyperparameter tuning for decision trees and random forests

## Model persistence (saving and loading trained models)

When working with trained models, it's essential to know how to save and load them for future use. This allows you to:

- Save training time by reusing trained models
- Deploy models in production environments
- Share models with team members
- Version control your models

Let's look at different methods for saving our trained Decision Tree and Random Forest models:

In [336]:
import json
import os
import pickle
import joblib
import sklearn
import xgboost
from datetime import datetime

"""
Model Persistence Guide
----------------------
This script demonstrates different methods for saving and loading machine learning models.
Key methods covered:
1. joblib - Efficient for scikit-learn models with NumPy arrays
2. pickle - Python's native serialization
3. XGBoost's native format - Optimized for XGBoost models
4. Version control and metadata management

Each method has its pros and cons:
- joblib: Best for scikit-learn models with large NumPy arrays
- pickle: More flexible but slower for large NumPy arrays
- XGBoost native: Most efficient for XGBoost models
"""

# # Create base directory for models
# os.makedirs('models', exist_ok=True)

###########################################
# 1. Decision Tree Model (scikit-learn)
###########################################

# Create comprehensive metadata dictionary
# This helps track model versions, performance, and dependencies
dt_model_info = {
    'model_version': '1.0',
    'model_type': 'DecisionTreeRegressor',
    'training_date': datetime.now().strftime('%Y-%m-%d'),
    'model_params': decision_tree_model.get_params(),
    'performance': {
        'mae': 474797.03,
        'rmse': 1296308.09,
        'r2': 0.7569
    },
    'library_versions': {
        'sklearn': sklearn.__version__
    }
}

# Save using joblib - recommended for scikit-learn models
# joblib is optimized for numpy arrays and large datasets
joblib.dump(decision_tree_model, '../models/decision_tree_model.joblib')

# Save using pickle - alternative method
# Pickle is Python's native serialization protocol
with open('../models/decision_tree_model.pkl', 'wb') as f:
    pickle.dump(decision_tree_model, f)

###########################################
# 2. Random Forest Model (scikit-learn)
###########################################

rf_model_info = {
    'model_version': '1.0',
    'model_type': 'RandomForestRegressor',
    'training_date': datetime.now().strftime('%Y-%m-%d'),
    'model_params': random_forest_model.get_params(),
    'performance': {
        'mae': 520728.92,
        'rmse': 1399084.58,
        'r2': 0.7168
    },
    'library_versions': {
        'sklearn': sklearn.__version__
    }
}

# Save using joblib with compression
# compress=3 provides good balance between size and speed
joblib.dump(random_forest_model, '../models/random_forest_model.joblib', compress=3)

# Save using pickle with highest protocol for better performance
# HIGHEST_PROTOCOL is usually faster and more efficient
with open('../models/random_forest_model.pkl', 'wb') as f:
    pickle.dump(random_forest_model, f, protocol=pickle.HIGHEST_PROTOCOL)

###########################################
# 3. XGBoost Model
###########################################

xgb_model_info = {
    'model_version': '1.0',
    'model_type': 'XGBRegressor',
    'training_date': datetime.now().strftime('%Y-%m-%d'),
    'model_params': xgb_original_model.get_params(),
    'performance': {
        'mae': 472931.22,
        'rmse': 1301992.45,
        'r2': 0.7548
    },
    'library_versions': {
        'xgboost': xgboost.__version__,
        'sklearn': sklearn.__version__
    }
}

# Save using XGBoost's native format
# This is the recommended method for XGBoost models
# Advantages: smaller file size, faster loading, better version compatibility
xgb_original_model.save_model('../models/xgboost_model.json')

# Backup save using pickle
# It's good practice to have multiple save formats
with open('../models/xgboost_model.pkl', 'wb') as f:
    pickle.dump(xgb_original_model, f, protocol=pickle.HIGHEST_PROTOCOL)

# Save metadata for all models
# Storing metadata separately allows easy model information lookup
# without loading the entire model
for model_name, info in [
    ('decision_tree', dt_model_info),
    ('random_forest', rf_model_info),
    ('xgboost', xgb_model_info)
]:
    with open(f'../models/{model_name}_info.json', 'w') as f:
        json.dump(info, f, indent=4)

### Loading saved models and making predictions

We can load the saved models and make predictions on a sample of the test data or new data like so:

In [None]:
import joblib
import pickle
import xgboost as xgb
import pandas as pd
import numpy as np

def load_and_predict(model_path, model_type, X_test):
    """
    Load a model and make predictions
    
    Args:
        model_path: Path to saved model
        model_type: Type of model ('joblib', 'pickle', or 'xgboost')
        X_test: Test data to predict on
        
    Returns:
        Model and predictions
    """
    if model_type == 'joblib':
        model = joblib.load(model_path)
    elif model_type == 'pickle':
        with open(model_path, 'rb') as f:
            model = pickle.load(f)
    elif model_type == 'xgboost':
        model = xgb.XGBRegressor()
        model.load_model(model_path)
    
    predictions = model.predict(X_test)
    return model, predictions

# Load each model and make predictions
dt_model, dt_predictions = load_and_predict(
    '../models/decision_tree_model.joblib',
    'joblib',
    decision_tree_test_sample
)

rf_model, rf_predictions = load_and_predict(
    '../models/random_forest_model.joblib',
    'joblib',
    random_forest_test_sample
)

xgb_original_model, xgb_predictions = load_and_predict(
    '../models/xgboost_model.json',
    'xgboost',
    xgb_test_sample
)


# # Compare predictions
prediction_results = pd.DataFrame({
    'Decision Tree': dt_predictions,
    'Random Forest': rf_predictions,
    'XGBoost': xgb_predictions
})

display(results)



## Limitations of Decision Trees

While powerful, decision trees have some limitations:

1. **Overfitting**: Deep trees can learn rules that are too specific to the training data.
2. **Instability**: Small changes in the data can result in very different trees. 
3. **Bias towards features with many levels**: Trees prefer to split on features with many distinct values.
4. **Difficulty capturing some relationships**: Trees struggle to model linear or smooth relationships.
5. **High variance**: Predictions can vary significantly based on the specific training data used.

Ensemble methods like random forests can mitigate some of these issues.

## Ethical Considerations

When using machine learning for real-world applications like house price prediction, it's important to consider the potential ethical implications:

- **Bias**: If the training data contains historical biases, the model may perpetuate these biases in its predictions.

- **Transparency**: If the model is used to make important decisions (like mortgage approvals), there may be a legal or moral obligation to explain how it makes predictions.

- **Privacy**: The model uses detailed personal information, so it's crucial to ensure that data is collected, stored, and used responsibly.

As machine learning practitioners, it's our duty to strive for models that are fair, transparent, and respectful of privacy. This may involve techniques like bias auditing, model interpretability tools, and differential privacy.

## Machine Learning Model Deployment: From Development to Production

Moving machine learning models from development to production requires careful consideration of both software engineering and ML-specific challenges. This guide bridges the gap between theoretical understanding and practical implementation.

### 1. Model Serving Architecture: Beyond Basic REST APIs

The serving architecture is your model's interface with the world. While simple REST APIs work for basic use cases, production deployments need to handle concerns like:
- Request batching for efficiency
- Model versioning and rollbacks
- Load balancing and scaling
- Request prioritization
- Warm-up strategies to avoid cold starts
- Comprehensive error handling

Here's a practical implementation that addresses these concerns:

```python
# Advanced serving implementation with pools and warm-up
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel, validator
from typing import Optional, Dict, List
import asyncio
from contextlib import asynccontextmanager

class PredictionRequest(BaseModel):
    features: Dict[str, float]
    request_id: str
    priority: Optional[int] = 1  # Support priority queuing

    @validator('features')
    def validate_features(cls, v):
        # Strict validation prevents issues downstream
        required_features = {'square_feet', 'bedrooms', 'location'}
        if missing := required_features - v.keys():
            raise ValueError(f"Missing required features: {missing}")
        return v

class ModelServer:
    """
    Production-ready model server with:
    - Model pooling for concurrent requests
    - Version management
    - Warm-up handling
    - Request queuing
    """
    def __init__(self):
        self.model_pool = {}  # Version -> List[Model]
        self.request_queue = asyncio.Queue()
        self.is_warm = False  # Track warm-up state

    @asynccontextmanager
    async def get_model(self, version: str = 'latest'):
        model = await self.model_pool[version].acquire()
        try:
            # Ensure model is warm before first prediction
            if not self.is_warm:
                await model.warmup()
                self.is_warm = True
            yield model
        finally:
            await self.model_pool[version].release(model)
```

### 2. Advanced Model Optimization

Model optimization isn't just about making predictions faster - it's about finding the optimal balance between:
- Inference latency
- Memory usage
- Prediction accuracy
- Resource costs
- Maintenance complexity

Different use cases will prioritize these differently. For example, edge deployment might prioritize memory usage, while a high-throughput API might focus on latency.

Here's an implementation that considers these trade-offs:

```python
class ModelOptimizer:
    """
    Comprehensive model optimization with validation at each step.
    Balances multiple optimization objectives while maintaining 
    model quality.
    """
    def __init__(self, model, validation_data):
        self.model = model
        self.validation_data = validation_data
        self.baseline_metrics = self.evaluate(model)

    def optimize(self, target_latency_ms: float, min_accuracy_drop: float = 0.01):
        """
        Multi-stage optimization pipeline that:
        1. Optimizes model structure (pruning, compression)
        2. Improves memory layout
        3. Optimizes inference paths
        While maintaining accuracy within specified bounds
        """
        optimized_model = self.model

        # Stage 1: Structure Optimization
        # Prune model while monitoring accuracy impact
        optimized_model = self.optimize_structure(optimized_model)
        
        # Stage 2: Memory Layout
        # Improve cache efficiency and reduce memory footprint
        optimized_model = self.optimize_memory_layout(optimized_model)
        
        # Stage 3: Inference Optimization
        # Speed up common prediction paths
        optimized_model = self.optimize_inference(optimized_model)

        # Validate all requirements are met
        final_metrics = self.evaluate(optimized_model)
        if not self.meets_requirements(final_metrics, target_latency_ms, min_accuracy_drop):
            raise OptimizationError("Failed to meet optimization targets")

        return optimized_model, final_metrics
```

### 3. Monitoring and Drift Detection

Production ML systems need three types of monitoring:
1. System metrics (latency, throughput, resource usage)
2. ML metrics (accuracy, predictions distribution)
3. Business metrics (user satisfaction, revenue impact)

Additionally, drift detection is crucial for maintaining model quality over time. Common types of drift include:
- Feature drift (input distributions change)
- Concept drift (relationship between features and target changes)
- Data quality drift (degradation in input quality)

Here's a monitoring implementation that covers these aspects:

```python
class MLMonitor:
    """
    Comprehensive monitoring system that tracks system health,
    model performance, and data quality.
    """
    def __init__(self):
        # Separate collectors for different metric types
        self.metrics = MetricsCollector()
        self.drift_detector = DriftDetector()
        self.performance_tracker = PerformanceTracker()

    async def monitor_prediction(self, features, prediction, actual=None):
        """
        Holistic monitoring of each prediction, covering:
        - System performance (latency, resource usage)
        - Prediction quality (confidence, distributions)
        - Data quality (missing values, ranges)
        """
        await asyncio.gather(
            self.track_prediction_metrics(features, prediction),
            self.check_drift(features),
            self.monitor_performance(),
            self.log_prediction(features, prediction)
        )

    async def check_drift(self, features):
        """
        Multi-faceted drift detection using statistical tests
        and distribution monitoring. Handles different types of
        drift with appropriate statistical methods.
        """
        drift_types = {
            'feature_drift': self.drift_detector.check_feature_drift(features),
            'concept_drift': self.drift_detector.check_concept_drift(features),
            'data_quality_drift': self.drift_detector.check_data_quality()
        }

        if any(drift_types.values()):
            await self.handle_drift(drift_types)
```

### 4. Testing Framework

ML testing goes beyond traditional software testing. We need to verify:
- Statistical performance (accuracy, precision, recall)
- System performance (latency, throughput)
- Edge cases and failure modes
- Model behavior under load
- Drift detection accuracy

A comprehensive testing strategy should cover all these aspects while remaining maintainable and reliable.

```python
class MLTestSuite:
    """
    End-to-end testing framework for ML systems.
    Combines traditional software tests with ML-specific validation.
    """
    def __init__(self, model_service):
        self.model_service = model_service
        self.test_cases = self.load_test_cases()

    async def run_comprehensive_tests(self):
        """
        Full test suite that validates:
        - Model accuracy on test set
        - Performance under various loads
        - System reliability and error handling
        - Drift detection accuracy
        - Edge case handling
        """
        results = await asyncio.gather(
            self.test_accuracy(),      # Statistical performance
            self.test_performance(),   # System performance
            self.test_reliability(),   # Error handling
            self.test_edge_cases(),    # Boundary conditions
            self.test_drift_detection(),  # Drift handling
            self.test_load_handling()     # Load testing
        )
        
        return self.generate_test_report(results)

    async def test_load_handling(self):
        """
        Sophisticated load testing that simulates real-world scenarios:
        - Steady state load
        - Sudden spikes
        - Gradual ramp-up
        - Mixed load patterns
        """
        async with LoadGenerator() as generator:
            patterns = [
                ('steady', 100, 60),    # Baseline load
                ('spike', 500, 10),     # Traffic spike
                ('ramp', (10, 200), 30) # Gradual increase
            ]
            
            for pattern_type, load, duration in patterns:
                metrics = await generator.run_pattern(pattern_type, load, duration)
                await self.analyze_load_metrics(metrics)
```

### 5. Production Deployment Pipeline

Deploying ML models safely requires more care than traditional software deployments. Key considerations include:
- Model versioning and artifact management
- Gradual rollouts with monitoring
- Automatic rollback capabilities
- Performance comparison with previous versions
- Handling model warmup and cold starts

Here's an implementation of a robust deployment pipeline:

```python
class ModelDeployment:
    """
    Production deployment manager that handles safe rollouts
    and monitoring of new model versions.
    """
    def __init__(self):
        self.current_model = None
        self.rollback_model = None  # Keep previous version for rollbacks

    async def deploy_new_version(self, new_model):
        """
        Careful deployment process:
        1. Validate new model thoroughly
        2. Deploy gradually with monitoring
        3. Enable quick rollback if needed
        4. Handle traffic shifting safely
        """
        try:
            # Pre-deployment validation
            await self.health_check(self.current_model)
            await self.validate_model(new_model)
            
            # Gradual rollout with monitoring
            await self.gradual_rollout(new_model)
            
            # Post-deployment monitoring
            await self.monitor_deployment(new_model)
            
        except DeploymentError as e:
            await self.rollback()
            raise DeploymentFailed(f"Deployment failed: {str(e)}")

    async def gradual_rollout(self, new_model):
        """
        Traffic shifting with health monitoring at each stage.
        Uses increasing traffic percentages with validation
        at each step.
        """
        traffic_splits = [
            (0.1, 300),  # Start with 10% traffic
            (0.5, 300),  # Increase to 50%
            (1.0, 300)   # Full deployment
        ]
        
        for traffic_fraction, duration in traffic_splits:
            await self.shift_traffic(new_model, traffic_fraction)
            await asyncio.sleep(duration)  # Allow metrics to stabilize
            
            # Continuous health checking
            if not await self.check_health_metrics():
                raise DeploymentError("Health check failed during rollout")
```

### Key Takeaways and Best Practices

1. **System Design**
   - Plan for scalability from the start
   - Consider both ML and system metrics
   - Build in monitoring and testing from day one

2. **Model Optimization**
   - Balance multiple performance objectives
   - Validate optimizations thoroughly
   - Keep optimization pipeline maintainable

3. **Monitoring**
   - Monitor both technical and ML metrics
   - Implement comprehensive drift detection
   - Have clear incident response procedures

4. **Testing**
   - Go beyond accuracy metrics
   - Test under realistic conditions
   - Include performance and reliability tests

5. **Deployment**
   - Use gradual rollouts
   - Monitor deployments closely
   - Maintain rollback capabilities


## Conclusion

In this lesson, we've covered:

- The intuition behind decision trees and how they make predictions
- Different splitting criteria, including MSE and MAE
- Preprocessing data for decision tree models, handling missing values, and feature engineering
- Training and evaluating decision trees in scikit-learn
- The impact of different feature subsets on model performance
- Comparing decision trees to linear regression and random forests
- The bias-variance trade-off and how it relates to model selection
- Interpreting decision tree models and analyzing feature importances
- Advanced techniques like hyperparameter tuning and ensemble methods
- The limitations of decision trees and ethical considerations in their use

Decision trees are a powerful and interpretable tool for regression and classification tasks. While they have limitations, they form the foundation for more advanced methods like random forests and gradient boosting.

Understanding decision trees is crucial for any machine learning practitioner. They provide a solid grounding in the core concepts of supervised learning, and their interpretability makes them invaluable for explaining predictions to stakeholders.

In the next lesson, we'll dive deeper into ensemble methods with random forests, seeing how they can improve upon the performance of single decision trees.

## Further Reading

- [Scikit-learn documentation on decision trees](https://scikit-learn.org/stable/modules/tree.html)
- [A visual introduction to machine learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
- [An Introduction to Statistical Learning, Chapter 8: Tree-Based Methods](http://faculty.marshall.usc.edu/gareth-james/ISL/)
- [Elements of Statistical Learning, Chapter 9: Additive Models, Trees, and Related Methods](https://web.stanford.edu/~hastie/ElemStatLearn/)
- [Kaggle course on Machine Learning Explainability](https://www.kaggle.com/learn/machine-learning-explainability)
- [Google's Machine Learning Crash Course, Descending into ML: Training and Loss](https://developers.google.com/machine-learning/crash-course/descending-into-ml/training-and-loss)
- [Interpretable Machine Learning, A Guide for Making Black Box Models Explainable](https://christophm.github.io/interpretable-ml-book/)

These resources will help deepen your understanding of decision trees and their place in the broader machine learning landscape. They cover the mathematical underpinnings, practical considerations, and cutting-edge techniques in model interpretability and explainability.

Machine learning is a vast and rapidly evolving field, and there's always more to learn. I encourage you to actively experiment with these models, tune their parameters, and test them on different datasets. Hands-on experience is invaluable for building intuition and understanding.

As you progress in your machine learning journey, always keep the end goal in mind: creating models that are not only accurate, but also transparent, fair, and beneficial to society. The technical skills are important, but the ethical considerations are just as crucial.

I hope this lesson has provided a solid foundation for your exploration of decision trees and machine learning. Feel free to reach out if you have any further questions!