# Lesson 2C: ATLAS - Decision Tree Comparison

## Introduction

Ok so if you've gotten this far you might be wondering: Whats the best type of decision tree for predicting house prices? How best to configure such a tree?  What features give the best predictions without overfitting?

To do this we'll need a pipeline for running and comparing lots of tree-based models at once. So lets build that!

So in this lesson we're building ATLAS - an Automated Tree Learning Analysis System

Why ATLAS? Well, comparing loads of models is heavy work, and like its mythological namesake, our pipeline carries that load.

We're going to explore:

1. How to compare lots of models without losing our sanity
2. Why feature engineering is still more art than science
3. The eternal trade-off between complexity and performance
4. What we've learned about model evaluation (mostly through making mistakes)

## Table of Contents

1. [Introduction](#introduction)
1. [Why build ATLAS?](#why-build-atlas)
2. [ATLAS architecture](#atlas-architecture)
  - [Understanding k-fold cross-validation in ATLAS](#understanding-k-fold-cross-validation-in-atlas)
  -  [Core components](#core-components)
  -  [Class design and model persistence](#class-design-and-model-persistence)
  -  [System workflow](#system-workflow)
  -  [Key challenges solved](#key-challenges-solved)
  -  [Next steps](#next-steps)
7. [Running ATLAS: The pipeline in action](#running-atlas-the-pipeline-in-action)
8. [Unveiling the drivers of London house prices](#unveiling-the-drivers-of-london-house-prices)

## Why build ATLAS?
Let's be upfront - we're building this pipeline because:

1. Running models one at a time is tedious and error-prone
2. We keep forgetting which model performed best and with which features
3. Copying and pasting code between notebooks is a recipe for disaster
4. We want to spend time understanding results, not running experiments

Let's also acknowledge something else upfront - we're all using AI tools to help write code these days and this gives us the opportunity to quickly build this comparison engine in a day, a task that would have taken much longer a year ago.

We want to encourage the use of these tools as they are brilliant at the boilerplate stuff, but its important we remember we're the editors, proof-readers, and decision makers.

Every line gets our scrutiny, and every design choice serves a purpose. The interesting questions - the "why are we doing this?" and "what does this actually mean?" - that's where we humans get to shine.

This is where ATLAS comes in. Think of it as our research assistant that handles the repetitive work while we focus on what matters - understanding and interpreting the results.

In building this system, we face three critical challenges:

1. We need to compare many different approaches systematically
2. We need to ensure our comparisons are fair and reliable
3. We need to avoid common pitfalls that can invalidate our results

Let's dive in by setting up our tools and loading our data.

## ATLAS architecture

At its core, ATLAS is a pipeline that automates the process of:
```
Raw Data → Feature Engineering → Model Training → Evaluation → Results
```

But it does this across:
- Multiple feature combinations
- Multiple model types (Decision Trees, Random Forests, XGBoost)
- Multiple training/validation splits
- All while preventing common mistakes such as target variable leakage

### Understanding k-fold cross-validation in ATLAS

Cross-validation sits at the heart of ATLAS's validation strategy. It lets us systematically rotate which portion of our training data is used for validation, with 'k' referring to the number of groups we split our data into.

#### Why do we need it?
When evaluating our house price models, using a single train-validation split is risky - our results might depend heavily on which properties end up in which set. Cross-validation solves this by validating each model multiple times on different splits of the data.

#### How it works in practice
The general procedure is as follows:

1. Mix up the data randomly
2. Cut it into `k` equal chunks
3. For each chunk:
  - Use it to validate
  - Use everything else to train
  - Build a model and see how it does
  - Write down the score
4. Average all those scores together

Let's see how this works with our house prices.

Imagine you have 1000 house prices and want to test your model. You could:

1. **Simple split (not great):**
   ```
   800 houses for training → Train Model → Test on 200 houses
   ```
   Problem: Your results might depend heavily on which 200 houses you picked

2. **Use 5-fold cross-validation (much better):**
   ```
   Split 800 training houses into 5 folds of 160 each
   
  Fold 1: [Val][Train][Train][Train][Train]
  Fold 2: [Train][Val][Train][Train][Train]
  Fold 3: [Train][Train][Val][Train][Train]
  Fold 4: [Train][Train][Train][Val][Train]
  Fold 5: [Train][Train][Train][Train][Val]
   ```
   
   Now you:
   - Train 5 different times
   - Each time, use 4 folds (640 houses) for training
   - Validate on the remaining fold (160 houses)
   - Average the results

This gives you much more reliable performance estimates and tells you how much your model's performance varies.

For ATLAS, this means we can tell if a model really works across all kinds of London properties and hasn't just gotten lucky with one split.

### Core components

#### 1. PreProcessor: The data guardian
```python
preprocessor = PreProcessor()
train_data, test_data = preprocessor.create_train_test_split(raw_data)
```

The PreProcessor's job is simple but crucial:
- Split data into training and test sets
- Ensure the splits represent all price ranges using stratified sampling
- Keep the test data untouched until final evaluation

#### 2. FeatureEncoder: The feature factory

The FeatureEncoder tackles our biggest challenge: how to use location information without leaking price data. Here's the problem:

##### The price leakage problem

Imagine you're predicting house prices in London. You know that houses in Chelsea are expensive, so you want to encode this information. A naive approach would be:

```python
# 🚫 BAD APPROACH - Price Leakage!
chelsea_average_price = all_data[all_data['area'] == 'Chelsea']['price'].mean()
data['chelsea_price_level'] = chelsea_average_price
```

This leaks future price information because you're using the entire dataset's prices to create features. Instead, ATLAS does this:

```python
# ✅ GOOD APPROACH - No Leakage
def encode_location(train_fold, validation_fold):
    # Calculate price levels using ONLY training data
    area_prices = train_fold.groupby('area')['price'].mean()
    
    # Apply to validation data without leakage
    validation_fold['area_price_level'] = validation_fold['area'].map(area_prices)
```

ATLAS also encodes area based price information hierarchically:
```
City Level (e.g., London)
   ↓
Area Level (e.g., North London)
   ↓
Neighborhood Level (e.g., Islington)
```

For areas with limited data, we fall back to broader geographic averages.

#### 3. CrossValidator: The experiment runner

The CrossValidator is where everything comes together.

It manages running training, evaluating each model and feature set combinations, and calls the feature encoders within the training folds to prevent target variable information leakage.

```python
class CrossValidator:
    def evaluate_all_combinations(self, train_data, test_data):
        results = []
        
        # PART 1: K-FOLD CROSS VALIDATION
        kf = KFold(n_splits=5, shuffle=True)
        for fold_idx, (train_idx, val_idx) in enumerate(kf.split(train_data)):
            # Get this fold's data
            fold_train = train_data.iloc[train_idx]
            fold_val = train_data.iloc[val_idx]
            
            # Create features fresh for this fold
            feature_sets = encoder.create_fold_features(fold_train, fold_val)
            
            # Try each feature set and model combination
            for feature_set in feature_sets:
                for model_name, model in self.models.items():
                    # Train and evaluate
                    model.fit(feature_set.X_train, feature_set.y_train)
                    val_pred = model.predict(feature_set.X_val)
                    
                    # Record results
                    results.append({
                        'fold': fold_idx,
                        'feature_set': feature_set.name,
                        'model': model_name,
                        'performance': calculate_metrics(val_pred)
                    })
        
        # PART 2: FINAL EVALUATION
        # Only best models get evaluated on test data
        best_models = select_best_models(results)
        final_results = evaluate_on_test_data(best_models, test_data)
```

This ensures each fold is truly independent, with its own feature encoding.

### Class design and model persistence

ATLAS is built for deployment, not just experimentation. Our encoders can easily be extended to each save their states:

```python
class FeatureEncoder:
    def __init__(self):
        self.encoding_maps = {}  # Stores encoding information
        
    def save(self, path):
        # Save encoding_maps for future use
        with open(path, 'wb') as f:
            pickle.dump(self.encoding_maps, f)
            
    @classmethod
    def load(cls, path):
        # Load saved encoder
        with open(path, 'rb') as f:
            encoder = cls()
            encoder.encoding_maps = pickle.load(f)
            return encoder
```

This means when you deploy your model, you can:
1. Save all the preprocessing steps
2. Load them in production
3. Apply the exact same transformations to new data

### System workflow

Let's follow how ATLAS processes our housing data:

1. **Initial split** (PreProcessor)
   ```
   Raw Data → Training Data (80%) + Test Data (20%)
   ```
   Test data remains locked away until final evaluation.

2. **Cross-validation split** (CrossValidator)
   ```
   Training Data → 5 Folds
   Each fold gets a turn as validation data
   ```
   This gives us reliable performance estimates.

3. **Feature engineering** (FeatureEncoder)
   ```
   For each fold:
      Create features using only training portion
      Apply those features to validation portion
   ```
   This prevents data leakage while giving us multiple feature combinations to try.

4. **Model training and evaluation**
   ```
   Cross-validation phase:
     For each fold:
         For each feature set:
             For each model type:
                 Train on fold training data
                 Evaluate on fold validation data
                 Record performance metrics

   Final evaluation phase:
     For best performing models:
         Train on full training data
         Evaluate on held-out test data

   ```
   This gives us comprehensive comparison results.

### Key challenges solved

The architecture of ATLAS directly addresses common machine learning challenges:

1. **The feature selection problem**
   
   Instead of guessing what drives house prices, ATLAS methodically evaluates all combinations of features across different models.

2. **The reliability problem**
   
   Rather than trusting a single train/test split ATLAS uses cross-validation for robust estimates, maintaining a separate test set for final validation.

3. **The leakage problem**

   By encoding features separately for each fold, ATLAS ensures our models never see future price information during training.

4. **The deployment problem**

   All components can easily be extended to save their state, ensuring our deployed models use exactly the same transformations we tested.

This systematic approach means our predictions are based on evidence, not intuition.


### Next steps

Now that we understand how ATLAS works, let's see what it reveals about London house prices. We'll:
1. See ATLAS in action on real data
2. Learn how to interpret its results
3. Use those insights to choose the best model for our needs
4. Discuss the real world implications of our models.


## ATLAS implementation

### Required libraries

We'll use the following libraries to compare our decision trees - keeping it simple:

| Library | Purpose |
|---------|----------|
| NumPy | Numerical computations and array operations |
| Pandas | DataFrames, groupby |
| sklearn.tree | Decision trees, feature importance |
| sklearn.ensemble | Random forests, bagging, parallel training |
| XGBoost | Gradient boosting, early stopping, GPU support |
| sklearn.model_selection | Train-test split, cross-validation, parameter tuning |
| sklearn.preprocessing | Feature encoding, scaling, pipelines |
| sklearn.metrics | Error metrics, scoring, validation |
| typing | Type hints, TypedDict |
| dataclasses | Data structures, automated class creation |
| pickle | Model saving/loading |
| tqdm | Progress bars, ETA estimation |
| IPython.display | Rich notebook output |


#### Configuration
- Fixed random seeds for reproducibility
- Formatted DataFrame output
- Full column visibility



In [9]:
# Core data and analysis libraries
import numpy as np                                     # For numerical computations and array operations
import pandas as pd                                    # For data manipulation and analysis using DataFrames
from typing import TypedDict
import numpy as np
import pandas as pd

# Machine Learning Framework
from sklearn.model_selection import (
    train_test_split,                                 # Splits data into training and test sets
    KFold,                                            # Performs k-fold cross-validation
)

from sklearn.preprocessing import (
    OneHotEncoder                                     # Converts categorical variables into binary features
)

# Tree-based Models
from sklearn.tree import DecisionTreeRegressor        # Basic decision tree implementation
from sklearn.ensemble import RandomForestRegressor    # Ensemble of decision trees
from xgboost import XGBRegressor                      # Gradient boosting implementation

# Model Evaluation Metrics
from sklearn.metrics import (
    mean_absolute_error,                             # Measures average magnitude of prediction errors
    r2_score,                                        # Measures proportion of variance explained by model
)

# Utilities and  infrastructure
from typing import Dict, List, Tuple, Optional       # For type annotations
from dataclasses import dataclass                    # For creating data classes
import pickle                                        # For saving/loading objects
from pathlib import Path

# Progress Tracking
from tqdm.notebook import tqdm                       # For displaying progress bars
from IPython.display import display                  # For rich output in notebooks

# Display Configuration
pd.set_option('display.max_columns', None)           # Show all columns in DataFrames
pd.set_option('display.float_format',                # Format floating point numbers to 2 decimal places
              lambda x: '{:,.2f}'.format(x))

# Reproducibility Settings
RANDOM_STATE = 42                                    # Fixed seed for reproducible results
np.random.seed(RANDOM_STATE)

### Data loading, validation and FeatureSet dataclass implementation

In [3]:
def validate_housing_data(df: pd.DataFrame) -> None:
    """Validate housing data has correct columns and content"""
    # Check required columns exist
    required_columns = [
        'Price', 'Area in sq ft', 'No. of Bedrooms',
        'House Type', 'Outcode', 'Postal Code', 'Location', 'City/County'
    ]

    missing = set(required_columns) - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns: {missing}")

    # Basic data validation
    if (df['Price'] <= 0).any():
        raise ValueError("Found non-positive prices")

    if (df['Area in sq ft'] <= 0).any():
        raise ValueError("Found non-positive areas")

    if ((df['No. of Bedrooms'] <= 0) | (df['No. of Bedrooms'] > 20)).any():
        raise ValueError("Invalid number of bedrooms")

    # Print summary
    print("Data validation complete!")
    print(f"Rows: {len(df)}")
    print(f"Price range: £{df['Price'].min():,.0f} - £{df['Price'].max():,.0f}")
    print(f"Area range: {df['Area in sq ft'].min():,.0f} - {df['Area in sq ft'].max():,.0f} sq ft")
    print(f"Bedrooms range: {df['No. of Bedrooms'].min()} - {df['No. of Bedrooms'].max()}")
    print(f"Missing locations: {df['Location'].isnull().sum()}")

# Load and validate data
# df_with_outcode = pd.read_csv('../data/df_with_outcode.csv')
url = "https://raw.githubusercontent.com/powell-clark/supervised-machine-learning/main/data/df_with_outcode.csv"
df_with_outcode = pd.read_csv(url)

validate_housing_data(df_with_outcode)
display(df_with_outcode.head())

@dataclass
class FeatureSet:
    X_train: pd.DataFrame
    X_val: pd.DataFrame
    y_train: pd.Series
    y_val: pd.Series
    name: str
    description: str

Data validation complete!
Rows: 3480
Price range: £180,000 - £39,750,000
Area range: 274 - 15,405 sq ft
Bedrooms range: 1 - 10
Missing locations: 916


Unnamed: 0,Price,House Type,Area in sq ft,No. of Bedrooms,Location,City/County,Postal Code,Outcode
0,1675000,House,2716,5,wimbledon,london,SW19 8NY,SW19
1,650000,Flat / Apartment,814,2,clerkenwell,london,EC1V 3PA,EC1V
2,735000,Flat / Apartment,761,2,putney,london,SW15 1QL,SW15
3,1765000,House,1986,4,putney,london,SW15 1LP,SW15
4,675000,Flat / Apartment,700,2,putney,london,SW15 1PL,SW15


### PreProcessor

Building on Lesson 2B's exploration of model evaluation, we learned that proper model evaluation requires careful data splitting. Since house prices follow a highly skewed distribution, we need to ensure our train and test sets have similar price distributions.

The PreProcessor class exists to:

- Add transformations before modeling, in this case transforming price to log price
- Create price bands for stratification using log-transformed prices
- Perform stratified train/test splits that preserve the price distribution
- Provide a foundation for any future preprocessing needs

#### Input Requirements
The DataFrame must already be a clean dataset of features ready for modeling

In [4]:
class PreProcessor:
    """Handles initial data transformations and train/test splitting"""

    def __init__(self, random_state: int = RANDOM_STATE):
        self.random_state = random_state

    def prepare_pre_split_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Creates features that must be calculated before train/test split"""
        df_processed = df.copy()

        # Add log-transformed price
        df_processed['log_price'] = np.log(df_processed['Price'])

        # Create price bands for stratification
        df_processed['price_band'] = pd.qcut(df_processed['log_price'], q=10, labels=False)

        return df_processed

    def create_train_test_split(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """Performs stratified train/test split using price bands"""
        train_data, test_data = train_test_split(
            df,
            test_size=0.2,
            stratify=df['price_band'],
            random_state=self.random_state
        )

        return train_data, test_data

print("PreProcessor class loaded successfully!")

PreProcessor class loaded successfully!


### FeatureEncoder

Our FeatureEncoder solves several core challenges in house price prediction:
1. Converting raw data into model-ready features without data leakage
2. Engineering rich location-based price signals while preserving validation integrity
3. Generation of multiple feature combinations
4. Consistent handling of training, validation, and test data

#### Geographic Encoding: A three-level challenge

Our housing's location data has a natural hierarchy:
```
Outcode (e.g., "SW1")
   ↓
Postal Code (e.g., "SW1A 1AA")
   ↓
Location (e.g., "Buckingham Palace")
```

Each level presents a tradeoff between specificity and data availability. Here's how we handle each:

#### 1. Outcode level (primary signal)
```python
def _encode_outcode_target(self, train_data, eval_data):
    if 'cv_fold' in train_data.columns:  # Cross-validation mode
        oof_predictions = pd.Series(index=train_data.index)
        for train_idx, val_idx in kf.split(train_data):
            inner_train = train_data.iloc[train_idx]
            outcode_means = inner_train.groupby('Outcode')['log_price'].mean()
            oof_predictions.iloc[val_idx] = val_data['Outcode'].map(outcode_means)
    else:  # Test/Production mode
        outcode_means = train_data.groupby('Outcode')['log_price'].mean()
        encoded = eval_data['Outcode'].map(outcode_means)
```
- Most robust due to larger sample sizes
- Different logic for CV vs test/production predictions
- Handles unseen outcodes via global mean

#### 2. Postal code level (more granular)
```python
def _encode_postcode_target(self, fold_train, fold_val, outcode_encoding):
    counts = fold_train['Postal Code'].value_counts()
    means = fold_train.groupby('Postal Code')['log_price'].mean()
    
    # Bayesian-style smoothing
    weight = counts / (counts + self.smoothing_factor)
    encoded = weight * means + (1 - weight) * outcode_encoding
```
- Adaptive trust in local estimates
- Smoothing against outcode baseline
- Handles data sparsity gracefully

#### 3. Location level (maximum detail)
```python
def _encode_location_target(self, fold_train, fold_val, postcode_encoding):
    counts = fold_train['Location'].value_counts()
    means = fold_train.groupby('Location')['log_price'].mean()
    
    # Handle rare locations
    low_freq_mask = (counts < self.min_location_freq)
    encoded[low_freq_mask] = postcode_encoding[low_freq_mask]
```
- Falls back to postal code for rare locations
- Minimum frequency threshold prevents unstable estimates
- Preserves granular information where reliable

#### Cross-validation safety mechanisms

The encoder implements three critical safeguards:

1. **Out-of-Fold Encoding**
```python
for train_idx, val_idx in kf.split(train_data):
    # Encode validation using only training data
    inner_train = train_data.iloc[train_idx]
    inner_val = train_data.iloc[val_idx]
    encoded = encode_features(inner_train, inner_val)
```
- Prevents target leakage during model selection
- Maintains fold independence
- Mimics real-world information availability

2. **Test Set Handling**
```python
if is_test_set:
    # Use all training data for stable estimates
    means = full_training_data.groupby('Location')['log_price'].mean()
    encoded = test_data['Location'].map(means).fillna(global_mean)
```
- Maximises encoding stability for final evaluation
- Uses full training data appropriately
- Ready for production use

3. **Hierarchical Fallbacks**
```python
def encode_location(self, data, means, fallback):
    encoded = data.map(means)
    return encoded.fillna(fallback)  # Use broader geography when needed
```
- Systematic fallback chain
- No missing values possible
- Maintains encoding stability

#### Feature Set Generation

The encoder creates multiple feature combinations:

1. **Base Features**
```python
numeric_features = ['Area in sq ft', 'No. of Bedrooms']
```

2. **Property Features**
```python
house_features = one_hot_encode('House Type')
city_features = one_hot_encode('City/County')
```

3. **Geographic Variants**
```python
geo_features = {
    'target': location_encoded,
    'onehot': outcode_onehot,
    'price_sqft': price_per_sqft
}
```

4. **Progressive Combinations**
```python
feature_sets = [
    base_only,
    base_plus_house,
    base_plus_geo,
    all_features
]
```

#### Production Readiness

The current implementation calculates encodings on-the-fly:
```python
class FeatureEncoder:
    def _encode_outcode_target(self, train_data, eval_data):
        means = train_data.groupby('Outcode')['log_price'].mean()
        return eval_data['Outcode'].map(means)
```

Could be extended for persistence:
```python
class FeatureEncoder:
    def save_encodings(self, path):
        encodings = {
            'outcode_means': self.outcode_means,
            'global_mean': self.global_mean,
            'smoothing_factor': self.smoothing_factor
        }
        pickle.dump(encodings, open(path, 'wb'))
```

Current design is appropriate because:
1. During experimentation, we need fresh calculations for valid cross-validation
2. Final model selection determines which encodings to persist
3. Encoding logic remains correct for both scenarios

#### Why This Architecture Succeeds

1. **Statistical Validity**
   - Proper handling of training/validation/test boundaries
   - Appropriate use of hierarchical information
   - Robust handling of sparse data

2. **Production Viability**
   - Clean separation of cross-validation and test logic
   - Ready for persistence extension
   - Systematic feature generation

3. **Engineering Quality**
   - Clear single responsibility for each method
   - Explicit handling of edge cases
   - Well-documented assumptions

The FeatureEncoder isn't just converting data - it's ensuring our entire modeling pipeline maintains statistical validity while remaining practical for production deployment.

In [10]:
@dataclass
class EncoderState:
    """State container for FeatureEncoder persistence"""
    # Parameters
    smoothing_factor: int
    min_location_freq: int
    random_state: int

    # Fitted encoders
    house_encoder: Optional[OneHotEncoder] = None
    city_country_encoder: Optional[OneHotEncoder] = None
    outcode_encoder: Optional[OneHotEncoder] = None

    # Geographic statistics
    outcode_means: Optional[Dict[str, float]] = None
    outcode_global_mean: Optional[float] = None
    postcode_means: Optional[Dict[str, float]] = None
    postcode_counts: Optional[Dict[str, int]] = None
    location_means: Optional[Dict[str, float]] = None
    location_counts: Optional[Dict[str, int]] = None
    price_per_sqft_means: Optional[Dict[str, float]] = None
    price_per_sqft_global_mean: Optional[float] = None


class FeatureEncoder:
    """Handles all feature engineering and encoding with fold awareness and persistence"""

    def __init__(self, smoothing_factor: int = 10, min_location_freq: int = 5, random_state: int = 42):
        self.smoothing_factor = smoothing_factor
        self.min_location_freq = min_location_freq
        self.random_state = random_state
        self.state = EncoderState(
            smoothing_factor=smoothing_factor,
            min_location_freq=min_location_freq,
            random_state=random_state
        )

    def _calculate_outcode_price_per_sqft(self,
                                        fold_train: pd.DataFrame,
                                        fold_val: pd.DataFrame) -> Dict[str, pd.Series]:
        """
        Calculate mean price per square foot using out-of-fold means for outcodes

        Args:
            fold_train: Training data for current fold
            fold_val: Validation data for current fold

        Returns:
            Dictionary containing train and validation series of outcode mean price per sqft
        """
        # Initialise empty series for OOF predictions
        oof_price_per_sqft = pd.Series(index=fold_train.index, dtype='float64')

        # Calculate OOF means for training data
        kf = KFold(n_splits=5, shuffle=True, random_state=self.random_state)
        for train_idx, val_idx in kf.split(fold_train):
            inner_train = fold_train.iloc[train_idx]
            inner_val = fold_train.iloc[val_idx]

            # Calculate price per sqft for inner training set
            inner_price_per_sqft = inner_train['Price'] / inner_train['Area in sq ft']
            outcode_means = inner_price_per_sqft.groupby(inner_train['Outcode']).mean()
            global_mean = inner_price_per_sqft.mean()

            # Apply to inner validation set
            oof_price_per_sqft.iloc[val_idx] = (
                inner_val['Outcode']
                .map(outcode_means)
                .fillna(global_mean)
            )

        # Calculate means for validation data using full training set
        train_price_per_sqft = fold_train['Price'] / fold_train['Area in sq ft']
        outcode_means = train_price_per_sqft.groupby(fold_train['Outcode']).mean()
        global_mean = train_price_per_sqft.mean()

        val_price_per_sqft = (
            fold_val['Outcode']
            .map(outcode_means)
            .fillna(global_mean)
        )

        return {
            'train': oof_price_per_sqft,
            'val': val_price_per_sqft
        }

    def _encode_house_type(self,
                          fold_train: pd.DataFrame,
                          fold_val: pd.DataFrame) -> Dict[str, pd.DataFrame]:
        """Create one-hot encoding for house type"""
        # Initialise encoder for this fold
        house_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

        # Fit on fold's training data
        train_encoded = pd.DataFrame(
            house_encoder.fit_transform(fold_train[['House Type']]),
            columns=house_encoder.get_feature_names_out(['House Type']),
            index=fold_train.index
        )

        # Transform validation data
        val_encoded = pd.DataFrame(
            house_encoder.transform(fold_val[['House Type']]),
            columns=house_encoder.get_feature_names_out(['House Type']),
            index=fold_val.index
        )

        return {
            'train': train_encoded,
            'val': val_encoded
        }

    def _encode_city_country(self,
                           fold_train: pd.DataFrame,
                           fold_val: pd.DataFrame) -> Dict[str, pd.DataFrame]:
        """Create one-hot encoding for city/county"""
        # Initialise encoder for this fold
        city_country_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

        # Fit on fold's training data
        train_encoded = pd.DataFrame(
            city_country_encoder.fit_transform(fold_train[['City/County']]),
            columns=city_country_encoder.get_feature_names_out(['City/County']),
            index=fold_train.index
        )

        # Transform validation data
        val_encoded = pd.DataFrame(
            city_country_encoder.transform(fold_val[['City/County']]),
            columns=city_country_encoder.get_feature_names_out(['City/County']),
            index=fold_val.index
        )

        return {
            'train': train_encoded,
            'val': val_encoded
        }

    def _encode_outcode_onehot(self,
                              fold_train: pd.DataFrame,
                              fold_val: pd.DataFrame) -> Dict[str, pd.DataFrame]:
        """Create one-hot encoding for outcodes"""
        outcode_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

        train_encoded = pd.DataFrame(
            outcode_encoder.fit_transform(fold_train[['Outcode']]),
            columns=outcode_encoder.get_feature_names_out(['Outcode']),
            index=fold_train.index
        )

        val_encoded = pd.DataFrame(
            outcode_encoder.transform(fold_val[['Outcode']]),
            columns=outcode_encoder.get_feature_names_out(['Outcode']),
            index=fold_val.index
        )

        return {
            'train': train_encoded,
            'val': val_encoded
        }

    def _encode_outcode_postcode_location_target_hierarchical(self,
                                                            fold_train: pd.DataFrame,
                                                            fold_val: pd.DataFrame
                                                            ) -> Tuple[Dict[str, pd.Series],
                                                                     Dict[str, pd.Series],
                                                                     Dict[str, pd.Series]]:
        """
        Create hierarchical target encoding for geographic features:
        - Outcode encoding
        - Postcode encoding using outcode as prior
        - Location encoding using postcode as prior

        Returns:
            Tuple of (outcode_encoding, postcode_encoding, location_encoding)
        """
        # 1. Outcode encoding
        outcode_encoding = self._encode_outcode_target(fold_train, fold_val)

        # 2. Postcode encoding using outcode as prior
        postcode_encoding = self._encode_postcode_target(
            fold_train,
            fold_val,
            outcode_encoding
        )

        # 3. Location encoding using postcode as prior
        location_encoding = self._encode_location_target(
            fold_train,
            fold_val,
            postcode_encoding
        )

        return outcode_encoding, postcode_encoding, location_encoding

    def _encode_outcode_target(self,
                             train_data: pd.DataFrame,
                             eval_data: pd.DataFrame) -> Dict[str, pd.Series]:
        """Create target encoding for outcodes"""
        if 'cv_fold' in train_data.columns:  # We're in cross-validation
            # Use out-of-fold encoding for training data
            oof_predictions = pd.Series(index=train_data.index, dtype='float64')

            kf = KFold(n_splits=5, shuffle=True, random_state=self.random_state)
            for inner_train_idx, inner_val_idx in kf.split(train_data):
                inner_train = train_data.iloc[inner_train_idx]
                inner_val = train_data.iloc[inner_val_idx]

                outcode_means = inner_train.groupby('Outcode')['log_price'].mean()
                global_mean = inner_train['log_price'].mean()

                oof_predictions.iloc[inner_val_idx] = (
                    inner_val['Outcode']
                    .map(outcode_means)
                    .fillna(global_mean)
                )

            # For validation data, use means from all training data
            outcode_means = train_data.groupby('Outcode')['log_price'].mean()
            global_mean = train_data['log_price'].mean()

            val_encoded = (
                eval_data['Outcode']
                .map(outcode_means)
                .fillna(global_mean)
            )

            return {
                'train': oof_predictions,
                'val': val_encoded
            }

        else:  # We're encoding for the test set
            # Use all training data to encode test set
            outcode_means = train_data.groupby('Outcode')['log_price'].mean()
            global_mean = train_data['log_price'].mean()

            test_encoded = (
                eval_data['Outcode']
                .map(outcode_means)
                .fillna(global_mean)
            )

            return {
                'train': train_data['Outcode'].map(outcode_means).fillna(global_mean),
                'val': test_encoded
            }

    def _encode_postcode_target(self,
                              fold_train: pd.DataFrame,
                              fold_val: pd.DataFrame,
                              outcode_encoding: Dict[str, pd.Series]) -> Dict[str, pd.Series]:
        """Create hierarchical encoding for postcodes using outcode prior"""
        postcode_means = fold_train.groupby('Postal Code')['log_price'].mean()
        postcode_counts = fold_train['Postal Code'].value_counts()

        def encode_postcodes(df: pd.DataFrame, outcode_encoded: pd.Series) -> pd.Series:
            counts = df['Postal Code'].map(postcode_counts)
            means = df['Postal Code'].map(postcode_means)

            # Handle unseen categories using outcode encoding
            means = means.fillna(outcode_encoded)
            counts = counts.fillna(0)

            # Calculate smoothed values
            weight = counts / (counts + self.smoothing_factor)
            return weight * means + (1 - weight) * outcode_encoded

        return {
            'train': encode_postcodes(fold_train, outcode_encoding['train']),
            'val': encode_postcodes(fold_val, outcode_encoding['val'])
        }

    def _encode_location_target(self,
                              fold_train: pd.DataFrame,
                              fold_val: pd.DataFrame,
                              postcode_encoding: Dict[str, pd.Series]) -> Dict[str, pd.Series]:
        """Create hierarchical encoding for locations using postcode prior"""
        location_means = fold_train.groupby('Location')['log_price'].mean()
        location_counts = fold_train['Location'].value_counts()

        def encode_locations(df: pd.DataFrame, postcode_encoded: pd.Series) -> pd.Series:
            counts = df['Location'].map(location_counts)
            means = df['Location'].map(location_means)

            # Handle missing and unseen locations using postcode encoding
            means = means.fillna(postcode_encoded)
            counts = counts.fillna(0)

            # Use postcode encoding for low-frequency locations
            low_freq_mask = (counts < self.min_location_freq) | counts.isna()

            # Calculate smoothed values
            weight = counts / (counts + self.smoothing_factor)
            encoded = weight * means + (1 - weight) * postcode_encoded

            # Replace low frequency locations with postcode encoding
            encoded[low_freq_mask] = postcode_encoded[low_freq_mask]

            return encoded

        return {
            'train': encode_locations(fold_train, postcode_encoding['train']),
            'val': encode_locations(fold_val, postcode_encoding['val'])
        }

    def create_fold_features(self, fold_train: pd.DataFrame, fold_val: pd.DataFrame) -> List[FeatureSet]:
        """Create all feature set variations for a fold"""

        house_features = self._encode_house_type(fold_train, fold_val)
        city_country_features = self._encode_city_country(fold_train, fold_val)

        # Exploded geographic features with hierarchical encoding
        outcode_target_hierarchical, postcode_target_hierarchical, location_target_hierarchical = (
            self._encode_outcode_postcode_location_target_hierarchical(fold_train, fold_val)
        )

        outcode_onehot = self._encode_outcode_onehot(fold_train, fold_val)
        outcode_price_per_sqft = self._calculate_outcode_price_per_sqft(fold_train, fold_val)

        feature_combinations = [
            # Base features
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': None,
                'city': None,
                'geo_target': None,
                'geo_onehot': None,
                'price_sqft': None,
                'name': 'area_bedrooms',
                'desc': 'Area in sq ft, No. of Bedrooms'
            },
            # Single feature additions
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': house_features,
                'city': None,
                'geo_target': None,
                'geo_onehot': None,
                'price_sqft': None,
                'name': 'area_bedrooms_house',
                'desc': 'Area in sq ft, No. of Bedrooms, House Type'
            },
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': None,
                'city': city_country_features,
                'geo_target': None,
                'geo_onehot': None,
                'price_sqft': None,
                'name': 'area_bedrooms_city',
                'desc': 'Area in sq ft, No. of Bedrooms, City/County'
            },
            # Individual geographic features - Target encoded
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': None,
                'city': None,
                'geo_target': {'outcode': outcode_target_hierarchical},
                'geo_onehot': None,
                'price_sqft': None,
                'name': 'area_bedrooms_outcode_target',
                'desc': 'Area in sq ft, No. of Bedrooms, Outcode (Target)'
            },
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': None,
                'city': None,
                'geo_target': {'postcode': postcode_target_hierarchical},
                'geo_onehot': None,
                'price_sqft': None,
                'name': 'area_bedrooms_postcode_target',
                'desc': 'Area in sq ft, No. of Bedrooms, Postcode (Target)'
            },
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': None,
                'city': None,
                'geo_target': {'location': location_target_hierarchical},
                'geo_onehot': None,
                'price_sqft': None,
                'name': 'area_bedrooms_location_target',
                'desc': 'Area in sq ft, No. of Bedrooms, Location (Target)'
            },
            # One-hot encoded outcode
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': None,
                'city': None,
                'geo_target': None,
                'geo_onehot': {'outcode': outcode_onehot},
                'price_sqft': None,
                'name': 'area_bedrooms_outcode_onehot',
                'desc': 'Area in sq ft, No. of Bedrooms, Outcode (One-hot)'
            },
            # Price per square foot
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': None,
                'city': None,
                'geo_target': None,
                'geo_onehot': None,
                'price_sqft': outcode_price_per_sqft,
                'name': 'area_bedrooms_pricesqft',
                'desc': 'Area in sq ft, No. of Bedrooms, Price/sqft'
            },
            # Two feature combinations
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': house_features,
                'city': city_country_features,
                'geo_target': None,
                'geo_onehot': None,
                'price_sqft': None,
                'name': 'area_bedrooms_house_city',
                'desc': 'Area in sq ft, No. of Bedrooms, House Type, City/County'
            },
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': None,
                'city': None,
                'geo_target': {
                    'outcode': outcode_target_hierarchical,
                    'postcode': postcode_target_hierarchical
                },
                'geo_onehot': None,
                'price_sqft': None,
                'name': 'area_bedrooms_outcode_postcode_target',
                'desc': 'Area in sq ft, No. of Bedrooms, Outcode & Postcode (Target)'
            },
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': None,
                'city': None,
                'geo_target': {
                    'postcode': postcode_target_hierarchical,
                    'location': location_target_hierarchical
                },
                'geo_onehot': None,
                'price_sqft': None,
                'name': 'area_bedrooms_postcode_location_target',
                'desc': 'Area in sq ft, No. of Bedrooms, Postcode & Location (Target)'
            },
            # Three feature combinations
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': house_features,
                'city': city_country_features,
                'geo_target': {'outcode': outcode_target_hierarchical},
                'geo_onehot': None,
                'price_sqft': None,
                'name': 'area_bedrooms_house_city_outcode_target',
                'desc': 'Area in sq ft, No. of Bedrooms, House Type, City/County, Outcode (Target)'
            },
            # All geographic features
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': None,
                'city': None,
                'geo_target': {
                    'outcode': outcode_target_hierarchical,
                    'postcode': postcode_target_hierarchical,
                    'location': location_target_hierarchical
                },
                'geo_onehot': None,
                'price_sqft': None,
                'name': 'area_bedrooms_all_geo_target',
                'desc': 'Area in sq ft, No. of Bedrooms, All Geographic Features (Target)'
            },
            # Complex combinations
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': house_features,
                'city': None,
                'geo_target': {'outcode': outcode_target_hierarchical},
                'geo_onehot': None,
                'price_sqft': outcode_price_per_sqft,
                'name': 'area_bedrooms_house_outcode_target_pricesqft',
                'desc': 'Area in sq ft, No. of Bedrooms, House Type, Outcode (Target), Price/sqft'
            },
            # All features
            {
                'numeric': ['Area in sq ft', 'No. of Bedrooms'],
                'house': house_features,
                'city': city_country_features,
                'geo_target': {
                    'outcode': outcode_target_hierarchical,
                    'postcode': postcode_target_hierarchical,
                    'location': location_target_hierarchical
                },
                'geo_onehot': None,
                'price_sqft': outcode_price_per_sqft,
                'name': 'all_features',
                'desc': 'All Features Combined'
            }
        ]

        return [self._combine_features(
            fold_train,
            fold_val,
            combo['numeric'],
            combo['house'],
            combo['city'],
            combo['geo_target'],
            combo['geo_onehot'],
            combo['price_sqft'],
            combo['name'],
            combo['desc']
        ) for combo in feature_combinations]

    def _combine_features(self,
                         fold_train: pd.DataFrame,
                         fold_val: pd.DataFrame,
                         base_numeric: List[str],
                         house_features: Optional[Dict[str, pd.DataFrame]],
                         city_country_features: Optional[Dict[str, pd.DataFrame]],
                         geo_target_features: Optional[Dict[str, Dict[str, pd.Series]]],
                         geo_onehot_features: Optional[Dict[str, Dict[str, pd.DataFrame]]],
                         price_sqft_features: Optional[Dict[str, pd.Series]],
                         name: str,
                         description: str) -> FeatureSet:
        """
        Combine different feature types into a single feature set
        """
        # Start with base numeric features
        X_train = fold_train[base_numeric].copy()
        X_val = fold_val[base_numeric].copy()

        # Add house type features if provided
        if house_features:
            X_train = pd.concat([X_train, house_features['train']], axis=1)
            X_val = pd.concat([X_val, house_features['val']], axis=1)

        # Add city/country features if provided
        if city_country_features:
            X_train = pd.concat([X_train, city_country_features['train']], axis=1)
            X_val = pd.concat([X_val, city_country_features['val']], axis=1)

        # Add target-encoded geographic features if provided
        if geo_target_features:
            for feature_name, feature_dict in geo_target_features.items():
                X_train[feature_name] = feature_dict['train']
                X_val[feature_name] = feature_dict['val']

        # Add one-hot encoded geographic features if provided
        if geo_onehot_features:
            for feature_name, feature_dict in geo_onehot_features.items():
                X_train = pd.concat([X_train, feature_dict['train']], axis=1)
                X_val = pd.concat([X_val, feature_dict['val']], axis=1)

        # Add price per square foot features if provided
        if price_sqft_features:
            X_train['outcode_price_per_sqft'] = price_sqft_features['train']
            X_val['outcode_price_per_sqft'] = price_sqft_features['val']

        return FeatureSet(
            X_train=X_train,
            X_val=X_val,
            y_train=fold_train['log_price'],
            y_val=fold_val['log_price'],
            name=name,
            description=description
        )

    def fit(self, training_data: pd.DataFrame) -> 'FeatureEncoder':
            """Fit all encoders on full training data for production use"""
            # Fit categorical encoders
            self.state.house_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
            self.state.house_encoder.fit(training_data[['House Type']])

            self.state.city_country_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
            self.state.city_country_encoder.fit(training_data[['City/County']])

            self.state.outcode_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
            self.state.outcode_encoder.fit(training_data[['Outcode']])

            # Calculate geographic statistics
            self.state.outcode_means = (
                training_data.groupby('Outcode')['log_price'].mean().to_dict()
            )
            self.state.outcode_global_mean = training_data['log_price'].mean()

            self.state.postcode_means = (
                training_data.groupby('Postal Code')['log_price'].mean().to_dict()
            )
            self.state.postcode_counts = (
                training_data['Postal Code'].value_counts().to_dict()
            )

            self.state.location_means = (
                training_data.groupby('Location')['log_price'].mean().to_dict()
            )
            self.state.location_counts = (
                training_data['Location'].value_counts().to_dict()
            )

            # Calculate price per sqft statistics
            price_per_sqft = training_data['Price'] / training_data['Area in sq ft']
            self.state.price_per_sqft_means = (
                price_per_sqft.groupby(training_data['Outcode']).mean().to_dict()
            )
            self.state.price_per_sqft_global_mean = price_per_sqft.mean()

            return self

    def save(self, path: str) -> None:
        """Save encoder state to disk"""
        if not hasattr(self, 'state'):
            raise ValueError("Encoder not fitted. Call fit() first.")

        with open(path, 'wb') as f:
            pickle.dump(self.state, f)

    @classmethod
    def load(cls, path: str) -> 'FeatureEncoder':
        """Load encoder state from disk"""
        with open(path, 'rb') as f:
            state = pickle.load(f)

        encoder = cls(
            smoothing_factor=state.smoothing_factor,
            min_location_freq=state.min_location_freq,
            random_state=state.random_state
        )
        encoder.state = state
        return encoder

    def create_production_features(self, data: pd.DataFrame) -> pd.DataFrame:
        """Create features for production use using fitted state"""
        if not hasattr(self, 'state'):
            raise ValueError("Encoder not fitted. Call fit() first.")

        # Start with base numeric features
        features = data[['Area in sq ft', 'No. of Bedrooms']].copy()

        # Add house type features
        features = pd.concat([
            features,
            pd.DataFrame(
                self.state.house_encoder.transform(data[['House Type']]),
                columns=self.state.house_encoder.get_feature_names_out(['House Type']),
                index=data.index
            )
        ], axis=1)

        # Add geographic target encodings
        features['outcode'] = data['Outcode'].map(self.state.outcode_means).fillna(self.state.outcode_global_mean)

        # Add postcode encoding with smoothing
        postcode_means = pd.Series(data['Postal Code'].map(self.state.postcode_means))
        postcode_counts = pd.Series(data['Postal Code'].map(self.state.postcode_counts))
        postcode_means = postcode_means.fillna(features['outcode'])
        postcode_counts = postcode_counts.fillna(0)
        weight = postcode_counts / (postcode_counts + self.smoothing_factor)
        features['postcode'] = weight * postcode_means + (1 - weight) * features['outcode']

        # Add location encoding with smoothing and frequency threshold
        location_means = pd.Series(data['Location'].map(self.state.location_means))
        location_counts = pd.Series(data['Location'].map(self.state.location_counts))
        location_means = location_means.fillna(features['postcode'])
        location_counts = location_counts.fillna(0)
        weight = location_counts / (location_counts + self.smoothing_factor)
        features['location'] = weight * location_means + (1 - weight) * features['postcode']
        low_freq_mask = (location_counts < self.min_location_freq) | location_counts.isna()
        features.loc[low_freq_mask, 'location'] = features.loc[low_freq_mask, 'postcode']

        # Add price per square foot
        features['price_per_sqft'] = (
            data['Outcode']
            .map(self.state.price_per_sqft_means)
            .fillna(self.state.price_per_sqft_global_mean)
        )

        return features

In [11]:
def test_encoder_persistence(df_with_outcode: pd.DataFrame):
    """Test that the FeatureEncoder can be saved and loaded correctly"""
    print("Testing FeatureEncoder persistence...")

    # Setup save path with proper folder creation
    save_dir = Path("../model/atlas")
    save_path = save_dir / "encoder.pkl"

    # Create directories if they don't exist
    save_dir.mkdir(parents=True, exist_ok=True)

    # Remove existing encoder file if it exists
    if save_path.exists():
        print(f"Found existing encoder at {save_path}, will overwrite")
        save_path.unlink()

    # Use PreProcessor for proper stratified splitting
    preprocessor = PreProcessor()

    # Create price bands and split data properly
    df_processed = preprocessor.prepare_pre_split_features(df_with_outcode)
    training_data, production_data = preprocessor.create_train_test_split(df_processed)

    print("\n1. Training phase...")
    # Create and fit encoder
    encoder = FeatureEncoder()
    encoder.fit(training_data)

    # Print encoding statistics
    print("\nEncoding Statistics:")
    print("-" * 40)
    print(f"Number of unique Outcodes: {len(encoder.state.outcode_means)}")
    print(f"Number of unique Postcodes: {len(encoder.state.postcode_means)}")
    print(f"Number of unique Locations: {len(encoder.state.location_means)}")

    # Print some price statistics
    print("\nPrice Statistics (from encoded values):")
    print("-" * 40)
    outcode_prices = pd.Series(encoder.state.outcode_means)
    print("Outcode price levels:")
    print(f"  Min: £{np.exp(outcode_prices.min()):,.0f}")
    print(f"  Max: £{np.exp(outcode_prices.max()):,.0f}")
    print(f"  Mean: £{np.exp(outcode_prices.mean()):,.0f}")
    print(f"  Median: £{np.exp(outcode_prices.median()):,.0f}")

    # Print location frequency statistics
    location_counts = pd.Series(encoder.state.location_counts)
    print("\nLocation frequency statistics:")
    print("-" * 40)
    print(f"Locations with fewer than {encoder.min_location_freq} samples: {(location_counts < encoder.min_location_freq).sum()}")
    print(f"Most common location: {location_counts.idxmax()} ({location_counts.max()} samples)")
    print(f"Median samples per location: {location_counts.median():.0f}")

    # Generate features before saving
    print("\n2. Creating features before saving...")
    features_before = encoder.create_production_features(production_data)

    # Print feature statistics
    print("\nFeature Statistics:")
    print("-" * 40)
    for col in ['outcode', 'postcode', 'location']:
        if col in features_before.columns:
            print(f"\n{col.title()} encoding statistics:")
            stats = features_before[col].describe()
            print(f"  Mean: {stats['mean']:.3f}")
            print(f"  Std: {stats['std']:.3f}")
            print(f"  Min: {stats['min']:.3f}")
            print(f"  Max: {stats['max']:.3f}")

    print("\n3. Saving encoder...")
    try:
        encoder.save(save_path)
        print(f"Saved encoder to {save_path}")
    except Exception as e:
        print(f"Error saving encoder: {e}")
        return False

    print("\n4. Loading encoder...")
    try:
        loaded_encoder = FeatureEncoder.load(save_path)
    except Exception as e:
        print(f"Error loading encoder: {e}")
        return False

    print("Creating features after loading...")
    features_after = loaded_encoder.create_production_features(production_data)

    print("\n5. Verifying results...")
    features_match = features_before.equals(features_after)
    print(f"Features match: {features_match}")

    if features_match:
        print("\nFeature columns:")
        for col in features_before.columns:
            print(f"- {col}")

        print("\nData statistics:")
        print(f"Training samples: {len(training_data)}")
        print(f"Production samples: {len(production_data)}")
        print(f"Features shape: {features_before.shape}")

        # Print sample correlations
        print("\nFeature correlations with target:")
        target = np.exp(production_data['log_price'])
        for col in ['outcode', 'postcode', 'location']:
            if col in features_before.columns:
                corr = features_before[col].corr(target)
                print(f"{col.title()}: {corr:.3f}")
    else:
        print("\nFeature differences:")
        diff_cols = []
        for col in features_before.columns:
            if not features_before[col].equals(features_after[col]):
                diff_cols.append(col)
        print(f"Columns with differences: {diff_cols}")

    print(f"\nEncoder saved successfully at {save_path}")
    return features_match

# Run the test
print("\n=== Testing FeatureEncoder Persistence ===")
success = test_encoder_persistence(df_with_outcode)
print(f"\nPersistence test {'passed' if success else 'failed'}!")
print("="*40 + "\n")


=== Testing FeatureEncoder Persistence ===
Testing FeatureEncoder persistence...

1. Training phase...

Encoding Statistics:
----------------------------------------
Number of unique Outcodes: 143
Number of unique Postcodes: 2351
Number of unique Locations: 444

Price Statistics (from encoded values):
----------------------------------------
Outcode price levels:
  Min: £325,000
  Max: £8,126,729
  Mean: £1,321,151
  Median: £1,215,566

Location frequency statistics:
----------------------------------------
Locations with fewer than 5 samples: 368
Most common location: putney (103 samples)
Median samples per location: 1

2. Creating features before saving...

Feature Statistics:
----------------------------------------

Outcode encoding statistics:
  Mean: 14.106
  Std: 0.454
  Min: 13.100
  Max: 15.911

Postcode encoding statistics:
  Mean: 14.105
  Std: 0.457
  Min: 13.082
  Max: 15.922

Location encoding statistics:
  Mean: 14.123
  Std: 0.453
  Min: 13.082
  Max: 15.911

3. Saving

### The CrossValidator: ATLAS's Experiment Runner

Remember how ATLAS is our mythological helper, carrying the heavy load of model comparison? Well, the CrossValidator is like its brain - planning experiments, running tests, and keeping track of what works best.

#### The Big Picture

Our CrossValidator handles three key responsibilities:
```
Raw Data → Train/Test Splitting → Model Training → Performance Tracking
```

But it does this:
- Across multiple data folds
- For different model types
- With various feature combinations
- While preventing common mistakes

## Understanding the Code

### 1. Model Management
```python
class CrossValidator:
    def __init__(self, n_folds: int = 5):
        self.models = {
            'decision_tree': DecisionTreeRegressor(),
            'random_forest': RandomForestRegressor(n_estimators=100),
            'xgboost': XGBRegressor(n_estimators=100)
        }
```

This setup gives us:
- A simple decision tree (our baseline)
- A random forest (for robustness)
- XGBoost (for high performance)

Each configured consistently for fair comparison.

### 2. Experiment Design

The evaluation happens in two stages:

```
Stage 1: Cross-Validation
    ├── Split training data into 5 folds
    ├── Train on 4 folds, test on 1
    ├── Repeat 5 times
    └── Average the results

Stage 2: Final Testing
    ├── Train on all training data
    ├── Test on held-out test set
    └── Compare with CV results
```

### 3. Progress Tracking

We use nested progress bars:
```python
with tqdm(total=n_folds) as fold_pbar:          # Outer loop: folds
    with tqdm(total=n_features) as feature_pbar: # Inner loop: features
```

This shows:
- Overall progress through folds
- Detailed progress within each fold
- Estimated time remaining

### 4. Performance Metrics

We track four key metrics:

| Metric | Purpose | Calculation |
|--------|----------|------------|
| RMSE | Overall error magnitude | `sqrt(mean((y_true - y_pred)²))` |
| R² | Explained variance | `1 - (residual_var / total_var)` |
| MAE | Average error in pounds | `mean(abs(exp(y_true) - exp(y_pred)))` |
| % MAE | Relative error | `mean(abs((true_price - pred_price) / true_price))` |

### 5. Results Collection

Results are stored in a structured format:
```python
{
    'fold': fold_idx,                    # Which fold (or 'final' for test)
    'feature_set': feature_set.name,     # What features were used
    'model': model_name,                 # Which model type
    'rmse': rmse_score,                  # Root Mean Squared Error
    'r2': r2_score,                      # R-squared value
    'mae': mean_absolute_error,          # Mean Absolute Error in pounds
    'pct_mae': percentage_error          # Percentage Mean Absolute Error
}
```

## Why This Design Works

1. **Consistency**: Each model gets the same treatment
2. **Fairness**: No data leakage between folds
3. **Completeness**: Multiple metrics for different needs
4. **Clarity**: Progress tracking and organised results

## Common Pitfalls Avoided

1. **Data Leakage**: Features are created independently for each fold
2. **Selection Bias**: Cross-validation provides robust estimates
3. **Overfitting**: Test set remains untouched until final evaluation
4. **Metric Misuse**: Multiple metrics give complete picture

## Looking Forward

In the next section, we'll see ATLAS in action, running experiments and helping us understand what really drives house prices. But first, let's look at how to actually use this CrossValidator in practice.

Remember: The CrossValidator isn't just running models - it's conducting scientific experiments to help us understand what really works for predicting house prices.

In [6]:
class CrossValidator:
    """Handles cross-validation and model evaluation"""

    def __init__(self, n_folds: int = 5, random_state: int = RANDOM_STATE):
        self.n_folds = n_folds
        self.random_state = random_state
        self.models = {
            'decision_tree': DecisionTreeRegressor(random_state=random_state),
            'random_forest': RandomForestRegressor(
                n_estimators=100,
                random_state=random_state
            ),
            'xgboost': XGBRegressor(
                n_estimators=100,
                random_state=random_state
            )
        }

    def evaluate_all_combinations(self,
                                train_data: pd.DataFrame,
                                test_data: pd.DataFrame) -> pd.DataFrame:
        """
        Evaluate all feature set and model combinations using:
        1. K-fold CV on training data
        2. Final evaluation on test set
        """
        results = []
        encoder = FeatureEncoder()

        # Calculate total iterations for progress tracking
        n_folds = self.n_folds
        n_models = len(self.models)

        # PART 1: K-FOLD CROSS VALIDATION ON TRAINING DATA ONLY
        kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=self.random_state)

        print("\nRunning cross-validation...")

        status_display = display('Starting fold 1...', display_id=True)

        # Create main progress bar for folds
        with tqdm(total=n_folds, desc="Folds") as fold_pbar:
            for fold_idx, (fold_train_idx, fold_val_idx) in enumerate(kf.split(train_data)):
                # Get this fold's train/val split
                fold_train = train_data.iloc[fold_train_idx].copy()
                fold_val = train_data.iloc[fold_val_idx].copy()

                # Mark as CV fold (for target encoding)
                fold_train['cv_fold'] = fold_idx
                fold_val['cv_fold'] = fold_idx

                # Create features for this fold
                feature_sets = encoder.create_fold_features(fold_train, fold_val)
                n_features = len(feature_sets)

                # Create nested progress bar for feature sets
                with tqdm(total=n_features * n_models,
                         desc=f"Fold {fold_idx + 1} Progress") as feature_pbar:

                    # Evaluate each feature set and model combination
                    for feature_set in feature_sets:
                        for model_name, model in self.models.items():
                            # Update status display
                            status_display.update(
                                f"Fold {fold_idx + 1}: {model_name} on {feature_set.name}"
                            )

                            model.fit(feature_set.X_train, feature_set.y_train)
                            fold_val_pred = model.predict(feature_set.X_val)

                            results.append({
                                'fold': fold_idx,
                                'feature_set': feature_set.name,
                                'description': feature_set.description,
                                'model': model_name,
                                'split_type': 'cv_fold',
                                'rmse': self._calculate_rmse(feature_set.y_val, fold_val_pred),
                                'r2': r2_score(feature_set.y_val, fold_val_pred),
                                'mae': mean_absolute_error(
                                    np.exp(feature_set.y_val),
                                    np.exp(fold_val_pred)
                                ),
                                'pct_mae': np.mean(np.abs(
                                    (np.exp(feature_set.y_val) - np.exp(fold_val_pred)) /
                                    np.exp(feature_set.y_val)
                                )) * 100,
                                'n_features': feature_set.X_train.shape[1]
                            })
                            feature_pbar.update(1)
                fold_pbar.update(1)

        # PART 2: FINAL EVALUATION ON TEST SET
        print("\nRunning final evaluation on test set...")
        status_display.update("Starting test set evaluation...")

        # Remove CV marking
        train_data = train_data.drop('cv_fold', axis=1, errors='ignore')

        # Create features using full training set and test set
        final_feature_sets = encoder.create_fold_features(train_data, test_data)

        # Create progress bar for final evaluation
        with tqdm(total=len(final_feature_sets) * len(self.models),
                 desc="Test Set Evaluation") as test_pbar:

            for feature_set in final_feature_sets:
                for model_name, model in self.models.items():
                    # Update status display
                    status_display.update(
                        f"Test Set: {model_name} on {feature_set.name}"
                    )

                    # Train on full training data
                    model.fit(feature_set.X_train, feature_set.y_train)
                    test_pred = model.predict(feature_set.X_val)

                    results.append({
                        'fold': 'final',
                        'feature_set': feature_set.name,
                        'description': feature_set.description,
                        'model': model_name,
                        'split_type': 'test',
                        'rmse': self._calculate_rmse(feature_set.y_val, test_pred),
                        'r2': r2_score(feature_set.y_val, test_pred),
                        'mae': mean_absolute_error(
                            np.exp(feature_set.y_val),
                            np.exp(test_pred)
                        ),
                        'pct_mae': np.mean(np.abs(
                            (np.exp(feature_set.y_val) - np.exp(test_pred)) /
                            np.exp(feature_set.y_val)
                        )) * 100,
                        'n_features': feature_set.X_train.shape[1]
                    })
                    test_pbar.update(1)

        return pd.DataFrame(results)

    def _calculate_rmse(self,
                       y_true: pd.Series,
                       y_pred: np.ndarray) -> float:
        """
        Calculate Root Mean Squared Error
        """
        return np.sqrt(np.mean((y_true - y_pred) ** 2))

### Running ATLAS: The Pipeline in Action

ATLAS processes our house price data through a straightforward sequence:

```python
Raw Data → PreProcessor → Train/Test Split → CrossValidator → Results
```

## Running the Pipeline
```python
preprocessor = PreProcessor()
df_processed = preprocessor.prepare_pre_split_features(df_with_outcode)
train_data, test_data = preprocessor.create_train_test_split(df_processed)
validator = CrossValidator()
results = validator.evaluate_all_combinations(train_data, test_data)
```

For each model and feature combination, ATLAS:
1. Splits data appropriately
2. Creates features inside each fold
3. Trains and evaluates models
4. Collects performance metrics

The progress bars show:
- Overall completion (outer bar)
- Current fold progress (inner bar)
- Estimated time remaining

Each experiment's results include:
- Model type and feature set
- Cross-validation scores
- Final test set performance
- Performance metrics (RMSE, R², MAE, %MAE)

This systematic approach lets us focus on interpreting results rather than managing experiments. Let's examine what ATLAS discovered about house price prediction.

In [7]:
def ATLAS_pipeline(df_with_outcode: pd.DataFrame) -> pd.DataFrame:
    """Run complete pipeline from raw data to model comparison"""

    preprocessor = PreProcessor()

    # Create pre-split features
    df_processed = preprocessor.prepare_pre_split_features(df_with_outcode)

    # Create initial train/test split
    train_data, test_data = preprocessor.create_train_test_split(df_processed)

    # Run cross-validation evaluation
    validator = CrossValidator()
    results = validator.evaluate_all_combinations(train_data, test_data)

    return results

# Time the pipeline execution
print("Running ATLAS pipeline (estimated time: 2 minutes)...")

results = ATLAS_pipeline(df_with_outcode)


# Display key information about the results DataFrame
print("\nResults DataFrame Info:")
print(f"Shape: {results.shape}")
print("\nFirst few rows:")
print(results.head())

def display_results(results: pd.DataFrame) -> None:
    """
    Display model performance summary with cross-validation and test results.

    Args:
        results: DataFrame containing model evaluation results with columns:
                feature_set, model, split_type, r2, rmse, mae, pct_mae, description
    """
    print("\nModel Performance Summary:")
    print("-" * 170)

    # Print header
    header = "Features - Model".ljust(100) + " "
    header += "CV R²".ljust(15)
    header += "CV RMSE".ljust(15)
    header += "CV MAE (£)".ljust(20)
    header += "CV %Error".ljust(20)
    print(header)
    print("-" * 170)

    for (feature_set, model), group in results.groupby(['feature_set', 'model']):
        cv_results = group[group['split_type'] == 'cv_fold']
        test_results = group[group['split_type'] == 'test'].iloc[0]

        # Create feature_model string using description
        feature_model = f"{test_results['description']} - {model}"

        # Print CV results
        cv_line = feature_model.ljust(100) + " "
        cv_line += f"{cv_results['r2'].mean():.3f} ±{cv_results['r2'].std():.3f}".ljust(15)
        cv_line += f"{cv_results['rmse'].mean():.3f} ±{cv_results['rmse'].std():.3f}".ljust(15)
        cv_line += f"£{cv_results['mae'].mean():,.0f} ±{cv_results['mae'].std():,.0f}".ljust(20)
        cv_line += f"{cv_results['pct_mae'].mean():.1f} ±{cv_results['pct_mae'].std():.1f}%"
        print(cv_line)

        # Print test results (indented)
        test_line = "→ Test Results".ljust(100) + " "
        test_line += f"{test_results['r2']:.3f}".ljust(15)
        test_line += f"{test_results['rmse']:.3f}".ljust(15)
        test_line += f"£{test_results['mae']:,.0f}".ljust(20)
        test_line += f"{test_results['pct_mae']:.1f}%"
        print(test_line)

# Usage:
display_results(results)

Running ATLAS pipeline (estimated time: 2 minutes)...

Running cross-validation...


'Test Set: xgboost on all_features'

Folds:   0%|          | 0/5 [00:00<?, ?it/s]

Fold 1 Progress:   0%|          | 0/45 [00:00<?, ?it/s]

Fold 2 Progress:   0%|          | 0/45 [00:00<?, ?it/s]

Fold 3 Progress:   0%|          | 0/45 [00:00<?, ?it/s]

Fold 4 Progress:   0%|          | 0/45 [00:00<?, ?it/s]

Fold 5 Progress:   0%|          | 0/45 [00:00<?, ?it/s]


Running final evaluation on test set...


Test Set Evaluation:   0%|          | 0/45 [00:00<?, ?it/s]


Results DataFrame Info:
Shape: (270, 10)

First few rows:
  fold          feature_set                                 description  \
0    0        area_bedrooms              Area in sq ft, No. of Bedrooms   
1    0        area_bedrooms              Area in sq ft, No. of Bedrooms   
2    0        area_bedrooms              Area in sq ft, No. of Bedrooms   
3    0  area_bedrooms_house  Area in sq ft, No. of Bedrooms, House Type   
4    0  area_bedrooms_house  Area in sq ft, No. of Bedrooms, House Type   

           model split_type  rmse   r2        mae  pct_mae  n_features  
0  decision_tree    cv_fold  0.53 0.53 903,982.30    42.14           2  
1  random_forest    cv_fold  0.45 0.66 741,766.07    35.05           2  
2        xgboost    cv_fold  0.45 0.66 764,810.56    33.97           2  
3  decision_tree    cv_fold  0.52 0.55 883,891.13    41.00          10  
4  random_forest    cv_fold  0.43 0.69 697,184.76    32.92          10  

Model Performance Summary:
------------------------

## Unveiling the Drivers of London House Prices: Lessons from ATLAS

Our exploration of London house price prediction, guided by the ATLAS pipeline, has yielded a wealth of insights. By systematically comparing a diverse set of models and feature combinations, we've gained a nuanced understanding of the key drivers of property value in this complex market. This analysis will delve into the performance of our top models, the impact of including price information, the effectiveness of our feature engineering techniques, and the implications of our findings for both model deployment and further research.

### Top Performing Models: A Closer Look

Among the myriad of models evaluated, three have distinguished themselves with their exceptional performance:

| Model                                                                          | CV R²         | Test R²      | CV MAE (£)       | Test MAE (£)  |
|-------------------------------------------------------------------------------|---------------|--------------|------------------|---------------|
| Random Forest with All Features                                               | 0.903 ±0.011  | 0.912        | £368,595 ±49,041 | £387,417      |
| XGBoost with All Features                                                     | 0.899 ±0.009  | 0.913        | £379,417 ±55,543 | £397,701      |
| Random Forest with Area, Bedrooms, House Type, Outcode (Target), and Price/sqft | 0.902 ±0.012 | 0.907        | £372,747 ±48,481 | £416,147      |

These models have achieved remarkable accuracy, explaining over 90% of the price variance (as measured by R²) in both cross-validation and on the unseen test set. Their mean absolute errors (MAE) range from around £370,000 to £420,000, which, while substantial in absolute terms, are quite reasonable given the high prices and wide price range in the London market.

The strong and consistent performance of these models across both cross-validation and the test set is a testament to the robustness of our modeling approach. It suggests that these models have successfully captured the underlying patterns and relationships in the data, rather than simply memorising noise or idiosyncrasies of the training set.

### The Price Information Paradox

One of the most striking findings from our experiments is the significant impact of including price-derived features, such as the average price per square foot at the outcode level. Models that incorporate this information consistently outperform those that don't, with improvements in MAE ranging from £40,000 to £50,000.

This improvement in accuracy is substantial and underscores the importance of considering current market conditions in property valuation. By providing the models with information about prevailing price levels in different areas, we enable them to make more context-aware predictions.

However, this boost in performance comes with a important caveat. By including current price information, our models risk amplifying feedback loops in the housing market. If such models were to be widely adopted and used to inform pricing decisions, they could potentially exacerbate both upward and downward price trends. In a rising market, the models would predict higher prices, which could in turn drive actual prices higher if used to set asking prices or guide bidding. Conversely, in a falling market, the models could contribute to a downward spiral.

This is a well-known challenge in the real estate industry, and one that major players like Zoopla and Rightmove actively monitor and manage. It highlights the importance of considering not just the accuracy of our models, but also their potential impact on the market they seek to predict.

### Models without Price Information: A Fundamental Perspective

Given the potential risks associated with including current price information, it's worth examining the performance of models that rely solely on fundamental property characteristics and location.

Among these models, several stand out:

| Model                                                                            | CV R²         | Test R²      | CV MAE (£)       | Test MAE (£)  |
|--------------------------------------------------------------------------------|---------------|--------------|------------------|---------------|
| XGBoost with Area, Bedrooms, and Outcode (One-hot)                              | 0.881 ±0.013  | 0.899        | £418,617 ±58,112 | £432,491      |
| Random Forest with Area, Bedrooms, Location (Target)                            | 0.822 ±0.031  | 0.815        | £506,152 ±84,254 | £593,446      |
| Random Forest with Area, Bedrooms, House Type, City/County, Outcode (Target)    | 0.855 ±0.023  | 0.887        | £447,107 ±54,538 | £449,116      |

While these models don't quite match the accuracy of those including price information, they still achieve impressive performance. With R² scores mostly in the 0.80 to 0.90 range and MAEs around £400,000 to £600,000, they demonstrate that a substantial portion of a property's value can be explained by its intrinsic characteristics and location.

These models provide a valuable perspective on the fundamental drivers of house prices, independent of current market conditions. They can help identify areas or property types that may be over- or under-valued relative to their inherent attributes. In practice, such models could be used in conjunction with price-aware models to provide a more comprehensive view of a property's value.

### The Importance of Validation: Ensuring Reliability

A crucial aspect of our modeling process that deserves highlighting is the rigor of our validation strategy. By employing stratified k-fold cross-validation, we ensure that our performance estimates are reliable and representative of the models' true predictive power.

Stratified k-fold cross-validation involves splitting the data into k folds (in our case, 5), while ensuring that each fold has a similar distribution of the target variable (price). The model is then trained k times, each time using k-1 folds for training and the remaining fold for validation. The performance is then averaged across all k validation folds.

This approach has several advantages over a simple train-test split:
1. It provides a more robust estimate of model performance, as it averages over k different train-test splits rather than relying on a single split.
2. By stratifying the folds based on the target variable, it ensures that each fold is representative of the overall data distribution, reducing the risk of lucky or unlucky splits.

Moreover, by maintaining a strict separation between our training and validation data within each fold, and between all the training folds and the final test set, we avoid the pitfalls of data leakage and overfitting. Data leakage occurs when information from the validation or test sets inadvertently leaks into the model training process, leading to overly optimistic performance estimates. Overfitting happens when a model learns to fit the noise or peculiarities of the training data, rather than the underlying patterns, leading to poor generalisation to new data.

Our models' strong and consistent performance across the cross-validation folds and on the unseen test set demonstrates that they have successfully learned genuine patterns in the data and can generalise well to new, unseen properties. This is crucial for real-world application, where the model will be applied to properties it has never seen before.

### Feature Engineering: The Art of Extracting Signal from Noise

Another key lesson from our analysis is the importance and nuance of feature engineering, particularly when dealing with geographical data.

Our geographical features presented a hierarchy of granularity:
- Outcode (e.g., "SW1")
- Postcode (e.g., "SW1A 1AA")
- Location (e.g., "Westminster")

Each level provided a different trade-off between specificity and data sparsity. While more granular levels (like location) can potentially provide more specific information, they also suffer from data sparsity, with many locations having very few or even just one property.

Our solution was a hierarchical target encoding scheme. For each level, we calculated the mean price in the training data. Then, when encoding a particular property, if the specific level (e.g., postcode) had sufficient data, we used its mean price. If not, we fell back to the mean price of the next higher level (e.g., outcode). This way, we extracted as much specific information as the data allowed, while still providing a robust fallback for sparse levels.

This encoding scheme proved very effective, with models using these features achieving strong performance. It demonstrates that, with careful engineering, geographical information can be a powerful predictor of house prices, even without resorting to complex geospatial techniques.

Beyond geographical features, our experiments also highlighted the predictive power of even simple property attributes like area, number of bedrooms, and property type. Models using just these features achieved respectable performance, forming a strong baseline upon which more complex models could improve.

### Ethical Considerations, Human Impact, and Future Directions

As we marvel at the predictive power of our models, it's crucial that we also pause to consider the ethical implications of our work. Housing is not just a financial asset, but a fundamental human need. The prices predicted by our models have real consequences for real people - they can determine whether a family can afford to buy their dream home, whether a pensioner can comfortably retire, or whether a young professional can afford to live near their work.

With this in mind, we have a responsibility to ensure that our models are not just accurate, but also fair and unbiased. We must be vigilant to potential sources of bias in our data and algorithms, and work to mitigate them. For example, if our training data under-represents certain areas or demographic groups, our models may learn to undervalue these properties, perpetuating or even amplifying existing inequalities.

Moreover, we must consider the potential unintended consequences of our models' usage. If used improperly, such as to guide predatory pricing practices or to justify rent hikes, our models could harm the very people they're meant to serve. It's our responsibility to ensure that our models are used ethically and for the benefit of all stakeholders.

On a more positive note, our models also have the potential to empower individuals and promote transparency in the housing market. By providing accurate and unbiased valuations, they can help buyers and sellers make informed decisions, reducing information asymmetries and the potential for exploitation. They can also help policymakers and urban planners better understand the dynamics of the housing market, informing policies that promote affordability and social equity.

Our journey into London house price prediction has been one of technical exploration, but also one of growing awareness of the human implications of our work. We've seen the power of machine learning to uncover complex patterns and dynamics in the housing market, but also the potential pitfalls and ethical considerations that come with this power.

As we look to the future, several exciting directions beckon:

1. **Ensemble Methods**: Given the strong performance of multiple models, combining their predictions through techniques like stacking or blending could potentially yield even greater accuracy and robustness.

2. **Advanced Feature Engineering**: While our current features have proven effective, there's always room for refinement. Techniques like feature interaction, clustering, or dimensional reduction could uncover additional predictive signals.

3. **Temporal Dynamics**: Our current models provide a static snapshot of the market. Incorporating temporal features like price trends, economic indicators, or seasonal effects could enable more dynamic and forward-looking predictions.

4. **Model Interpretability**: As powerful as our models are, their complexity can hinder interpretation. Techniques like feature importance analysis, partial dependence plots, or SHAP values could help shed light on how the models make their predictions, increasing transparency and trust.

5. **Application and Deployment**: Finally, the true test of our models will be in their application to real-world pricing decisions. This will require not just technical excellence, but also close collaboration with domain experts to ensure the models are used appropriately and responsibly.

As we embark on these future directions, let us proceed with a commitment to not just technical excellence, but also to social responsibility. Let us strive to build models that are not just accurate, but also fair, transparent, and beneficial to all. Let us engage closely with the communities impacted by our work, learning from their perspectives and ensuring that our models serve their needs.

In doing so, we have the potential to not just predict house prices, but to contribute to a housing market that is more efficient, more equitable, and more responsive to the needs of its participants. This is the ultimate promise and challenge of our work - to use the power of data and algorithms to build a better, fairer world for all.

As we conclude this phase of our journey, let us do so with gratitude for the insights gained, with humility in the face of the challenges ahead, and with hope for the positive impact we can make. Ultimately, the true measure of our success will not be just the accuracy of our predictions, but the positive impact we have on the lives of those touched by the housing market. The path forward is not always clear or easy, but with ATLAS as our guide and our values as our compass, I am confident that we will navigate it successfully, one home at a time.

### Thanks for Learning!

This notebook is part of the Supervised Machine Learning from First Principles series.

© 2025 Powell-Clark Limited. Licensed under Apache License 2.0.

If you found this helpful, please cite as:
```
Powell-Clark (2025). Supervised Machine Learning from First Principles.
GitHub: https://github.com/powell-clark/supervised-machine-learning
```

Questions or feedback? Contact emmanuel@powellclark.com