# PyCaret Baseline Models for Two Datasets

This notebook trains **baseline regression models** for two datasets using **PyCaret**.

You should use it for your project datasets (e.g. UK housing & UK electricity demand).

## Instructions
1. Edit the file paths and target column names in the **Configuration** cell.
2. Run the notebook top to bottom.
3. It will:
   - Load both `.parquet` files
   - Run PyCaret's `setup` and `compare_models`
   - Tune the best model
   - Save the final model for each dataset
   - Save a CSV file with all metrics for later use in your report


In [1]:
import sys
print(sys.executable)  # just to be sure

# install pycaret into THIS interpreter
!"{sys.executable}" -m pip install pycaret
!pip uninstall -y pycaret
!pip install pycaret==2.3.10
!pip install scikit-learn==0.23.2


c:\Users\User\AppData\Local\Programs\Python\Python311\python.exe



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting pycaret==2.3.10
  Using cached pycaret-2.3.10-py3-none-any.whl.metadata (12 kB)
Collecting scipy<=1.5.4 (from pycaret==2.3.10)
  Using cached scipy-1.5.4.tar.gz (25.2 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'


  error: subprocess-exited-with-error
  
  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [73 lines of output]
      Ignoring numpy: markers 'python_version == "3.6" and platform_system != "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.7" and platform_system != "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.6" and platform_system == "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.7" and platform_system == "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version >= "3.8" and platform_system == "AIX"' don't match your environment
      Collecting wheel
        Using cached wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
      Collecting setuptools
        Using cached setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
      Collecting Cython>=0.29.18
        Using cached cython-3.2.1-cp311-cp3

Collecting scikit-learn==0.23.2
  Using cached scikit-learn-0.23.2.tar.gz (7.2 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'


  error: subprocess-exited-with-error
  
  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [116 lines of output]
      Ignoring numpy: markers 'python_version == "3.6" and platform_system != "AIX" and platform_python_implementation == "CPython"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.6" and platform_system != "AIX" and platform_python_implementation != "CPython"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.7" and platform_system != "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.6" and platform_system == "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.7" and platform_system == "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version >= "3.8" and platform_system == "AIX"' don't match your environment
      Collecting setuptools
        Using cached setuptools

In [2]:
from pycaret.regression import (
    setup,
    compare_models,
    tune_model,
    finalize_model,
    save_model,
    load_model,
    predict_model,
    pull,
    create_model,
)

import pandas as pd
import numpy as np
from pandas.api.types import is_string_dtype, is_bool_dtype

import joblib
from pathlib import Path



In [3]:
# ==============================
# Configuration: EDIT THIS CELL
# ==============================

# We use two cleaned parquet files:
# 1) part-0.parquet                 -> housing dataset
# 2) electricity_all_cleaned.parquet -> electricity demand dataset
#
# IMPORTANT:
#   - Change TARGET_COLUMN_HOUSING and TARGET_COLUMN_ELECTRICITY
#     to the **exact** column names you want to predict.

datasets = [
    {
        'name': 'housing',
        'file': '../data/housing/part-0.parquet',                 # <-- housing parquet filename
        'target': 'price'         # <-- change to housing target column
    },
    {
        'name': 'electricity',
        'file': '../data/electricity/intermediate/electricity_all_cleaned.parquet',  # <-- electricity parquet filename
        'target': 'demand_mw'       # <-- change to electricity target column
    }
]

random_seed = 123  # for reproducibility
train_size = 0.8   # 80% train, 20% test


In [4]:
# Quick preview of both datasets (first 5 rows each)
for ds in datasets:
    print(f"\n=== Preview: {ds['name']} ({ds['file']}) ===")
    df = pd.read_parquet(ds['file'])
    display(df.head())
    print('Columns:', list(df.columns))


=== Preview: housing (../data/housing/part-0.parquet) ===


Unnamed: 0,transaction_unique_identifier,price,date_of_transfer,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,record_status__monthly_file_only,year,month,region,is_new_build
0,{2F772747-249E-3534-E050-A8C0630513CD},146500.0,2016-03-22,T,N,F,DEREHAM,BRECKLAND,NORFOLK,A,A,2016.0,3.0,NORFOLK,0.0
1,{2F772747-249F-3534-E050-A8C0630513CD},150000.0,2016-03-24,S,N,F,NORWICH,NORWICH,NORFOLK,A,A,2016.0,3.0,NORFOLK,0.0
2,{2F772747-24A0-3534-E050-A8C0630513CD},425000.0,2016-03-24,T,N,F,NORWICH,NORWICH,NORFOLK,A,A,2016.0,3.0,NORFOLK,0.0
3,{2F772747-26BB-3534-E050-A8C0630513CD},460000.0,2016-03-21,D,N,F,REDHILL,TANDRIDGE,SURREY,A,A,2016.0,3.0,SURREY,0.0
4,{2F772747-26BC-3534-E050-A8C0630513CD},400000.0,2016-03-04,S,N,F,GUILDFORD,WAVERLEY,SURREY,A,A,2016.0,3.0,SURREY,0.0


Columns: ['transaction_unique_identifier', 'price', 'date_of_transfer', 'property_type', 'oldnew', 'duration', 'towncity', 'district', 'county', 'ppdcategory_type', 'record_status__monthly_file_only', 'year', 'month', 'region', 'is_new_build']

=== Preview: electricity (../data/electricity/intermediate/electricity_all_cleaned.parquet) ===


Unnamed: 0,ts,demand_mw
0,2001-01-01 00:00:00,38631.0
1,2001-01-01 00:30:00,39808.0
2,2001-01-01 01:00:00,40039.0
3,2001-01-01 01:30:00,39339.0
4,2001-01-01 02:00:00,38295.0


Columns: ['ts', 'demand_mw']


In [5]:
for ds in datasets:
    print(f"\n=== {ds['name']} ===")
    print("Path:", ds['file'])
    df = pd.read_parquet(ds['file'])
    print("Shape:", df.shape)
    print("Columns:", list(df.columns))


=== housing ===
Path: ../data/housing/part-0.parquet
Shape: (100420, 15)
Columns: ['transaction_unique_identifier', 'price', 'date_of_transfer', 'property_type', 'oldnew', 'duration', 'towncity', 'district', 'county', 'ppdcategory_type', 'record_status__monthly_file_only', 'year', 'month', 'region', 'is_new_build']

=== electricity ===
Path: ../data/electricity/intermediate/electricity_all_cleaned.parquet
Shape: (434928, 2)
Columns: ['ts', 'demand_mw']


In [None]:
# Train PyCaret + XGBoost models for both datasets
# and save them into backend/app/models/ with consistent feature sets.

all_results = []

# Ensure output directory exists
models_dir = Path("backend/app/models")
models_dir.mkdir(parents=True, exist_ok=True)

for ds in datasets:
    print(f"\n============================")
    print(f"Working on dataset: {ds['name']}")
    print(f"File: {ds['file']}")
    print(f"Target: {ds['target']}")
    print(f"============================\n")

    # 1) Load raw data
    df_raw = pd.read_parquet(ds['file'])
    print("Loaded df_raw with shape:", df_raw.shape)
    print("Original dtypes:\n", df_raw.dtypes)

    # 2) Basic dtype cleaning to avoid pd.NA issues
    df = df_raw.copy()
    for col in df.columns:
        if is_string_dtype(df[col]):
            df[col] = df[col].astype("object")
    for col in df.columns:
        if is_bool_dtype(df[col]):
            df[col] = df[col].astype("float64")
    df = df.replace({pd.NA: np.nan})

    # 3) Feature engineering to enforce required inputs
    if ds['name'] == 'housing':
        # expected final features:
        # region, property_type, tenure, year, month, is_new_build
        # map tenure from 'duration' if needed
        if 'tenure' not in df.columns and 'duration' in df.columns:
            df['tenure'] = df['duration']

        required_features = ['region', 'property_type', 'tenure', 'year', 'month', 'is_new_build']

    elif ds['name'] == 'electricity':
        # expected final features:
        # year, month, day, hour, is_weekend
        # derive from timestamp column (assume 'ts')
        if 'ts' in df.columns:
            df['ts'] = pd.to_datetime(df['ts'])
            df['year'] = df['ts'].dt.year
            df['month'] = df['ts'].dt.month
            df['day'] = df['ts'].dt.day
            df['hour'] = df['ts'].dt.hour
            df['is_weekend'] = (df['ts'].dt.weekday >= 5).astype(int)

        required_features = ['year', 'month', 'day', 'hour', 'is_weekend']

    else:
        raise ValueError(f"Unknown dataset name: {ds['name']}")

    # Ensure all required features exist
    missing = [c for c in required_features if c not in df.columns]
    if missing:
        raise ValueError(f"Missing required features for {ds['name']}: {missing}")

    # Build modeling dataframe with just features + target
    model_df = df[required_features + [ds['target']]].copy()
    print("Model df shape:", model_df.shape)
    print("Model df columns:", list(model_df.columns))

    # 4) PyCaret setup on this reduced dataframe
    exp = setup(
        data=model_df,
        target=ds['target'],
        session_id=random_seed,
        train_size=train_size,
        verbose=False
    )


    # 5) Compare models, take best
    best_models = compare_models(n_select=3)
    compare_results = pull().copy()
    compare_results['dataset'] = ds['name']
    compare_results['stage'] = 'baseline_compare'
    all_results.append(compare_results)

    print('\nTop 3 models for', ds['name'])
    display(compare_results.head(3))

    # 6) Tune the best overall model
    best_model = best_models[0]
    tuned_best = tune_model(best_model)
    tuned_results = pull().copy()
    tuned_results['dataset'] = ds['name']
    tuned_results['stage'] = 'tuned_best'
    all_results.append(tuned_results)

    print('\nTuned best overall model for', ds['name'])
    display(tuned_results)

    # 7) Finalize and save tuned PyCaret model (full pipeline)
    final_best = finalize_model(tuned_best)
    pycaret_name = f"{ds['name']}_pycaret_best"
    # PyCaret's save_model will create e.g. '{name}.pkl' in the current folder
    save_model(final_best, pycaret_name)
    print(f"Saved PyCaret model for {ds['name']} as: {pycaret_name}.pkl")

    # Also dump the pipeline with joblib into backend/app/models/
    pycaret_pipeline_path = models_dir / f"{ds['name']}_pycaret_pipeline.joblib"
    joblib.dump(final_best, pycaret_pipeline_path)
    print(f"Saved PyCaret pipeline to {pycaret_pipeline_path}")

    # 8) Explicit XGBoost model via PyCaret (create_model('xgboost'))
    try:
        xgb_model = create_model('xgboost')
        tuned_xgb = tune_model(xgb_model)
        tuned_xgb_results = pull().copy()
        tuned_xgb_results['dataset'] = ds['name']
        tuned_xgb_results['stage'] = 'tuned_xgboost'
        all_results.append(tuned_xgb_results)

        print('\nTuned XGBoost model for', ds['name'])
        display(tuned_xgb_results)

        final_xgb = finalize_model(tuned_xgb)

        # 9) Build XGB bundle dict like in AWS notebook
        xgb_bundle = {
            'model_type': 'xgboost_pycaret',
            'dataset': ds['name'],
            'target': ds['target'],
            'features': required_features,
            'model': final_xgb,
        }

        xgb_bundle_path = models_dir / f"{ds['name']}_xgb_bundle.joblib"
        joblib.dump(xgb_bundle, xgb_bundle_path)
        print(f"Saved XGB bundle for {ds['name']} to {xgb_bundle_path}")

    except Exception as e:
        print(f"Could not create/tune XGBoost model for {ds['name']}: {e}")


Working on dataset: housing
File: ../data/housing/part-0.parquet
Target: price

Loaded df_raw with shape: (100420, 15)
Original dtypes:
 transaction_unique_identifier       string[python]
price                                      float64
date_of_transfer                    string[python]
property_type                       string[python]
oldnew                              string[python]
duration                            string[python]
towncity                            string[python]
district                            string[python]
county                              string[python]
ppdcategory_type                    string[python]
record_status__monthly_file_only    string[python]
year                                       float64
month                                      float64
region                              string[python]
is_new_build                               float64
dtype: object
Model df shape: (100420, 7)
Model df columns: ['region', 'property_type', 'tenure',

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,75997.0833,11212412316.0297,105881.028,0.5156,0.5512,1.2084,0.239
gbr,Gradient Boosting Regressor,77440.4579,11499802761.6669,107229.2678,0.5032,0.5598,1.2373,0.478
rf,Random Forest Regressor,76372.8323,11654756337.9841,107949.1104,0.4965,0.551,1.1639,0.759
et,Extra Trees Regressor,76837.8875,11943540012.9944,109276.8364,0.4841,0.5608,1.157,0.641
dt,Decision Tree Regressor,77067.0733,12088672423.3313,109935.5836,0.4778,0.566,1.1505,0.071
ridge,Ridge Regression,80947.4434,12398177903.7889,111337.3622,0.4645,0.5758,1.2883,0.065
lar,Least Angle Regression,80947.4043,12398177680.6542,111337.3613,0.4645,0.5758,1.2884,0.062
llar,Lasso Least Angle Regression,80947.4132,12398178339.0945,111337.3641,0.4645,0.5758,1.2883,0.061
lasso,Lasso Regression,80947.4189,12398177830.3704,111337.3618,0.4645,0.5758,1.2883,0.529
lr,Linear Regression,80947.4043,12398177680.654,111337.3613,0.4645,0.5758,1.2884,0.54



Top 3 models for housing


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec),dataset,stage
lightgbm,Light Gradient Boosting Machine,75997.0833,11212410000.0,105881.028,0.5156,0.5512,1.2084,0.239,housing,baseline_compare
gbr,Gradient Boosting Regressor,77440.4579,11499800000.0,107229.2678,0.5032,0.5598,1.2373,0.478,housing,baseline_compare
rf,Random Forest Regressor,76372.8323,11654760000.0,107949.1104,0.4965,0.551,1.1639,0.759,housing,baseline_compare


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,75865.8805,11226754280.6867,105956.3791,0.5148,0.5407,1.1101
1,77038.0933,11540692464.3635,107427.615,0.5052,0.5507,1.1675
2,76731.9283,11607812687.2477,107739.5595,0.5102,0.5245,0.8946
3,74022.8953,10647284743.1334,103185.6809,0.5322,0.5385,1.2197
4,76081.628,11252447705.3445,106077.5551,0.5169,0.5923,1.7514
5,74396.4457,10897260839.8784,104389.9461,0.529,0.5327,0.9579
6,75013.8356,10974464691.9164,104759.0793,0.5185,0.5589,1.3606
7,75628.8656,11107034130.8547,105389.9147,0.509,0.5466,0.9918
8,75226.4876,11102672559.3696,105369.2202,0.5221,0.5567,1.1166
9,76214.4058,11218804086.8147,105918.8561,0.5224,0.5388,1.0566


Fitting 10 folds for each of 10 candidates, totalling 100 fits

Tuned best overall model for housing


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE,dataset,stage
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,75865.8805,11226750000.0,105956.3791,0.5148,0.5407,1.1101,housing,tuned_best
1,77038.0933,11540690000.0,107427.615,0.5052,0.5507,1.1675,housing,tuned_best
2,76731.9283,11607810000.0,107739.5595,0.5102,0.5245,0.8946,housing,tuned_best
3,74022.8953,10647280000.0,103185.6809,0.5322,0.5385,1.2197,housing,tuned_best
4,76081.628,11252450000.0,106077.5551,0.5169,0.5923,1.7514,housing,tuned_best
5,74396.4457,10897260000.0,104389.9461,0.529,0.5327,0.9579,housing,tuned_best
6,75013.8356,10974460000.0,104759.0793,0.5185,0.5589,1.3606,housing,tuned_best
7,75628.8656,11107030000.0,105389.9147,0.509,0.5466,0.9918,housing,tuned_best
8,75226.4876,11102670000.0,105369.2202,0.5221,0.5567,1.1166,housing,tuned_best
9,76214.4058,11218800000.0,105918.8561,0.5224,0.5388,1.0566,housing,tuned_best


Transformation Pipeline and Model Successfully Saved
Saved PyCaret model for housing as: housing_pycaret_best.pkl
Saved PyCaret pipeline to backend\app\models\housing_pycaret_pipeline.joblib
Could not create/tune XGBoost model for housing: Estimator xgboost not available. Please see docstring for list of available estimators.

Working on dataset: electricity
File: ../data/electricity/intermediate/electricity_all_cleaned.parquet
Target: demand_mw

Loaded df_raw with shape: (434928, 2)
Original dtypes:
 ts           datetime64[ns]
demand_mw           float64
dtype: object
Model df shape: (434928, 6)
Model df columns: ['year', 'month', 'day', 'hour', 'is_weekend', 'demand_mw']


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,714.3594,1063314.8303,1031.1383,0.9854,0.0328,0.0226,21.831
et,Extra Trees Regressor,802.6399,1265342.0122,1124.843,0.9826,0.0356,0.0253,28.708
dt,Decision Tree Regressor,819.2971,1408079.8277,1186.5691,0.9806,0.0375,0.0257,0.197
knn,K Neighbors Regressor,1065.8069,2114526.825,1454.1104,0.9709,0.0457,0.0335,0.398
lightgbm,Light Gradient Boosting Machine,1372.1603,3236292.3243,1798.9524,0.9554,0.058,0.0437,0.565
gbr,Gradient Boosting Regressor,1766.0579,5322889.1043,2307.0484,0.9267,0.0731,0.0559,4.158
ada,AdaBoost Regressor,3261.1891,15845661.4974,3980.3284,0.7817,0.1269,0.1067,3.25
llar,Lasso Least Angle Regression,4908.9466,36516353.1933,6042.8459,0.4971,0.1821,0.1539,0.049
ridge,Ridge Regression,4908.8885,36516348.0348,6042.8455,0.4971,0.1821,0.1539,0.053
br,Bayesian Ridge,4908.8939,36516348.1455,6042.8455,0.4971,0.1821,0.1539,0.075



Top 3 models for electricity


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec),dataset,stage
rf,Random Forest Regressor,714.3594,1063315.0,1031.1383,0.9854,0.0328,0.0226,21.831,electricity,baseline_compare
et,Extra Trees Regressor,802.6399,1265342.0,1124.843,0.9826,0.0356,0.0253,28.708,electricity,baseline_compare
dt,Decision Tree Regressor,819.2971,1408080.0,1186.5691,0.9806,0.0375,0.0257,0.197,electricity,baseline_compare


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


In [None]:
if len(all_results) > 0:
    metrics_df = pd.concat(all_results, ignore_index=True)
    display(metrics_df)
    metrics_df.to_csv('model_comparison_results.csv', index=False)
    print("Saved all metrics to 'model_comparison_results.csv'.")
else:
    print("No results collected – check the training cell for errors.")


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec),dataset,stage
0,Elastic Net,95549.1452,15429740000.0,124210.6433,0.3335,0.6471,1.3821,0.436,housing,baseline_compare
1,K Neighbors Regressor,104035.9555,18471260000.0,135903.7469,0.2021,0.6752,1.34,0.89,housing,baseline_compare
2,Decision Tree Regressor,121107.1751,23149500000.0,152145.9342,-0.0,0.751,1.548,0.404,housing,baseline_compare
3,Gradient Boosting Regressor,121129.3661,23151610000.0,152152.8357,-0.0001,0.7511,1.5486,1.663,housing,baseline_compare
4,Light Gradient Boosting Machine,121057.797,23151080000.0,152151.0781,-0.0001,0.7507,1.5458,0.602,housing,baseline_compare
5,Random Forest Regressor,121108.3538,23150510000.0,152149.2254,-0.0001,0.751,1.548,3.844,housing,baseline_compare
6,Ridge Regression,121111.5543,23151250000.0,152151.628,-0.0001,0.751,1.548,0.485,housing,baseline_compare
7,Passive Aggressive Regressor,121113.8062,23152020000.0,152154.1771,-0.0002,0.751,1.548,0.387,housing,baseline_compare
8,Least Angle Regression,121113.806,23152020000.0,152154.1765,-0.0002,0.751,1.548,0.409,housing,baseline_compare
9,Extra Trees Regressor,121112.0807,23152260000.0,152155.0098,-0.0002,0.751,1.548,2.491,housing,baseline_compare


Saved all metrics to 'model_comparison_results.csv'.


In [None]:
# Interactive prediction for a single house

# This cell loads the saved housing model ("housing_best_model.pkl"),
# asks you for feature values, and prints the predicted price.

from pycaret.regression import load_model, predict_model
import pandas as pd
from pandas.api.types import is_bool_dtype

print("Loading trained housing model 'housing_best_model' ...")
model = load_model("housing_best_model")

# Try to reuse the housing dataframe `df` from the training cell.
# If it is not available, we reload the housing dataset.
try:
    feature_df = df.drop(columns=[datasets[0]['target']])
except Exception:
    housing_cfg = [d for d in datasets if d.get('name') == 'housing'][0]
    df = pd.read_parquet(housing_cfg['file'])
    feature_df = df.drop(columns=[housing_cfg['target']])

row = {}

print("\nPlease provide values for the following columns.")
print("Press ENTER to accept the suggested default value for each column.\n")

for col in feature_df.columns:
    series = feature_df[col]

    # Choose a sensible default value from the training data
    if pd.api.types.is_numeric_dtype(series):
        default = float(series.median())
    elif is_bool_dtype(series):
        # Use most frequent boolean
        default = bool(series.mode().iloc[0])
    else:
        # For categorical/string, use the most frequent category
        default = str(series.mode().iloc[0])

    user_in = input(f"Enter value for '{col}' (default={default}): ").strip()

    if user_in == "":
        value = default
    else:
        # Attempt to cast to the right type
        if pd.api.types.is_numeric_dtype(series):
            try:
                value = float(user_in)
            except ValueError:
                print(f"  Could not parse number, falling back to default for '{col}'.")
                value = default
        elif is_bool_dtype(series):
            value = user_in.lower() in ["1", "true", "yes", "y"]
        else:
            value = user_in

    row[col] = value

input_df = pd.DataFrame([row])
print("\nYou entered the following values:")
display(input_df)

pred_df = predict_model(model, data=input_df)

# Try to find the prediction column name used by this PyCaret version
pred_col = None
for cand in ["Label", "prediction_label", "Prediction", "prediction"]:
    if cand in pred_df.columns:
        pred_col = cand
        break

if pred_col is None:
    print("Could not automatically find the prediction column.")
    print("Available columns in pred_df:", list(pred_df.columns))
else:
    pred_price = pred_df[pred_col].iloc[0]
    print(f"\nEstimated {datasets[0]['target']} for this house: {pred_price:.2f}")
    display(pred_df)


print(f"\nEstimated {datasets[0]['target']} for this house: {pred_price:.2f}")
display(pred_df)


Loading trained housing model 'housing_best_model' ...
Transformation Pipeline and Model Successfully Loaded

Please provide values for the following columns.
Press ENTER to accept the suggested default value for each column.


You entered the following values:


Unnamed: 0,transaction_unique_identifier,date_of_transfer,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,record_status__monthly_file_only,year,month,region,is_new_build
0,{2AC10E4F-AD4E-1AF6-E050-A8C063052BA1},2016-03-31,T,N,F,LONDON,BIRMINGHAM,GREATER LONDON,A,A,2016.0,5.0,GREATER LONDON,0.0



Estimated price for this house: 325792.73


Unnamed: 0,transaction_unique_identifier,date_of_transfer,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,record_status__monthly_file_only,year,month,region,is_new_build,prediction_label
0,{2AC10E4F-AD4E-1AF6-E050-A8C063052BA1},2016-03-31,T,N,F,LONDON,BIRMINGHAM,GREATER LONDON,A,A,2016.0,5.0,GREATER LONDON,0.0,325792.734599



Estimated price for this house: 325792.73


Unnamed: 0,transaction_unique_identifier,date_of_transfer,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,record_status__monthly_file_only,year,month,region,is_new_build,prediction_label
0,{2AC10E4F-AD4E-1AF6-E050-A8C063052BA1},2016-03-31,T,N,F,LONDON,BIRMINGHAM,GREATER LONDON,A,A,2016.0,5.0,GREATER LONDON,0.0,325792.734599
