# PyCaret Baseline Models for Two Datasets

This notebook trains **baseline regression models** for two datasets using **PyCaret**.

You should use it for your project datasets (e.g. UK housing & UK electricity demand).

## Instructions
1. Edit the file paths and target column names in the **Configuration** cell.
2. Run the notebook top to bottom.
3. It will:
   - Load both `.parquet` files
   - Run PyCaret's `setup` and `compare_models`
   - Tune the best model
   - Save the final model for each dataset
   - Save a CSV file with all metrics for later use in your report


In [1]:
import sys
print(sys.executable)  # just to be sure

# install pycaret into THIS interpreter
!"{sys.executable}" -m pip install pycaret
!pip uninstall -y pycaret
!pip install pycaret==2.3.10
!pip install scikit-learn==0.23.2


c:\Users\User\AppData\Local\Programs\Python\Python311\python.exe



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting pycaret==2.3.10
  Using cached pycaret-2.3.10-py3-none-any.whl.metadata (12 kB)
Collecting scipy<=1.5.4 (from pycaret==2.3.10)
  Using cached scipy-1.5.4.tar.gz (25.2 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'


  error: subprocess-exited-with-error
  
  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [73 lines of output]
      Ignoring numpy: markers 'python_version == "3.6" and platform_system != "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.7" and platform_system != "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.6" and platform_system == "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.7" and platform_system == "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version >= "3.8" and platform_system == "AIX"' don't match your environment
      Collecting wheel
        Using cached wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
      Collecting setuptools
        Using cached setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
      Collecting Cython>=0.29.18
        Using cached cython-3.2.1-cp311-cp3

Collecting scikit-learn==0.23.2
  Using cached scikit-learn-0.23.2.tar.gz (7.2 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'


  error: subprocess-exited-with-error
  
  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [116 lines of output]
      Ignoring numpy: markers 'python_version == "3.6" and platform_system != "AIX" and platform_python_implementation == "CPython"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.6" and platform_system != "AIX" and platform_python_implementation != "CPython"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.7" and platform_system != "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.6" and platform_system == "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.7" and platform_system == "AIX"' don't match your environment
      Ignoring numpy: markers 'python_version >= "3.8" and platform_system == "AIX"' don't match your environment
      Collecting setuptools
        Using cached setuptools

In [2]:
from pycaret.regression import (
    setup,
    compare_models,
    tune_model,
    finalize_model,
    save_model,
    load_model,
    predict_model,
    pull,
)

import pandas as pd
import numpy as np
from pandas.api.types import is_string_dtype, is_bool_dtype


In [3]:
# ==============================
# Configuration: EDIT THIS CELL
# ==============================

# We use two cleaned parquet files:
# 1) part-0.parquet                 -> housing dataset
# 2) electricity_all_cleaned.parquet -> electricity demand dataset
#
# IMPORTANT:
#   - Change TARGET_COLUMN_HOUSING and TARGET_COLUMN_ELECTRICITY
#     to the **exact** column names you want to predict.

datasets = [
    {
        'name': 'housing',
        'file': '../data/housing/part-0.parquet',                 # <-- housing parquet filename
        'target': 'price'         # <-- change to housing target column
    },
    {
        'name': 'electricity',
        'file': '../data/electricity/intermediate/electricity_all_cleaned.parquet',  # <-- electricity parquet filename
        'target': 'demand_mw'       # <-- change to electricity target column
    }
]

random_seed = 123  # for reproducibility
train_size = 0.8   # 80% train, 20% test


In [4]:
# Quick preview of both datasets (first 5 rows each)
for ds in datasets:
    print(f"\n=== Preview: {ds['name']} ({ds['file']}) ===")
    df = pd.read_parquet(ds['file'])
    display(df.head())
    print('Columns:', list(df.columns))


=== Preview: housing (../data/housing/part-0.parquet) ===


Unnamed: 0,transaction_unique_identifier,price,date_of_transfer,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,record_status__monthly_file_only,year,month,region,is_new_build
0,{2F772747-249E-3534-E050-A8C0630513CD},146500.0,2016-03-22,T,N,F,DEREHAM,BRECKLAND,NORFOLK,A,A,2016.0,3.0,NORFOLK,0.0
1,{2F772747-249F-3534-E050-A8C0630513CD},150000.0,2016-03-24,S,N,F,NORWICH,NORWICH,NORFOLK,A,A,2016.0,3.0,NORFOLK,0.0
2,{2F772747-24A0-3534-E050-A8C0630513CD},425000.0,2016-03-24,T,N,F,NORWICH,NORWICH,NORFOLK,A,A,2016.0,3.0,NORFOLK,0.0
3,{2F772747-26BB-3534-E050-A8C0630513CD},460000.0,2016-03-21,D,N,F,REDHILL,TANDRIDGE,SURREY,A,A,2016.0,3.0,SURREY,0.0
4,{2F772747-26BC-3534-E050-A8C0630513CD},400000.0,2016-03-04,S,N,F,GUILDFORD,WAVERLEY,SURREY,A,A,2016.0,3.0,SURREY,0.0


Columns: ['transaction_unique_identifier', 'price', 'date_of_transfer', 'property_type', 'oldnew', 'duration', 'towncity', 'district', 'county', 'ppdcategory_type', 'record_status__monthly_file_only', 'year', 'month', 'region', 'is_new_build']

=== Preview: electricity (../data/electricity/intermediate/electricity_all_cleaned.parquet) ===


Unnamed: 0,ts,demand_mw
0,2001-01-01 00:00:00,38631.0
1,2001-01-01 00:30:00,39808.0
2,2001-01-01 01:00:00,40039.0
3,2001-01-01 01:30:00,39339.0
4,2001-01-01 02:00:00,38295.0


Columns: ['ts', 'demand_mw']


In [5]:
for ds in datasets:
    print(f"\n=== {ds['name']} ===")
    print("Path:", ds['file'])
    df = pd.read_parquet(ds['file'])
    print("Shape:", df.shape)
    print("Columns:", list(df.columns))


=== housing ===
Path: ../data/housing/part-0.parquet
Shape: (100420, 15)
Columns: ['transaction_unique_identifier', 'price', 'date_of_transfer', 'property_type', 'oldnew', 'duration', 'towncity', 'district', 'county', 'ppdcategory_type', 'record_status__monthly_file_only', 'year', 'month', 'region', 'is_new_build']

=== electricity ===
Path: ../data/electricity/intermediate/electricity_all_cleaned.parquet
Shape: (434928, 2)
Columns: ['ts', 'demand_mw']


In [6]:


all_results = []

# JUST FIRST DATASET FOR NOW (housing only)
for ds in datasets[:1]:
    print(f"\n============================")
    print(f"Working on dataset: {ds['name']}")
    print(f"File: {ds['file']}")
    print(f"Target: {ds['target']}")
    print(f"============================\n")

    # Load data
    df = pd.read_parquet(ds['file'])
    print("Loaded df with shape:", df.shape)
    print("Original dtypes:\n", df.dtypes)

    if ds['target'] not in df.columns:
        raise ValueError(
            f"Target column '{ds['target']}' not found in columns: {list(df.columns)}.\n"
            "Please edit the configuration cell with the correct target column name."
        )

    # ======== FIX FOR pd.NA / nullable dtypes ========
    df = df.copy()

    # 1) Convert new-style pandas 'string' dtype to old 'object'
    for col in df.columns:
        if is_string_dtype(df[col]):
            df[col] = df[col].astype("object")

    # 2) Convert nullable boolean to float (0/1 with NaN)
    for col in df.columns:
        if is_bool_dtype(df[col]):
            df[col] = df[col].astype("float64")

    # 3) Replace pandas NA with np.nan
    df = df.replace({pd.NA: np.nan})

    print("\nDtypes after cleaning:\n", df.dtypes)
    # ================================================

    # Setup PyCaret experiment
    exp = setup(
        data=df,
        target=ds['target'],
        session_id=random_seed,
        train_size=train_size,
        normalize=True,
        verbose=False
    )

    # Compare models (get top 3)
    top3 = compare_models(n_select=3)
    compare_results = pull().copy()
    compare_results['dataset'] = ds['name']
    compare_results['stage'] = 'baseline_compare'
    all_results.append(compare_results)

    print('\nTop 3 models for', ds['name'])
    display(compare_results.head(3))

    # Take the best model and tune it
    best_model = top3[0]
    tuned_best = tune_model(best_model)
    tuned_results = pull().copy()
    tuned_results['dataset'] = ds['name']
    tuned_results['stage'] = 'tuned_best'
    all_results.append(tuned_results)

    print('\nTuned best model for', ds['name'])
    display(tuned_results)

    # Finalize and save the tuned model
    final_model = finalize_model(tuned_best)
    model_filename = f"{ds['name']}_best_model"
    save_model(final_model, model_filename)
    print(f"Saved final model for {ds['name']} as: {model_filename}.pkl")



Working on dataset: housing
File: ../data/housing/part-0.parquet
Target: price

Loaded df with shape: (100420, 15)
Original dtypes:
 transaction_unique_identifier       string[python]
price                                      float64
date_of_transfer                    string[python]
property_type                       string[python]
oldnew                              string[python]
duration                            string[python]
towncity                            string[python]
district                            string[python]
county                              string[python]
ppdcategory_type                    string[python]
record_status__monthly_file_only    string[python]
year                                       float64
month                                      float64
region                              string[python]
is_new_build                               float64
dtype: object

Dtypes after cleaning:
 transaction_unique_identifier        object
price             

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
en,Elastic Net,95549.1452,15429735373.3877,124210.6433,0.3335,0.6471,1.3821,0.436
knn,K Neighbors Regressor,104035.9555,18471255654.4,135903.7469,0.2021,0.6752,1.34,0.89
dt,Decision Tree Regressor,121107.1751,23149503576.764,152145.9342,-0.0,0.751,1.548,0.404
gbr,Gradient Boosting Regressor,121129.3661,23151609943.726,152152.8357,-0.0001,0.7511,1.5486,1.663
lightgbm,Light Gradient Boosting Machine,121057.797,23151082342.0769,152151.0781,-0.0001,0.7507,1.5458,0.602
rf,Random Forest Regressor,121108.3538,23150510433.0816,152149.2254,-0.0001,0.751,1.548,3.844
ridge,Ridge Regression,121111.5543,23151245839.8274,152151.628,-0.0001,0.751,1.548,0.485
par,Passive Aggressive Regressor,121113.8062,23152021523.7603,152154.1771,-0.0002,0.751,1.548,0.387
lar,Least Angle Regression,121113.806,23152021331.7234,152154.1765,-0.0002,0.751,1.548,0.409
et,Extra Trees Regressor,121112.0807,23152263594.0394,152155.0098,-0.0002,0.751,1.548,2.491



Top 3 models for housing


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec),dataset,stage
en,Elastic Net,95549.1452,15429740000.0,124210.6433,0.3335,0.6471,1.3821,0.436,housing,baseline_compare
knn,K Neighbors Regressor,104035.9555,18471260000.0,135903.7469,0.2021,0.6752,1.34,0.89,housing,baseline_compare
dt,Decision Tree Regressor,121107.1751,23149500000.0,152145.9342,-0.0,0.751,1.548,0.404,housing,baseline_compare


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,93302.5859,14893609773.0531,122039.378,0.3563,0.627,1.1957
1,93720.1036,14958987908.3454,122306.9414,0.3587,0.6338,1.2965
2,94400.916,15379833770.7861,124015.4578,0.3511,0.6192,1.1821
3,92421.4445,14374614305.8233,119894.1796,0.3685,0.6291,1.4818
4,93618.1054,14990850278.9206,122437.1279,0.3563,0.6765,2.0748
5,92876.6879,14652056259.6161,121045.6784,0.3667,0.6293,1.2117
6,91915.8487,14531395790.2471,120546.2392,0.3624,0.6447,1.4142
7,92273.6394,14472444184.8073,120301.4721,0.3602,0.6315,1.1571
8,93526.2372,14964487521.9223,122329.4221,0.3558,0.6456,1.258
9,93466.7919,15093778802.9304,122856.741,0.3575,0.6315,1.3391


Fitting 10 folds for each of 10 candidates, totalling 100 fits

Tuned best model for housing


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE,dataset,stage
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,93302.5859,14893610000.0,122039.378,0.3563,0.627,1.1957,housing,tuned_best
1,93720.1036,14958990000.0,122306.9414,0.3587,0.6338,1.2965,housing,tuned_best
2,94400.916,15379830000.0,124015.4578,0.3511,0.6192,1.1821,housing,tuned_best
3,92421.4445,14374610000.0,119894.1796,0.3685,0.6291,1.4818,housing,tuned_best
4,93618.1054,14990850000.0,122437.1279,0.3563,0.6765,2.0748,housing,tuned_best
5,92876.6879,14652060000.0,121045.6784,0.3667,0.6293,1.2117,housing,tuned_best
6,91915.8487,14531400000.0,120546.2392,0.3624,0.6447,1.4142,housing,tuned_best
7,92273.6394,14472440000.0,120301.4721,0.3602,0.6315,1.1571,housing,tuned_best
8,93526.2372,14964490000.0,122329.4221,0.3558,0.6456,1.258,housing,tuned_best
9,93466.7919,15093780000.0,122856.741,0.3575,0.6315,1.3391,housing,tuned_best


Transformation Pipeline and Model Successfully Saved
Saved final model for housing as: housing_best_model.pkl


In [7]:
if len(all_results) > 0:
    metrics_df = pd.concat(all_results, ignore_index=True)
    display(metrics_df)
    metrics_df.to_csv('model_comparison_results.csv', index=False)
    print("Saved all metrics to 'model_comparison_results.csv'.")
else:
    print("No results collected – check the training cell for errors.")


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec),dataset,stage
0,Elastic Net,95549.1452,15429740000.0,124210.6433,0.3335,0.6471,1.3821,0.436,housing,baseline_compare
1,K Neighbors Regressor,104035.9555,18471260000.0,135903.7469,0.2021,0.6752,1.34,0.89,housing,baseline_compare
2,Decision Tree Regressor,121107.1751,23149500000.0,152145.9342,-0.0,0.751,1.548,0.404,housing,baseline_compare
3,Gradient Boosting Regressor,121129.3661,23151610000.0,152152.8357,-0.0001,0.7511,1.5486,1.663,housing,baseline_compare
4,Light Gradient Boosting Machine,121057.797,23151080000.0,152151.0781,-0.0001,0.7507,1.5458,0.602,housing,baseline_compare
5,Random Forest Regressor,121108.3538,23150510000.0,152149.2254,-0.0001,0.751,1.548,3.844,housing,baseline_compare
6,Ridge Regression,121111.5543,23151250000.0,152151.628,-0.0001,0.751,1.548,0.485,housing,baseline_compare
7,Passive Aggressive Regressor,121113.8062,23152020000.0,152154.1771,-0.0002,0.751,1.548,0.387,housing,baseline_compare
8,Least Angle Regression,121113.806,23152020000.0,152154.1765,-0.0002,0.751,1.548,0.409,housing,baseline_compare
9,Extra Trees Regressor,121112.0807,23152260000.0,152155.0098,-0.0002,0.751,1.548,2.491,housing,baseline_compare


Saved all metrics to 'model_comparison_results.csv'.


In [10]:
# Interactive prediction for a single house

# This cell loads the saved housing model ("housing_best_model.pkl"),
# asks you for feature values, and prints the predicted price.

from pycaret.regression import load_model, predict_model
import pandas as pd
from pandas.api.types import is_bool_dtype

print("Loading trained housing model 'housing_best_model' ...")
model = load_model("housing_best_model")

# Try to reuse the housing dataframe `df` from the training cell.
# If it is not available, we reload the housing dataset.
try:
    feature_df = df.drop(columns=[datasets[0]['target']])
except Exception:
    housing_cfg = [d for d in datasets if d.get('name') == 'housing'][0]
    df = pd.read_parquet(housing_cfg['file'])
    feature_df = df.drop(columns=[housing_cfg['target']])

row = {}

print("\nPlease provide values for the following columns.")
print("Press ENTER to accept the suggested default value for each column.\n")

for col in feature_df.columns:
    series = feature_df[col]

    # Choose a sensible default value from the training data
    if pd.api.types.is_numeric_dtype(series):
        default = float(series.median())
    elif is_bool_dtype(series):
        # Use most frequent boolean
        default = bool(series.mode().iloc[0])
    else:
        # For categorical/string, use the most frequent category
        default = str(series.mode().iloc[0])

    user_in = input(f"Enter value for '{col}' (default={default}): ").strip()

    if user_in == "":
        value = default
    else:
        # Attempt to cast to the right type
        if pd.api.types.is_numeric_dtype(series):
            try:
                value = float(user_in)
            except ValueError:
                print(f"  Could not parse number, falling back to default for '{col}'.")
                value = default
        elif is_bool_dtype(series):
            value = user_in.lower() in ["1", "true", "yes", "y"]
        else:
            value = user_in

    row[col] = value

input_df = pd.DataFrame([row])
print("\nYou entered the following values:")
display(input_df)

pred_df = predict_model(model, data=input_df)

# Try to find the prediction column name used by this PyCaret version
pred_col = None
for cand in ["Label", "prediction_label", "Prediction", "prediction"]:
    if cand in pred_df.columns:
        pred_col = cand
        break

if pred_col is None:
    print("Could not automatically find the prediction column.")
    print("Available columns in pred_df:", list(pred_df.columns))
else:
    pred_price = pred_df[pred_col].iloc[0]
    print(f"\nEstimated {datasets[0]['target']} for this house: {pred_price:.2f}")
    display(pred_df)


print(f"\nEstimated {datasets[0]['target']} for this house: {pred_price:.2f}")
display(pred_df)


Loading trained housing model 'housing_best_model' ...
Transformation Pipeline and Model Successfully Loaded

Please provide values for the following columns.
Press ENTER to accept the suggested default value for each column.


You entered the following values:


Unnamed: 0,transaction_unique_identifier,date_of_transfer,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,record_status__monthly_file_only,year,month,region,is_new_build
0,{2AC10E4F-AD4E-1AF6-E050-A8C063052BA1},2016-03-31,T,N,F,LONDON,BIRMINGHAM,GREATER LONDON,A,A,2016.0,5.0,GREATER LONDON,0.0



Estimated price for this house: 325792.73


Unnamed: 0,transaction_unique_identifier,date_of_transfer,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,record_status__monthly_file_only,year,month,region,is_new_build,prediction_label
0,{2AC10E4F-AD4E-1AF6-E050-A8C063052BA1},2016-03-31,T,N,F,LONDON,BIRMINGHAM,GREATER LONDON,A,A,2016.0,5.0,GREATER LONDON,0.0,325792.734599



Estimated price for this house: 325792.73


Unnamed: 0,transaction_unique_identifier,date_of_transfer,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,record_status__monthly_file_only,year,month,region,is_new_build,prediction_label
0,{2AC10E4F-AD4E-1AF6-E050-A8C063052BA1},2016-03-31,T,N,F,LONDON,BIRMINGHAM,GREATER LONDON,A,A,2016.0,5.0,GREATER LONDON,0.0,325792.734599
