# Load Featured Dataset, Prepare Target, and Train-Test Split

This notebook cell loads the fully featured Airbnb dataset with all preprocessing done, applies a log-transform to the target price for better modeling behavior, and then splits the data into training and test sets for downstream machine learning tasks.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

# Step 1: Load the featured dataset
df = pd.read_csv("../data/cleaned/airbnb_featured_data.csv")

# Step 2: Prepare target variable
# Log-transform 'price' to reduce skewness (common for price prediction)
df['log_price'] = np.log1p(df['price'])

# Step 3: Define features (drop original price and any non-feature columns)
X = df.drop(columns=['price', 'log_price'])  # All features except price
y = df['log_price']  # Target variable for training

# Step 4: Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Quick sanity checks
print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Sample features:\n{X_train.head()}")
print(f"Sample targets:\n{y_train.head()}")


Training set shape: (35436, 18)
Test set shape: (8859, 18)
Sample features:
       host_identity_verified  instant_bookable  construction_year  \
4955                        1                 0             2004.0   
23316                       0                 1             2005.0   
43055                       0                 0             2021.0   
15516                       1                 1             2007.0   
11671                       1                 0             2015.0   

       minimum_nights  number_of_reviews  reviews_per_month  \
4955              1.0                2.0               0.05   
23316             1.0               12.0               2.95   
43055             1.0               24.0               2.47   
15516             1.0               16.0               1.84   
11671             1.0                8.0               0.41   

       review_rate_number  availability_365  cancellation_policy_moderate  \
4955                  3.0              22.0    

# Baseline Model Training and Evaluation

We will train a simple Random Forest Regressor on the prepared dataset to establish a baseline performance. The target variable is the log-transformed price to stabilize variance.

**Steps:**
- Initialize the Random Forest model with 100 trees and a fixed random state for reproducibility.
- Train the model on the training set.
- Predict on the test set.
- Evaluate the model performance using RMSE (Root Mean Squared Error) and R² score in the log-price scale.
- Convert predictions back to the original price scale to interpret error in actual price units.

This baseline will help gauge how well the current features predict Airbnb listing prices.


In [5]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Train model
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

# Predict
y_pred = rf.predict(X_test)

# Evaluate RMSE manually since 'squared' param unsupported
rmse_log = mean_squared_error(y_test, y_pred) ** 0.5
r2 = r2_score(y_test, y_pred)

print(f"Test RMSE (log scale): {rmse_log:.4f}")
print(f"Test R² score: {r2:.4f}")

# Back to original price scale
y_test_price = np.expm1(y_test)
y_pred_price = np.expm1(y_pred)
rmse_price = mean_squared_error(y_test_price, y_pred_price) ** 0.5

print(f"Test RMSE (original price scale): {rmse_price:.2f}")



Test RMSE (log scale): 0.6142
Test R² score: 0.2975
Test RMSE (original price scale): 304.55


# Improved Model: XGBoost Regressor

We use the XGBoost regressor, a powerful gradient boosting framework that often outperforms Random Forest on structured datasets. This step includes:

- Initializing XGBoost with default parameters.
- Training on the same training data.
- Evaluating performance using RMSE and R² on the test set.
- Comparing results with the Random Forest baseline.

XGBoost is known for faster training and often better accuracy, especially with some tuning.


In [8]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Initialize XGBoost regressor
xgb_reg = xgb.XGBRegressor(random_state=42, n_jobs=-1, verbosity=0)

# Train
xgb_reg.fit(X_train, y_train)

# Predict
y_pred_xgb = xgb_reg.predict(X_test)

# Evaluate on log scale
rmse_log_xgb = mean_squared_error(y_test, y_pred_xgb) ** 0.5
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"XGBoost Test RMSE (log scale): {rmse_log_xgb:.4f}")
print(f"XGBoost Test R² score: {r2_xgb:.4f}")

# Convert back to original price scale
y_test_price = np.expm1(y_test)
y_pred_price_xgb = np.expm1(y_pred_xgb)
rmse_price_xgb = mean_squared_error(y_test_price, y_pred_price_xgb) ** 0.5

print(f"XGBoost Test RMSE (original price scale): {rmse_price_xgb:.2f}")


XGBoost Test RMSE (log scale): 0.7072
XGBoost Test R² score: 0.0687
XGBoost Test RMSE (original price scale): 339.73


# Hyperparameter Tuning: Random Forest Regressor

We will optimize key Random Forest hyperparameters to improve performance:

- Number of trees (`n_estimators`)
- Maximum tree depth (`max_depth`)
- Minimum samples per leaf (`min_samples_leaf`)

We’ll use randomized search with cross-validation to efficiently explore combinations.

This should boost accuracy without much extra complexity.


In [9]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define model
rf = RandomForestRegressor(random_state=42, n_jobs=-1)

# Hyperparameter space
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_leaf': [1, 2, 4, 6, 8]
}

# Setup RandomizedSearchCV
random_search = RandomizedSearchCV(
    rf,
    param_distributions=param_dist,
    n_iter=20,
    scoring='neg_root_mean_squared_error',
    cv=3,
    verbose=2,
    random_state=42,
    n_jobs=-1
)

# Run search on training data
random_search.fit(X_train, y_train)

# Best parameters
print("Best parameters found:", random_search.best_params_)

# Evaluate on test set
best_rf = random_search.best_estimator_
y_pred_tuned = best_rf.predict(X_test)

rmse_log_tuned = mean_squared_error(y_test, y_pred_tuned) ** 0.5
r2_tuned = r2_score(y_test, y_pred_tuned)

print(f"Tuned RF Test RMSE (log scale): {rmse_log_tuned:.4f}")
print(f"Tuned RF Test R² score: {r2_tuned:.4f}")

# Back to original price scale

y_test_price = np.expm1(y_test)
y_pred_price_tuned = np.expm1(y_pred_tuned)
rmse_price_tuned = mean_squared_error(y_test_price, y_pred_price_tuned) ** 0.5

print(f"Tuned RF Test RMSE (original price scale): {rmse_price_tuned:.2f}")


Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best parameters found: {'n_estimators': 500, 'min_samples_leaf': 1, 'max_depth': 40}
Tuned RF Test RMSE (log scale): 0.6101
Tuned RF Test R² score: 0.3068
Tuned RF Test RMSE (original price scale): 302.34


# Save the Trained Model and Artifacts for Deployment

We save the trained Random Forest model and necessary preprocessing info using `joblib`. This ensures we can load the model later for inference without retraining.

- Save the trained model object.
- Save feature names list (to keep consistent feature order).
- Provide loading and prediction example for deployment.


In [10]:
import joblib
import os

# Ensure models directory exists
model_dir = "models"
os.makedirs(model_dir, exist_ok=True)

# File paths
model_path = os.path.join(model_dir, "random_forest_tuned_model.joblib")
features_path = os.path.join(model_dir, "feature_columns.joblib")

# Save model and features
joblib.dump(best_rf, model_path)
joblib.dump(X_train.columns.tolist(), features_path)

print(f"Model saved to {model_path}")
print(f"Feature columns saved to {features_path}")


Model saved to models\random_forest_tuned_model.joblib
Feature columns saved to models\feature_columns.joblib


In [11]:
import joblib

# Load feature columns saved during training
features_path = "models/feature_columns.joblib"
feature_cols = joblib.load(features_path)

print(f"Number of features used in training: {len(feature_cols)}")
print("Features:")
for f in feature_cols:
    print(f"- {f}")


Number of features used in training: 18
Features:
- host_identity_verified
- instant_bookable
- construction_year
- minimum_nights
- number_of_reviews
- reviews_per_month
- review_rate_number
- availability_365
- cancellation_policy_moderate
- cancellation_policy_strict
- room_type_Hotel room
- room_type_Private room
- room_type_Shared room
- last_review_year
- last_review_month
- days_since_last_review
- property_age
- neighbourhood_freq


In [12]:
import joblib
import numpy as np
import pandas as pd

# Load model and features
model_path = "models/random_forest_tuned_model.joblib"
features_path = "models/feature_columns.joblib"
model = joblib.load(model_path)
feature_cols = joblib.load(features_path)

# Create a sample input with plausible random values for each feature
sample_data = {
    'host_identity_verified': [1],        # binary 0 or 1
    'instant_bookable': [1],               # binary 0 or 1
    'construction_year': [2015],           # year in float
    'minimum_nights': [2],                 # float
    'number_of_reviews': [10],             # float
    'reviews_per_month': [0.5],            # float
    'review_rate_number': [4.0],           # rating 1-5 float
    'availability_365': [150],             # float days available in year
    'cancellation_policy_moderate': [False], # bool
    'cancellation_policy_strict': [True],    # bool
    'room_type_Hotel room': [False],          # bool
    'room_type_Private room': [True],         # bool
    'room_type_Shared room': [False],          # bool
    'last_review_year': [2022],               # int
    'last_review_month': [5],                  # int
    'days_since_last_review': [100],           # int
    'property_age': [7],                        # float
    'neighbourhood_freq': [500]                 # int
}

# Build DataFrame with correct feature order
X_sample = pd.DataFrame(sample_data)[feature_cols]

# Predict log price and convert back to original price scale
log_price_pred = model.predict(X_sample)[0]
price_pred = np.expm1(log_price_pred)

print(f"Predicted log price: {log_price_pred:.4f}")
print(f"Predicted price: ${price_pred:.2f}")


Predicted log price: 6.1678
Predicted price: $476.15


In [13]:
# Assuming your final cleaned DataFrame with both columns is named df_cleaned
neighbourhood_freq_dict = (
    df_cleaned[['neighbourhood', 'neighbourhood_freq']]
    .drop_duplicates()
    .set_index('neighbourhood')['neighbourhood_freq']
    .to_dict()
)

# Optional: Save to file
import joblib
joblib.dump(neighbourhood_freq_dict, "models/neighbourhood_freq_dict.joblib")


NameError: name 'df_cleaned' is not defined

In [14]:
import pandas as pd
import joblib

# Load the final featured dataset (make sure the path is correct)
df_cleaned = pd.read_csv("../data/cleaned/airbnb_featured_data.csv")

# Extract the neighbourhood → frequency mapping
neighbourhood_freq_dict = (
    df_cleaned[['neighbourhood', 'neighbourhood_freq']]
    .drop_duplicates()
    .set_index('neighbourhood')['neighbourhood_freq']
    .to_dict()
)

# Save the dictionary for use in Streamlit
joblib.dump(neighbourhood_freq_dict, "models/neighbourhood_freq_dict.joblib")

print(f"Saved {len(neighbourhood_freq_dict)} neighbourhoods to models/neighbourhood_freq_dict.joblib")


KeyError: "['neighbourhood'] not in index"

In [15]:
import pandas as pd

df = pd.read_csv("../data/cleaned/airbnb_featured_data.csv")
print(df.columns.tolist())


['host_identity_verified', 'instant_bookable', 'construction_year', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'review_rate_number', 'availability_365', 'cancellation_policy_moderate', 'cancellation_policy_strict', 'room_type_Hotel room', 'room_type_Private room', 'room_type_Shared room', 'last_review_year', 'last_review_month', 'days_since_last_review', 'property_age', 'neighbourhood_freq']


In [16]:
import pandas as pd
import joblib

# Load full cleaned dataset BEFORE dropping 'neighbourhood'
df_full = pd.read_csv("../data/cleaned/airbnb_cleaned.csv")  # this should have 'neighbourhood' column

# Load final featured dataset with neighbourhood_freq
df_featured = pd.read_csv("../data/cleaned/airbnb_featured_data.csv")

# Merge on unique identifier (index assumed to be consistent)
df_full = df_full.reset_index(drop=True)
df_featured = df_featured.reset_index(drop=True)
df_merged = pd.concat([df_full['neighbourhood'], df_featured['neighbourhood_freq']], axis=1)

# Drop duplicates and build mapping
neighbourhood_freq_dict = (
    df_merged
    .drop_duplicates()
    .set_index('neighbourhood')['neighbourhood_freq']
    .to_dict()
)

# Save for Streamlit app
joblib.dump(neighbourhood_freq_dict, "models/neighbourhood_freq_dict.joblib")
print("✅ Saved: models/neighbourhood_freq_dict.joblib")


✅ Saved: models/neighbourhood_freq_dict.joblib
