#  Multimodal Model Training & Evaluation

##  Notebook Overview
This notebook implements the **model training and evaluation pipeline** for the satellite imagery–based property valuation project. It compares traditional tabular-only regression with a multimodal approach that integrates engineered tabular features and CNN-extracted visual embeddings.

##  Modeling Strategy
The training process is carried out in two stages:

1. **Tabular-Only Baseline Model**
   - A Random Forest regressor is trained using the engineered tabular features.
   - This model serves as a strong baseline to quantify the predictive power of numerical data alone.

2. **Multimodal Regression Model**
   - Tabular features are fused with visual feature embeddings extracted from satellite images using a CNN.
   - The combined feature representation is used to train a multimodal regression model based on CatBoost.
   - This approach enables the model to leverage both structural property attributes and surrounding environmental context.

##  Evaluation & Output
Model performance is evaluated using **RMSE and R² score** to compare tabular-only and multimodal results. The trained multimodal model is saved for inference on the test dataset and final price prediction generation.

---



#TABULAR ONLY MODEL TRAINING
> ## RANDOM FOREST

In [None]:
import numpy as np
import pandas as pd
import os

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Mount drive (Colab)
from google.colab import drive
drive.mount("/content/drive")


Mounted at /content/drive


In [None]:
PROCESSED_DIR = "/content/drive/MyDrive/multimodal-real-estate/data/processed"

# Load engineered tabular data
X_train = pd.read_csv(f"{PROCESSED_DIR}/X_train.csv")
X_val   = pd.read_csv(f"{PROCESSED_DIR}/X_val.csv")

# Load log-price targets
y_train = pd.read_csv(f"{PROCESSED_DIR}/y_train.csv").squeeze()
y_val   = pd.read_csv(f"{PROCESSED_DIR}/y_val.csv").squeeze()

print("X_train shape:", X_train.shape)
print("X_val shape  :", X_val.shape)
print("y_train shape:", y_train.shape)
print("y_val shape  :", y_val.shape)


X_train shape: (12967, 32)
X_val shape  : (3242, 32)
y_train shape: (12967,)
y_val shape  : (3242,)


In [None]:
assert X_train.shape[1] == X_val.shape[1]
assert len(X_train) == len(y_train)
assert len(X_val) == len(y_val)

print("✅ Tabular data alignment confirmed")


✅ Tabular data alignment confirmed


In [None]:
from sklearn.ensemble import RandomForestRegressor


In [None]:
rf_model = RandomForestRegressor(
    n_estimators=400,
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)


In [None]:
rf_model.fit(X_train, y_train)


In [None]:
# Predict (log space)
y_val_pred_log_rf = rf_model.predict(X_val)

# Convert back to price
y_val_pred_price_rf = np.expm1(y_val_pred_log_rf)
y_val_true_price = np.expm1(y_val)

# RMSE (price scale)
mse_rf = mean_squared_error(y_val_true_price, y_val_pred_price_rf)
rmse_rf = np.sqrt(mse_rf)

# R² (log space)
r2_rf = r2_score(y_val, y_val_pred_log_rf)

print("RANDOM FOREST TABULAR RESULTS")
print(f"RMSE (price) : {rmse_rf:,.2f}")
print(f"R² (log)    : {r2_rf:.4f}")


RANDOM FOREST TABULAR RESULTS
RMSE (price) : 139,195.04
R² (log)    : 0.8751


# MULTIMODEL (TABULAR+IMAGE) TRAINING
> ## CATBOOST

In [None]:
! pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [None]:
# Load multimodal arrays
X_img_train = np.load(f"{BASE_DIR}/X_img_train.npy")
X_img_val   = np.load(f"{BASE_DIR}/X_img_val.npy")

X_tab_train = np.load(f"{BASE_DIR}/X_tab_train.npy")
X_tab_val   = np.load(f"{BASE_DIR}/X_tab_val.npy")

y_train_mm  = np.load(f"{BASE_DIR}/y_train_mm.npy")
y_val_mm    = np.load(f"{BASE_DIR}/y_val_mm.npy")

# Shape checks
print("TRAIN SHAPES")
print("X_img_train:", X_img_train.shape)
print("X_tab_train:", X_tab_train.shape)
print("y_train_mm :", y_train_mm.shape)

print("\nVAL SHAPES")
print("X_img_val:", X_img_val.shape)
print("X_tab_val:", X_tab_val.shape)
print("y_val_mm :", y_val_mm.shape)

# Alignment assertions
assert X_img_train.shape[0] == X_tab_train.shape[0] == y_train_mm.shape[0]
assert X_img_val.shape[0]   == X_tab_val.shape[0]   == y_val_mm.shape[0]

# NaN checks
assert not np.isnan(X_tab_train).any()
assert not np.isnan(X_tab_val).any()
assert not np.isnan(y_train_mm).any()
assert not np.isnan(y_val_mm).any()

print("\n✅ Multimodal alignment and integrity CONFIRMED")


TRAIN SHAPES
X_img_train: (12967, 512)
X_tab_train: (12967, 32)
y_train_mm : (12967,)

VAL SHAPES
X_img_val: (3242, 512)
X_tab_val: (3242, 32)
y_val_mm : (3242,)

✅ Multimodal alignment and integrity CONFIRMED


In [None]:
# Combine tabular + image features
X_train_mm = np.hstack([X_tab_train, X_img_train])
X_val_mm   = np.hstack([X_tab_val, X_img_val])

print("X_train_mm shape:", X_train_mm.shape)
print("X_val_mm shape  :", X_val_mm.shape)


X_train_mm shape: (12967, 544)
X_val_mm shape  : (3242, 544)


In [None]:
from catboost import CatBoostRegressor

cat_model = CatBoostRegressor(
    iterations=1200,
    learning_rate=0.05,
    depth=8,
    loss_function="RMSE",
    eval_metric="RMSE",
    random_seed=42,
    verbose=100
)

cat_model.fit(
    X_train_mm, y_train_mm,
    eval_set=(X_val_mm, y_val_mm),
    use_best_model=True
)


0:	learn: 0.5051571	test: 0.5069081	best: 0.5069081 (0)	total: 725ms	remaining: 14m 29s
100:	learn: 0.1724331	test: 0.1808036	best: 0.1808036 (100)	total: 49.7s	remaining: 9m 1s
200:	learn: 0.1534203	test: 0.1716805	best: 0.1716805 (200)	total: 1m 40s	remaining: 8m 18s
300:	learn: 0.1391231	test: 0.1678399	best: 0.1678399 (300)	total: 2m 29s	remaining: 7m 25s
400:	learn: 0.1277476	test: 0.1671806	best: 0.1671699 (399)	total: 3m 19s	remaining: 6m 38s
500:	learn: 0.1174360	test: 0.1674220	best: 0.1671699 (399)	total: 4m 8s	remaining: 5m 46s
600:	learn: 0.1078641	test: 0.1679382	best: 0.1671699 (399)	total: 4m 59s	remaining: 4m 58s
700:	learn: 0.0989049	test: 0.1689224	best: 0.1671699 (399)	total: 5m 48s	remaining: 4m 7s
800:	learn: 0.0906140	test: 0.1695811	best: 0.1671699 (399)	total: 6m 38s	remaining: 3m 18s
900:	learn: 0.0832171	test: 0.1707512	best: 0.1671699 (399)	total: 7m 27s	remaining: 2m 28s
1000:	learn: 0.0763951	test: 0.1718145	best: 0.1671699 (399)	total: 8m 17s	remaining: 1m

<catboost.core.CatBoostRegressor at 0x7d5f604facf0>

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Predict (log space)
y_val_pred_log = cat_model.predict(X_val_mm)

# Convert back to price scale
y_val_pred_price = np.expm1(y_val_pred_log)
y_val_true_price = np.expm1(y_val_mm)

# RMSE on price scale (version-safe)
mse = mean_squared_error(y_val_true_price, y_val_pred_price)
rmse = np.sqrt(mse)

# R² on log scale
r2 = r2_score(y_val_mm, y_val_pred_log)

print("CATBOOST MULTIMODAL RESULTS")
print(f"RMSE (price) : {rmse:,.2f}")
print(f"R² (log)    : {r2:.4f}")


CATBOOST MULTIMODAL RESULTS
RMSE (price) : 114,039.08
R² (log)    : 0.8987


In [None]:
MODEL_DIR = "/content/drive/MyDrive/multimodal-real-estate/models"
os.makedirs(MODEL_DIR, exist_ok=True)

MODEL_PATH = f"{MODEL_DIR}/catboost_multimodal.cbm"

cat_model.save_model(MODEL_PATH)

print("✅ CatBoost multimodal model saved at:")
print(MODEL_PATH)


✅ CatBoost multimodal model saved at:
/content/drive/MyDrive/multimodal-real-estate/models/catboost_multimodal.cbm
