#  Price Prediction on Test Data

## Notebook Overview
This notebook performs **final price prediction** on the test dataset using the trained multimodal regression model. It applies the same preprocessing and feature extraction steps used during training to ensure consistency and reproducibility.

## Prediction Pipeline
The prediction process follows these steps:
- Satellite images for the test dataset are programmatically downloaded using latitude and longitude coordinates
- Visual features are extracted from test images using the same CNN-based feature extractor
- Tabular features undergo identical feature engineering transformations as applied during training
- The saved multimodal CatBoost model is loaded and used to predict property prices based on the fused tabular and image features

## Output
The final output of this notebook is a CSV file containing predicted property prices for the test dataset, formatted strictly according to the submission guidelines.


#DOWNLOADING TEST IMAGES
> Using data_fetcher.py

In [None]:
import pandas as pd

# Load test dataset
test_df = pd.read_excel(
    "/content/drive/MyDrive/multimodal-real-estate/data/raw/test2.xlsx"
)

# Create coords file
test_coords = test_df[["id", "lat", "long"]].copy()

# Save for image downloader
test_coords_path = "/content/drive/MyDrive/multimodal-real-estate/data/raw/test_image_coords.csv"
test_coords.to_csv(test_coords_path, index=False)

print("test_image_coords.csv created")
print("Shape:", test_coords.shape)


test_image_coords.csv created
Shape: (5404, 3)


In [None]:
import os

TEST_IMAGE_DIR = "/content/drive/MyDrive/multimodal-real-estate/data/raw/images/test"
os.makedirs(TEST_IMAGE_DIR, exist_ok=True)

print("Test image directory ready:", TEST_IMAGE_DIR)


Test image directory ready: /content/drive/MyDrive/multimodal-real-estate/data/raw/images/test


In [None]:
!python /content/drive/MyDrive/multimodal-real-estate/data_fetcher.py


Loading coordinate CSV...
Total rows in CSV: 5404
Saving images to: /content/drive/MyDrive/multimodal-real-estate/data/raw/images/test
--------------------------------------------------
[100/5404] Downloaded: 100 | Skipped: 0 | Failed: 0
[200/5404] Downloaded: 200 | Skipped: 0 | Failed: 0
[300/5404] Downloaded: 300 | Skipped: 0 | Failed: 0
[400/5404] Downloaded: 400 | Skipped: 0 | Failed: 0
[500/5404] Downloaded: 500 | Skipped: 0 | Failed: 0
[600/5404] Downloaded: 600 | Skipped: 0 | Failed: 0
[700/5404] Downloaded: 699 | Skipped: 1 | Failed: 0
[800/5404] Downloaded: 799 | Skipped: 1 | Failed: 0
[900/5404] Downloaded: 899 | Skipped: 1 | Failed: 0
[1000/5404] Downloaded: 999 | Skipped: 1 | Failed: 0
[1100/5404] Downloaded: 1099 | Skipped: 1 | Failed: 0
[1200/5404] Downloaded: 1199 | Skipped: 1 | Failed: 0
[1300/5404] Downloaded: 1299 | Skipped: 1 | Failed: 0
[1400/5404] Downloaded: 1399 | Skipped: 1 | Failed: 0
[1500/5404] Downloaded: 1498 | Skipped: 2 | Failed: 0
[1600/5404] Downloaded:

#Test Images Feature Extraction
>Using same CNN based feature extractor

In [1]:
# Paths
BASE_PATH = "/content/drive/MyDrive/multimodal-real-estate"

DATA_PROCESSED = f"{BASE_PATH}/data/processed"
DATA_RAW = f"{BASE_PATH}/data/raw"

IMAGE_DIR_TEST = f"{DATA_RAW}/images/test"

print("Processed path:", DATA_PROCESSED)
print("Raw path:", DATA_RAW)
print("Test image dir:", IMAGE_DIR_TEST)


Processed path: /content/drive/MyDrive/multimodal-real-estate/data/processed
Raw path: /content/drive/MyDrive/multimodal-real-estate/data/raw
Test image dir: /content/drive/MyDrive/multimodal-real-estate/data/raw/images/test


In [None]:
import pandas as pd

# Load test2 Excel file
test2_df = pd.read_excel(f"{DATA_RAW}/test2.xlsx")

print("Test2 shape:", test2_df.shape)
print("\nColumns:")
print(test2_df.columns.tolist())

test2_df.head()


Test2 shape: (5404, 20)

Columns:
['id', 'date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']


Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,2591820310,20141006T000000,4,2.25,2070,8893,2.0,0,0,4,8,2070,0,1986,0,98058,47.4388,-122.162,2390,7700
1,7974200820,20140821T000000,5,3.0,2900,6730,1.0,0,0,5,8,1830,1070,1977,0,98115,47.6784,-122.285,2370,6283
2,7701450110,20140815T000000,4,2.5,3770,10893,2.0,0,2,3,11,3770,0,1997,0,98006,47.5646,-122.129,3710,9685
3,9522300010,20150331T000000,3,3.5,4560,14608,2.0,0,2,3,12,4560,0,1990,0,98034,47.6995,-122.228,4050,14226
4,9510861140,20140714T000000,3,2.5,2550,5376,2.0,0,0,3,9,2550,0,2004,0,98052,47.6647,-122.083,2250,4050


In [None]:
# Load training tabular columns (ground truth for order & selection)
X_train_cols = pd.read_csv(f"{DATA_PROCESSED}/X_train.csv").columns.tolist()

print("Training feature columns (model expects):")
print(X_train_cols)
print("\nNumber of features:", len(X_train_cols))


Training feature columns (model expects):
['bedrooms', 'bathrooms', 'sqft_living', 'floors', 'lat', 'long', 'sqft_lot', 'sqft_above', 'sqft_basement', 'condition', 'grade', 'view', 'waterfront', 'sqft_living15', 'sqft_lot15']

Number of features: 15


In [None]:
# Create tabular test2 features using training column order
X_tab_test2 = test2_df[X_train_cols].copy()

print("X_tab_test2 shape:", X_tab_test2.shape)
print("\nFirst row (sanity check):")
X_tab_test2.head(1)


X_tab_test2 shape: (5404, 15)

First row (sanity check):


Unnamed: 0,bedrooms,bathrooms,sqft_living,floors,lat,long,sqft_lot,sqft_above,sqft_basement,condition,grade,view,waterfront,sqft_living15,sqft_lot15
0,4,2.25,2070,2.0,47.4388,-122.162,8893,2070,0,4,8,0,0,2390,7700


In [None]:
import os

# Get all image filenames (without .jpg)
available_images = set(
    fname.replace(".jpg", "")
    for fname in os.listdir(IMAGE_DIR_TEST)
    if fname.endswith(".jpg")
)

missing_ids = []

for i in test2_df["id"].unique():
    id_int = f"id_{int(i)}"
    id_float = f"id_{float(i)}"

    if id_int not in available_images and id_float not in available_images:
        missing_ids.append(i)

print("Total unique test2 IDs:", test2_df["id"].nunique())
print("Available images:", len(available_images))
print("Missing images:", len(missing_ids))
print("Sample missing IDs:", missing_ids[:10])


Total unique test2 IDs: 5396
Available images: 5396
Missing images: 0
Sample missing IDs: []


In [None]:
from torchvision import transforms

image_transforms = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

print("Image transforms ready")


Image transforms ready


In [None]:
import os
import torch
from torch.utils.data import Dataset
from PIL import Image

class TestHouseDataset(Dataset):
    def __init__(self, X, image_dir, transform=None):
        self.X = X.reset_index(drop=True)
        self.image_dir = image_dir
        self.transform = transform

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        row = self.X.iloc[idx]
        house_id = int(row["id"])

        # Tabular features
        tabular_features = torch.tensor(
            row.drop("id").values,
            dtype=torch.float32
        )

        # Image path handling (robust)
        img_path_1 = os.path.join(self.image_dir, f"id_{house_id}.jpg")
        img_path_2 = os.path.join(self.image_dir, f"id_{float(house_id)}.jpg")
        img_path = img_path_1 if os.path.exists(img_path_1) else img_path_2

        image = Image.open(img_path).convert("RGB")
        if self.transform:
            image = self.transform(image)

        return image, tabular_features


In [None]:
from torch.utils.data import DataLoader

# Attach id column for image loading
X_test2_mm = X_tab_test2.copy()
X_test2_mm["id"] = test2_df["id"]

test2_dataset = TestHouseDataset(
    X=X_test2_mm,
    image_dir=IMAGE_DIR_TEST,
    transform=image_transforms
)

test2_loader = DataLoader(
    test2_dataset,
    batch_size=32,
    shuffle=False,
    num_workers=2,
    pin_memory=True
)

print("Test2 dataset size:", len(test2_dataset))


Test2 dataset size: 5404


In [None]:
import torch
import torch.nn as nn
from torchvision import models

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Load pretrained ResNet-18
resnet = models.resnet18(pretrained=True)

# Remove final classification layer
resnet = nn.Sequential(*list(resnet.children())[:-1])

# Freeze all weights
for param in resnet.parameters():
    param.requires_grad = False

resnet = resnet.to(device)
resnet.eval()

print("ResNet-18 loaded and frozen")


Using device: cpu




Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth


100%|██████████| 44.7M/44.7M [00:00<00:00, 146MB/s]


ResNet-18 loaded and frozen


In [None]:
import numpy as np

def extract_test_embeddings(model, dataloader, device):
    model.eval()
    all_img_features = []
    all_tab_features = []

    with torch.no_grad():
        for images, tabular in dataloader:
            images = images.to(device)

            # Forward pass through ResNet
            features = model(images)
            features = features.view(features.size(0), -1)  # (B, 512)

            all_img_features.append(features.cpu().numpy())
            all_tab_features.append(tabular.numpy())

    X_img = np.vstack(all_img_features)
    X_tab = np.vstack(all_tab_features)

    return X_tab, X_img


In [None]:
X_tab_test2_np, X_img_test2_np = extract_test_embeddings(
    model=resnet,
    dataloader=test2_loader,
    device=device
)

print("Tabular embeddings shape:", X_tab_test2_np.shape)
print("Image embeddings shape:", X_img_test2_np.shape)




Tabular embeddings shape: (5404, 15)
Image embeddings shape: (5404, 512)


In [None]:
import numpy as np

# Paths
TEST2_TAB_PATH = f"{DATA_PROCESSED}/X_tab_test2.npy"
TEST2_IMG_PATH = f"{DATA_PROCESSED}/X_img_test2.npy"

# Save arrays
np.save(TEST2_TAB_PATH, X_tab_test2_np)
np.save(TEST2_IMG_PATH, X_img_test2_np)

print("✅ Saved TEST2 embeddings to Drive")
print("Tabular:", TEST2_TAB_PATH)
print("Images :", TEST2_IMG_PATH)


✅ Saved TEST2 embeddings to Drive
Tabular: /content/drive/MyDrive/multimodal-real-estate/data/processed/X_tab_test2.npy
Images : /content/drive/MyDrive/multimodal-real-estate/data/processed/X_img_test2.npy


#Same Feature Engineering on Tabular Data

In [4]:
import pandas as pd
import numpy as np
import os

from google.colab import drive
drive.mount("/content/drive")

# Paths
RAW_DIR = "/content/drive/MyDrive/multimodal-real-estate/data/raw"
PROCESSED_DIR = "/content/drive/MyDrive/multimodal-real-estate/data/processed"
MODEL_DIR = "/content/drive/MyDrive/multimodal-real-estate/models"

# Load raw test2 data
test2_df = pd.read_excel(f"{RAW_DIR}/test2.xlsx")

print("Test2 shape:", test2_df.shape)
print("\nColumns:")
print(test2_df.columns.tolist())

test2_df.head()


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Test2 shape: (5404, 20)

Columns:
['id', 'date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']


Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,2591820310,20141006T000000,4,2.25,2070,8893,2.0,0,0,4,8,2070,0,1986,0,98058,47.4388,-122.162,2390,7700
1,7974200820,20140821T000000,5,3.0,2900,6730,1.0,0,0,5,8,1830,1070,1977,0,98115,47.6784,-122.285,2370,6283
2,7701450110,20140815T000000,4,2.5,3770,10893,2.0,0,2,3,11,3770,0,1997,0,98006,47.5646,-122.129,3710,9685
3,9522300010,20150331T000000,3,3.5,4560,14608,2.0,0,2,3,12,4560,0,1990,0,98034,47.6995,-122.228,4050,14226
4,9510861140,20140714T000000,3,2.5,2550,5376,2.0,0,0,3,9,2550,0,2004,0,98052,47.6647,-122.083,2250,4050


In [None]:
import numpy as np

# --- Safe helpers ---
EPS = 1e-6

# Ratios & utilization
test2_df["basement_ratio"] = test2_df["sqft_basement"] / (test2_df["sqft_living"] + EPS)
test2_df["above_ratio"] = test2_df["sqft_above"] / (test2_df["sqft_living"] + EPS)
test2_df["lot_utilization"] = test2_df["sqft_living"] / (test2_df["sqft_lot"] + EPS)

# Neighborhood comparisons
test2_df["living_vs_neighbors"] = test2_df["sqft_living"] / (test2_df["sqft_living15"] + EPS)
test2_df["lot_vs_neighbors"] = test2_df["sqft_lot"] / (test2_df["sqft_lot15"] + EPS)

# Temporal features
CURRENT_YEAR = 2015
test2_df["house_age"] = CURRENT_YEAR - test2_df["yr_built"]
test2_df["is_renovated"] = (test2_df["yr_renovated"] > 0).astype(int)
test2_df["years_since_renovation"] = np.where(
    test2_df["yr_renovated"] > 0,
    CURRENT_YEAR - test2_df["yr_renovated"],
    0
)

# Room configuration
test2_df["bath_per_bed"] = test2_df["bathrooms"] / (test2_df["bedrooms"] + EPS)
test2_df["rooms_total"] = test2_df["bedrooms"] + test2_df["bathrooms"]

# Quality interactions
test2_df["grade_sqft_interaction"] = test2_df["grade"] * test2_df["sqft_living"]
test2_df["view_grade_interaction"] = test2_df["view"] * test2_df["grade"]

# Geographic features
CITY_CENTER_LAT = 47.6062
CITY_CENTER_LON = -122.3321

test2_df["dist_to_city_center"] = np.sqrt(
    (test2_df["lat"] - CITY_CENTER_LAT)**2 +
    (test2_df["long"] - CITY_CENTER_LON)**2
)

test2_df["lat_long_interaction"] = test2_df["lat"] * test2_df["long"]

# Log transforms
test2_df["log_sqft_living"] = np.log1p(test2_df["sqft_living"])
test2_df["log_sqft_lot"] = np.log1p(test2_df["sqft_lot"])
test2_df["log_sqft_lot15"] = np.log1p(test2_df["sqft_lot15"])


In [None]:
DROP_COLS = [
    "id",
    "date",
    "zipcode",
    "yr_built",
    "yr_renovated"
]

X_test2 = test2_df.drop(columns=DROP_COLS, errors="ignore")

print("X_test2 shape after FE:", X_test2.shape)


X_test2 shape after FE: (5404, 32)


In [None]:
# Load engineered TRAIN features
X_train_ref = pd.read_csv(f"{PROCESSED_DIR}/X_train.csv")

train_cols = list(X_train_ref.columns)
test_cols  = list(X_test2.columns)

print("Train feature count:", len(train_cols))
print("Test2 feature count:", len(test_cols))


Train feature count: 32
Test2 feature count: 32


In [None]:
missing_in_test = set(train_cols) - set(test_cols)
extra_in_test   = set(test_cols) - set(train_cols)

print("Missing in test2:", missing_in_test)
print("Extra in test2  :", extra_in_test)


Missing in test2: set()
Extra in test2  : set()


In [None]:
X_test2 = X_test2[train_cols]

print("✅ Column order aligned with training data")


✅ Column order aligned with training data


In [None]:
assert not X_test2.isna().any().any()
assert not np.isinf(X_test2.values).any()

print("✅ No NaNs or infs in test2 features")


✅ No NaNs or infs in test2 features


In [None]:
# Convert to numpy (float32 for CatBoost)
X_tab_test2 = X_test2.values.astype(np.float32)

print("X_tab_test2 shape:", X_tab_test2.shape)

# Replace existing file (same name)
np.save(f"{PROCESSED_DIR}/X_tab_test2.npy", X_tab_test2)

print("✅ X_tab_test2.npy regenerated and replaced")


X_tab_test2 shape: (5404, 32)
✅ X_tab_test2.npy regenerated and replaced


#PREDICTING TEST PRICES
> ## Using Trained Catboost Model

In [2]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [5]:
from catboost import CatBoostRegressor
import numpy as np
import os

# Load image embeddings for test2
X_img_test2 = np.load(f"{PROCESSED_DIR}/X_img_test2.npy")
print("X_img_test2 shape:", X_img_test2.shape)

# Load tabular test2 (just regenerated)
X_tab_test2 = np.load(f"{PROCESSED_DIR}/X_tab_test2.npy")
print("X_tab_test2 shape:", X_tab_test2.shape)

# Sanity alignment
assert X_img_test2.shape[0] == X_tab_test2.shape[0]

# Load trained CatBoost model
MODEL_PATH = f"{MODEL_DIR}/catboost_multimodal.cbm"

cat_model = CatBoostRegressor()
cat_model.load_model(MODEL_PATH)

print("✅ CatBoost model loaded successfully")


X_img_test2 shape: (5404, 512)
X_tab_test2 shape: (5404, 32)
✅ CatBoost model loaded successfully


In [6]:
import pandas as pd
import numpy as np

# Fuse tabular + image features
X_test2_mm = np.hstack([X_tab_test2, X_img_test2])
print("X_test2_mm shape:", X_test2_mm.shape)

# Predict in log space
y_test2_pred_log = cat_model.predict(X_test2_mm)

# Convert back to price scale
y_test2_pred_price = np.expm1(y_test2_pred_log)

print("Prediction sanity check:")
print("Min :", y_test2_pred_price.min())
print("Max :", y_test2_pred_price.max())
print("Mean:", y_test2_pred_price.mean())


X_test2_mm shape: (5404, 544)
Prediction sanity check:
Min : 159381.86482453733
Max : 4018244.998555967
Mean: 567674.441840937


In [7]:
# Reload raw test2 to get IDs (safe & simple)
test2_df = pd.read_excel(f"{RAW_DIR}/test2.xlsx")

prediction_df = pd.DataFrame({
    "id": test2_df["id"].values,
    "predicted_price": y_test2_pred_price
})

# Save prediction file
PRED_DIR = "/content/drive/MyDrive/multimodal-real-estate/predictions"
os.makedirs(PRED_DIR, exist_ok=True)

pred_path = f"{PRED_DIR}/23115047_final.csv"
prediction_df.to_csv(pred_path, index=False)

print("✅ 23115047_final.csv saved at:")
print(pred_path)

prediction_df.head()


✅ 23115047_final.csv saved at:
/content/drive/MyDrive/multimodal-real-estate/predictions/23115047_final.csv


Unnamed: 0,id,predicted_price
0,2591820310,396554.0
1,7974200820,951223.2
2,7701450110,1225626.0
3,9522300010,2142465.0
4,9510861140,753755.3
