# Satellite Imagery Based Property Valuation
## Data Preprocessing & Exploratory Data Analysis (EDA)

### Objective
This notebook performs:
- Loading and inspection of tabular housing data
- Data cleaning and preprocessing
- Exploratory Data Analysis (EDA)
- Feature scaling for model training

Satellite images are fetched separately using latitude and longitude
via ESRI World Imagery.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler

%matplotlib inline
sns.set(style="whitegrid")

In [None]:
# Load training and test datasets
train_df = pd.read_excel("data/train(1).xlsx")
test_df  = pd.read_excel("data/test2.xlsx")

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

train_df.head()

In [None]:
train_df.info()

In [None]:
train_df.isnull().sum().sort_values(ascending=False)

In [None]:
train_df.describe()


In [None]:
plt.figure(figsize=(8,5))
sns.histplot(train_df["price"], bins=50, kde=True)
plt.title("Distribution of House Prices")
plt.xlabel("Price")
plt.ylabel("Count")
plt.show()


In [None]:
plt.figure(figsize=(8,5))
sns.histplot(np.log1p(train_df["price"]), bins=50, kde=True)
plt.title("Log-Transformed House Price Distribution")
plt.xlabel("log(price)")
plt.ylabel("Count")
plt.show()


In [None]:
plt.figure(figsize=(14,10))
corr = train_df.corr()
sns.heatmap(corr, cmap="coolwarm", linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()

In [None]:
plt.figure(figsize=(7,5))
sns.scatterplot(
    x=train_df["sqft_living"],
    y=train_df["price"],
    alpha=0.4
)
plt.title("Price vs Living Area")
plt.xlabel("Sqft Living")
plt.ylabel("Price")
plt.show()



In [None]:
plt.figure(figsize=(8,6))
plt.scatter(
    train_df["long"],
    train_df["lat"],
    c=np.log1p(train_df["price"]),
    cmap="viridis",
    s=10
)
plt.colorbar(label="Log Price")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Geographic Distribution of House Prices")
plt.show()


In [None]:
tabular_features = [
    "bedrooms", "bathrooms", "sqft_living", "sqft_lot", "floors",
    "waterfront", "view", "condition", "grade", "sqft_above",
    "sqft_basement", "yr_built", "yr_renovated", "zipcode",
    "sqft_living15", "sqft_lot15", "lat", "long"
]

target = "price"

X_train = train_df[tabular_features].astype(float)
y_train = train_df[target]

X_test = test_df[tabular_features].astype(float)


In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

print("Scaled Train Shape:", X_train_scaled.shape)
print("Scaled Test Shape:", X_test_scaled.shape)


In [None]:
import os
import joblib

os.makedirs("models", exist_ok=True)
joblib.dump(scaler, "models/tabular_scaler.pkl")

print("Scaler saved to models/tabular_scaler.pkl")


In [None]:
## âœ… Preprocessing Summary

- Loaded and validated tabular housing data
- Checked missing values and statistics
- Performed Exploratory Data Analysis (EDA):
  - Price distribution
  - Log-transformed target analysis
  - Correlation heatmap
  - Geographic price visualization
- Selected relevant tabular features
- Applied StandardScaler normalization
- Saved scaler for use in model training

### Next Steps
 Satellite images are downloaded using `src/data_fetcher.py`  
 Multimodal model training is performed using `src/train.py`
