# Real Estate Valuation in Sri Lanka: A Machine Learning Approach

## 1. Introduction and Problem Identification
**Problem**: Real estate and land valuation in Sri Lanka is often opaque, relying heavily on subjective appraisals and manual comparisons. There is a lack of accessible, data-driven tools for ordinary citizens to estimate land prices based on objective parameters like location and utility access.

**Objective**: To develop a machine learning model capable of accurately predicting the price per perch of land in Sri Lanka. Provide an explainable AI (XAI) interface to ensure transparency in how the model derives its valuations.

**Algorithm Selection**: While traditional linear models provide a baseline, this project utilizes **CatBoostRegressor** (Categorical Boosting). CatBoost is an advanced gradient boosting algorithm that handles categorical variables (like `District` and `City`) natively and effectively without requiring extensive one-hot encoding, reducing feature space sparsity and improving accuracy over standard algorithms.

## 2. Dataset Collection & Loading
The dataset was custom compiled by web scraping real estate listings. It contains records of land sizes, locations, and binary indicators for essential utilities.

In [None]:
# Install required libraries if running in Google Colab
try:
    import google.colab
    print('Running in Google Colab. Installing dependencies...')
    !pip install catboost shap
except ImportError:
    print('Not running in Google Colab. Continuing...')


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
sns.set_theme(style="whitegrid")

# Load Custom Dataset
import os
file_path = 'data/processed/cleaned_land_data1.csv'
if not os.path.exists(file_path):
    print("Dataset not found at 'data/processed/cleaned_land_data1.csv'.")
    print("If you are in Google Colab, please upload 'cleaned_land_data1.csv' now.")
    try:
        from google.colab import files
        uploaded = files.upload()
        if uploaded:
            file_path = list(uploaded.keys())[0]
    except ImportError:
        pass

df = pd.read_csv(file_path)
print(f"Dataset Shape: {df.shape}")
df.head()

## 3. Exploratory Data Analysis (EDA)
Let's explore the distribution of land prices across major districts and understand the feature correlations.

In [None]:
plt.figure(figsize=(12, 6))
top_districts = df['District'].value_counts().nlargest(10).index
sns.boxplot(x='District', y='Price per perch', data=df[df['District'].isin(top_districts)])
plt.title('Distribution of Price per Perch by Top 10 Districts')
plt.yscale('log') # Log scale due to extreme premium outliers in Colombo
plt.ylabel('Price per Perch (Log Scale LKR)')
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
# Calculate correlation matrix for numerical columns
num_cols = df.select_dtypes(include=[np.number])
sns.heatmap(num_cols.corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix of Numerical Features')
plt.show()

## 4. Data Preprocessing
We need to guarantee data types, separate the target variable (`Price per perch`), and split the data into training and validation sets.

In [None]:
from sklearn.model_selection import train_test_split

# Type checking
df['District'] = df['District'].astype(str)
df['City'] = df['City'].astype(str)
df['Land size'] = df['Land size'].astype(float)
df['Availability of electricity'] = df['Availability of electricity'].astype(int)
df['Availability of tap water'] = df['Availability of tap water'].astype(int)

# Define Target (y) and Features (X)
y = df['Price per perch'].astype(float)
X = df[['District', 'City', 'Land size', 'Availability of electricity', 'Availability of tap water']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training instances: {len(X_train)} | Testing instances: {len(X_test)}")

## 5. Baseline Model Comparison (Random Forest)
To justify the use of our advanced boosting technique (CatBoost), we must establish a baseline. We will use a standard `RandomForestRegressor`. Since standard Random Forest in scikit-learn cannot handle strings directly, we must apply an `OrdinalEncoder`.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), ['District', 'City'])
    ],
    remainder='passthrough'
)

rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(n_estimators=100, random_state=42))
])

rf_pipeline.fit(X_train, y_train)
rf_pred = rf_pipeline.predict(X_test)

print("--- Random Forest Baseline Metrics ---")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, rf_pred)):.2f}")
print(f"MAE:  {mean_absolute_error(y_test, rf_pred):.2f}")
print(f"R2:   {r2_score(y_test, rf_pred):.4f}")

## 6. Advanced Model Training & Hyperparameter Tuning (CatBoost)
We now train the `CatBoostRegressor`. We utilize `RandomizedSearchCV` to systematically tune the hyperparameters (`iterations`, `learning_rate`, `depth`) rather than relying on arbitrary guesses.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from catboost import CatBoostRegressor

categorical_features_indices = ['District', 'City']

# Define base model
cb_base = CatBoostRegressor(
    random_seed=42, 
    verbose=0
)

# Hyperparameter Selection Grid
param_grid = {
    'iterations': [200, 500],
    'learning_rate': [0.05, 0.1, 0.2],
    'depth': [4, 6, 8]
}

print("Starting RandomizedSearchCV cross-validation...")
random_search = RandomizedSearchCV(
    estimator=cb_base,
    param_distributions=param_grid,
    n_iter=5,
    cv=3,
    scoring='neg_root_mean_squared_error',
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train, cat_features=categorical_features_indices)
best_params = random_search.best_params_
print(f"\nOptimal tuned parameters found: {best_params}")

### 6.1 Final Model Evaluation
Using the identified optimal hyperparameters, we assemble the final model. We also implement Early Stopping (`early_stopping_rounds=50`) to prevent overfitting on the training set.

In [None]:
final_model = CatBoostRegressor(
    iterations=best_params['iterations'],
    learning_rate=best_params['learning_rate'],
    depth=best_params['depth'],
    cat_features=categorical_features_indices,
    random_seed=42,
    verbose=0
)

# Fit with Early Stopping on the validation test set
final_model.fit(X_train, y_train, eval_set=(X_test, y_test), early_stopping_rounds=50)

cb_pred = final_model.predict(X_test)
cb_rmse = np.sqrt(mean_squared_error(y_test, cb_pred))
cb_mae = mean_absolute_error(y_test, cb_pred)
cb_r2 = r2_score(y_test, cb_pred)

print("--- Tuned CatBoost Final Metrics ---")
print(f"RMSE: {cb_rmse:.2f}")
print(f"MAE:  {cb_mae:.2f}")
print(f"R2:   {cb_r2:.4f}")

## 7. Model Explainability (XAI) using SHAP
Modern machine learning requires transparency. We apply **SHapley Additive exPlanations (SHAP)** to break down how the complex CatBoost model makes decisions.

SHAP values measure the marginal contribution of each feature to the final prediction.

In [None]:
import shap
shap.initjs() # Initialize JavaScript visualization for SHAP inside notebooks

# Create TreeExplainer tailored for Gradient Boosting trees
explainer = shap.TreeExplainer(final_model)

# Calculate SHAP values for the test set
shap_values = explainer.shap_values(X_test)

# Summary Plot: Global Interpretability
plt.title("Global Feature Importance (SHAP Summary Plot)")
shap.summary_plot(shap_values, X_test)

### 7.1 Local Explainability (XAI)
Global interpretability tells us that `District` and `City` govern general value. But how does the model decide the price of *one specific property*?

Below is a SHAP Force Plot. It shows the Base Value (average land price in the dataset), and how the specific features of an individual property push the value up (red) or down (blue) to arrive at the final output price.

In [None]:
# Local XAI for the very first instance in our test array
instance_idx = 0

print(f"Explaining prediction for:\n{X_test.iloc[instance_idx]}")
print(f"\nPredicted Price: LKR {cb_pred[instance_idx]:,.2f}")

shap.force_plot(
    explainer.expected_value, 
    shap_values[instance_idx,:], 
    X_test.iloc[instance_idx,:]
)

## 8. Conclusion
- We successfully framed a real-world problem mapping locations and utilities to property prices.
- We evaluated a Random Forest baseline model and subsequently improved upon it by utilizing CatBoostRegressor with RandomizedSearchCV.
- Through SHAP (Global and Local XAI), we proved the model's transparency, demonstrating exactly how the prediction is heavily anchored geographically and marginally modified by utilities.