# **Project Title:** House Price Prediction for Properties in Major Cities of Pakistan


`Dataset`: [Pakistan House Price Dataset](https://www.kaggle.com/datasets/jillanisofttech/pakistan-house-price-dataset/data)  


# **About Dataset**
## Context
The dataset is scraped from zameen.com which is Pakistan's top real estate platform. It contains listings of properties from five major cities of Pakistan.  
The aim of this project is to perform Exploratory Data Analysis (EDA) to uncover insights and use Machine Learning models to predict property price based on the given attributes.  

### **Content**
#### Column Descriptions:
`property_id`: Unique identifier for each property.  
`location_id`: Unique identifier for each location within a city.  
`page_url`: The URL of the webpage where the property was published.  
`property_type`: Categorization of the property into six types: House, FarmHouse, Upper Portion, Lower Portion, Flat, or Room.  
`price`: The price of the property, which is the dependent feature in this dataset.  
`city`: The city where the property is located. The dataset includes five cities: Lahore, Karachi, Faisalabad, Rawalpindi, and Islamabad.  
`province`: The state or province where the city is located.  
`location`: Different types of locations within each city.  
`latitude` and `longitude`: Geographic coordinates of the cities.  




---

In [None]:
#importing libraries
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

from sklearn.model_selection import GridSearchCV, cross_val_score

import joblib

import warnings
warnings.filterwarnings('ignore')


In [3]:
df = pd.read_csv('cleaned_data.csv')

In [4]:
df.head()

Unnamed: 0,property_type,price,location,city,province_name,latitude,longitude,baths,purpose,bedrooms,date_added,Area Category,area_sqft,price_per_sqft
0,Flat,10000000,G-10,Islamabad,Islamabad Capital,33.67989,73.01264,2,For Sale,2,2019-02-04,0-5 Marla,1088.0,9191.176471
1,Flat,6900000,E-11,Islamabad,Islamabad Capital,33.700993,72.971492,3,For Sale,3,2019-05-04,5-10 Marla,1523.2,4529.936975
2,House,16500000,G-15,Islamabad,Islamabad Capital,33.631486,72.926559,6,For Sale,5,2019-07-17,5-10 Marla,2176.0,7582.720588
3,House,43500000,Bani Gala,Islamabad,Islamabad Capital,33.707573,73.151199,4,For Sale,4,2019-04-05,1-5 Kanal,10890.0,3994.490358
4,House,7000000,DHA Defence,Islamabad,Islamabad Capital,33.492591,73.301339,3,For Sale,3,2019-07-10,5-10 Marla,2176.0,3216.911765


### Splitting Dataset

In [5]:
#splitting dataset
X = df.drop(['price', 'Area Category', 'date_added'], axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [6]:
cat_features = X.select_dtypes(include=['object']).columns
cat_features

Index(['property_type', 'location', 'city', 'province_name', 'purpose'], dtype='object')

Target encoding on location (try it)

In [7]:
# Compute mean price per location on TRAINING data only
location_mean_price = X_train.join(y_train).groupby('location')['price'].mean()

# Map to both train and test
X_train['location'] = X_train['location'].map(location_mean_price)
X_test['location'] = X_test['location'].map(location_mean_price)

# Fill unknown locations in test set with overall mean
mean_price_overall = y_train.mean()
X_test['location'].fillna(mean_price_overall, inplace=True)

In [8]:


# Define which columns get which scaler
numeric_features_1 = ['bedrooms', 'baths']         # normal distribution
numeric_features_2 = ['area_sqft', 'location']                 # skewed, outliers
categorical_features = ['city', 'property_type', 'purpose','province_name']   

# Define transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num_standard', StandardScaler(), numeric_features_1),
        ('num_robust', RobustScaler(), numeric_features_2),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)


In [None]:
# Create the full pipeline with preprocessor and regressor( trying different regressors)
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "XGBoost": XGBRegressor(random_state=42, n_estimators=200, learning_rate=0.1),
    "LightGBM": LGBMRegressor(random_state=42, n_estimators=200, learning_rate=0.1),
    "CatBoost": CatBoostRegressor(verbose=0, random_state=42, iterations=300, learning_rate=0.1),
    # "SVR": SVR(kernel='rbf', C=100, epsilon=0.1),
    "KNN": KNeighborsRegressor(n_neighbors=5)
    
}

results = {}
best_model = None
best_model_name = None
best_r2 = float(0.0)

for name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', model)])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    
    print(f"{name}: R² = {r2}, MSE = {mse}")
    
    # Update best model if this one is better
    if r2 > best_r2:
        best_r2 = r2
        best_model = model
        best_model_name = name

print(f"\nBest Model: {best_model_name} with R² = {best_r2}")



Linear Regression: R² = 0.29661971493129113, MSE = 973776149405006.1
Random Forest: R² = 0.8334710855149872, MSE = 230546531590618.56
Gradient Boosting: R² = 0.7608725235041893, MSE = 331053682086469.56
XGBoost: R² = 0.8147498369216919, MSE = 256464610394112.0
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004670 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 561
[LightGBM] [Info] Number of data points in the train set: 118365, number of used features: 21
[LightGBM] [Info] Start training from score 18475012.395472
LightGBM: R² = 0.849196395426509, MSE = 208776044047956.84
CatBoost: R² = 0.8176077977290005, MSE = 252508038936012.28
KNN: R² = 0.7371494815658972, MSE = 363896417262913.9

Best Model: LightGBM with R² = 0.849196395426509


### Hyperparameter Tuning

In [15]:
best_model

0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.1
,n_estimators,200
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [None]:
# # choose best parameters based for lightgbm
# param_grid = {
#     'num_leaves': [31, 50, 100],
#     'learning_rate': [0.01, 0.05, 0.1],
#     'n_estimators': [100, 300, 500],
#     'max_depth': [-1, 10, 20, 30],
#     'subsample': [0.6, 0.8, 1.0],
#     'feature_fraction': [0.6, 0.8, 1.0],
# }

# pipeline = Pipeline(steps=[
#     ('preprocessor', preprocessor),
#     ('regressor', best_model)
# ])


# grid_search = GridSearchCV(
#     estimator=pipeline,
#     param_grid={'regressor__' + k: v for k, v in param_grid.items()},
#     scoring='r2',
#     cv=3,
#     n_jobs=-1,
#     verbose=2
# )

# grid_search.fit(X_train, y_train)

# print("Best Parameters:", grid_search.best_params_)
# print("Best R² Score:", grid_search.best_score_)

# best_model = grid_search.best_estimator_



Fitting 3 folds for each of 972 candidates, totalling 2916 fits


KeyboardInterrupt: 

using randomizedSearchCV because GridSearch taking too long to run

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint  # for distributions (optional)

# parameters = {
#     'num_leaves': randint(31, 150),
#     'learning_rate': uniform(0.01, 0.1),
#     'n_estimators': randint(100, 800),
#     'max_depth': [-1, 10, 20, 30],
#     'subsample': uniform(0.6, 0.4),        # 0.6 to 1.0
#     'feature_fraction': uniform(0.6, 0.4)  # 0.6 to 1.0
# }

parameters = {
    'num_leaves': [31, 50, 100],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 300, 500],
    'max_depth': [-1, 10, 20, 30],
    'subsample': [0.6, 0.8, 1.0],
    'feature_fraction': [0.6, 0.8, 1.0],
}

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', best_model)
])


random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions={'regressor__' + k: v for k, v in parameters.items()},
    scoring='r2',
    cv=3,
    n_iter=100,        
    n_jobs=-1,
    verbose=2,
    random_state=42
)

random_search.fit(X_train, y_train)

print("Best Parameters:", random_search.best_params_)
print("Best R² Score:", random_search.best_score_)

best_model = random_search.best_estimator_


Fitting 3 folds for each of 100 candidates, totalling 300 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003586 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 561
[LightGBM] [Info] Number of data points in the train set: 118365, number of used features: 21
[LightGBM] [Info] Start training from score 18475012.395472
Best Parameters: {'regressor__subsample': 1.0, 'regressor__num_leaves': 100, 'regressor__n_estimators': 100, 'regressor__max_depth': -1, 'regressor__learning_rate': 0.1, 'regressor__feature_fraction': 1.0}
Best R² Score: 0.8302525458809796


In [23]:

joblib.dump(best_model, 'model.pkl')


['model.pkl']

In [None]:
## for loading

# model = joblib.load('best_model.pkl')
# y_pred = model.predict(new_data)