# 🧠 Model Development — Housing Prices in India

---

### 🔗 **Notebook Context**

This notebook is the **third stage** of the *Housing Prices in India* project.  
In the previous notebooks, we:
1. Performed **Exploratory Data Analysis (EDA)** to understand the dataset.  
2. Conducted **Data Cleaning and Feature Engineering** to prepare high-quality inputs for modeling.

In this notebook, we’ll build, train, and evaluate three regression models — **Linear Regression**, **Ridge Regression**, and **Lasso Regression** — to predict housing prices in India.  
We’ll compare performance, analyze feature importance, and save the trained models for deployment.

---

## 🎯 **Objectives**

1. Load the cleaned dataset  
2. Split data into training and testing sets  
3. Train Linear, Ridge, and Lasso regression models  
4. Evaluate model performance (R², MAE, RMSE)  
5. Save the trained models into the `models/` directory  
6. Summarize key insights and next steps  

---

## 📦 **1. Setup & Data Loading**

In [92]:
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style="whitegrid")
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler, LabelEncoder,\
    MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score, mean_absolute_error 


In [93]:
df = pd.read_csv('../data/cleaned-feature-engineered-data.csv')
df.head()

Unnamed: 0,posted_by,under_construction,rera_approved,num_of_rooms,bhk_or_rk,ready_to_move,resale,longitude,latitude,price,avg_price_per_unit_area,avg_price_per_room,area_per_room
0,Owner,No,No,2,BHK,Yes,Yes,12.96991,77.59796,55.0,0.0423,27.5,650.118204
1,Dealer,No,No,2,BHK,Yes,Yes,12.274538,76.644605,51.0,0.04,25.5,637.5
2,Owner,No,No,2,BHK,Yes,Yes,12.778033,77.632191,43.0,0.04608,21.5,466.579861
3,Owner,No,Yes,2,BHK,Yes,Yes,28.6423,77.3445,62.5,0.06721,31.25,464.960571
4,Dealer,Yes,No,2,BHK,No,Yes,22.5922,88.484911,60.5,0.06056,30.25,499.504623


In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29050 entries, 0 to 29049
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   posted_by                29050 non-null  object 
 1   under_construction       29050 non-null  object 
 2   rera_approved            29050 non-null  object 
 3   num_of_rooms             29050 non-null  int64  
 4   bhk_or_rk                29050 non-null  object 
 5   ready_to_move            29050 non-null  object 
 6   resale                   29050 non-null  object 
 7   longitude                29050 non-null  float64
 8   latitude                 29050 non-null  float64
 9   price                    29050 non-null  float64
 10  avg_price_per_unit_area  29050 non-null  float64
 11  avg_price_per_room       29050 non-null  float64
 12  area_per_room            29050 non-null  float64
dtypes: float64(6), int64(1), object(6)
memory usage: 2.9+ MB


## **Data Splitting**
Now, split the data into **train set** and **test set**.

In [95]:
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set shape: {X_train.shape}, Test set shape: {X_test.shape}")
assert X_train.shape[0] == y_train.shape[0]
assert X_test.shape[0] == y_test.shape[0]

Training set shape: (23240, 12), Test set shape: (5810, 12)


Check the notebook on exploratory data analysis to see the distribution of the numerical features.

- I apply simple imputer to fill in missing feature values.

- To the price feature, I applied log transformation
- Apply log transformation to number of rooms, since it has a very high value of 20 which skewed the data.
- To longitude and latitude, I applied `MinMax` scalar.
- To the rest, I applied **MinMax** scaler.

- For all binary (yes/no) features, I applied label encoder.
- For all categorical features that are not binary, I applied **One-Hot** encoder.
- Finally, I apply **standard scaler** to all features to fix their mean at 0 and standard deviation at 1

In [96]:
binary_cols = [col for col in X.select_dtypes(include=['object', 'category']).columns if X[col].nunique() == 2]
multi_class_cols = [col for col in X.select_dtypes(include=['object', 'category']).columns if X[col].nunique() > 2]


def log_feature_names_out(input_features, input_dtype=None) -> list[str]    :
    return [input_features]


log_transformer = FunctionTransformer(func=np.log1p,
                                      inverse_func=np.expm1,
                                      feature_names_out=log_feature_names_out)


log_pipeline = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                               ('transformer', log_transformer),
                               ('scaler', StandardScaler())])
minmax_pipeline = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                  ('scaler', MinMaxScaler()),
                                  ('standard_scaler', StandardScaler())])
onehot_pipeline = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                                  ('onehot', OneHotEncoder(sparse_output=False, handle_unknown='ignore', drop='if_binary')),
                                    ('standard_scaler', StandardScaler())])
onehot_binary_pipeline = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                                  ('onehot', OneHotEncoder(drop='if_binary', sparse_output=False)),
                                    ('standard_scaler', StandardScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ('log_transformer', log_pipeline, ['num_of_rooms']),
        ('minmax_transformer', minmax_pipeline, ['longitude', 'latitude', 'avg_price_per_unit_area', 'avg_price_per_room', 'area_per_room']),
        ('onehot_encoder', onehot_pipeline, multi_class_cols),
        ('onehot_encoder_binary', onehot_binary_pipeline, binary_cols)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
print(f"Processed training set shape: {X_train_processed.shape}, Processed test set shape: {X_test_processed.shape}")

Processed training set shape: (23240, 14), Processed test set shape: (5810, 14)


Since the target (price) also contains large values which causes skewness, we need to trransform the target.

In [97]:
model = TransformedTargetRegressor(
    regressor=LinearRegression(),
    func=np.log1p,
    inverse_func=np.expm1
)


In [98]:
full_pipeline_lr = Pipeline([
    ('preprocessor', preprocessor),
    ('model', model)
])

full_pipeline_lr.fit(X_train, y_train)

In [99]:
full_pipeline_lr.predict(X_test.iloc[0:5])

array([ 40.2511535 ,  81.50058126,  50.69469041, 145.20596566,
        45.52409148])

In [100]:
y_test.iloc[0:5].values

array([ 80. ,  64.7,  85. , 220. ,  46. ])

## **Hyperparameter Tuning**

Here, I tune three models on the dataset.
1. Linear Regression
2. Ridge Regression
3. Lasso Regression

In [101]:
best_parameters = {}

In [102]:
# ----- Linear Regression Tuning -----
param_grid_lr = {
    'model__regressor__fit_intercept': [True, False],    
}

grid_serach_lr = GridSearchCV(full_pipeline_lr,
                              param_grid_lr,
                              cv=5,
                              scoring='neg_mean_squared_error',
                              n_jobs=-1)
grid_serach_lr.fit(X_train, y_train)
grid_serach_lr.best_params_
best_parameters['linear_regression'] = grid_serach_lr.best_params_

In [103]:
full_pipeline_lasso = Pipeline([
    ('preprocessor', preprocessor),
    ('model', TransformedTargetRegressor(
        regressor=Lasso(),
        func=np.log1p,
        inverse_func=np.expm1
    ))
])

param_grid_lasso = {
    'model__regressor__alpha': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10_000.0, 100_000.0],
    'model__regressor__fit_intercept': [True, False],
}

grid_serach_lasso = GridSearchCV(full_pipeline_lasso,
                              param_grid_lasso,
                              cv=5,
                              scoring='neg_mean_squared_error',
                              n_jobs=-1)
grid_serach_lasso.fit(X_train, y_train)
grid_serach_lasso.best_params_
best_parameters['lasso_regression'] = grid_serach_lasso.best_params_

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [104]:
full_pipeline_ridge = Pipeline([
    ('preprocessor', preprocessor),
    ('model', TransformedTargetRegressor(
        regressor=Ridge(),
        func=np.log1p,
        inverse_func=np.expm1
    ))
])

param_grid_ridge = {
    'model__regressor__alpha': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10_000.0, 100_000.0],
    'model__regressor__fit_intercept': [True, False],
}

grid_serach_ridge = GridSearchCV(full_pipeline_ridge,
                              param_grid_ridge,
                              cv=5,
                              scoring='neg_mean_squared_error',
                              n_jobs=-1)
grid_serach_ridge.fit(X_train, y_train)
grid_serach_ridge.best_params_
best_parameters['ridge_regression'] = grid_serach_ridge.best_params_

Now, we will measure the metrics on the test data. We will record the mean squared error and r2 score.

In [105]:
metrics = {}
# Evaluate Linear Regression
y_pred_lr = grid_serach_lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)
explained_variance_lr = explained_variance_score(y_test, y_pred_lr)
mean_absolute_error_lr = mean_absolute_error(y_test, y_pred_lr)

y_pred_lasso = grid_serach_lasso.predict(X_test)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)
explained_variance_lasso = explained_variance_score(y_test, y_pred_lasso)
mean_absolute_error_lasso = mean_absolute_error(y_test, y_pred_lasso)

y_pred_ridge = grid_serach_ridge.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)
explained_variance_ridge = explained_variance_score(y_test, y_pred_ridge)
mean_absolute_error_ridge = mean_absolute_error(y_test, y_pred_ridge)

metrics['Mean Squared Error'] = [mse_lr, mse_lasso, mse_ridge]
metrics['R2 Score'] = [r2_lr, r2_lasso, r2_ridge]
metrics['Explained Variance Score'] = [explained_variance_lr, explained_variance_lasso, explained_variance_ridge]
metrics['Mean Absolute Error'] = [mean_absolute_error_lr, mean_absolute_error_lasso, mean_absolute_error_ridge]
metrics_df = pd.DataFrame(metrics, index=['Linear Regression', 'Lasso Regression', 'Ridge Regression'])
metrics_df

Unnamed: 0,Mean Squared Error,R2 Score,Explained Variance Score,Mean Absolute Error
Linear Regression,2058857.0,-2.76001,-2.730684,161.174746
Lasso Regression,554066.8,-0.01187,0.0,111.358986
Ridge Regression,569146.6,-0.03941,0.000809,148.400765


Now that Lasso gave us the best model, let's see how the features influence the model.

In [106]:
grid_serach_lasso.best_params_

{'model__regressor__alpha': 1.0, 'model__regressor__fit_intercept': True}

---

<p align="center">
  <b>👨🏽‍💻 Authored by:</b><br>
  <a href="https://github.com/mobadara">
    <img src="https://img.shields.io/badge/GitHub-mobadara-black?logo=github" alt="GitHub"/>
  </a>
  <a href="https://linkedin.com/in/obadara-m">
    <img src="https://img.shields.io/badge/LinkedIn-Muyiwa%20Obadara-blue?logo=linkedin" alt="LinkedIn"/>
  </a>
  <a href="https://x.com/m_obadara">
    <img src="https://img.shields.io/badge/Twitter-@m__obadara-1DA1F2?logo=x" alt="Twitter"/>
  </a>
</p>

<p align="center">
  <i>Exploring the intersection of Data Science, AI, and real-world impact — one dataset at a time.</i>
</p>

---
