Introduction

House prices reflect a blend of structural attributes (area, bedrooms, bathrooms, stories), accessibility (main road), and amenities (air conditioning, hot water heating, guest room, basement, furnishing). Accurately modeling price supports budget planning, valuation, and policy decisions. Different regression models capture different relationships: linear models favor interpretability, polynomial captures smooth nonlinearities, KNN learns local patterns, and decision trees learn rule-like structures. We will prepare the data, visualize key relationships, train five models, and evaluate them rigorously

Statement of the problem

Given a tabular dataset of houses with numerical and categorical features, predict the target variable ‚Äúprice‚Äù as accurately and robustly as possible. We want:

A reproducible preprocessing pipeline suitable for all five models.

Comparable evaluation metrics across models: R¬≤, MAE, RMSE.

Practical interpretation to guide model choice (performance vs. interpretability vs. risk of overfitting)

In [2]:
import pandas as pd


In [3]:
from google.colab import files
uploaded = files.upload()   # This will open a file picker


Saving Housing Price.csv to Housing Price.csv


In [4]:
df = pd.read_csv("Housing Price.csv")


In [5]:
df


Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished


preprocessing data

Step 1: Inspect the Data

In [6]:

df.head()

df.info()

df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


Unnamed: 0,0
price,0
area,0
bedrooms,0
bathrooms,0
stories,0
mainroad,0
guestroom,0
basement,0
hotwaterheating,0
airconditioning,0


Step 2: Encode Categorical Variables


Binary columns (yes/no)

In [7]:
binary_cols = ["mainroad", "guestroom", "basement", "hotwaterheating", "airconditioning", "prefarea"]
df[binary_cols] = df[binary_cols].applymap(lambda x: 1 if str(x).strip().lower() == "yes" else 0)


  df[binary_cols] = df[binary_cols].applymap(lambda x: 1 if str(x).strip().lower() == "yes" else 0)


Multi-category column (furnishingstatus)

In [8]:
df = pd.get_dummies(df, columns=["furnishingstatus"], prefix="furnish", drop_first=True)


Step 3: Define Features and Target

In [9]:
X = df.drop(columns=["price"])
y = df["price"]


Step 4: Train-Test Split

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Step 5: Feature Scaling (for models like Linear, Polynomial, KNN)

In [12]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Final Check

In [13]:
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


Train shape: (436, 13)
Test shape: (109, 13)


**1. Simple Linear Regression**


*Idea:* Fit a straight line to a single predictor (e.g., area) to estimate price.

**Use when**: You want a simple, interpretable baseline and a quick demonstration
Formula:
ùë¶
^
=
ùõΩ
0
+
ùõΩ
1
‚ãÖ
area

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

X_area_train = X_train[["area"]]
X_area_test = X_test[["area"]]

scaler_area = StandardScaler()
X_area_train_scaled = scaler_area.fit_transform(X_area_train)
X_area_test_scaled = scaler_area.transform(X_area_test)

lr_single = LinearRegression()
lr_single.fit(X_area_train_scaled, y_train)
y_pred_lr_single = lr_single.predict(X_area_test_scaled)


Evaluation

In [16]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from math import sqrt

print("R¬≤:", r2_score(y_test, y_pred_lr_single))
print("MAE:", mean_absolute_error(y_test, y_pred_lr_single))
print("RMSE:", sqrt(mean_squared_error(y_test, y_pred_lr_single)))


R¬≤: 0.27287851871974633
MAE: 1474748.1337969352
RMSE: 1917103.7021424235


**‚úÖ Pros**
Easy to interpret

Fast and simple

**‚ùå Cons**
Ignores other features

Assumes linearity

**2Ô∏è‚É£ Multiple Linear Regression**

Models the relationship between multiple independent variables and one dependent variable.

Formula:
ùë¶
^
=
ùõΩ
0
+
‚àë
ùõΩ
ùëñ
ùë•
ùëñ

In [17]:
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)


Evaluation

In [18]:
print("R¬≤:", r2_score(y_test, y_pred_lr))
print("MAE:", mean_absolute_error(y_test, y_pred_lr))
print("RMSE:", sqrt(mean_squared_error(y_test, y_pred_lr)))


R¬≤: 0.6529242642153176
MAE: 970043.4039201642
RMSE: 1324506.9600914402


** Pros**
Uses all features

Coefficients are interpretable

** Cons**
Sensitive to multicollinearity

Assumes linear relationships

**3. Polynomial Regression**
Used when data follows a curved (non-linear) pattern.Formula:
ùë¶
^
=
ùõΩ
0
+
ùõΩ
1
ùë•
+
ùõΩ
2
ùë•
2
+
‚Ä¶

In [19]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

poly_model = Pipeline([
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("lr", LinearRegression())
])

poly_model.fit(X_train_scaled, y_train)
y_pred_poly = poly_model.predict(X_test_scaled)


Evaluation

In [20]:
print("R¬≤:", r2_score(y_test, y_pred_poly))
print("MAE:", mean_absolute_error(y_test, y_pred_poly))
print("RMSE:", sqrt(mean_squared_error(y_test, y_pred_poly)))


R¬≤: 0.6237689217365154
MAE: 1034749.2706758833
RMSE: 1379016.466162188


 **Pros**
Captures nonlinear trends

Still interpretable at low degrees

** Cons**
Can overfit at high degrees

Requires scaling

**4Ô∏è‚É£ K-Nearest Neighbors (KNN) Regression**
Predicts values based on similar data points

No training phase‚Äîjust memorizes data.

In [21]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=5, weights="distance")
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)


Evaluation

In [22]:
print("R¬≤:", r2_score(y_test, y_pred_knn))
print("MAE:", mean_absolute_error(y_test, y_pred_knn))
print("RMSE:", sqrt(mean_squared_error(y_test, y_pred_knn)))


R¬≤: 0.6155312432500738
MAE: 992579.195933982
RMSE: 1394031.6849064424


**Pros**
Captures local patterns

No assumptions about data

 **Cons**
Sensitive to scaling

Slower on large datasets

**5. Decision Tree Regression**
Splits data into regions and predicts mean value per region.

In [23]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(max_depth=6, random_state=42)
dt.fit(X_train, y_train)  # No scaling needed
y_pred_dt = dt.predict(X_test)


Evaluation

In [24]:
print("R¬≤:", r2_score(y_test, y_pred_dt))
print("MAE:", mean_absolute_error(y_test, y_pred_dt))
print("RMSE:", sqrt(mean_squared_error(y_test, y_pred_dt)))


R¬≤: 0.4868260105767255
MAE: 1226189.555033347
RMSE: 1610550.8306615497


 **Pros**
Captures nonlinear interactions

Easy to visualize and interpret

**Cons**
Can overfit

Sensitive to small data changes

**Final Comparison Table**

In [25]:
results = pd.DataFrame({
    "Model": ["Simple Linear", "Multiple Linear", "Polynomial", "KNN", "Decision Tree"],
    "R¬≤": [r2_score(y_test, y_pred_lr_single),
           r2_score(y_test, y_pred_lr),
           r2_score(y_test, y_pred_poly),
           r2_score(y_test, y_pred_knn),
           r2_score(y_test, y_pred_dt)],
    "MAE": [mean_absolute_error(y_test, y_pred_lr_single),
            mean_absolute_error(y_test, y_pred_lr),
            mean_absolute_error(y_test, y_pred_poly),
            mean_absolute_error(y_test, y_pred_knn),
            mean_absolute_error(y_test, y_pred_dt)],
    "RMSE": [sqrt(mean_squared_error(y_test, y_pred_lr_single)),
             sqrt(mean_squared_error(y_test, y_pred_lr)),
             sqrt(mean_squared_error(y_test, y_pred_poly)),
             sqrt(mean_squared_error(y_test, y_pred_knn)),
             sqrt(mean_squared_error(y_test, y_pred_dt))]
})
results.sort_values(by="R¬≤", ascending=False)


Unnamed: 0,Model,R¬≤,MAE,RMSE
1,Multiple Linear,0.652924,970043.4,1324507.0
2,Polynomial,0.623769,1034749.0,1379016.0
3,KNN,0.615531,992579.2,1394032.0
4,Decision Tree,0.486826,1226190.0,1610551.0
0,Simple Linear,0.272879,1474748.0,1917104.0



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



**Summary**
Interpretability: Linear and Decision Tree are easiest to explain.

Flexibility: Polynomial and KNN capture nonlinearities.