# Housing Price Prediction

In [None]:
from google.colab import files
uploaded = files.upload()


Saving Housing Price.csv to Housing Price.csv


# STEP 1 : Import Libraries

1. pandas → for loading and handling the dataset

2. numpy → for numerical operations

3. matplotlib / seaborn → for visualization

4. sklearn → for preprocessing, splitting data, and machine learning models

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

# STEP 2 : Load the Dataset and Understand the dataset



In [None]:

df = pd.read_csv("Housing Price.csv")
df.head()





Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


# STEP 3 — Preprocessing (Encoding categorical columns)

We will use Label Encoding for binary (yes/no) columns
and One‑Hot Encoding for multi‑category (furnished / semi‑furnished / unfurnished)

In [None]:
df_copy = df.copy()

binary_cols = ['mainroad', 'guestroom', 'basement', 'hotwaterheating',
               'airconditioning', 'prefarea']

le = LabelEncoder()

for col in binary_cols:
    df_copy[col] = le.fit_transform(df_copy[col])



In [None]:

df_copy = pd.get_dummies(df_copy, columns=['furnishingstatus'], drop_first=True)

df_copy.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,False,False
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,False,False
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,True,False
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,False,False
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,False,False


# STEP 4 — Split into train/test (unseen data)

Target column = price

In [None]:

X = df_copy.drop("price", axis=1)
y = df_copy["price"]


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape

((436, 13), (109, 13))

# STEP 5: Apply Models

 **A) Linear Regression** **(1 feature)**

Let’s predict price using only area. Only one input variable is used

In [None]:
X_train_simple = X_train[['area']]
X_test_simple  = X_test[['area']]

lr_simple = LinearRegression()
lr_simple.fit(X_train_simple, y_train)

pred_simple = lr_simple.predict(X_test_simple)

**B) MULTIPLE Linear Regression (many features)**

Many input variables (area, bedrooms, bathrooms, etc.)

In [None]:

lr_multi = LinearRegression()
lr_multi.fit(X_train, y_train)

pred_multi = lr_multi.predict(X_test)


**C) Polynomial Regression**

Polynomial regression transforms features into polynomial terms
degree = 2

In [None]:
poly = PolynomialFeatures(degree=2)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)

poly_model = LinearRegression()
poly_model.fit(X_poly_train, y_train)

pred_poly = poly_model.predict(X_poly_test)

 **D) KNN Regression**

We use k = 5 neighbors

In [None]:
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)

pred_knn = knn.predict(X_test)

# STEP 6 — Evaluate Models

that we already have:
1. **pred_simple** (simple linear regression)

2. **pred_multi** (multiple linear regression)

3. **pred_poly** (polynomial regression)

4. **pred_knn** (KNN regression)

We evaluate them using the **4 standard regression metrics**:

1. **MAE** : Mean Absolute Error

2. **MSE** : Mean Squared Error

3. **RMSE** : Root Mean Squared Error

4. **R² Score** : How well the model fits the data

In [None]:
def evaluate_model(y_test, y_pred):
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    return mae, mse, rmse, r2



*   **Evaluate Simple Linear Regression**




In [None]:
print("Simple Linear Regression:")
evaluate_model(y_test, pred_simple)


Simple Linear Regression:


(1474748.1337969352,
 3675286604768.185,
 np.float64(1917103.7021424233),
 0.27287851871974644)

R² = 0.27 → only 27% of price variation explained (Weak Performance)

High error values
Because it uses only one feature, it cannot capture the complex relationships.

*   **Evaluate Multiple Linear Regression**


In [None]:
print("Multiple Linear Regression:")
evaluate_model(y_test, pred_multi)


Multiple Linear Regression:


(970043.4039201637,
 1754318687330.6633,
 np.float64(1324506.9600914384),
 0.6529242642153185)

R² = 0.65 → explains 65% of the variation in housing price (BEST PERFORMANCE)

Lowest MAE and lowest RMSE among all models.
This model is performing the best in your dataset.



*   **Evaluate Polynomial Regression**




In [None]:
print("Polynomial Regression:")
evaluate_model(y_test, pred_poly)


Polynomial Regression:


(1042927.6357113257,
 1916484377876.3582,
 np.float64(1384371.4739463388),
 0.6208412813618335)

R² = 0.62, close to multiple regression  (Decent but Overfitting Slightly)

Error became higher
Polynomial regression sometimes overfits in tabular datasets.



*   **Evaluate KNN Regression**




In [None]:
print("KNN Regression:")
evaluate_model(y_test, pred_knn)


KNN Regression:


(1296547.7064220184,
 3213839804128.4404,
 np.float64(1792718.5512869665),
 0.36417150272211063)

R² = 0.36 (Weak Performance)

High MAE, MSE, RMSE
 KNN is sensitive to scaling and does not work very well on this dataset.



# Final Conclusion
Among all models tested, Multiple Linear Regression achieved the best performance with the lowest error (MAE and RMSE) and the highest R² score (~0.65). This indicates that linear relationships among multiple features such as area, bedrooms, bathrooms, stories, and furnishing status are sufficient to model house prices effectively.