<a href="https://colab.research.google.com/github/nahom-maru/Machine-Learning-Regression-Models-using-House-Price-Dataset/blob/main/House_pricing_regression%20models%20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Housing Price Prediction using Machine Learning**
## **1. Introduction**

Buying or selling a house depends on many factors such as area, location, and facilities.In this project, we use Machine Learning regression models to predict house prices using a dataset with 13 features.

## **2. Problem Statement**

The goal of this project is to predict house prices accurately.We compare different machine learning models to find the one with the highest accuracy and lowest error.

## **3. Methodology**

## Data Preprocessing

Convert Yes / No values into 1 / 0,Convert Furnishing status into numbers (0, 1, 2),Scale the data using StandardScaler,Split data into training and testing sets

## Model Building

We build and test the following regression models:

Simple Linear Regression (using Area only)

Multiple Linear Regression (using all features)

Polynomial Regression

K-Nearest Neighbors (KNN)

Decision Tree Regression

## Model Evaluation

We evaluate each model using:

R² Score – model accuracy

Adjusted R² – accuracy with feature count

RMSE – average prediction error

MAPE – error percentage

## **4. Conclusion**

Using only Area is not enough to predict house price

Multiple Linear Regression gives the best results

House price depends on size, location, and amenities

Proper data preprocessing and scaling improves model performance

## **1.1 Import Libraries**
First, we need to bring in the Python tools (libraries) that will help us work with data.
*   **Pandas:** For reading the data table.
*   **Numpy:** For mathematical calculations.
*   **Matplotlib & Seaborn:** For drawing charts and graphs.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## **Load the Dataset**


In [None]:

from google.colab import files

uploaded = files.upload()

In [None]:
df = pd.read_csv('Housing Price.csv')
print("Dataset loaded successfully!")

## **Data Overview**


In [None]:
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


## **Data Cleaning: Check for Nulls**


In [None]:
df.isnull().sum()

Unnamed: 0,0
price,0
area,0
bedrooms,0
bathrooms,0
stories,0
mainroad,0
guestroom,0
basement,0
hotwaterheating,0
airconditioning,0


## **Feature scaling**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## **Simple Linear Regression (Area Only)**


In [None]:
from sklearn.linear_model import LinearRegression
# Select only the first column (Area)
X_train_simple = X_train[:, 0:1]
X_test_simple = X_test[:, 0:1]

simple_model = LinearRegression()
simple_model.fit(X_train_simple, y_train)
print("Simple Linear Regression (Area only) Trained.")

Simple Linear Regression (Area only) Trained.


## **Evaluation: Simple Linear Regression**

*   **MAE (Mean Absolute Error):** The average difference between predicted and actual price.
*   **MSE/RMSE:** Penalizes large errors more heavily.
*   **R² Score:** Explains how well the independent variables explain the variance in price (1.0 is perfect).

*   `y_test`: The actual real prices.
*   `y_pred_simple`: The prices our model guessed.

In [None]:

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_pred_simple = simple_model.predict(X_test_simple)
print("MAE:", mean_absolute_error(y_test, y_pred_simple))
print("Simple Linear R2 Score:", r2_score(y_test, y_pred_simple))
print("Simple Linear RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_simple)))

MAE: 1164724.99112224
Simple Linear R2 Score: 0.3644537917042163
Simple Linear RMSE: 1520330.3218016648


In [None]:

my_house = np.array([[8000, 3, 2, 2, 1, 0, 0, 0, 1, 2, 1, 2]])
my_house_scaled = scaler.transform(my_house)
simple_pred = simple_model.predict(my_house_scaled[:, 0:1])
print(f"Simple Model (Area only) predicts: {simple_pred[0]:,.2f}")

## **Multiple Linear Regression**


In [None]:
multi_model = LinearRegression()
multi_model.fit(X_train, y_train)
y_pred_multi = multi_model.predict(X_test)
print("Multiple Linear Regression Trained.")

Multiple Linear Regression Trained.


## **Evaluation: Multiple Linear Regression**

In [None]:
print("Multiple Linear R2 Score:", r2_score(y_test, y_pred_multi))
print("Multiple Linear RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_multi)))

Multiple Linear R2 Score: 0.7728898145444733
Multiple Linear RMSE: 908830.0894893687


## **Polynomial Regression**


In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
print("Polynomial Regression Trained.")

Polynomial Regression Trained.


## **Evaluation: Polynomial Regression**


In [None]:
y_pred_poly = poly_model.predict(X_test_poly)
print("Polynomial Regression R2 Score:", r2_score(y_test, y_pred_poly))
print("Polynomial RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_poly)))

Polynomial Regression R2 Score: 0.7049584394631567
Polynomial RMSE: 1035871.7504454693


## **KNN Regression**

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn_model = KNeighborsRegressor(n_neighbors=5)
knn_model.fit(X_train, y_train)
print("KNN Regression Model Trained.")

KNN Regression Model Trained.


## **Evaluation: KNN Regression**


In [None]:
y_pred_knn = knn_model.predict(X_test)
print("KNN R2 Score:", r2_score(y_test, y_pred_knn))
print("KNN RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_knn)))

KNN R2 Score: 0.6981737268207961
KNN RMSE: 1047714.3973141963


In [None]:
knn_pred = knn_model.predict(my_house_scaled)
print(f"KNN Model predicts: {knn_pred[0]:,.2f}")

KNN Model predicts: 8,463,000.00


## **Decision Tree Regression**


In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)
print("Decision Tree Model Trained.")

Decision Tree Model Trained.


## **Evaluation: Decision Tree**


In [None]:
y_pred_tree = tree_model.predict(X_test)
print("Decision Tree R2 Score:", r2_score(y_test, y_pred_tree))
print("Decision Tree RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_tree)))

Decision Tree R2 Score: 0.20249933274650422
Decision Tree RMSE: 1703059.8825540862


In [None]:
tree_pred = tree_model.predict(my_house_scaled)
print(f"Decision Tree predicts: {tree_pred[0]:,.2f}")

Decision Tree predicts: 8,575,000.00


## **Calculate All Evaluation Metrics**
*   **MAE/RMSE:** Lower is better (Less error).
*   **R² / Adj R²:** Higher is better (Better fit).
*   **MAPE:** Lower is better (Percentage error).


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_absolute_percentage_error

model_list = [
    ('Simple Linear', y_pred_simple, 1),
    ('Multiple Linear', y_pred_multi, 13),
    ('Polynomial', y_pred_poly, 13),
    ('KNN', y_pred_knn, 13),
    ('Decision Tree', y_pred_tree, 13)
]

results = []
n = len(y_test)

for name, pred, p in model_list:
    r2 = r2_score(y_test, pred)
    adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
    results.append({
        'Model': name,
        'MAE': mean_absolute_error(y_test, pred),
        'RMSE': np.sqrt(mean_squared_error(y_test, pred)),
        'R2': r2,
        'Adj_R2': adj_r2,
        'MAPE': mean_absolute_percentage_error(y_test, pred)
    })

final_comparison = pd.DataFrame(results)
final_comparison.sort_values(by='R2', ascending=False)

Unnamed: 0,Model,MAE,RMSE,R2,Adj_R2,MAPE
1,Multiple Linear,725949.3,908830.1,0.77289,0.741812,0.177292
2,Polynomial,810040.3,1035872.0,0.704958,0.664584,0.191826
3,KNN,812300.6,1047714.0,0.698174,0.656871,0.196438
0,Simple Linear,1164725.0,1520330.0,0.364454,0.358514,0.28329
4,Decision Tree,1210128.0,1703060.0,0.202499,0.093368,0.272892


## **Final Interpretation & Conclusion**
## **1. Model Performance Analysis**

We evaluated the models using R², Adjusted R², RMSE, and MAPE. The results are summarized below:

## **Simple Linear Regression**
This model performed the worst. It used only Area to predict house price, which resulted in high error and low accuracy. This shows that house size alone is not enough to predict price.

## **Multiple Linear Regression**
This model performed very well. It achieved a high R² and a low MAPE (around 18%). Adding more features such as Air Conditioning, Number of Bathrooms, and Furnishing Status greatly improved prediction accuracy.

## **Polynomial Regression**
This model performed similarly to, or slightly better than, Multiple Linear Regression. It captured some non-linear (curved) relationships in the data.

## **Decision Tree Regression**
Although powerful, this model showed lower performance on the test data. This suggests overfitting, meaning it learned the training data too well but did not generalize to new houses.

## **K-Nearest Neighbors (KNN)**
KNN showed average performance but was generally worse than Linear Regression. This is likely because the dataset has many features, making distance-based calculations less reliable.

## **2. Best Model**

The Multiple Linear Regression model (and Polynomial Regression) is the best choice for this dataset.

**Reason:**
It provides high accuracy, low error, and is easy to understand and interpret using coefficients.

## **3. Key Learnings**

## Feature Importance:
House price depends on more than just area. Features like bathrooms, air conditioning, furnishing, and location play a major role.

## Data Scaling:
Scaling was necessary because some features (like Area) had much larger values than others (like number of bedrooms).