#Problem Statement

The real estate market is influenced by multiple factors such as property size, number of rooms, location-based amenities, furnishing status, and other housing features. Accurately estimating house prices is essential for buyers, sellers, and real estate companies to make informed decisions.
However, manually evaluating the impact of multiple variables on house prices is complex and error-prone.

#Objective
To develop a regression-based predictive model that estimates house prices based on various property features using data analysis and machine learning techniques. The model aims to learn relationships between independent variables and house prices to provide accurate and reliable predictions.


In [213]:
# Import pandas
import pandas as pd

In [214]:
# Load and read the dataset
df=pd.read_csv('/content/Housing.csv')

The dataset was successfully loaded and read.

In [215]:
df.head() # Display the first 5 rows of the dataset

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [216]:
df.shape # Display the dimensions of the dataset

(545, 13)

The dataset contains 545 rows and 13 columns.

In [217]:
df.info() # Display the concise summary of the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


Displayed the dataset structure to identify the numerical and categorical features and their datatype.

In [218]:
df.describe() # Display the summary statistics of the dataset

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
count,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5150.541284,2.965138,1.286239,1.805505,0.693578
std,1870440.0,2170.141023,0.738064,0.50247,0.867492,0.861586
min,1750000.0,1650.0,1.0,1.0,1.0,0.0
25%,3430000.0,3600.0,2.0,1.0,1.0,0.0
50%,4340000.0,4600.0,3.0,1.0,2.0,0.0
75%,5740000.0,6360.0,3.0,2.0,2.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,3.0


Displayed the descriptive statistics of the numerical features of the dataset before preprocessing.

In [219]:
# Check for missing values
missing_values=df.isnull().sum().sum()
print(missing_values)

0


No missing values found in the dataset.

In [220]:
# Check for duplicate values
duplicate=df.duplicated().sum()
print(duplicate)

0


No duplicate rows found in the dataset.

In [221]:
# Convert the prices into lakhs
df['price']=df['price']/100000

Converted house prices to lakhs to reduce numeric magnitude and improve numerical stability during regression modelling.

In [222]:
# Rename the columns
df=df.rename(columns={'price':'prices_in_lakhs','area':'area_sqft','mainroad':'main_road','guestroom':'guest_room','hotwaterheating':'hot_water_heating','airconditioning':'air_conditioning','prefarea':'preferred_area','furnishingstatus':'furnishing_status'})
print(df.columns) # Display the names of the columns after renaming

Index(['prices_in_lakhs', 'area_sqft', 'bedrooms', 'bathrooms', 'stories',
       'main_road', 'guest_room', 'basement', 'hot_water_heating',
       'air_conditioning', 'parking', 'preferred_area', 'furnishing_status'],
      dtype='object')


Standardized column names by using underscore(_) to improve readability.

In [223]:
# Convert the datatype of categorical columns
df[['main_road','guest_room','basement','hot_water_heating','air_conditioning','preferred_area','furnishing_status']]=df[['main_road','guest_room','basement','hot_water_heating','air_conditioning','preferred_area','furnishing_status']].astype('category')

Converted categorical variables to the 'category' datatype to clearly distinguish them from the numerical features and prepare them for encoding.

In [224]:
df.info() # Display the concise summary of the dataset after appropriate changes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   prices_in_lakhs    545 non-null    float64 
 1   area_sqft          545 non-null    int64   
 2   bedrooms           545 non-null    int64   
 3   bathrooms          545 non-null    int64   
 4   stories            545 non-null    int64   
 5   main_road          545 non-null    category
 6   guest_room         545 non-null    category
 7   basement           545 non-null    category
 8   hot_water_heating  545 non-null    category
 9   air_conditioning   545 non-null    category
 10  parking            545 non-null    int64   
 11  preferred_area     545 non-null    category
 12  furnishing_status  545 non-null    category
dtypes: category(7), float64(1), int64(5)
memory usage: 30.3 KB


Displayed the dataset structure after preprocessng to verify that numerical and categorical features are correctly represented.

In [225]:
df.describe() # Display the descriptive statistics of the dataset after appropriate changes

Unnamed: 0,prices_in_lakhs,area_sqft,bedrooms,bathrooms,stories,parking
count,545.0,545.0,545.0,545.0,545.0,545.0
mean,47.667292,5150.541284,2.965138,1.286239,1.805505,0.693578
std,18.704396,2170.141023,0.738064,0.50247,0.867492,0.861586
min,17.5,1650.0,1.0,1.0,1.0,0.0
25%,34.3,3600.0,2.0,1.0,1.0,0.0
50%,43.4,4600.0,3.0,1.0,2.0,0.0
75%,57.4,6360.0,3.0,2.0,2.0,1.0
max,133.0,16200.0,6.0,4.0,4.0,3.0


Analyzed the statistical distribution of numerical features after preprocessing to validate data consistency and scale.

In [226]:
# Perform one hot encoding on furnishing_status column
df = pd.get_dummies(df, columns=["furnishing_status"], drop_first=True)
print(df.head()) # Display the first 5 rows to verify the changes

   prices_in_lakhs  area_sqft  bedrooms  bathrooms  stories main_road  \
0           133.00       7420         4          2        3       yes   
1           122.50       8960         4          4        4       yes   
2           122.50       9960         3          2        2       yes   
3           122.15       7500         4          2        2       yes   
4           114.10       7420         4          1        2       yes   

  guest_room basement hot_water_heating air_conditioning  parking  \
0         no       no                no              yes        2   
1         no       no                no              yes        3   
2         no      yes                no               no        2   
3         no      yes                no              yes        3   
4        yes      yes                no              yes        2   

  preferred_area  furnishing_status_semi-furnished  \
0            yes                             False   
1             no                      

Performed one hot encoding on 'furnishing_status' column and displayed the first 5 rows to verify the changes in the dataset.

Changes verified successfully.

In [227]:
# Create a list of binary columns
binary_cols = ['basement', 'hot_water_heating', 'air_conditioning', 'preferred_area', 'main_road', 'guest_room']

# Perform encoding for binary columns
for col in binary_cols:
    df[col] = df[col].map({'yes': 1, 'no': 0})

Performed encoding for all binary columns.

In [228]:
df.head() # Display the first 5 rows to verify the changes

Unnamed: 0,prices_in_lakhs,area_sqft,bedrooms,bathrooms,stories,main_road,guest_room,basement,hot_water_heating,air_conditioning,parking,preferred_area,furnishing_status_semi-furnished,furnishing_status_unfurnished
0,133.0,7420,4,2,3,1,0,0,0,1,2,1,False,False
1,122.5,8960,4,4,4,1,0,0,0,1,3,0,False,False
2,122.5,9960,3,2,2,1,0,1,0,0,2,1,True,False
3,122.15,7500,4,2,2,1,0,1,0,1,3,1,False,False
4,114.1,7420,4,1,2,1,1,1,0,1,2,0,False,False


Changes verified successfully.

In [229]:
# Create a list of all features affecting the target
features=[
    'area_sqft',
    'basement',
    'hot_water_heating',
    'air_conditioning',
    'parking',
    'preferred_area',
    'bedrooms',
    'bathrooms',
    'stories',
    'main_road',
    'guest_room',
    'furnishing_status_semi-furnished', # Updated feature
    'furnishing_status_unfurnished'  # Updated feature
    ]
target='prices_in_lakhs' # target variable

x=df[features]
y=df[target]

The above code selects the relevant input features abd defines the target variable for the housing price prediction problem. The data is separated into independent and dependent variables to prepare it for applying Linear Regression in the further steps.

In [230]:
# Import the required features
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Perform train-test split first (using the original 'x' and 'y')
x_train, x_test, y_train, y_test = train_test_split(
    x, y, train_size=0.8, test_size=0.2, random_state=42
)

The dataset has been split into training and testing sets using the train_test_split function.

In [231]:
from sklearn.linear_model import LinearRegression

# Create a Linear Regression Model
lr = LinearRegression()

# Train the model using training data
lr.fit(x_train, y_train)

The model has been trained for Linear Regression task.

The model assumes a linear relationship between the input features and house prices. It also assumes that the observations are independent, the errors have constant variance (homoscedasticity), and the residuals are normally distributed. Additionally, the model assumes there is no strong multicollinearity among the independent variables.

In [232]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score
import numpy as np

# Calculate the required metrics
def evaluate(model, name):
    y_pred = model.predict(x_test)
    mae = mean_absolute_error(y_test, y_pred)
    mse= mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = model.score(x_test, y_test)

# Display the required metrics
    print(f"\n{name}")
    print("MAE :", mae)
    print("MSE :", mse)
    print("RMSE:", rmse)
    print("R²  :", r2)

evaluate(lr, "Linear Regression")


Linear Regression
MAE : 9.70043403920164
MSE : 175.43186873306635
RMSE: 13.245069600914386
R²  : 0.6529242642153186


The performance of the Linear Regression model was evaluated using MAE, MSE, RMSE, and R² score. The MAE value indicates that, on average, the model's predicted house prices differ from the actual prices by about 9–10 lakhs, which is reasonable for real-estate data. The RMSE being higher than MAE suggests that some larger prediction errors are present, which is common in housing datasets due to high-value properties. The R² score of about 0.65 shows that the model explains around 65% of the variation in house prices, indicating a moderately good fit. Overall, the model provides a reliable baseline for predicting housing prices using the selected features.

In [233]:
# Define a sample data and predict the house price using it
sample = {
    "area_sqft": 12000,
    "basement": 'yes',
    "hot_water_heating": 'yes',
    "air_conditioning": 'yes',
    "parking": 3,
    "preferred_area": 'yes',
    "furnishing_status_semi-furnished": False,
    "furnishing_status_unfurnished": True,
    "bedrooms": 4,
    "bathrooms": 3,
    "stories": 4,
    "main_road": 'yes',
    "guest_room": 'no',
}

import numpy as np
import pandas as pd

input_sample = pd.DataFrame([sample])

# Binary columns were mapped to 0/1 during data preparation for df
# Apply the same mapping to the sample input
binary_cols_for_sample = ['basement', 'hot_water_heating', 'air_conditioning', 'preferred_area', 'main_road', 'guest_room']
for col in binary_cols_for_sample:
    if col in input_sample.columns:
        input_sample[col] = input_sample[col].map({'yes': 1, 'no': 0})

# Ensure the order of columns in input_sample matches the features used for training
# The 'features' list is defined in an earlier cell and was used to create 'x' (and thus 'x_train').
input_sample = input_sample[features]

# Prediction
print("Predicted Price:", lr.predict(input_sample)[0])

Predicted Price: 114.37032544840768


A sample house was created using realistic property features to test the trained model. The same preprocessing and feature ordering used during training were applied to the input. The model predicted a house price of approximately 114.37 lakhs, showing that it can estimate prices for new properties.

**Conclusion:**
In this project, a Linear Regression model was built to predict house prices using key property features and amenities. After proper data preprocessing and encoding, the model achieved a reasonable performance and was able to generate meaningful predictions for new inputs. This demonstrates the practical applicability of machine learning in real estate price estimation.