# House Price Prediction

- [1 - Introduction](#Introduction)
    - [1.1 - Project Overview](#Project-Overview)
    - [1.2 - Problem Statement](#Problem-Statement)
    - [1.3 - Dataset Description](#Dataset-Description)

- [2 - Import Libraries](#Import-Libraries)

- [3 - Data Loading and Exploration](#Data-Loading-and-Exploration)
    - [3.1 - Load the Dataset](#Load-the-Dataset)
    - [3.2 - Display Basic Information](#Display-Basic-Information)
    
- [4 - Data Preprocessing](#Data-Preprocessing)
    - [4.1 - Removing Irrelevant Features](#Removing-Irrelevant-Features)
    - [4.2 - Handle Missing Values](#Handle-Missing-Values)
    - [4.3 - Encoding Categorical Variables](#Encoding-Categorical-Variables)
    - [4.4 - Feature Engineering](#Feature-Engineering)
    - [4.5 - Outlier Removal](#Outlier-Removal)
    - [4.6 - Further Encoding Categorical Variables](#Futher-Encoding-Categorical-Variables)
    - [4.7 - Feature Scaling](#Feature-Scaling)


- [5 - Data Splitting](#Data-Splitting)
    - [5.1 - Split into Train, Validation, and Test Sets](#Split-into-Train-Validation-and-Test-Sets)
    - [5.2 - Split Data into Features (X) and Target (y)](#Split-Data-into-Features-X-and-Target-y)


- [6 - Model Definition and Training](#Model-Definition)
    - [6.1 - Defining Models](#Defining-Models)
    - [6.2 - Finding Best Model](#Finding-Best-Model)
    - [6.3 - Fine-Tuning Model](#Fine-Tuning-Model)

- [7 - Model Evaluation](#Model-Evaluation)

- [8 - Conclusion](#Conclusion)

# 1 - Introduction

## 1.1 - Project Overview
The goal of this project is to develop a predictive model that can estimate the prices of houses in Bengaluru, India. Accurately predicting house prices is crucial for real estate agents, buyers, and sellers to make informed decisions. By analyzing various factors such as the size of the property, location, and available amenities, we aim to build a machine learning model that can effectively predict house prices based on historical data.

## 1.2 - Problem Statement
The real estate market in Bengaluru is dynamic and influenced by multiple factors, making it challenging to estimate property prices accurately. The primary objective of this project is to address the following questions:

- Can we build an accurate model to predict house prices using historical real estate data from Bengaluru?
- How can we interpret the model's predictions to provide actionable insights for real estate professionals and potential buyers?

By answering these questions, we aim to create a tool that can assist in making more accurate and informed real estate decisions.

## 1.3 - Dataset Description
The dataset used in this project is sourced from Kaggle and contains detailed information on various properties in Bengaluru, India.

### Bengaluru House Data
Each row in the dataset represents a property listing, and each column provides different attributes about the properties.

- **Number of Rows:** 13,320 (properties)
- **Number of Columns:** 9 (features)
- **Target Column:** "price"

### Data Composition
The dataset includes the following information:

- **Area Type:**
  - The type of area (e.g., Super built-up Area, Plot Area, Built-up Area).

- **Availability:**
  - The availability status of the property (e.g., Ready to Move, available from a specific date).

- **Location:**
  - The location of the property within Bengaluru.

- **Size:**
  - The size of the property in terms of the number of bedrooms (e.g., 2 BHK, 3 Bedroom).

- **Total Area:**
  - The total area of the property in square feet.

- **Number of Bathrooms:**
  - The number of bathrooms available in the property.

- **Number of Balconies:**
  - The number of balconies available in the property.

This dataset provides a comprehensive view of the real estate market in Bengaluru, allowing us to analyze and model the factors that influence house prices effectively.


# [2 - Import Libraries](#Import-Libraries)

In this section, we import the necessary libraries required for data manipulation, visualization, and building a machine learning model using Sklearn.


In [None]:
# Basic libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sklearn for data preprocessing, building, training the model and evaluation
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# [3 - Data Loading and Exploration](#Data-Loading-and-Exploration)

## [3.1 - Load the Dataset](#Load-the-Dataset)

In this section, we will load the Bengaluru House dataset into a pandas DataFrame for further exploration and analysis.

In [None]:
# Load the dataset into a pandas DataFrame
data_path = './Bengaluru_House_Data.csv'
df = pd.read_csv(data_path)

# Display the first few rows of the dataset to verify loading
df.head()

## [3.2 - Display Basic Information](#Display-Basic-Information)

In this section, we will display basic information about the dataset to understand its structure and contents.


In [None]:
# Display the basic information about the dataset
df.info()
df.shape

# [4 - Data Preprocessing](#Data-Preprocessing)

## [4.1 - Removing Irrelevant Features](#Removing-Irrelevant-Features)

In this section we will remove irrelevant features which we assume do not have any decisive weight for the target (house price)



The following code will remove the 'availability' feature from the dataframe, as it is considered irrelevant for the analysis.

In [None]:
# Remove irrelevant features from the dataframe
df.drop(['availability'], axis=1, inplace=True)
df.head()

## [4.2 - Handle Missing Values](#Handle-Missing-Values)

In this section, we will identify and handle missing values in the dataset to ensure the data is clean and ready for modeling.

In [None]:
missing_values = df.isnull().sum()
print(missing_values)

5502 out of 13320 Samples do not have a value for **society**, therefore we will drop society too as feature.

In [None]:
df = df.drop(['society'], axis=1)
df.head()

Now we will clean all Samples which do not have a value for the balcony feature.

In [None]:
df = df.dropna(subset=['balcony'])
df.isnull().sum()

Now lets drop the row which do not have a value for the location.

In [None]:
df = df.dropna(subset=['location'])
df.isnull().sum()

## [4.3 - Encoding Categorical Variables](#Encoding-Categorical-Variables)

In this section, we handle the categorical variables present in the dataset by converting them into a numerical format that can be used by our machine learning model. We use **One-Hot Encoding** to achieve this, which transforms each categorical variable into a set of binary columns (0 or 1), representing the presence or absence of each category.


In [None]:
# Making size numerical feature
df['size'] = df['size'].apply(lambda x: float(x.split(' ')[0]))
df.info()


The **Total_Sqft** feature is not numerical. Lets find out how the input of these feature look like.

In [None]:
# Creating a method for detecting whether a object is a float or not
def isFloat(x):
    try:
        float(x)
        return True
    except:
        return False
    
# Filtering the Total_Sqft column for containing NON Float Values
df[~df['total_sqft'].apply(isFloat)].head(40)


We can see that the total_sqft column contains strings describing the Square in form like:
2100 - 2850, 1005.03 - 1252.49

The following code will convert this type of strings into floats.
Square Foot Strings with other forms we will drop to make it easier.

In [None]:
def convert_to_float(x):
    try:
        # Case 1: Range values, e.g., "2100 - 2850"
        if '-' in x:
            parts = x.split('-')
            return (float(parts[0].strip()) + float(parts[1].strip())) / 2
        
        # Default case: Single float value, e.g., "2100"
        else:
            return float(x.strip())
    except:
        return None

# Applying the function to the total_sqft column
df['total_sqft']= df['total_sqft'].apply(convert_to_float)

# For Edge Cases where the conversion failed delete the rows
df = df.dropna(subset=['total_sqft'])

print("NaN Values per Column")
print(df.isnull().sum())
df.head(40)

## [4.4 - Feature Engineering](#Feature-Engineering)

In this section, we will introduce a new feature that **will assist in identifying and removing outliers** in the dataset. By engineering this additional feature, we aim to capture more nuanced patterns in the data that may not be immediately apparent from the existing features. This new feature will provide valuable insights for subsequent steps, particularly during the outlier detection and removal process, ultimately contributing to a more robust and accurate predictive model.




In [None]:
# Create new feature 'square_meter_price'
df['square_meter_price'] = df['price'] * 100000 / df['total_sqft']

df.shape

## [4.5 - Outlier Removal](#Outlier-Removal)

In this section, we will identify and remove outliers using the newly engineered feature. Removing these anomalies ensures a cleaner dataset and improves the model's accuracy and reliability.

In this step we will remove real estates that are too extreme based on their properties.

In [None]:
# Remove all real estate properties with square_meter / bedrooms less than 300
print(df.square_meter_price.describe())
df_tmp = df[(df['total_sqft'] / df['size'] < 300)]
print(df_tmp[['total_sqft', 'size', 'bath','balcony']].head())

df = df[~(df['total_sqft'] / df['size'] < 300)]
df.square_meter_price.describe()



The following function removes outliers through the price_per_sqft column for each location in the dataset. It keeps only the data points within one standard deviation of the mean for each location, effectively filtering out extreme values that could skew the analysis or model training. 

In [None]:
df.square_meter_price.describe()

def remove_pps_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.square_meter_price)
        st = np.std(subdf.square_meter_price)
        reduced_df = subdf[(subdf.square_meter_price>(m-st)) & (subdf.square_meter_price<=(m+st))]
        df_out = pd.concat([df_out,reduced_df],ignore_index=True)
    return df_out

In [None]:
# Remove outliers from the dataset using the method defined above
df_new = remove_pps_outliers(df)

print("Removed Samples: ", df.shape[0] - df_new.shape[0])

df = df_new

## [4.6 - Further Encoding Categorical Variables](#Futher-Encoding-Categorical-Variables)
The next step is to convert the **location** feature to a numerical feature by using One Hot Encoding.

In [None]:
# Printing Amount of Unique Values in the Location Column
print(len(df.location.unique()))

In [None]:
# Amount of Locations with less than 10 entries
print(len(df['location'].value_counts()[df['location'].value_counts() < 10]))

To simplify the dataset and enhance the effectiveness of model training, we will replace all locations that appear fewer than 10 times with the label 'Other'.

In [None]:
# Get the locations with fewer than 10 occurrences
rare_locations = df['location'].value_counts()[df['location'].value_counts() < 10].index

# Replace rare locations with 'Other'
df.loc[df['location'].isin(rare_locations), 'location'] = 'Other'

df.head(40)

Now let's encode the location feature by using One Hot Encoding.

In [None]:
#Encoding the categorical feature
df = pd.get_dummies(df, columns=['location'])

df.head()

The next step is to convert the **area_type** feature to a numerical feature by using One Hot Encoding.

In [None]:
# Check how many unique values are there in the area_type column
print(df['area_type'].nunique())

# Print the names of the unique area types
print(df['area_type'].unique())


#Encoding the categorical feature
df = pd.get_dummies(df, columns=['area_type'])

# Print the number of columns in the dataframe after encoding 
print(len(df.columns))

df.shape

## [4.7 - Feature Scaling](#Feature-Scaling)

In this section, we will apply feature scaling and normalization to ensure that all features contribute equally to the model, as algorithms like linear regression can be sensitive to the scale of input data. This step is crucial to improve the model's performance by preventing features with larger ranges from dominating the learning process.

In [None]:
# Create a StandardScaler object
scaler = StandardScaler()

# Normalize the size, bath, and balcony columns
df[['size', 'bath', 'balcony']] = scaler.fit_transform(df[['size', 'bath', 'balcony']])

# Standardize the total_sqft, price, and square_meter_price columns
df[['total_sqft', 'price', 'square_meter_price']] = scaler.fit_transform(df[['total_sqft', 'price', 'square_meter_price']])

df.head()

# [5 - Data Splitting](#Data-Splitting)

## [5.1 Split into Train and Test Set](#Split-into-Train-and-Test-Sets)


In this chapter, we will split the dataset into training, and testing sets. 
This step is essential to evaluate the model's performance, tune hyperparameters, and ensure its generalizability to unseen data.
We will use K-fold cross-validation afterwards so we won't need a validation set.


In [None]:
# First, split the data into training and test sets (80% train, 20% test)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Display the sizes of each set to verify the split
print("Training set size:", len(train_df))
print("Test set size:", len(test_df))

## [5.2 - Split Data into Features (X) and Target (y)](#Split-Data-into-Features-X-and-Target-y)

In this section, we will divide our dataset into two main components: Features (X) and the target variable (y). The features (X) consist of all the independent variables that will be used as input to the model, while the target variable (y) represents the outcome we aim to predict—in this case, customer churn. This separation is crucial for training and evaluating the model effectively.



In [None]:
# Define the target column
target_column = 'price'

# Split the training set into features (X_train) and target (y_train)
X_train = train_df.drop(columns=[target_column])
y_train = train_df[target_column]

# Split the test set into features (X_test) and target (y_test)
X_test = test_df.drop(columns=[target_column])
y_test = test_df[target_column]

# Display the first few rows of each to verify
print("Training features (X_train):")
print(X_train.head())
print("\nTraining target (y_train):")
print(y_train.head())

print("\nTest features (X_test):")
print(X_test.head())
print("\nTest target (y_test):")
print(y_test.head())

# [6 - Model Definition](#Model-Definition)
## [6.1 - Define the Logistic Regression Model using Sklearn](#Define-the-Logistic-Regression-Model-using-Sklearn)

In this section we will define the Linear Regression Model using Sklearn

In [None]:
linear_model = LinearRegression()
ridge_model = Ridge()
lasso_model = Lasso()
elastic_net_model = ElasticNet()


## [6.2 - Finding Best Model](#Finding-Best-Model)

In this section we will train different Regression Models on the training set and select the best one.

Let's start by defining four different default models and evaluate their accuracy.

In [None]:
# Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Ridge Regression
ridge_model = Ridge()
ridge_model.fit(X_train, y_train)

# Lasso Regression
lasso_model = Lasso()
lasso_model.fit(X_train, y_train)

# Elastic Net 
elastic_net_model = ElasticNet()
elastic_net_model.fit(X_train, y_train)

# Printing accuarcy for the default models
print("Linear Regression Train Score: ", linear_model.score(X_test, y_test))
print("Ridge Regression Train Score: ", ridge_model.score(X_test, y_test))  
print("Lasso Regression Train Score: ", lasso_model.score(X_test, y_test))
print("Elastic Net Train Score: ", elastic_net_model.score(X_test, y_test))

This code performs hyperparameter tuning using GridSearchCV for Ridge, Lasso, and Elastic Net regression models by exploring different regularization strengths (alpha) and solvers. For Elastic Net, it also tunes the l1_ratio, which balances L1 and L2 regularization. The best model for each type is selected based on cross-validated mean squared error on the training set.

In [None]:
# Ridge Regression with different solvers
ridge_params = {
    'alpha': [0.01, 0.1, 1, 10, 100],
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sag', 'saga']
}
ridge_grid = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='neg_mean_squared_error')
ridge_grid.fit(X_train, y_train)
best_ridge = ridge_grid.best_estimator_

# Lasso Regression with different solvers 
lasso_params = {
    'alpha': [0.01, 0.1, 1, 10, 100],
}
lasso_grid = GridSearchCV(Lasso(), lasso_params, cv=5, scoring='neg_mean_squared_error')
lasso_grid.fit(X_train, y_train)
best_lasso = lasso_grid.best_estimator_

# Elastic Net with different solvers
elastic_net_params = {
    'alpha': [0.01, 0.1, 1, 10, 100],
    'l1_ratio': [0.1, 0.5, 0.7, 1.0],  # l1_ratio=1 corresponds to Lasso
}
elastic_net_grid = GridSearchCV(ElasticNet(), elastic_net_params, cv=5, scoring='neg_mean_squared_error')
elastic_net_grid.fit(X_train, y_train)
best_elastic_net = elastic_net_grid.best_estimator_


This code makes predictions on the test set using Linear Regression, Ridge, Lasso, and Elastic Net models, calculates their Mean Squared Error (MSE), and identifies the best-performing model based on the lowest MSE.

In [None]:
# Predictions
linear_pred = linear_model.predict(X_test)
ridge_pred = best_ridge.predict(X_test)
lasso_pred = best_lasso.predict(X_test)
elastic_net_pred = best_elastic_net.predict(X_test)

# Calculate Mean Squared Error
linear_mse = mean_squared_error(y_test, linear_pred)
ridge_mse = mean_squared_error(y_test, ridge_pred)
lasso_mse = mean_squared_error(y_test, lasso_pred)
elastic_net_mse = mean_squared_error(y_test, elastic_net_pred)

# Print the MSE results
print(f"Linear Regression MSE: {linear_mse}")
print(f"Ridge Regression MSE: {ridge_mse} with alpha={best_ridge.alpha}, solver={best_ridge.solver}")
print(f"Lasso Regression MSE: {lasso_mse} with alpha={best_lasso.alpha}")
print(f"Elastic Net MSE: {elastic_net_mse} with alpha={best_elastic_net.alpha}, l1_ratio={best_elastic_net.l1_ratio}")

# Determine the best model
best_model_name = min(
    [('Linear Regression', linear_mse),
     ('Ridge Regression', ridge_mse),
     ('Lasso Regression', lasso_mse),
     ('Elastic Net', elastic_net_mse)],
    key=lambda x: x[1]
)[0]

print(f"The best model based on the test set is: {best_model_name}")

In [None]:
# Evaluate the best model
best_ridge.score(X_test, y_test)

## [6.3 - Fine-Tuning Model](#Fine-Tuning-Model)


This step involves fine-tuning the Ridge Regression model by exploring a finer range of alpha values around the initially successful value of 1. By using GridSearchCV with cross-validation on the training set, we systematically search for the optimal alpha that minimizes the mean squared error, potentially improving the model's performance further.

In [None]:
# Define a finer grid for alpha
ridge_params = {
    'alpha': [0.1, 0.5, 1, 2, 5, 10],
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sag', 'saga']
}

# Initialize GridSearchCV
ridge_grid = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='neg_mean_squared_error')

# Fit the model on the training data
ridge_grid.fit(X_train, y_train)

# Get the best model
best_ridge_fine_tuned = ridge_grid.best_estimator_

# Evaluate the fine-tuned model on the test set
ridge_fine_tuned_pred = best_ridge_fine_tuned.predict(X_test)
ridge_fine_tuned_mse = mean_squared_error(y_test, ridge_fine_tuned_pred)

# Print results
print(f"Fine-tuned Ridge Regression MSE: {ridge_fine_tuned_mse}")
print(f"Best alpha after fine-tuning: {best_ridge_fine_tuned.alpha}")
print(f"Best solver after fine-tuning: {best_ridge_fine_tuned.solver}")

In [None]:
fine_tuned_score = best_ridge_fine_tuned.score(X_test, y_test)
print(f"Fine-tuned Ridge Regression Score: {fine_tuned_score}")
normal_score = best_ridge.score(X_test, y_test)
print(f"Normal Ridge Regression Score: {normal_score}")

print("Better Model: ", "Fine-tuned Ridge Regression" if fine_tuned_score > normal_score else "Normal Ridge Regression")

Comparing the fine-tuned Ridge Regression model with the default Ridge Regression model, we observe that fine-tuning does not lead to any significant improvement in the model's predictive performance.

In [None]:
final_model = best_ridge

## [7 - Model Evaluation](#Model-Evaluation)

In this section, we evaluate the performance of the Ridge Regression model using the test dataset. We begin by calculating key metrics such as Mean Squared Error (MSE) and the R² score to quantify the model's predictive accuracy. Additionally, we visualize the model's predictions by plotting the actual vs. predicted values to assess how closely the predictions match the real outcomes. We also include a residual plot to examine the distribution of errors, helping us to identify any patterns or issues that may suggest further improvements are needed. This comprehensive evaluation provides insights into the effectiveness of the Ridge Regression model and guides potential fine-tuning or adjustments.


In [None]:
# Step 1: Make predictions using the Ridge Regression model
ridge_pred = best_ridge_fine_tuned.predict(X_test)

# Step 2: Calculate the Mean Squared Error (MSE)
ridge_mse = mean_squared_error(y_test, ridge_pred)

# Step 3: Calculate the R² Score
ridge_r2 = r2_score(y_test, ridge_pred)

# Step 4: Print the evaluation metrics
print(f"Ridge Regression Model Evaluation:")
print(f"Mean Squared Error (MSE): {ridge_mse}")
print(f"R² Score: {ridge_r2}")

# Step 5: Plotting Actual vs Predicted Values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, ridge_pred, color='blue', alpha=0.6)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.title('Ridge Regression: Actual vs Predicted Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.grid(True)
plt.show()

# Step 6: Plotting Residuals
plt.figure(figsize=(10, 6))
residuals = y_test - ridge_pred
plt.scatter(ridge_pred, residuals, color='purple', alpha=0.6)
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals of Ridge Regression Model')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.grid(True)
plt.show()

## [8 - Conclusion](#Conclusion)

In this project, we aimed to predict house prices using various regression models, with a primary focus on Ridge Regression. The workflow included data preprocessing, model training, fine-tuning, and evaluation using different metrics and visualizations.

### Key Takeaways:
1. **Data Preprocessing**:
    - The data underwent necessary preprocessing steps such as handling missing values, encoding categorical variables, and feature scaling. These steps are critical in ensuring the model's accuracy and reliability.
    - It was observed that the dataset had outliers and some non-linear relationships that might have influenced the model’s performance. While Ridge Regression managed to handle these to some extent, the presence of such data points calls for further investigation or more sophisticated methods like outlier detection and non-linear transformation.

2. **Model Training and Fine-Tuning**:
    - Several models were trained, including Linear Regression, Ridge, Lasso, and Elastic Net. Ridge Regression was identified as the best-performing model based on the initial Mean Squared Error (MSE) on the test set.
    - Fine-tuning was performed on Ridge Regression, primarily adjusting the `alpha` parameter to achieve a balance between model complexity and predictive power. However, it was found that the fine-tuning did not significantly improve the model’s performance, indicating that the default settings were already near-optimal for this dataset.

3. **Model Evaluation**:
    - The evaluation metrics (MSE and R² score) and the accompanying visualizations (Actual vs. Predicted Values, Residual Plots) provided insights into the model’s performance.
    - While the Ridge Regression model performed reasonably well, especially for lower range values, it showed weaknesses in predicting higher range values, as evidenced by the scattered residuals and points far from the perfect prediction line.
    - The residual plot suggested that while the model performed adequately for most data points, there were areas (particularly for higher predicted values) where the model's predictions deviated significantly from actual values.

### Conclusion:
Overall, the Ridge Regression model provided a solid starting point for predicting house prices, particularly in managing overfitting through regularization. However, the limitations observed in the model's performance indicate that further refinement could be beneficial. These could include exploring non-linear models, conducting more in-depth feature engineering, and possibly incorporating advanced techniques such as ensemble methods or neural networks to better capture the complexities of the data.

### Future Work:
- **Explore Non-Linear Models**: Given the patterns observed in the residuals, non-linear models like Random Forest, Gradient Boosting, or even Support Vector Machines could provide better performance.
- **Feature Engineering**: Further exploration of feature interactions, polynomial features, or other transformations could help the model better capture the relationship between features and the target variable.
- **Outlier Detection**: Implementing outlier detection and removal techniques might help in improving the model’s robustness and accuracy.
- **Ensemble Methods**: Combining the strengths of multiple models through ensemble techniques like bagging, boosting, or stacking could further improve predictive performance.

This project has laid a strong foundation for predicting house prices, and with additional refinements, the models could become even more accurate and reliable.