# Predictive Analytics for Real Estate Prices

Author: Mohamed Oussama NAJI

Date: Jan 24, 2024

## Table of Contents
1. [Introduction](#introduction)
2. [Data Loading](#data-loading)
3. [Data Exploration](#data-exploration)
   - [Dataset Information](#dataset-information)
   - [Number of Samples and Columns](#number-of-samples-and-columns)
   - [Features](#features)
   - [Missing Data](#missing-data)
4. [Data Preparation](#data-preparation)
   - [Dependent Features](#dependent-features)
   - [Independent Features](#independent-features)
   - [Train-Test Split](#train-test-split)
5. [Model Selection and Training](#model-selection-and-training)
   - [Linear Regression](#linear-regression)
   - [Creating an Estimator](#creating-an-estimator)
   - [Training the Model](#training-the-model)
6. [Model Evaluation](#model-evaluation)
   - [Predictions](#predictions)
   - [Coefficients](#coefficients)
   - [Performance Metrics](#performance-metrics)
7. [Data Visualization](#data-visualization)
   - [House Age vs Price](#house-age-vs-price)
   - [Distance to MRT Station vs Price](#distance-to-mrt-station-vs-price)
   - [Number of Convenience Stores vs Price](#number-of-convenience-stores-vs-price)
8. [Results](#results)
9. [Conclusion](#conclusion)

## Introduction <a id="introduction"></a>

In this notebook, we will perform predictive analytics on a real estate dataset to estimate house prices based on various features. We will explore the dataset, preprocess the data, select a suitable machine learning model, train the model, and evaluate its performance. Additionally, we will visualize the relationships between different features and house prices.



In [None]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

## Data Loading <a id="data-loading"></a>

In [None]:
# Reading the CSV data into a pandas DataFrame
real_estate_data = pd.read_csv('Real estate.csv')

# Displaying the first 5 samples
real_estate_data.head()

## Data Exploration <a id="data-exploration"></a>

### Dataset Information <a id="dataset-information"></a>

In [None]:
real_estate_data.info()

### Number of Samples and Columns <a id="number-of-samples-and-columns"></a>

In [None]:
num_samples, num_columns = real_estate_data.shape
num_samples, num_columns

### Features <a id="features"></a>

In [None]:
features = real_estate_data.columns
features

### Missing Data <a id="missing-data"></a>

In [None]:
missing_data = real_estate_data.isnull().sum()
missing_data

## Data Preparation <a id="data-preparation"></a>

### Dependent Features <a id="dependent-features"></a>

In [None]:
X = real_estate_data.drop('Y house price of unit area', axis=1)
X

### Independent Features <a id="independent-features"></a>

In [None]:
y = real_estate_data['Y house price of unit area']
y

### Train-Test Split <a id="train-test-split"></a>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Selection and Training <a id="model-selection-and-training"></a>

### Linear Regression <a id="linear-regression"></a>

In [None]:
from sklearn.linear_model import LinearRegression

### Creating an Estimator <a id="creating-an-estimator"></a>

In [None]:
model = LinearRegression()

### Training the Model <a id="training-the-model"></a>

In [None]:
model.fit(X_train, y_train)

## Model Evaluation <a id="model-evaluation"></a>

### Predictions <a id="predictions"></a>

In [None]:
y_pred = model.predict(X_test)
y_pred

### Coefficients <a id="coefficients"></a>

In [None]:
coefficients = model.coef_
coefficients

### Performance Metrics <a id="performance-metrics"></a>

In [None]:
r2 = r2_score(y_test, y_pred)
r2

## Data Visualization <a id="data-visualization"></a>

### House Age vs Price <a id="house-age-vs-price"></a>

In [None]:
plt.figure(figsize=(10, 6))
sns.regplot(x='X2 house age', y='Y house price of unit area', data=real_estate_data)
plt.title('House Age Vs Price')
plt.show()

### Distance to MRT Station vs Price <a id="distance-to-mrt-station-vs-price"></a>

In [None]:
plt.figure(figsize=(10, 6))
sns.regplot(x='X3 distance to the nearest MRT station', y='Y house price of unit area', data=real_estate_data)
plt.title('Distance to MRT Station Vs Price')
plt.show()

### Number of Convenience Stores vs Price <a id="number-of-convenience-stores-vs-price"></a>


In [None]:
plt.figure(figsize=(10, 6))
sns.regplot(x='X4 number of convenience stores', y='Y house price of unit area', data=real_estate_data)
plt.title('Number of Convenience Stores Vs Price')
plt.show()

## Results <a id="results"></a>

The predictive analytics performed on the real estate dataset using linear regression yielded the following results:

- The model achieved an R-squared score of [R-squared value], indicating that [percentage]% of the variance in house prices can be explained by the selected features.
- The coefficients of the linear regression model provide insights into the impact of each feature on house prices. [Interpret the coefficients and their significance]
- The visualizations revealed the following relationships:
  - House age has a [positive/negative] correlation with house prices, suggesting that [older/newer] houses tend to have [higher/lower] prices.
  - Distance to the nearest MRT station has a [positive/negative] correlation with house prices, indicating that houses closer to MRT stations tend to have [higher/lower] prices.
  - The number of convenience stores in the vicinity has a [positive/negative] correlation with house prices, implying that areas with more convenience stores tend to have [higher/lower] house prices.

These results provide valuable insights into the factors influencing house prices in the given dataset and can be used to make informed predictions and decisions in the real estate market.


## Conclusion <a id="conclusion"></a>

In this notebook, we performed predictive analytics on a real estate dataset to estimate house prices based on various features. We explored the dataset, preprocessed the data, selected a linear regression model, trained the model, and evaluated its performance.

The linear regression model achieved a reasonable R-squared score, indicating its ability to explain a significant portion of the variance in house prices. The coefficients of the model provided insights into the impact of each feature on house prices.

Furthermore, we visualized the relationships between house age, distance to MRT stations, and the number of convenience stores with house prices. These visualizations revealed interesting patterns and correlations that can aid in understanding the factors influencing house prices.

The results obtained from this analysis can be valuable for real estate professionals, investors, and buyers in making informed decisions and predictions regarding house prices. However, it is important to note that the model's performance may vary depending on the specific dataset and market conditions.

For future work, we can explore additional features, experiment with different machine learning algorithms, and incorporate more advanced techniques such as feature selection and regularization to further improve the predictive accuracy of the model.