## House Price Prediction

## Case Study: House Price Prediction

### Background

Lagos is one of the fastest-growing cities in Africa, with a rapidly expanding population and a booming real estate market. The city's housing market is highly competitive, with a wide range of properties available at varying prices. However, with the high demand for housing, it can be challenging for buyers and sellers to accurately determine the fair market value of a property.

### Objective

The objective of this case study is to help Cressida Homes, a new entrant into the Real Estate Market, develop a machine learning model that can accurately predict the price of a house in Lagos based on its features, such as size, number of bedrooms, and amenities.

### Data
The data used in this case study is a publicly available dataset from Kaggle, which contains information on various properties in Lagos, including their location, size, number of bedrooms, and price. The dataset contains over 545 records and 13 variables.

### Methodology

The methodology used in this case study involves the following steps:

Data cleaning and preprocessing: The first step is to clean and preprocess the data, including handling missing values, removing outliers, and transforming variables as necessary.

Exploratory data analysis: Next, we will perform exploratory data analysis to gain insights into the data, such as identifying trends and patterns, and identifying correlations between variables.

Feature engineering: Based on the insights gained from the exploratory data analysis, we will perform feature engineering to select the most relevant features for predicting house prices and transform them as necessary.

Model selection and training: We will then select a suitable machine learning algorithm for predicting house prices, such as linear regression or a decision tree, and train the model on the preprocessed data.

Model evaluation and fine-tuning: We will evaluate the performance of the model using various metrics, such as mean absolute error and mean squared error, and fine-tune the model as necessary to improve its accuracy.

In [None]:
# Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings("ignore")

### Load the Data

In [None]:
# Load dataset

house_data = pd.read_csv(r"C:\Users\Chinazam\Downloads\Housing_Data.csv")
house_data

### Data Inspection and Cleaning

In [None]:
# Descriptive statistics

house_data.describe(include='all').T

In [None]:
# Information about our columns

house_data.info()

In [None]:
# Check for missing values

house_data.isnull().sum()

In [None]:
# Check for duplicates


house_data.duplicated().sum()

### EDA

In [None]:
# Check Price distribution

plt.figure(figsize=(40,20))
this_plot = sns.displot(house_data['price'])


In [None]:
# Check for relationship between area and price

sns.scatterplot(x='area',y='price', data=house_data)
plt.ticklabel_format(style='plain')

In [None]:
# Check for spread of rooms

sns.countplot(x='bedrooms', data=house_data)


In [None]:
house_data['bedrooms'].value_counts()

In [None]:
# Check for spread of floors/stories

sns.countplot(x='stories', data=house_data)

In [None]:
house_data['stories'].value_counts()

In [None]:
# Check for spread of parking spaces

sns.countplot(x='parking', data=house_data)

In [None]:
house_data['parking'].value_counts()

In [None]:
# How many houses have hot water heating?

sns.countplot(x='hotwaterheating', data=house_data)

In [None]:
house_data['hotwaterheating'].value_counts()

In [None]:
# How many houses have airconditioning?

sns.countplot(x='airconditioning', data=house_data)

In [None]:
house_data['airconditioning'].value_counts()

In [None]:
# How many houses are in preferred areas?

sns.countplot(x='prefarea', data=house_data)

In [None]:
house_data['prefarea'].value_counts()

In [None]:
# What is the furnishing status?

sns.countplot(x='furnishingstatus', data=house_data)

In [None]:
house_data['furnishingstatus'].value_counts()

In [None]:
# Distribution of bedrooms and price

sns.catplot(data=house_data, x='bedrooms', y='price', kind="box")

In [None]:
# Distribution of stories and price

sns.catplot(data=house_data, x='parking', y='price', kind="box")

In [None]:
# Distribution of parking and price

sns.catplot(data=house_data, x='parking', y='price', kind="box")

In [None]:
# Distribution of prefarea and price

sns.catplot(data=house_data, x='prefarea', y='price', kind="box")

In [None]:
# Distribution of furnishing status and price

sns.catplot(data=house_data, x='furnishingstatus', y='price', kind="box")

In [None]:
sns.pairplot(house_data)

In [None]:
# Check for correlation in our features


plt.figure(figsize=(10,10))
sns.heatmap(house_data.corr(),annot=True)

### Splitting the data

In [None]:
X = house_data.drop('price', axis=1)
X

In [None]:
X = pd.get_dummies(X)
X

In [None]:
y = house_data['price']
y

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)



In [None]:
# Import algorithms


from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor



### Linear Regression Model

In [None]:
# Initialise/Create and instance of Linear Regression model

lr = LinearRegression()

# Fit your model
lr.fit(X_train, y_train)


# Make predictions
lr_pred = lr.predict(X_test)



In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
    

In [None]:
mae = mean_absolute_error(y_test, lr_pred)
mse = mean_squared_error(y_test, lr_pred)
r_square = r2_score(y_test, lr_pred)

print(mae)
print(mse)
print(r_square)

### Using other regression models

In [None]:
dr = DecisionTreeRegressor()
rr = RandomForestRegressor()


#create list of your model names
models = [lr, dr, rr]

In [None]:
#create function to train a model and evaluate metrics
def trainer(model,X_train,y_train,X_test,y_test):
    #fit your model
    model.fit(X_train,y_train)
    #predict on the fitted model
    prediction = model.predict(X_test)
    #print evaluation metric
    print('\nFor {}, Mean Absolute Error is {} \n'.format(model.__class__.__name__,mean_absolute_error(prediction,y_test)))
    print('\nFor {}, Mean Squared Error is {} \n'.format(model.__class__.__name__,mean_squared_error(prediction,y_test)))
    print('\nFor {}, R_Square is {} \n'.format(model.__class__.__name__,r2_score(prediction,y_test)))
    print('------------------------------------------------------------------------------------------------------')
    #print(classification_report(prediction,y_valid)) #use this later
    

In [None]:
#loop through each model, training in the process
for model in models:
    trainer(model,X_train,y_train,X_test,y_test)
    

In [None]:
# Linear Regression has the best metrics from the lot 

### Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
models = [lr, dr, rr]




#create function to train a model and evaluate r2
def trainer_with_cv(model,X_train,y_train,X_test,y_test):
    scores = cross_val_score(model, X_train, y_train, scoring='r2', cv=5)
    #print evaluation metric
    print('\nFor {}, Cross-Validation Scores are  {} \n'.format(model.__class__.__name__,scores))
    print('------------------------------------------------------------------------------------------------------')


In [None]:
for model in models:
    trainer_with_cv(model,X_train,y_train,X_test,y_test)

### Using K-Fold Cross Validation 

In [None]:
from sklearn.model_selection import KFold
from numpy import mean
from numpy import std

    
# Perform a 10-Fold split and evaluate mean cross evaluation score    
folds = KFold(n_splits=10, random_state=1, shuffle=True)

def trainer_with_kfold_cv(model,X_train,y_train,X_test,y_test):
    '''Cross validation function. Expects a model,'''
    # evaluate model
    scores = cross_val_score(model, X_train, y_train, scoring='r2', cv=folds, n_jobs=-1)
    # cross validation scores
    print('\nFor {}, Cross-Validation Scores are  {} \n'.format(model.__class__.__name__,scores))
    # report performance
    print('R_Square: %.3f' % (mean(scores)))   



In [None]:
for model in models:
    trainer_with_kfold_cv(model,X_train,y_train,X_test,y_test)