# Melbourne Dataset: House price prediction using Regression

In this study, we follow the **CRISP-DM (Cross-industry standard process for Data mining) technique**. This technique consists of the following steps.
1. Business understanding
2. Data understanding
3. Data Preparation
4. Modeling
5. Evaluation of results
6. Deployment

Let’s skip the first and last steps as we already know that the business requirement is to predict the house prices of Melbourne and we are not doing any deployment using apps here.

**Data understanding:**

Basically, the data understanding is nothing but collecting data, checking whether data is right or not, what type of data we have, whether the available data can answer my business questions or not, and also exploring, visualizing the data by using plots, graphs charts to understand the hidden meanings in the data.
Now, we don’t have to worry much regarding the data availability Since we have chosen a dataset based on the objective to predict the house prices using regression. Although Regression is the classical prediction algorithm it’s a very powerful technique to predict based on the independent features we have.
Our data set consists of 13, 580 rows and 21 columns. Each row has information regarding the Price and Address of the house, Number of Rooms, Seller information, Bedrooms, bathrooms, car parking availability, Land Size of the house, etc.
let’s start with some business questions that we can answer from the data.
1. What is the trend of the house prices?
2. What is the highest and lowest price of a house?
3. What are the highest correlated variables to price?
4. What are the key features affecting the price of a house?

**Data Preparation:**

One of the first things we have to do is to check the missing values present in the data and treat them accordingly based on the type of data. For example, Since we have both categorical and numerical data the missing value treatment would be different.

## Data Exploration 

### Imports

In [2]:
import os
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Reading the datasets

In [None]:
# HOUSING_PATH = os.path.join('Kaggle', 'melb')

# def load_data (HOUSING_PATH, data):
#     csv_path = os.path.join(HOUSING_PATH, data)
#     return pd.read_csv(csv_path)

In [3]:
data_house = pd.read_csv('melb_data.csv')

In [4]:
data_house.shape

(13580, 21)

In [5]:
data_house.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [6]:
data_house.tail()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
13575,Wheelers Hill,12 Strada Cr,4,h,1245000.0,S,Barry,26/08/2017,16.7,3150.0,...,2.0,2.0,652.0,,1981.0,,-37.90562,145.16761,South-Eastern Metropolitan,7392.0
13576,Williamstown,77 Merrett Dr,3,h,1031000.0,SP,Williams,26/08/2017,6.8,3016.0,...,2.0,2.0,333.0,133.0,1995.0,,-37.85927,144.87904,Western Metropolitan,6380.0
13577,Williamstown,83 Power St,3,h,1170000.0,S,Raine,26/08/2017,6.8,3016.0,...,2.0,4.0,436.0,,1997.0,,-37.85274,144.88738,Western Metropolitan,6380.0
13578,Williamstown,96 Verdon St,4,h,2500000.0,PI,Sweeney,26/08/2017,6.8,3016.0,...,1.0,5.0,866.0,157.0,1920.0,,-37.85908,144.89299,Western Metropolitan,6380.0
13579,Yarraville,6 Agnes St,4,h,1285000.0,SP,Village,26/08/2017,6.3,3013.0,...,1.0,1.0,362.0,112.0,1920.0,,-37.81188,144.88449,Western Metropolitan,6543.0


In [7]:
data_house.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [8]:
data_house.shape

(13580, 21)

In [9]:
data_house.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Rooms,13580.0,2.937997,0.955748,1.0,2.0,3.0,3.0,10.0
Price,13580.0,1075684.0,639310.724296,85000.0,650000.0,903000.0,1330000.0,9000000.0
Distance,13580.0,10.13778,5.868725,0.0,6.1,9.2,13.0,48.1
Postcode,13580.0,3105.302,90.676964,3000.0,3044.0,3084.0,3148.0,3977.0
Bedroom2,13580.0,2.914728,0.965921,0.0,2.0,3.0,3.0,20.0
Bathroom,13580.0,1.534242,0.691712,0.0,1.0,1.0,2.0,8.0
Car,13518.0,1.610075,0.962634,0.0,1.0,2.0,2.0,10.0
Landsize,13580.0,558.4161,3990.669241,0.0,177.0,440.0,651.0,433014.0
BuildingArea,7130.0,151.9676,541.014538,0.0,93.0,126.0,174.0,44515.0
YearBuilt,8205.0,1964.684,37.273762,1196.0,1940.0,1970.0,1999.0,2018.0


In [18]:
[features for features in data_house.columns if data_house[features].isnull().sum() ]

['Car', 'BuildingArea', 'YearBuilt', 'CouncilArea']

In [10]:
data_house.isnull().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

In [None]:
# Clear the car coloumn missing values
car_mean = data_house['Car']
car_miss = car_mean.isnull().sum()
car_m = car_mean.mean()
print ('There are {} missing Value in this coloumn'.format(car_miss))
print('The mean of this coloumn is {}.'.format(round(car_m, 0)))
car_mean.fillna(2.0, inplace=True)

In [None]:
# clearing the building area missing values
ba_mean = data_house['BuildingArea']
ba_miss = ba_mean.isnull().sum()
ba_m = ba_mean.mean()
print ('There are {} missing Value in this coloumn'.format(ba_miss))
print ('The mean of this coloumn is {}.'.format(round(ba_m, 0)))
ba_mean.fillna(round(ba_m, 0), inplace=True)

In [None]:
# clearing the council area coloumn
ca = data_house['CouncilArea']
ca_miss = ca.isna().sum()
print ('There are {} missing Value in this coloumn'.format(ca_miss))

In [None]:
data_house.dropna(subset=['CouncilArea'], inplace=True)

In [None]:
data_house.sort_values('YearBuilt', ascending=True)

In [None]:
data_house.groupby('YearBuilt').mean()

In [None]:
data_house['Price'].astype('int64').dtype

In [None]:
data_house.dtypes

In [None]:
int32 = ['Postcode', 'Bedroom2', 'Bathroom', 'Car', 'YearBuilt']
data_house = data_house[int32].astype()