# Boston House Price Estimation Project 
<br>
<font size="3">
    
- **Overview:** In this project we are going to build a machine learning model using *Linear Regression* which will eventually help us to predict the price of the house given explanatory variables that cover many aspects of residential houses such as size, no. of bedrooms, age etc.


- **Dataset Used:**  Boston Dataset available on sklearn.datasets


- **Aim:** Building a regression model which can give us a good prediction on the price of the house based on given features.
</font>

# Table of Contents
- [ 1 - Understand the problem statement ](#1)
- [ 2 - About the data used ](#2)
- [ 3 - Algorithm used](#3)
- [ 4 - Data Collection](#4)
- [ 5 - Data Preprocessing](#5)
- [ 6 - Exploratory Data Analysis(EDA)](#6)
- [ 7 - Feature Observation](#7)
- [ 8 - Feature Selection](#8)
- [ 9 - Model Building](#9)
- [ 10 - Model Performances](#10)
- [ 11 - Prediction and Final Score](#11)
- [ 12 - Output](#12)

## Understanding the problem:
<br>

What is the first thing we look out for while buying a house? Price right!!!. Everything else comes next. But the interesting part is what influences the price of a house. Most of the time it isn't just one specific feature rather a multiple factors like location, size, age, crime rate in town, AQI, etc. Therefore it wouldn't be wisest move to estimate the price of a house based on just one or two feature. But what if I say we can build a model based on previous records which can best predict the price of the house when provided some features. Cool right!, that is exactly what we are going to do this project.

## Dataset used:
<br>

For building a prediction model, the very first thing we need is data, to train and test the model. **No data means No model**. Now you guys must be wondering, where will we get the data for our model. 

We are going to use Boston House data available on sklearn.datasets as well as kaggle. The dataset used in this project comes from the UCI Machine Learning Repository. This data was collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes from various suburbs located in Boston.

- Features available on the data can be summarized as follows:
    - 1. **CRIM** per capital crime rate by town

    - 2. **ZN** proportion of residential land zoned for lots over 25,000 sq.ft.

    - 3. **INDUS** proportion of non-retail business acres per town

    - 4. **CHAS** Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

    - 5. **NOX** nitric oxides concentration (parts per 10 million)

    - 6. **RM** average number of rooms per dwelling

    - 7. **AGE** proportion of owner-occupied units built prior to 1940

    - 8. **DIS** weighted distances to five Boston employment centers

    - 9. **RAD** index of accessibility to radial highways

    - 10. **TAX** full-value property-tax rate per 10,000 USD

    - 11. **PTRATIO** pupil-teacher ratio by town

    - 12. **Black** 1000(Bk — 0.63)² where Bk is the proportion of blacks by town

    - 13. **LSTAT** % lower status of the population
    
    - 14. **MEDV**  This is the median value of owner-occupied homes in $1000.

## Algorithm Used:

Linear Regression algorithm is implemented on the Boston house dataset. Python Libraries used are:

- **sklearn:** For linear regression model and to fetch the dataset.
- **numpy:**  To perform operations on the data
- **pandas:** To work with the data efficiently
- **matplotlib:** For visualizing the data and the result.

## Data collection:

Let's import all the required python libraries and the dataset.

In [3]:
import numpy as np
import matplotlib as plt
import copy, math
from sklearn.datasets import load_boston
import pandas as pd

In [5]:
#loading the data
boston = load_boston()

In [14]:
#data = np.array(boston.data)
#print(type(boston))
#converting the data into pandas DF
df = pd.DataFrame(boston.data)

In [21]:
# Let's look first few enteries in the dataset to understand what kind of data is present.
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


**NOTE:** We can see the columns do not have labels in the above table, so using it as it is will not give a good sense of which column represents which feature and hence create problem in feature selection.

In [35]:
print(boston.keys())
print()
print("Feature names:", boston.feature_names)

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename', 'data_module'])

Feature names: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


As we can see boston has a different list of feature_names which contains names of all the features present in dataset.

Let's extract this list and add it to the dataframe as column label.

In [38]:
#extracting the feature_names
feature_names = boston.feature_names

#add the extracted as feature label
df.columns = feature_names

#updated dataframe with name of the features
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


## Extracting the target variable:

We also need a `target variable` (i.e the actual price of the house) which we can use to train & test our model by comparing the predicted value with it.

Since we don't already have this variable in our DataFrame, let's extract it from the original dataset and add it to our dataframe.

In [42]:
#extracting the target variable and adding it to the dataframe
df["Price"] = boston.target

#updated dataframe with the price column
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


As you can see a new column has been appended at the end of the table. Later we will break the above table into two to implement our Linear Regression model.

In [15]:
#shape of the data i.e dimensions (number of examples, number of features)
df.shape

# Success
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

(506, 13)

### Observation:
This means there are 506 instances and 13 features in this data.

## Checking for null values
Now let's check if there are any missing values in the data.

In [20]:
#returns the number of example for which 
df.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
dtype: int64

In [None]:
df.shape()

In [None]:
df.info()

As we can, in our data the columns have not assigned any names. To do that we first need to extract the feature names from the data and then assign it in dataFrame.

In [25]:
df.columns = boston.feature_names

In [26]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


Now we can see the `features name`.

Let's have a look at the datatypes of different features present in the data.

In [27]:
df.dtypes

CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
dtype: object

In [31]:
# Identifying the unique number of values in the dataset
df.nunique()

CRIM       504
ZN          26
INDUS       76
CHAS         2
NOX         81
RM         446
AGE        356
DIS        412
RAD          9
TAX         66
PTRATIO     46
B          357
LSTAT      455
dtype: int64