### Project to build a house prediction algorithm for house prices in bangalore

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

Set the matplotlib parameter for figure size and turn on in line display

In [6]:
%matplotlib inline
matplotlib.rcParams["figure.figsize"] = (20, 10)

Import the SKLearn modules compare linear models to decision trees

In [9]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

Read in the housing price dataset

In [17]:
df = pd.read_csv("./Bengaluru_House_Data.csv")
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


### Exploratory Data Analysis

In [18]:
df.shape

(13320, 9)

In [31]:
# Check the column names and data types - note that total_sqft is not interpreted as numeric
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


In [25]:
# Print summary statistics for all numerical dimensions of the dataframe
df.describe()

Unnamed: 0,bath,balcony,price
count,13247.0,12711.0,13320.0
mean,2.69261,1.584376,112.565627
std,1.341458,0.817263,148.971674
min,1.0,0.0,8.0
25%,2.0,1.0,50.0
50%,2.0,2.0,72.0
75%,3.0,2.0,120.0
max,40.0,3.0,3600.0


In [28]:
# use include = "all" to decribe all features including categorical
# This will show some categorical statistics i.e. unique values & frequency of the common data
df.describe(include = "all")

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
count,13320,13320,13319,13304,7818,13320.0,13247.0,12711.0,13320.0
unique,4,81,1305,31,2688,2117.0,,,
top,Super built-up Area,Ready To Move,Whitefield,2 BHK,GrrvaGr,1200.0,,,
freq,8790,10581,540,5199,80,843.0,,,
mean,,,,,,,2.69261,1.584376,112.565627
std,,,,,,,1.341458,0.817263,148.971674
min,,,,,,,1.0,0.0,8.0
25%,,,,,,,2.0,1.0,50.0
50%,,,,,,,2.0,2.0,72.0
75%,,,,,,,3.0,2.0,120.0


In [29]:
# Count the number of missing values in each column
df.isnull().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

In [30]:
# Examining area type
df.groupby("area_type")["area_type"].agg("count")

area_type
Built-up  Area          2418
Carpet  Area              87
Plot  Area              2025
Super built-up  Area    8790
Name: area_type, dtype: int64

Since availability does not effect the house price - the house will cost the same regardless of when the inhabitant is ready to move - we can remove this feature. <br>
Society is missing values for 5.5k observations so, at least for a first pass, we can remove society as this will limit our data after removing null value observations. <br>
Balcony should remain as this could have an effect on house price - rather than drop obeservations with missing balcony information, we can instead examine and input approriate values i.e. null = zero.

In [36]:
# Drop the columns
df = df.drop(['availability', 'society'], axis = "columns")

KeyError: "['availability' 'society'] not found in axis"

In [37]:
df.shape

(13320, 7)