# Data Science Regression Project: Predicting Home Prices in Banglore

Dataset is downloaded from here: https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data

In [27]:
# Pandas is a dataframe 
import pandas as pd

In [28]:
# Numpy = number python, allows you to do matrix multiplication 
import numpy as np

In [29]:
#matplotlib = library to help make plots 
from matplotlib import pyplot as plt

In [4]:
%matplotlib inline

In [5]:
import matplotlib

In [6]:
matplotlib.rcParams["figure.figsize"]= (20,10)

## Data Load: Load banglore home prices into dataframe

In [30]:
# pd.read_csv - reads csv file
df1 = pd.read_csv("bengaluru_house_prices.csv")
# df1.head() - will show you the initial/top of the dataframe, avoid loading entire dataframe
df1.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [32]:
# Can set a number for head 
df1.head(6)

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0
5,Super built-up Area,Ready To Move,Whitefield,2 BHK,DuenaTa,1170,2.0,1.0,38.0


In [34]:
# Can also use tail to see last/ending of dataframe
df1.tail()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
13315,Built-up Area,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.0,231.0
13316,Super built-up Area,Ready To Move,Richards Town,4 BHK,,3600,5.0,,400.0
13317,Built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.0,60.0
13318,Super built-up Area,18-Jun,Padmanabhanagar,4 BHK,SollyCl,4689,4.0,1.0,488.0
13319,Super built-up Area,Ready To Move,Doddathoguru,1 BHK,,550,1.0,1.0,17.0


In [12]:
df1.shape
# rows, columns

(13320, 9)

In [10]:
df1.columns

Index(['area_type', 'availability', 'location', 'size', 'society',
       'total_sqft', 'bath', 'balcony', 'price'],
      dtype='object')

In [38]:
# Show the unique variables in this column 
df1['area_type'].unique()

array(['Super built-up  Area', 'Plot  Area', 'Built-up  Area',
       'Carpet  Area'], dtype=object)

In [39]:
# Different way to count/show unique type 
df1.groupby('area_type')['area_type'].agg('count')

area_type
Built-up  Area          2418
Carpet  Area              87
Plot  Area              2025
Super built-up  Area    8790
Name: area_type, dtype: int64

In [40]:
# This will add up each unique category in the column 'area_type'
df1['area_type'].value_counts()

Super built-up  Area    8790
Built-up  Area          2418
Plot  Area              2025
Carpet  Area              87
Name: area_type, dtype: int64

In [19]:
df1['availability'].unique()

array(['19-Dec', 'Ready To Move', '18-May', '18-Feb', '18-Nov', '20-Dec',
       '17-Oct', '21-Dec', '19-Sep', '20-Sep', '18-Mar', '20-Feb',
       '18-Apr', '20-Aug', '18-Oct', '19-Mar', '17-Sep', '18-Dec',
       '17-Aug', '19-Apr', '18-Jun', '22-Dec', '22-Jan', '18-Aug',
       '19-Jan', '17-Jul', '18-Jul', '21-Jun', '20-May', '19-Aug',
       '18-Sep', '17-May', '17-Jun', '21-May', '18-Jan', '20-Mar',
       '17-Dec', '16-Mar', '19-Jun', '22-Jun', '19-Jul', '21-Feb',
       'Immediate Possession', '19-May', '17-Nov', '20-Oct', '20-Jun',
       '19-Feb', '21-Oct', '21-Jan', '17-Mar', '17-Apr', '22-May',
       '19-Oct', '21-Jul', '21-Nov', '21-Mar', '16-Dec', '22-Mar',
       '20-Jan', '21-Sep', '21-Aug', '14-Nov', '19-Nov', '15-Nov',
       '16-Jul', '15-Jun', '17-Feb', '20-Nov', '20-Jul', '16-Sep',
       '15-Oct', '15-Dec', '16-Oct', '22-Nov', '15-Aug', '17-Jan',
       '16-Nov', '20-Apr', '16-Jan', '14-Jul'], dtype=object)

In [21]:
df1['society'].unique()

array(['Coomee ', 'Theanmp', nan, ..., 'SJovest', 'ThhtsV ', 'RSntsAp'],
      dtype=object)

In [22]:
df1['balcony'].unique()

array([ 1.,  3., nan,  2.,  0.])

Out of the 9 columns, determine which ones are not needed and which ones are necessary 

Columns: 
1. area_type (text, ex: Super built-up Area, Plot Area)
2. availability (text, ex: Ready To Move,19-Dec) 
3. location (text, ex: Electronic City Phase II, Chikka Tirupathi)  
4. size (text/number, ex: 2 BHK, 4 Bedroom) 
5. society (text, ex: Coomee, Theanmp) 
6. total_sqft (number, ex: 1056, 2600) 
7. bath (number, ex: 2.0, 5.0) 
8. balcony (number, ex: 1.0, 3.0) 
9. price (number, ex: 39.07, 120.00) 

Of those, most likely the ones that matter will be

3. location - price varies dramatically on location (ex: beach, good location)
4. size - number of bedrooms is an important factor for a home
6. total_sqft - good indication on property size and worth 
7. bath - number of bathrooms affect housing price 
9. price - actual price helps predict the house worth depending on above variables 

The columns that do not matter too much
1. area_type - knowing the area type in this scenario does not seem to matter 
2. availability - knowning the date when its available to sell does not help 
5. society - societial views do not really affect pricing 
8. balcony - although this could affect pricing, not sure the correlation it could have 

In [25]:
df2 = df1.drop(['area_type', 'society', 'balcony', 'availability'], axis="columns")
df2.shape

(13320, 5)

In [26]:
df1.shape

(13320, 9)

# Dropping the columns that are not helpful, you can notice that the columns have reduced from 9 to 5 

df1.shape = (13320, 9)

df2.shape = (13320, 5)

Given this dataset, what are things we must do first in order to ensure the data is clean, organized, and usable 

1. remove any 0 or NaN/null values 
2. make sure each column values are uniform/similiar - sometimes values are in ranges
    - ex: 2100 - 2850, (9 Bedroom vs 4 BHK)
3. using business logic, remove any data that could be an outlier 