# Predicting House Price using Machine Learning 
This notebook looks into using various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting a house price based on their attributes.

We're going to take the following approach:

    Problem definition
    Data
    Evaluation
    Features
    Modeling
    Experimentation
    
## 1. Problem Definition

In a statement, -> Given clinical parameters about a patient, can we predict whether or not they have a heart disease?

## 2. Data

The original data came from the kaggle Bengaluru House Price data


link: https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data

## 3. Evaluation

## 4. Features

Information about the columns/features.

#### Create data dictionary

    Area type
    Avaliblity
    Location 
    Size
    Society
    Balcony
    Total sqft
    Bath





## Preparing the tools
We're going to use `matplotlib` `pandas` `numpy` for data manipulation and analysis.



In [8]:
# Import all the tools we need


# Regular EDA(exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline


In [9]:
dic = {'name': ['oitik', 'ridoy', 'golam'],
       'roll': [71, 35, 113]}
dic

{'name': ['oitik', 'ridoy', 'golam'], 'roll': [71, 35, 113]}

In [12]:
df = pd.DataFrame(dic)

In [13]:
df

Unnamed: 0,name,roll
0,oitik,71
1,ridoy,35
2,golam,113


In [39]:
df = pd.read_csv('data/Bengaluru_House_Data.csv')

In [15]:
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [16]:
df.shape

(13320, 9)

In [40]:
df.groupby('area_type')['area_type'].count()

area_type
Built-up  Area          2418
Carpet  Area              87
Plot  Area              2025
Super built-up  Area    8790
Name: area_type, dtype: int64

In [41]:
df.drop(['area_type', 'society', 'availability'], axis=1, inplace=True)

In [42]:
df.head()

Unnamed: 0,location,size,total_sqft,bath,balcony,price
0,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,3.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0
4,Kothanur,2 BHK,1200,2.0,1.0,51.0


In [43]:
df.isnull().sum()

location        1
size           16
total_sqft      0
bath           73
balcony       609
price           0
dtype: int64

In [34]:
median = df['balcony'].median()

In [45]:
df['balcony'].fillna(df['balcony'].median(), inplace=True)

In [46]:
df.isnull().sum()

location       1
size          16
total_sqft     0
bath          73
balcony        0
price          0
dtype: int64

In [47]:
df['bath'].fillna(df['bath'].median(), inplace=True)

In [48]:
df.isnull().sum()

location       1
size          16
total_sqft     0
bath           0
balcony        0
price          0
dtype: int64

In [53]:
df.dropna(inplace=True)

In [54]:
df.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
balcony       0
price         0
dtype: int64

In [55]:
df.shape

(13303, 6)

In [58]:
df['size'].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

In [59]:
df['bhk'] = df['size'].apply(lambda x: int(x.split(' ')[0]))

In [60]:
df.head()

Unnamed: 0,location,size,total_sqft,bath,balcony,price,bhk
0,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0,4
2,Uttarahalli,3 BHK,1440,2.0,3.0,62.0,3
3,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0,3
4,Kothanur,2 BHK,1200,2.0,1.0,51.0,2


In [61]:
df[df.bhk>10]

Unnamed: 0,location,size,total_sqft,bath,balcony,price,bhk
459,1 Giri Nagar,11 BHK,5000,9.0,3.0,360.0,11
1718,2Electronic City Phase II,27 BHK,8000,27.0,0.0,230.0,27
1768,1 Ramamurthy Nagar,11 Bedroom,1200,11.0,0.0,170.0,11
3379,1Hanuman Nagar,19 BHK,2000,16.0,2.0,490.0,19
3609,Koramangala Industrial Layout,16 BHK,10000,16.0,2.0,550.0,16
3853,1 Annasandrapalya,11 Bedroom,1200,6.0,3.0,150.0,11
4684,Munnekollal,43 Bedroom,2400,40.0,0.0,660.0,43
4916,1Channasandra,14 BHK,1250,15.0,0.0,125.0,14
6533,Mysore Road,12 Bedroom,2232,6.0,2.0,300.0,12
7979,1 Immadihalli,11 BHK,6000,12.0,2.0,150.0,11


In [65]:
df['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

In [66]:
def is_float(x):
    '''
    Determines a value is float or not by returning True or False.
    '''
    try:
        float(x)
    except:
        return False
    return True

In [69]:
df[~df['total_sqft'].apply(is_float)].head(10)

Unnamed: 0,location,size,total_sqft,bath,balcony,price,bhk
30,Yelahanka,4 BHK,2100 - 2850,4.0,0.0,186.0,4
56,Devanahalli,4 Bedroom,3010 - 3410,2.0,2.0,192.0,4
81,Hennur Road,4 Bedroom,2957 - 3450,2.0,2.0,224.5,4
122,Hebbal,4 BHK,3067 - 8156,4.0,0.0,477.0,4
137,8th Phase JP Nagar,2 BHK,1042 - 1105,2.0,0.0,54.005,2
165,Sarjapur,2 BHK,1145 - 1340,2.0,0.0,43.49,2
188,KR Puram,2 BHK,1015 - 1540,2.0,0.0,56.8,2
224,Devanahalli,3 BHK,1520 - 1740,2.0,2.0,74.82,3
410,Kengeri,1 BHK,34.46Sq. Meter,1.0,0.0,18.5,1
549,Hennur Road,2 BHK,1195 - 1440,2.0,0.0,63.77,2


In [70]:
def convert(x):
    '''
    Converts Range sqft values to their average...
    '''
    vals = x.split('-')
    if len(vals) == 2:
        return (float(vals[0])+float(vals[1]))/2
    try:
        return float(x)
    except:
        return None

In [74]:
convert(df['total_sqft'][30]), (2100+2850)/2

(2475.0, 2475.0)

In [75]:
df1 = df.copy()

In [83]:
df1['total_sqft'] = df1['total_sqft'].apply(convert)

In [82]:
df.loc[30]

location      Yelahanka
size              4 BHK
total_sqft       2475.0
bath                4.0
balcony             0.0
price             186.0
bhk                   4
Name: 30, dtype: object

In [84]:
df.head()

Unnamed: 0,location,size,total_sqft,bath,balcony,price,bhk
0,Electronic City Phase II,2 BHK,1056.0,2.0,1.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,3.0,120.0,4
2,Uttarahalli,3 BHK,1440.0,2.0,3.0,62.0,3
3,Lingadheeranahalli,3 BHK,1521.0,3.0,1.0,95.0,3
4,Kothanur,2 BHK,1200.0,2.0,1.0,51.0,2


In [85]:
df2 = df.copy()

In [86]:
df['price_per_sqft'] = df['price']*100000/df['total_sqft']

In [87]:
df.head()

Unnamed: 0,location,size,total_sqft,bath,balcony,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,1.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,3.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,3.0,62.0,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,1.0,95.0,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,1.0,51.0,2,4250.0


In [91]:
len(df.location.unique())

1304

In [92]:
df.location = df.location.apply(lambda x: x.strip())

In [93]:
df['location']

0        Electronic City Phase II
1                Chikka Tirupathi
2                     Uttarahalli
3              Lingadheeranahalli
4                        Kothanur
                   ...           
13315                  Whitefield
13316               Richards Town
13317       Raja Rajeshwari Nagar
13318             Padmanabhanagar
13319                Doddathoguru
Name: location, Length: 13303, dtype: object

In [96]:
location_stats = df.groupby('location')['location'].count().sort_values(ascending=False)

In [97]:
location_stats

location
Whitefield               540
Sarjapur  Road           397
Electronic City          304
Kanakpura Road           273
Thanisandra              237
                        ... 
1 Giri Nagar               1
Kanakapura Road,           1
Kanakapura main  Road      1
Karnataka Shabarimala      1
whitefiled                 1
Name: location, Length: 1293, dtype: int64

In [99]:
len(location_stats[location_stats<=10])

1052

In [100]:
location_stats_less = location_stats[location_stats<=10]

In [101]:
location_stats_less

location
Basapura                 10
1st Block Koramangala    10
Gunjur Palya             10
Kalkere                  10
Sector 1 HSR Layout      10
                         ..
1 Giri Nagar              1
Kanakapura Road,          1
Kanakapura main  Road     1
Karnataka Shabarimala     1
whitefiled                1
Name: location, Length: 1052, dtype: int64

In [102]:
df.head(3)

Unnamed: 0,location,size,total_sqft,bath,balcony,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,1.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,3.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,3.0,62.0,3,4305.555556


In [103]:
df.location = df.location.apply(lambda x: 'other' if x in location_stats_less else x)


In [104]:
df.head()

Unnamed: 0,location,size,total_sqft,bath,balcony,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,1.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,3.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,3.0,62.0,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,1.0,95.0,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,1.0,51.0,2,4250.0


In [106]:
len(df.location.unique())

242