# Real estate price prediction project

The data consists of bengaluru houses including area type, availability, location, size, society, total area of estate in square feet, number of bathrooms, number of balconies and price.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("data/bengaluru-house-data.csv")
display(df)

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.00
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.00
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.00
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.00
...,...,...,...,...,...,...,...,...,...
13315,Built-up Area,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.0,231.00
13316,Super built-up Area,Ready To Move,Richards Town,4 BHK,,3600,5.0,,400.00
13317,Built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.0,60.00
13318,Super built-up Area,18-Jun,Padmanabhanagar,4 BHK,SollyCl,4689,4.0,1.0,488.00


## Cleaning the data by first removing the fields that do not seem to be contributing in the price prediction

It seems like fields such as area type, availability, number of balconies and the type of society does not contribute in predicting the price of a real estate. So it's better to create a new data frame not consisting of these fields.

### Getting the number of estates grouped by area type

The agg() methods allow us to apply a function or a group of functions to be executed along one of the axis of the data frame. By default the axis=0 which is the index(row) axis.

In [3]:
df.groupby("area_type")["area_type"].agg("count")

area_type
Built-up  Area          2418
Carpet  Area              87
Plot  Area              2025
Super built-up  Area    8790
Name: area_type, dtype: int64

### Creating a new data frame named df_dropped that do no conist of the above mentioned seemingly non-contributing fields

It can also be observed here that the dtype of `total_sqft` column is an Python Object type which indicates that the data in that column might not be uniform.

In [4]:
df_dropped = df.drop(["area_type", "availability", "balcony", "society"], axis="columns").copy()
display(df_dropped)

print(df_dropped.shape)
print(df_dropped.dtypes)

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.00
2,Uttarahalli,3 BHK,1440,2.0,62.00
3,Lingadheeranahalli,3 BHK,1521,3.0,95.00
4,Kothanur,2 BHK,1200,2.0,51.00
...,...,...,...,...,...
13315,Whitefield,5 Bedroom,3453,4.0,231.00
13316,Richards Town,4 BHK,3600,5.0,400.00
13317,Raja Rajeshwari Nagar,2 BHK,1141,2.0,60.00
13318,Padmanabhanagar,4 BHK,4689,4.0,488.00


(13320, 5)
location       object
size           object
total_sqft     object
bath          float64
price         float64
dtype: object


### Counting null records for each field

It can be observed that the number of null records are very less as compared to the size of the data. So they will be neglected and dropped from the data.

In [5]:
df_dropped.isnull().sum()

location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64

In [6]:
df_dropped_null_free = df_dropped.dropna().copy()
display(df_dropped_null_free)
print(df_dropped_null_free.shape)
print(df_dropped_null_free.isnull().sum())

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.00
2,Uttarahalli,3 BHK,1440,2.0,62.00
3,Lingadheeranahalli,3 BHK,1521,3.0,95.00
4,Kothanur,2 BHK,1200,2.0,51.00
...,...,...,...,...,...
13315,Whitefield,5 Bedroom,3453,4.0,231.00
13316,Richards Town,4 BHK,3600,5.0,400.00
13317,Raja Rajeshwari Nagar,2 BHK,1141,2.0,60.00
13318,Padmanabhanagar,4 BHK,4689,4.0,488.00


(13246, 5)
location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64


Below we obtain all the unique terms that are present in this `size` field. It seems that Bedroom and BHK are the two terms that are present and tokenization can be applied to split each cell value into 2 values one numeric(which will be helpful) and other BHK which we are focussing on right now.

In [7]:
df_dropped_null_free["size"].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

The dtype of the `size` column is a Object type which gives an indication that the data in this field might not be uniform.

In [8]:
print("dtype of size column:", df_dropped_null_free["size"].dtypes)

dtype of size column: object


### Observe the size field

The size field consists of non uniform values such as 5 Bedroom 4 BHK etc. The value that we are focussing right now is BHK.

So we tokenize each value of this field and split the string and got two indices 0 and 1 where the zeroth index will give us the BHK value of which we will be creating a new column in a new data frame.

It is recommended to use the .loc method to set values. `When setting values in a pandas object, care must be taken to avoid what is called chained indexing`.Refer to Pandas docs: [returning-a-view-versus-a-copy](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy)

In [9]:
df_dropped_null_free.loc[:, "bhk"] = df_dropped_null_free["size"].apply(lambda x: int(x.split(" ")[0]))
display(df_dropped_null_free)
print(df_dropped_null_free.dtypes)

Unnamed: 0,location,size,total_sqft,bath,price,bhk
0,Electronic City Phase II,2 BHK,1056,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.00,4
2,Uttarahalli,3 BHK,1440,2.0,62.00,3
3,Lingadheeranahalli,3 BHK,1521,3.0,95.00,3
4,Kothanur,2 BHK,1200,2.0,51.00,2
...,...,...,...,...,...,...
13315,Whitefield,5 Bedroom,3453,4.0,231.00,5
13316,Richards Town,4 BHK,3600,5.0,400.00,4
13317,Raja Rajeshwari Nagar,2 BHK,1141,2.0,60.00,2
13318,Padmanabhanagar,4 BHK,4689,4.0,488.00,4


location       object
size           object
total_sqft     object
bath          float64
price         float64
bhk             int64
dtype: object


By looking at these bhk column values let's say we are interested to see the estate that have 43 bathrooms. Looking at that record we came to know that an estate having 43 baths cannot have that much small total square feet area.

In [10]:
df_dropped_null_free["bhk"].unique()

array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
       13, 18], dtype=int64)

In [11]:
# df_dropped_null_free[df_dropped_null_free["bhk"] > 20] ---alternative way for the below line---
df_dropped_null_free[df_dropped_null_free.bhk > 20]

Unnamed: 0,location,size,total_sqft,bath,price,bhk
1718,2Electronic City Phase II,27 BHK,8000,27.0,230.0,27
4684,Munnekollal,43 Bedroom,2400,40.0,660.0,43


Observing these values look at the value `1133 - 1384`. This is non uniform data where other values are `int` type here we get a range type information in the cell.

In [12]:
# df_dropped_null_free["total_sqft"].unique() ---alternative way for the below line---
df_dropped_null_free.total_sqft.unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

In order to deal with these types of records consisting of range values we can either not consider them in the further new data frames created because currently the number of such records are less compared to whole data size or we can choose to find the average of those two range values and convert them all to one single float type values.

Here we choose to convert these range values to single float values by taking the average of the two.

Below function returns a data frame that consists of all those records that contains the range values by applying this is_float() function over the data frame.

In [13]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [14]:
df_dropped_null_free[~df_dropped_null_free["total_sqft"].apply(is_float)]

Unnamed: 0,location,size,total_sqft,bath,price,bhk
30,Yelahanka,4 BHK,2100 - 2850,4.0,186.000,4
122,Hebbal,4 BHK,3067 - 8156,4.0,477.000,4
137,8th Phase JP Nagar,2 BHK,1042 - 1105,2.0,54.005,2
165,Sarjapur,2 BHK,1145 - 1340,2.0,43.490,2
188,KR Puram,2 BHK,1015 - 1540,2.0,56.800,2
...,...,...,...,...,...,...
12975,Whitefield,2 BHK,850 - 1060,2.0,38.190,2
12990,Talaghattapura,3 BHK,1804 - 2273,3.0,122.000,3
13059,Harlur,2 BHK,1200 - 1470,2.0,72.760,2
13265,Hoodi,2 BHK,1133 - 1384,2.0,59.135,2


`Refer this below block to line 21`

In [15]:
df_dropped_null_free.loc[30]

location        Yelahanka
size                4 BHK
total_sqft    2100 - 2850
bath                  4.0
price               186.0
bhk                     4
Name: 30, dtype: object

Below is the function that converts the range type cells into single float values.

In [16]:
def convert_sqft_to_num(x):
    tokens = x.split("-")
    if len(tokens) == 2:
        return (float(tokens[0]) + float(tokens[1])) / 2
    try:
        return float(x)
    except:
        return None

In [17]:
# testing this above function.
convert_sqft_to_num("25 - 30")

27.5

In [18]:
# this function does not return anything which is fine.
convert_sqft_to_num("25sq. m")

In [19]:
df_dropped_null_free_tokenized = df_dropped_null_free.copy()
df_dropped_null_free_tokenized["total_sqft"] = df_dropped_null_free_tokenized["total_sqft"].apply(convert_sqft_to_num)
df_dropped_null_free_tokenized

Unnamed: 0,location,size,total_sqft,bath,price,bhk
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.00,4
2,Uttarahalli,3 BHK,1440.0,2.0,62.00,3
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.00,3
4,Kothanur,2 BHK,1200.0,2.0,51.00,2
...,...,...,...,...,...,...
13315,Whitefield,5 Bedroom,3453.0,4.0,231.00,5
13316,Richards Town,4 BHK,3600.0,5.0,400.00,4
13317,Raja Rajeshwari Nagar,2 BHK,1141.0,2.0,60.00,2
13318,Padmanabhanagar,4 BHK,4689.0,4.0,488.00,4


Now we can see that the dtype of the total_sqft field is float type.

In [20]:
df_dropped_null_free_tokenized.dtypes

location       object
size           object
total_sqft    float64
bath          float64
price         float64
bhk             int64
dtype: object

In [21]:
df_dropped_null_free_tokenized.loc[30]

location      Yelahanka
size              4 BHK
total_sqft       2475.0
bath                4.0
price             186.0
bhk                   4
Name: 30, dtype: object

<mark>**Note**: The total_sqft column contains null values</mark>

In [22]:
df5 = df_dropped_null_free_tokenized.copy()
display(df5)

Unnamed: 0,location,size,total_sqft,bath,price,bhk
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.00,4
2,Uttarahalli,3 BHK,1440.0,2.0,62.00,3
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.00,3
4,Kothanur,2 BHK,1200.0,2.0,51.00,2
...,...,...,...,...,...,...
13315,Whitefield,5 Bedroom,3453.0,4.0,231.00,5
13316,Richards Town,4 BHK,3600.0,5.0,400.00,4
13317,Raja Rajeshwari Nagar,2 BHK,1141.0,2.0,60.00,2
13318,Padmanabhanagar,4 BHK,4689.0,4.0,488.00,4


Removing any non-numeric values for creating a price per square feet column.

In [23]:
df["price"] = pd.to_numeric(df["price"], errors="coerce")
df["total_sqft"] = pd.to_numeric(df["total_sqft"], errors="coerce")
df5["price_per_sqft"] = (df5["price"] * 100000) / df["total_sqft"]
df5

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.00,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,62.00,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.00,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,51.00,2,4250.000000
...,...,...,...,...,...,...,...
13315,Whitefield,5 Bedroom,3453.0,4.0,231.00,5,6689.834926
13316,Richards Town,4 BHK,3600.0,5.0,400.00,4,11111.111111
13317,Raja Rajeshwari Nagar,2 BHK,1141.0,2.0,60.00,2,5258.545136
13318,Padmanabhanagar,4 BHK,4689.0,4.0,488.00,4,10407.336319


Checking how many unique locations we have

In [24]:
len(df5["location"].unique())
# dimensionality curse

1304

Finding the number of data points for a particular location

In [25]:
df["location"] = df5["location"].apply(lambda x: x.strip())
location_stats = df5.groupby("location")["location"].agg("count").sort_values(ascending=False)
location_stats

location
Whitefield             534
Sarjapur  Road         392
Electronic City        302
Kanakpura Road         266
Thanisandra            233
                      ... 
 Banaswadi               1
Kanakadasa Layout        1
Kanakapur main road      1
Kanakapura  Rod          1
whitefiled               1
Name: location, Length: 1304, dtype: int64

In [26]:
len(location_stats[location_stats <= 10])

1063

In [27]:
location_stats_less_than_10 = location_stats[location_stats <= 10]
location_stats_less_than_10

location
Dodsworth Layout         10
1st Block Koramangala    10
Nagappa Reddy Layout     10
Ganga Nagar              10
Dairy Circle             10
                         ..
 Banaswadi                1
Kanakadasa Layout         1
Kanakapur main road       1
Kanakapura  Rod           1
whitefiled                1
Name: location, Length: 1063, dtype: int64

In [28]:
df5["loaction"] = df5["location"].apply(lambda x: "other" if x in location_stats_less_than_10 else x)
len(df5.location.unique())

1304

In [29]:
df5.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft,loaction
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606,Electronic City Phase II
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615.384615,Chikka Tirupathi
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305.555556,Uttarahalli
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245.890861,Lingadheeranahalli
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250.0,Kothanur
