### Summary

In this project, we'll demonstrate our data science skills by predicting house prices in various locations within Bangalore, India. We're using a dataset from Kaggle, which you can find here: [Kaggle Dataset](https://www.kaggle.com/datasets/nithinthuruthipally/bengaluru-house-prices).

In this notebook, we'll use Python to perform a range of tasks including data importation, cleaning, wrangling, and other manipulations to prepare a well-structured dataset. Then, we'll apply machine learning techniques to find the best model for predicting house prices.

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline 
import matplotlib

In [2]:
#Data importation
df = pd.read_csv("bengaluru_house_prices.csv")
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [3]:
df.shape

(13320, 9)

We will now drop all columns that do not influence the house price. In other words, any column that doesn't impact the price will be removed. Specifically, the columns for availability, society, area type, and balcony do not contribute significantly to our price prediction. Therefore, we will use the following code to remove these columns:

In [4]:
dataframe = df.drop(columns =["availability", "area_type","balcony", "society"])
#Print the 3 first rows
dataframe.head(3)

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0


In [5]:
dataframe.isnull().sum() #Alway check for missing values

location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64

In [6]:
#Dealing with missing values
dataframe = dataframe.dropna()

#Double checking missing values
dataframe.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64

In [7]:
dataframe.shape

(13246, 5)

We will analyze each column to understand its contents and characteristics. The dataset contains 13,246 observations, but we will initially examine the first 3 to 5 rows for a quick overview. Additionally, the `.unique()` method is a useful tool for identifying all unique values in a column.

In [8]:
dataframe['size'].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

We create the bedroom column from spliting the seze column and keeping the first element, wich is the integer or number which represent the numbe of bedroom

In [9]:
dataframe['bedroom'] = dataframe['size'].apply(lambda x: x.split(' ')[0])

In [10]:
dataframe['bedroom'] = dataframe['bedroom'].apply(lambda x: int(x))

In [11]:
#drop the old size column
dataframe.drop(columns='size', inplace=True)

In [12]:
dataframe.head()

Unnamed: 0,location,total_sqft,bath,price,bedroom
0,Electronic City Phase II,1056,2.0,39.07,2
1,Chikka Tirupathi,2600,5.0,120.0,4
2,Uttarahalli,1440,2.0,62.0,3
3,Lingadheeranahalli,1521,3.0,95.0,3
4,Kothanur,1200,2.0,51.0,2


In [13]:
#Here we double checking, and we see that we only have the number of bedroom for each observation
dataframe['bedroom'].unique()

array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
       13, 18], dtype=int64)

In [14]:
dataframe[dataframe.bedroom>20]

Unnamed: 0,location,total_sqft,bath,price,bedroom
1718,2Electronic City Phase II,8000,27.0,230.0,27
4684,Munnekollal,2400,40.0,660.0,43


In [15]:
dataframe.shape

(13246, 5)

In [16]:
#Now we check the column 'total_sqft'
dataframe['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

The data format in this column is quite messy, with numbers having incorrect data types and some values presented as intervals. To address this, we will create a function to separate the numbers from the intervals.

In [17]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [20]:
# We are identifying all values in the data that cannot be converted to float.
dataframe[~ dataframe['total_sqft'].apply(lambda x: is_float(x))]

Unnamed: 0,location,total_sqft,bath,price,bedroom
30,Yelahanka,2100 - 2850,4.0,186.000,4
122,Hebbal,3067 - 8156,4.0,477.000,4
137,8th Phase JP Nagar,1042 - 1105,2.0,54.005,2
165,Sarjapur,1145 - 1340,2.0,43.490,2
188,KR Puram,1015 - 1540,2.0,56.800,2
...,...,...,...,...,...
12975,Whitefield,850 - 1060,2.0,38.190,2
12990,Talaghattapura,1804 - 2273,3.0,122.000,3
13059,Harlur,1200 - 1470,2.0,72.760,2
13265,Hoodi,1133 - 1384,2.0,59.135,2


In [21]:
dataframe.shape

(13246, 5)

We will create a function that returns the average of the minimum and maximum values of an interval.

In [22]:
def convert_total_sqft_to_number(x):
    value = x.split('-')
    if len(value) == 2:
        return (float(value[0]) + float(value[1]))/2
    try:
        return float(x)
    except:
        return None

In [None]:
dataframe2 = dataframe1.copy()
dataframe2['total_sqft'] = dataframe2['total_sqft'].apply(convert_total_sqft_to_number)

In [None]:
dataframe2.head(3)

In [None]:
dataframe2.shape

In [None]:
dataframe2.total_sqft.isnull().sum()

In [None]:
dataframe2.dropna()

In [None]:
dataframe3 = dataframe2.copy()

In [None]:
dataframe3['price_per_sqft'] = dataframe3['price']*100000/dataframe3['total_sqft']
dataframe3.head(3)

In [None]:
dataframe3['location'].unique()

In [None]:
len(dataframe3['location'].unique())

In [None]:
df3.location = df3.location.apply(lambda x: x.strip())
loc_stats = df3.groupby('location')['location'].agg('count').sort_values(ascending=False)
loc_stats

In [None]:
len(loc_stats[loc_stats<=10])

In [None]:
location_less_than_10 = loc_stats[loc_stats<=10]
location_less_than_10.size

In [None]:
len(df3.location.unique())

In [None]:
df3.location = df3.location.apply(lambda x: 'Other' if x in location_less_than_10 else x)

In [None]:
len(df3.location.unique())

In [None]:
df3.head(10)

In [None]:
df3[df3.total_sqft/df3.bedroom<300].head()

In [None]:
df4 = df3[~(df3.total_sqft/df3.bedroom<300)]
df4.head(3)

In [None]:
df4.price_per_sqft.describe()

In [None]:
def removal_outliers (df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.price_per_sqft)
        sd = np.std(subdf.price_per_sqft)
        reducedf = subdf[(subdf.price_per_sqft>(m-sd))&(subdf.price_per_sqft<=(m+sd))]
        df_out = pd.concat([df_out, reducedf], ignore_index=True )
    return df_out

df5 = removal_outliers(df4)

In [None]:
df4.shape

In [None]:
df5.shape

In [None]:
df5.head(3)

In [None]:
def plot_scatter_chart(df,location):
    bhk2 = df[(df.location==location) & (df.bedroom==2)]
    bhk3 = df[(df.location==location) & (df.bedroom==3)]
    matplotlib.rcParams['figure.figsize'] = (12,6)
    plt.scatter(bhk2.total_sqft,bhk2.price,color='blue',label='2 bedroom', s=50)
    plt.scatter(bhk3.total_sqft,bhk3.price,marker='+', color='green',label='3 bedroom', s=50)
    plt.xlabel("Total Square Feet Area")
    plt.ylabel("Price (Lakh Indian Rupees)")
    plt.title(location)
    plt.legend()
    
plot_scatter_chart(df5,"Hebbal")

We should also remove properties where for same location, the price of (for example) 3 bedroom apartment is less than 2 bedroom apartment (with same square ft area). What we will do is for a given location, we will build a dictionary of stats per bhk, i.e.

In [None]:
def remove_bedroom_outliers(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby('location'):
        bedroom_stats = {}
        for bedroom, bedroom_df in location_df.groupby('bedroom'):
            bedroom_stats[bedroom] = {
                'mean': np.mean(bedroom_df.price_per_sqft),
                'std': np.std(bedroom_df.price_per_sqft),
                'count': bedroom_df.shape[0]
            }
        for bedroom, bedroom_df in location_df.groupby('bedroom'):
            stats = bedroom_stats.get(bedroom-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bedroom_df[bedroom_df.price_per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices,axis='index')
df6 = remove_bedroom_outliers(df5)
# df8 = df7.copy()
df6.shape

In [None]:
plot_scatter_chart(df6,"Hebbal")

In [None]:
#import matplotlib
matplotlib.rcParams["figure.figsize"] = (12,6)
plt.hist(df6.price_per_sqft,rwidth=0.8)
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")

In [None]:
df6.bath.unique()

In [None]:
df6[df6.bath>10]

In [None]:
plt.hist(df6.bath,rwidth=0.8)
plt.xlabel("Number of bathrooms")
plt.ylabel("Count")

In [None]:
df7 = df6[df6.bath<df6.bedroom+2]
df7.shape

In [None]:
df7.head(3)

In [None]:
df8 = df7.drop('price_per_sqft', axis=1)

In [None]:
dummies = pd.get_dummies(df8.location, dtype=int)
dummies.head(3)

In [None]:
df9 = pd.concat([df8, dummies.drop('Other', axis='columns')], axis=1).drop('location', axis=1)
df9.head(2)

In [None]:
X = df9.drop('price', axis='columns')
y = df9.price

In [None]:
X.shape

In [None]:
y.shape

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=10)

In [None]:
from sklearn.linear_model import LinearRegression
lr_clf = LinearRegression()
lr_clf.fit(X_train,y_train)
lr_clf.score(X_test,y_test)

##### Use K Fold cross validation to measure accuracy of our LinearRegression model

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

cross_val_score(LinearRegression(), X, y, cv=cv)

##### Find best model using GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor

def find_best_model_using_gridsearchcv(X,y):
    algos = {
        'linear_regression' : {
            'model': LinearRegression(),
            'params': {
                'copy_X' : [True, False],
                'fit_intercept' : [True, False],
                'n_jobs' : [1,2,3],
                'positive' : [True, False]

            }
        },
        'lasso': {
            'model': Lasso(),
            'params': {
                'alpha': [1,2],
                'selection': ['random', 'cyclic']
            }
        },
        'decision_tree': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion' : ['mse','friedman_mse'],
                'splitter': ['best','random']
            }
        }
    }
    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    for algo_name, config in algos.items():
        gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(X,y)
        scores.append({
            'model': algo_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
        })

    return pd.DataFrame(scores,columns=['model','best_score','best_params'])

find_best_model_using_gridsearchcv(X,y)

##### linear_regression has a better score 

##### Test the model for few properties

In [None]:
def predict_price(location,sqft,bath,bedroom):    
    loc_index = np.where(X.columns==location)[0][0]

    x = np.zeros(len(X.columns))
    x[0] = sqft
    x[1] = bath
    x[2] = bedroom
    if loc_index >= 0:
        x[loc_index] = 1

    return lr_clf.predict([x])[0]

In [None]:
predict_price('Neeladri Nagar', 1000,2,2)

In [None]:
predict_price('Neeladri Nagar', 1000,3,3)

In [None]:
predict_price('1st Phase JP Nagar',1000, 2, 2)

In [None]:
predict_price('1st Phase JP Nagar',1000, 3,3)

In [None]:
predict_price('Indira Nagar',1000, 2, 2)

##### Export the tested model to a pickle file

In [None]:
import pickle
with open('banglore_home_prices_model.pickle','wb') as f:
    pickle.dump(lr_clf,f)

In [None]:
import json
columns = {
    'data_columns' : [col.lower() for col in X.columns]
}
with open("columns.json","w") as f:
    f.write(json.dumps(columns))