In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd

HOUSE = '/kaggle/input/house-price-prediction-dataset/csvdata.csv'
df = pd.read_csv(filepath_or_buffer=HOUSE, index_col=[0])
df.head()

Unnamed: 0,City,Price,Area,Location,No. of Bedrooms
0,Bangalore,30000000,3340,JP Nagar Phase 1,4
1,Bangalore,7888000,1045,Dasarahalli on Tumkur Road,2
2,Bangalore,4866000,1179,Kannur on Thanisandra Main Road,2
3,Bangalore,8358000,1675,Doddanekundi,3
4,Bangalore,6845000,1670,Kengeri,3


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29135 entries, 0 to 7718
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   City             29135 non-null  object
 1   Price            29135 non-null  int64 
 2   Area             29135 non-null  int64 
 3   Location         29135 non-null  object
 4   No. of Bedrooms  29135 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 1.3+ MB


We don't have a lot of data to work with; the obvious target variable is price; let's hope we have lots of signal in the three or maybe four independent variables.

In [3]:
df.nunique()

City                  6
Price              4924
Area               2452
Location           1776
No. of Bedrooms       9
dtype: int64

In [4]:
from plotly.express import histogram
histogram(data_frame=df, x='Price', facet_col='City', log_y=True, height=1200, facet_col_wrap=3)

In [5]:
from plotly.express import scatter
scatter(data_frame=df, x='Area', y='Price', log_y=True, facet_col='City', height=1200, facet_col_wrap=3, color='No. of Bedrooms', hover_name='Location')

In [6]:
scatter(data_frame=df, x='Area', y='No. of Bedrooms', color='City', trendline='ols')

Not surprisingly the number of bedrooms is a product of the area with an R2 of anywhere from 0.39 to 0.65. This suggests that we really only have 2 1/2 independent variables for determining the price.

In [7]:
from math import log
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, init='random', random_state=2024, n_iter=10000, verbose=1,)
sample_df = df.sample(n=2000, random_state=2024).copy()
sample_df[['tx', 'ty']] = tsne.fit_transform(X=sample_df[['Area', 'No. of Bedrooms']])
sample_df['log Price'] = sample_df['Price'].apply(func=log)
scatter(data_frame=sample_df, x='tx', y='ty', color='log Price', hover_name='Location', hover_data=['Area', 'No. of Bedrooms', 'Price'], height=900).show()
scatter(data_frame=sample_df, x='tx', y='ty', color='City', hover_name='Location', hover_data=['Area', 'No. of Bedrooms', 'Price'], height=900).show()

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 2000 samples in 0.001s...
[t-SNE] Computed neighbors for 2000 samples in 0.024s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2000
[t-SNE] Computed conditional probabilities for sample 2000 / 2000
[t-SNE] Mean sigma: 3.187933
[t-SNE] KL divergence after 250 iterations with early exaggeration: 48.684887
[t-SNE] KL divergence after 10000 iterations: 0.063338


As we see in the histograms above our dataset is dominated by low-priced houses, so price plots tend to be mostly monochrome. Our TSNE plot above manages to get most of the high-priced properties into one cluster, but it is difficult to tell whether we should expect to see high or low accuracy from our regression model.

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

X_train, X_test, y_train, y_test = train_test_split(df[['Area', 'No. of Bedrooms']], df['Price'].values, test_size=0.2, random_state=2024)
model = LinearRegression().fit(X=X_train, y=y_train)

print('R2: {:6.4f}'.format(r2_score(y_true=y_test, y_pred = model.predict(X=X_test))))
scatter(x=y_test, y=model.predict(X=X_test), log_x=True, log_y=True).show()


R2: 0.0940


Our regression model without the city data gives rather poor results; our R2 is about 9%.

In [9]:
t_df = pd.concat(objs=[pd.get_dummies(data=df[['City']]), df[['Area', 'No. of Bedrooms', 'Price']]], axis=1)

Xt_train, Xt_test, yt_train, yt_test = train_test_split(t_df.drop(columns=['Price']), t_df['Price'].values, test_size=0.2, random_state=2024)
t_model = LinearRegression().fit(X=Xt_train, y=yt_train)

print('R2: {:6.4f}'.format(r2_score(y_true=yt_test, y_pred = t_model.predict(X=Xt_test))))
scatter(x=yt_test, y=t_model.predict(X=Xt_test), log_x=True, log_y=True).show()

R2: 0.1299


Introducing variables for the cities improves our results on a relative basis by 30-40% on a relative basis but only 3.5% on an absolute basis.

In [10]:
histogram(x=t_df.columns[:-1], y=t_model.coef_)

We are clearly missing some important data here.