# Predicting Price of House
### Objectives: 
- to understand problem of overfitting in predicting housing price
- to apply regularization techniques (L1, L2) to reduce overfitting

Data source: https://github.com/dipalira/Melbourne-Housing-Data-Kaggle

### Import Required Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

### Read the dataset ( csv file)

In [5]:
df = pd.read_csv('Melbourne_housing_FULL.csv')
df.head()
df.shape
# df.columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34857 entries, 0 to 34856
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         34857 non-null  object 
 1   Address        34857 non-null  object 
 2   Rooms          34857 non-null  int64  
 3   Type           34857 non-null  object 
 4   Price          27247 non-null  float64
 5   Method         34857 non-null  object 
 6   SellerG        34857 non-null  object 
 7   Date           34857 non-null  object 
 8   Distance       34856 non-null  float64
 9   Postcode       34856 non-null  float64
 10  Bedroom2       26640 non-null  float64
 11  Bathroom       26631 non-null  float64
 12  Car            26129 non-null  float64
 13  Landsize       23047 non-null  float64
 14  BuildingArea   13742 non-null  float64
 15  YearBuilt      15551 non-null  float64
 16  CouncilArea    34854 non-null  object 
 17  Lattitude      26881 non-null  float64
 18  Longti

In [None]:
df.shape

In [None]:
df.nunique()

In [6]:
cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method',
               'SellerG', 'Regionname', 'Propertycount',
               'Distance','CouncilArea', 'Bedroom2',
               'Bathroom','Car','Landsize','BuildingArea',
                'Price']
df = df[cols_to_use]
df.shape

(34857, 15)

### Missing value handling

In [7]:
df.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        3
Distance             1
CouncilArea          3
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

In [8]:
cols_to_fill_zero = ['Propertycount', 'Distance','Bedroom2','Bathroom','Car']
df[cols_to_fill_zero] = df[cols_to_fill_zero].fillna(0)
df.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        0
Distance             0
CouncilArea          3
Bedroom2             0
Bathroom             0
Car                  0
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

In [None]:
df.info()

In [10]:
# df['Landsize']=df['Landsize'].fillna(df['Landsize'].mean())
# df['BuildingArea']=df['BuildingArea'].fillna(df['BuildingArea'].mean())

df.dropna(inplace=True)
df.isna().sum()

Suburb           0
Rooms            0
Type             0
Method           0
SellerG          0
Regionname       0
Propertycount    0
Distance         0
CouncilArea      0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
Price            0
dtype: int64

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27244 entries, 1 to 34856
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         27244 non-null  object 
 1   Rooms          27244 non-null  int64  
 2   Type           27244 non-null  object 
 3   Method         27244 non-null  object 
 4   SellerG        27244 non-null  object 
 5   Regionname     27244 non-null  object 
 6   Propertycount  27244 non-null  float64
 7   Distance       27244 non-null  float64
 8   CouncilArea    27244 non-null  object 
 9   Bedroom2       27244 non-null  float64
 10  Bathroom       27244 non-null  float64
 11  Car            27244 non-null  float64
 12  Landsize       27244 non-null  float64
 13  BuildingArea   27244 non-null  float64
 14  Price          27244 non-null  float64
dtypes: float64(8), int64(1), object(6)
memory usage: 3.3+ MB


### Encoding Categorical Features

In [13]:
df = pd.get_dummies(df,drop_first = True)
df.head()
df.columns

Index(['Rooms', 'Propertycount', 'Distance', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'Price', 'Suburb_Aberfeldie',
       ...
       'CouncilArea_Moorabool Shire Council',
       'CouncilArea_Moreland City Council',
       'CouncilArea_Nillumbik Shire Council',
       'CouncilArea_Port Phillip City Council',
       'CouncilArea_Stonnington City Council',
       'CouncilArea_Whitehorse City Council',
       'CouncilArea_Whittlesea City Council',
       'CouncilArea_Wyndham City Council', 'CouncilArea_Yarra City Council',
       'CouncilArea_Yarra Ranges Shire Council'],
      dtype='object', length=745)

### seperating Features, Target, Train, Test

In [14]:
X = df.drop('Price',axis = 1)
y = df[['Price']]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 2)

print(df.shape,'\n', X_train.shape, X_test.shape, '\n', y_train.shape, y_test.shape)

(27244, 745) 
 (19070, 744) (8174, 744) 
 (19070, 1) (8174, 1)


### Linear Regression: Fit the model

In [15]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

In [16]:
lr.score(X_train, y_train)

0.6827792395792723

In [17]:
lr.score(X_test, y_test)

0.13853683161537644

This is a problem of overfiting model

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
# mse_test = mean_squared_error(y_test,lr.predict(X_test))
# mse_train = mean_squared_error(y_train,lr.predict(X_train))
# print(round(mse_train,2), np.round(mse_test,2))

In [None]:
r2_score(y_test, lr.predict(X_test))

## L1 Regularization: LASSO 

In [23]:
# from sklearn.linear_model import Lasso
# lasso = Lasso(alpha = 50, max_iter = 1000, tol = 0.1)
# lasso.fit(X_train, y_train)
lasso.score(X_test, y_test)

0.6636280170612746

In [24]:
lasso.score(X_train, y_train)

0.6767149418617553

## Ridge (L2)

In [29]:
from sklearn.linear_model import Ridge
rr = Ridge(alpha = 5, max_iter = 100, tol = 0.1)
rr.fit(X_train,y_train)
rr.score(X_test, y_test)

0.6743855819360092

In [26]:
rr.score(X_train, y_train)

0.6622376739684328

## Further Consideration

- Feature Selection
- Feature Scaling
- Categorical Encoding
- Missing Value Imputation
- Hyperparameter Tuning
- Cross Validation