# House Price Prediction

### Problem Statement:
A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price.
The company is looking at prospective properties to buy to enter the market.
The company wants to know:
   - Which variables are significant in predicting the price of a house, and

   - How well those variables describe the price of a house.

## Reading and Understanding the Data

In [1]:
# Importing required packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

import os

# Supress warnings
import warnings
warnings.filterwarnings('ignore')
pd.pandas.set_option('display.max_columns',None)

In [2]:
# Importing train.csv and reading the data.

house_df = pd.read_csv("train.csv", na_values="NAN", keep_default_na=False)
house_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


Understanding the columns.

In [3]:
# Getting the count of the rows and cols in the data set.
house_df.shape

(1460, 81)

In [4]:
#Getting more information about each column

print(house_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Id             1460 non-null   int64 
 1   MSSubClass     1460 non-null   int64 
 2   MSZoning       1460 non-null   object
 3   LotFrontage    1460 non-null   object
 4   LotArea        1460 non-null   int64 
 5   Street         1460 non-null   object
 6   Alley          1460 non-null   object
 7   LotShape       1460 non-null   object
 8   LandContour    1460 non-null   object
 9   Utilities      1460 non-null   object
 10  LotConfig      1460 non-null   object
 11  LandSlope      1460 non-null   object
 12  Neighborhood   1460 non-null   object
 13  Condition1     1460 non-null   object
 14  Condition2     1460 non-null   object
 15  BldgType       1460 non-null   object
 16  HouseStyle     1460 non-null   object
 17  OverallQual    1460 non-null   int64 
 18  OverallCond    1460 non-null

In [5]:
# Observing the statistical data.

house_df.describe()

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,10516.828082,6.099315,5.575342,1971.267808,1984.865753,443.639726,46.549315,567.240411,1057.429452,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,6.517808,0.613014,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,9981.264932,1.382997,1.112799,30.202904,20.645407,456.098091,161.319273,441.866955,438.705324,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,1.625393,0.644666,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,223.0,795.75,882.0,0.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1.0,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,9478.5,6.0,5.0,1973.0,1994.0,383.5,0.0,477.5,991.5,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,11601.5,7.0,6.0,2000.0,2004.0,712.25,0.0,808.0,1298.25,1391.25,728.0,0.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,215245.0,10.0,9.0,2010.0,2010.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [6]:
house_df['GarageYrBlt'] = pd.to_numeric(house_df['GarageYrBlt'], errors ='coerce') 

In [None]:
# Plotting a heat map

plt.figure(figsize = (20, 15))
cor=house_df.corr()
sns.heatmap(cor, annot = True, cmap="YlGnBu")
plt.show()

In [None]:
# Heat map for the top 10 features related to SalePrice

corrmat = house_df.corr()
plt.figure(figsize=(20,10))
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(house_df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values, cmap="YlGnBu")
plt.show()

Understanding the top 10 features that impact the sales price of the houses.

In [None]:
#Total categorical features in the data.
categorical_features=[feature for feature in house_df.columns if house_df[feature].dtypes=='O']

print(categorical_features)
print(len(categorical_features),"Categorical columns")

Understanding the categorical data and the way to encode them.

In [None]:
#Total numerical features in the data.

numeric_features = house_df.dtypes[house_df.dtypes != "object"].index
print(numeric_features)
print(len(numeric_features),"Numerical columns")

Understanding the numerical data

In [None]:
house_df = house_df.drop("Id", axis=1) # Dropping the Id columns as it does not have any impact with the sales prrice.


In [None]:
# Identifying the year related columns

date_feature=[colVal for colVal in numeric_features if 'Year' in colVal or 'Yr' in colVal]
date_feature

Understanding the relation of the year columns with respect to sales price of the house.

In [None]:
# Getting the unique values.
for uniqueVal in date_feature:
    print(len(house_df[uniqueVal].unique()),'Unique value {}'.format(uniqueVal))

In [None]:
house_df.groupby('YearBuilt')['SalePrice'].median().plot()
plt.show()

In [None]:
house_df.groupby('YearRemodAdd')['SalePrice'].median().plot()
plt.show()

In [None]:
house_df.groupby('GarageYrBlt')['SalePrice'].median().plot()
plt.show()

In [None]:
house_df.groupby('YrSold')['SalePrice'].median().plot()
plt.show()

Since the above data does not give any much impact details so we can derive some new columns based on the above data.

# Data preparation

Let us encode the categorical values.

In [None]:
# Created new columns.

house_df['AgeWhenSold'] = house_df['YrSold'] - house_df['YearBuilt']
house_df['YearsSinceRemod'] = house_df['YrSold'] - house_df['YearRemodAdd']
house_df = house_df.drop(['YrSold','YearBuilt', 'YearRemodAdd'], axis=1)
house_df.head()

In [None]:
# grouping by frequency 

fq = house_df.groupby('MSZoning').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('MSZoning')] = house_df['MSZoning'].map(fq)   
# drop original column. 
house_df = house_df.drop(['MSZoning'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10) 

In [None]:
# Convering the categorical variable 'Street' to get the dummy variables.

street = pd.get_dummies(house_df['Street'], drop_first = True)
house_df= pd.concat([house_df,street], axis = 1)
# drop original column. 
house_df = house_df.drop(['Street'], axis = 1)  
house_df.head(10) 

In [None]:
# grouping by frequency

fq = house_df.groupby('LotShape').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('LotShape')] = house_df['LotShape'].map(fq)   
# drop original column. 
house_df = house_df.drop(['LotShape'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10) 

In [None]:
# grouping by frequency 

fq = house_df.groupby('LandContour').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('LandContour')] = house_df['LandContour'].map(fq)   
# drop original column. 
house_df = house_df.drop(['LandContour'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# Convering the categorical variable 'Utilities' to get the dummy variables.
street = pd.get_dummies(house_df['Utilities'], drop_first = True)
house_df= pd.concat([house_df,street], axis = 1)
# drop original column. 
house_df = house_df.drop(['Utilities'], axis = 1)  
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('LotConfig').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('LotConfig')] = house_df['LotConfig'].map(fq)   
# drop original column. 
house_df = house_df.drop(['LotConfig'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('LandSlope').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('LandSlope')] = house_df['LandSlope'].map(fq)   
# drop original column. 
house_df = house_df.drop(['LandSlope'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('Neighborhood').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('Neighborhood')] = house_df['Neighborhood'].map(fq)   
# drop original column. 
house_df = house_df.drop(['Neighborhood'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('Condition1').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('Condition1')] = house_df['Condition1'].map(fq)   
# drop original column. 
house_df = house_df.drop(['Condition1'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('Condition2').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('Condition2')] = house_df['Condition2'].map(fq)   
# drop original column. 
house_df = house_df.drop(['Condition2'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# Replace the NA values.
house_df['Alley'] = house_df.Alley.replace({"NA":"NoAlleyAccess"})
# grouping by frequency 
fq = house_df.groupby('Alley').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('Alley')] = house_df['Alley'].map(fq)   
# drop original column. 
house_df = house_df.drop(['Alley'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('BldgType').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('BldgType')] = house_df['BldgType'].map(fq)   
# drop original column. 
house_df = house_df.drop(['BldgType'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('HouseStyle').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('HouseStyle')] = house_df['HouseStyle'].map(fq)   
# drop original column. 
house_df = house_df.drop(['HouseStyle'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('RoofStyle').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('RoofStyle')] = house_df['RoofStyle'].map(fq)   
# drop original column. 
house_df = house_df.drop(['RoofStyle'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('RoofMatl').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('RoofMatl')] = house_df['RoofMatl'].map(fq)   
# drop original column. 
house_df = house_df.drop(['RoofMatl'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('Exterior1st').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('Exterior1st')] = house_df['Exterior1st'].map(fq)   
# drop original column. 
house_df = house_df.drop(['Exterior1st'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('Exterior2nd').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('Exterior2nd')] = house_df['Exterior2nd'].map(fq)   
# drop original column. 
house_df = house_df.drop(['Exterior2nd'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('MasVnrType').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('MasVnrType')] = house_df['MasVnrType'].map(fq)   
# drop original column. 
house_df = house_df.drop(['MasVnrType'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# Encoding the ExterQual variable.
ExterQual = {'Ex':5 , 'Gd': 4 , 'TA':3, 'Fa':2, 'Po':1}
house_df['ExterQualEnc'] = house_df.ExterQual.map(ExterQual)
# drop original column. 
house_df = house_df.drop(['ExterQual'], axis = 1)  
house_df.head(10)

In [None]:
# Encoding the ExterCond variable.
ExterCond = {'Ex':5 , 'Gd': 4 , 'TA':3, 'Fa':2, 'Po':1}
house_df['ExterCondEnc'] = house_df.ExterCond.map(ExterCond)
# drop original column. 
house_df = house_df.drop(['ExterCond'], axis = 1)  
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('Foundation').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('Foundation')] = house_df['Foundation'].map(fq)   
# drop original column. 
house_df = house_df.drop(['Foundation'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# Replace the NA values.
house_df['BsmtQual'] = house_df.BsmtQual.replace({"NA":"NoBasement"})
# Encoding the symboling variable.
BsmtQual = {'Ex':5 , 'Gd': 4 , 'TA':3, 'Fa':2, 'Po':1,'NoBasement':0}
house_df['BsmtQualEnc'] = house_df.BsmtQual.map(BsmtQual)
# drop original column. 
house_df = house_df.drop(['BsmtQual'], axis = 1)  
house_df.head(10)

In [None]:
# Replace the NA values.
house_df['BsmtCond'] = house_df.BsmtCond.replace({"NA":"NoBasement"})
# Encoding the symboling variable.
BsmtCond = {'Ex':5 , 'Gd': 4 , 'TA':3, 'Fa':2, 'Po':1,'NoBasement':0}
house_df['BsmtCondEnc'] = house_df.BsmtCond.map(BsmtCond)
# drop original column. 
house_df = house_df.drop(['BsmtCond'], axis = 1)  
house_df.head(10)

In [None]:
# Replace the NA values.
house_df['BsmtExposure'] = house_df.BsmtExposure.replace({"NA":"NoBasement"})
# Encoding the symboling variable.
BsmtExposure = {'Gd':4 , 'Av': 3 , 'Mn':2, 'No':1,'NoBasement':0}
house_df['BsmtExposureEnc'] = house_df.BsmtExposure.map(BsmtExposure)
# drop original column. 
house_df = house_df.drop(['BsmtExposure'], axis = 1)  
house_df.head(10)

In [None]:
# Replace the NA values.
house_df['BsmtFinType1'] = house_df.BsmtFinType1.replace({"NA":"NoBasement"})
# Encoding the symboling variable.
BsmtFinType1 = {'GLQ':6,'ALQ':5,'BLQ':4 , 'Rec': 3 , 'LwQ':2, 'Unf':1,'NoBasement':0}
house_df['BsmtFinType1Enc'] = house_df.BsmtFinType1.map(BsmtFinType1)
# drop original column. 
house_df = house_df.drop(['BsmtFinType1'], axis = 1)  
house_df.head(10)

In [None]:
# Replace the NA values.
house_df['BsmtFinType2'] = house_df.BsmtFinType2.replace({"NA":"NoBasement"})
# Encoding the symboling variable.
BsmtFinType2 = {'GLQ':6,'ALQ':5,'BLQ':4 , 'Rec': 3 , 'LwQ':2, 'Unf':1,'NoBasement':0}
house_df['BsmtFinType2Enc'] = house_df.BsmtFinType2.map(BsmtFinType2)
# drop original column. 
house_df = house_df.drop(['BsmtFinType2'], axis = 1)  
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('Heating').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('Heating')] = house_df['Heating'].map(fq)   
# drop original column. 
house_df = house_df.drop(['Heating'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# Encoding the HeatingQC variable.
HeatingQC = {'Ex':5 , 'Gd': 4 , 'TA':3, 'Fa':2, 'Po':1}
house_df['HeatingQCEnc'] = house_df.HeatingQC.map(HeatingQC)
# drop original column. 
house_df = house_df.drop(['HeatingQC'], axis = 1)  
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('CentralAir').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('CentralAir')] = house_df['CentralAir'].map(fq)   
# drop original column. 
house_df = house_df.drop(['CentralAir'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('Electrical').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('Electrical')] = house_df['Electrical'].map(fq)   
# drop original column. 
house_df = house_df.drop(['Electrical'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# Encoding the KitchenQual variable.
KitchenQual = {'Ex':5 , 'Gd': 4 , 'TA':3, 'Fa':2, 'Po':1,}
house_df['KitchenQualEnc'] = house_df.KitchenQual.map(KitchenQual)
# drop original column. 
house_df = house_df.drop(['KitchenQual'], axis = 1)  
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('Functional').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('Functional')] = house_df['Functional'].map(fq)   
# drop original column. 
house_df = house_df.drop(['Functional'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# Replace the NA values.

house_df['FireplaceQu'] = house_df.FireplaceQu.replace({"NA":"NoFireplace"})
# Encoding the symboling variable.
FireplaceQu = {'Ex':5 , 'Gd': 4 , 'TA':3, 'Fa':2, 'Po':1,'NoFireplace':0}
house_df['FireplaceQuEnc'] = house_df.FireplaceQu.map(FireplaceQu)
# drop original column. 
house_df = house_df.drop(['FireplaceQu'], axis = 1)  
house_df.head(10)

In [None]:
# Replace the NA values.

house_df['GarageType'] = house_df.GarageType.replace({"NA":"NoGarage"})
# grouping by frequency 
fq = house_df.groupby('GarageType').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('GarageType')] = house_df['GarageType'].map(fq)   
# drop original column. 
house_df = house_df.drop(['GarageType'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# Replace the NA values.

house_df['GarageFinish'] = house_df.GarageFinish.replace({"NA":"NoGarage"})
# grouping by frequency 
fq = house_df.groupby('GarageFinish').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('GarageFinish')] = house_df['GarageFinish'].map(fq)   
# drop original column. 
house_df = house_df.drop(['GarageFinish'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# Replace the NA values.

house_df['GarageQual'] = house_df.GarageQual.replace({"NA":"NoGarage"})
# Encoding the symboling variable.
GarageQual = {'Ex':5 , 'Gd': 4 , 'TA':3, 'Fa':2, 'Po':1,'NoGarage':0}
house_df['GarageQualEnc'] = house_df.GarageQual.map(GarageQual)
# drop original column. 
house_df = house_df.drop(['GarageQual'], axis = 1)  
house_df.head(10)

In [None]:
# Replace the NA values.

house_df['GarageCond'] = house_df.GarageCond.replace({"NA":"NoGarage"})
# Encoding the symboling variable.
GarageCond = {'Ex':5 , 'Gd': 4 , 'TA':3, 'Fa':2, 'Po':1,'NoGarage':0}
house_df['GarageCondEnc'] = house_df.GarageCond.map(GarageCond)
# drop original column. 
house_df = house_df.drop(['GarageCond'], axis = 1)  
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('PavedDrive').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('PavedDrive')] = house_df['PavedDrive'].map(fq)   
# drop original column. 
house_df = house_df.drop(['PavedDrive'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# Replace the NA values.

house_df['PoolQC'] = house_df.PoolQC.replace({"NA":"NoPool"})
# Encoding the symboling variable.
PoolQC = {'Ex':4 , 'Gd': 3 , 'TA':2, 'Fa':1,'NoPool':0}
house_df['PoolQCEnc'] = house_df.PoolQC.map(PoolQC)
# drop original column. 
house_df.drop(['PoolQC'], axis = 1, inplace=True)  
house_df.head(10)

In [None]:
# Replace the NA values.

house_df['Fence'] = house_df.Fence.replace({"NA":"NoFence"})
# grouping by frequency 
fq = house_df.groupby('Fence').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('Fence')] = house_df['Fence'].map(fq)   
# drop original column. 
house_df = house_df.drop(['Fence'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('MiscFeature').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('MiscFeature')] = house_df['MiscFeature'].map(fq)   
# drop original column. 
house_df = house_df.drop(['MiscFeature'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('SaleType').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('SaleType')] = house_df['SaleType'].map(fq)   
# drop original column. 
house_df = house_df.drop(['SaleType'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# grouping by frequency 
fq = house_df.groupby('SaleCondition').size()/len(house_df)    
# mapping values to dataframe 
house_df.loc[:, "{}_freq_encode".format('SaleCondition')] = house_df['SaleCondition'].map(fq)   
# drop original column. 
house_df = house_df.drop(['SaleCondition'], axis = 1)  
fq.plot.bar(stacked = True)   
house_df.head(10)

In [None]:
# As we saw from the heat map, there is a very high corelation between the GarageYrBlt and Yearbuild of the house,
# so we thought to drop GarageYrBlt as it had many missing values.
house_df = house_df.drop(['GarageYrBlt'], axis = 1)

Handling the missing vaules in the below columns.

In [None]:
house_df['MasVnrArea'] = pd.to_numeric(house_df['MasVnrArea'], errors ='coerce')
house_df['LotFrontage'] = pd.to_numeric(house_df['LotFrontage'], errors ='coerce') 
house_df.head()

In [None]:
#Filling the missing vaules.
house_df['MasVnrArea'].fillna((house_df['MasVnrArea'].mean()), inplace=True)
house_df['LotFrontage'].fillna((house_df['LotFrontage'].mean()), inplace=True)
house_df.head(10)

In [None]:
# Verifying if all objects are of numeric type.
house_df.info()

## Popping the target variable from the dataframe to get 'SalesPrice' in X and other columns in Y dataframes.

In [None]:
y = house_df.pop('SalePrice')
X = house_df

### Rescaling the variables so that the units of the coefficients obtained are all on the same scale

In [None]:
# scaling the features
from sklearn.preprocessing import scale

# storing column names in cols, since column names are (annoyingly) lost after 
# scaling (the df is converted to a numpy array)
cols = X.columns
X = pd.DataFrame(scale(X))
y = pd.DataFrame(scale(y))

X.columns = cols
X.columns

### Splitting the Data into Training and Testing Sets

In [None]:
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.7,
                                                    test_size = 0.3, random_state=100)

## Model Building and Evaluation

In [None]:
# Running RFE with the output number of the variable equal to 10
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm  
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 50)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
# Columns which are in top 50 factors affecting the Sales price.

col = X_train.columns[rfe.support_]
col

In [None]:
# Creating X_train and X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]
X_test_rfe = X_test[col]

In [None]:
from sklearn.metrics import r2_score
# linear regression
lm1 = LinearRegression()
lm1.fit(X_train_rfe, y_train)

# predict
y_train_pred = lm1.predict(X_train_rfe)
#r2_score for train data.
r2_score(y_true=y_train, y_pred=y_train_pred)

In [None]:
#r2_score for test data.
y_test_pred = lm1.predict(X_test_rfe)
r2_score(y_true=y_test, y_pred=y_test_pred)

As we can observe, after rfe the r2_score of both train and test data are very close.

## Ridge Regression

In [None]:
#Ridge


# list of alphas to tune

params = {'alpha': [0.001, 0.01, 1.0, 5.0, 10.0]}
ridge = Ridge()
# cross validation
folds = 5
model_cv = GridSearchCV(estimator = ridge, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            
model_cv.fit(X_train_rfe, y_train) 

In [None]:
cv_results = pd.DataFrame(model_cv.cv_results_)
cv_results = cv_results.sort_values(by="mean_test_score",ascending=False)
cv_results.head(20)

In [None]:
# plotting mean test and train scoes with alpha 
cv_results['param_alpha'] = cv_results['param_alpha'].astype('int32')

# plotting
plt.plot(cv_results['param_alpha'], cv_results['mean_train_score'])
plt.plot(cv_results['param_alpha'], cv_results['mean_test_score'])
plt.xlabel('alpha')
plt.ylabel('Negative Mean Absolute Error')
plt.title("Negative Mean Absolute Error and alpha")
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()

From the graph above we can see that the line gradually increases and might decrease for higher values of alpha.

In [None]:
alpha = 10
ridge = Ridge(alpha=alpha)

ridge.fit(X_train_rfe, y_train)
ridge.coef_

As from the graph and table we can see that the mean test score is highest for alpha =10, considering that as the optimal vaule.

In [None]:
s=pd.Series(ridge.coef_[0], index=X_train_rfe.columns)
s.sort_values(ascending=False)

From the Ridge regression we can see that top 5 features wich have positive impact on the price are OverallQual, GrLivArea, 2ndFlrSF, 1stFlrSF, GarageCars.

In [None]:
# predict
y_train_pred = ridge.predict(X_train_rfe)
print(r2_score(y_true=y_train, y_pred=y_train_pred))
y_test_pred = ridge.predict(X_test_rfe)
print(r2_score(y_true=y_test, y_pred=y_test_pred))

As we can observe, after rfe the r2_score of both train and test data are very close.

## Lasso Regression

In [None]:
#Lasso

lasso = Lasso()

# cross validation
model_cv = GridSearchCV(estimator = lasso, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            

model_cv.fit(X_train_rfe, y_train) 

In [None]:
cv_results = pd.DataFrame(model_cv.cv_results_)
#cv_results = cv_results[cv_results['param_alpha']<=200]
cv_results = cv_results.sort_values(by="mean_test_score",ascending=False)
cv_results.head()

In [None]:
# plotting mean test and train scoes with alpha 
cv_results['param_alpha'] = cv_results['param_alpha'].astype('float32')

# plotting
plt.plot(cv_results['param_alpha'], cv_results['mean_train_score'])
plt.plot(cv_results['param_alpha'], cv_results['mean_test_score'])
plt.xlabel('alpha')
plt.ylabel('Negative Mean Absolute Error')

plt.title("Negative Mean Absolute Error and alpha")
plt.legend(['train score', 'test score'], loc='lower right')
plt.show()

From the above graph we can see that the error is very less as the test and train lins are very close.

In [None]:
lasso = Lasso(alpha=0.01)
lasso.fit(X_train_rfe, y_train)

lasso.coef_

As from the table we can see that the mean test score is highest for alpha =0.01, considering that as the optimal vaule.
We can see that Lasso has done feature selection and because of which few features have 0 as coeffient.

In [None]:
s=pd.Series(lasso.coef_, index=X_train_rfe.columns)
s.sort_values(ascending=False)

From the Lasso regression we can see that top 5 features wich have positive impact on the price are OverallQual, GrLivArea, BsmtExposure, KitchenQual, GarageCars.

In [None]:
# predict
y_train_pred = lasso.predict(X_train_rfe)
print(r2_score(y_true=y_train, y_pred=y_train_pred))
y_test_pred = lasso.predict(X_test_rfe)
print(r2_score(y_true=y_test, y_pred=y_test_pred))

From the Linear, Ridge, Lasso regression we can see that the observed r2_score is having very less difference in test and train data.

The suprise housing company should consider the below points while pricing the houses for sale:

- Overall material and finish of the house.
- Size of garage in car capacity.
- Living area above ground.
- Kitchen quality
- Walkout or garden level walls

Overall quality is the most important feature observed both in Ridge and Lasso model where the sales price increases with better quality.