## Predicting the sales prices and practice feature engineering with Linear Regression

**House Prices - Advanced Regression Techniques**

##### data source: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv

`Adegoke Olanrewaju`

##### importing the necessary libraries

In [1]:
import numpy as np

import pandas as pd

import sklearn

##### loading the dataset through its path from my machine

In [2]:
dataset_master = pd.read_csv('/Users/OLALYTICS/dsp-olanrewaju-adegoke/data/train.csv')

training_data_csv = dataset_master.copy()

training_data_csv.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


#### Explanatory Data Analysis

In [3]:
training_data_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

### Selecting the features dataset having continuous and categorical variables and also the target known as the labels.

In [4]:
features = training_data_csv[['MSZoning','HouseStyle','YearBuilt','TotalBsmtSF','MiscVal']]

target = training_data_csv['SalePrice']

In [5]:
# merging the features and target to obtain a combined dataset

combined_dataset = features.join(target)

combined_dataset.head()

Unnamed: 0,MSZoning,HouseStyle,YearBuilt,TotalBsmtSF,MiscVal,SalePrice
0,RL,2Story,2003,856,0,208500
1,RL,1Story,1976,1262,0,181500
2,RL,2Story,2001,920,0,223500
3,RL,2Story,1915,756,0,140000
4,RL,2Story,2000,1145,0,250000


In [6]:
# checking for missing values

combined_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   MSZoning     1460 non-null   object
 1   HouseStyle   1460 non-null   object
 2   YearBuilt    1460 non-null   int64 
 3   TotalBsmtSF  1460 non-null   int64 
 4   MiscVal      1460 non-null   int64 
 5   SalePrice    1460 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 68.6+ KB


In [7]:
# checking for NaN

combined_dataset.isna().sum()

MSZoning       0
HouseStyle     0
YearBuilt      0
TotalBsmtSF    0
MiscVal        0
SalePrice      0
dtype: int64

In [8]:
# checking for duplicate and correcting it

combined_dataset.drop_duplicates(inplace = True)

combined_dataset.duplicated(keep = 'first').sum()

0

In [9]:
# defining the categorical and continuous features

cat_feat = ['MSZoning','HouseStyle']

cont_feat = ['YearBuilt','TotalBsmtSF','MiscVal']


In [10]:
# redefininng the cleaned features and target

feature = combined_dataset[cat_feat + cont_feat]

target.head()

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

### Feature engineering - OneHotEncoding for categorical variables and Scaling for continuous variables

In [11]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [12]:
# Encoding the categorical variables

onehot_encoder = OneHotEncoder(drop = 'first', sparse = False)

cat_feat_encoded = onehot_encoder.fit_transform(feature[cat_feat])

cat_feat_encoded_final = pd.DataFrame(cat_feat_encoded, columns = onehot_encoder.get_feature_names(cat_feat))

cat_feat_encoded_final.head()


Unnamed: 0,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [13]:
# Scaling the continuous variables

std_scaler = StandardScaler()

cont_feat_scaled = std_scaler.fit_transform(feature[cont_feat])

cont_feat_scaled_final = pd.DataFrame(cont_feat_scaled, columns = cont_feat)

cont_feat_scaled_final.head()

Unnamed: 0,YearBuilt,TotalBsmtSF,MiscVal
0,1.0518,-0.459288,-0.087718
1,0.157486,0.466175,-0.087718
2,0.985555,-0.313402,-0.087718
3,-1.863001,-0.687235,-0.087718
4,0.952432,0.199477,-0.087718


In [14]:
# clean dataset for the features and target

In [15]:
final_dataset = pd.concat([cont_feat_scaled_final, cat_feat_encoded_final, target], axis = 1)

final_dataset.tail()

Unnamed: 0,YearBuilt,TotalBsmtSF,MiscVal,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,SalePrice
1455,0.223732,1.104425,-0.087718,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,175000
1456,-1.00181,0.215434,4.951368,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,210000
1457,-0.703705,0.046753,-0.087718,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,266500
1458,-0.206864,0.452498,-0.087718,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,142125
1459,,,,,,,,,,,,,,,147500


In [16]:
# checking for NaN and replacing it

final_dataset.isna().sum()

final_dataset.dropna(inplace = True)

final_dataset.tail()

Unnamed: 0,YearBuilt,TotalBsmtSF,MiscVal,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,SalePrice
1454,0.919309,-0.23818,-0.087718,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,185000
1455,0.223732,1.104425,-0.087718,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,175000
1456,-1.00181,0.215434,4.951368,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,210000
1457,-0.703705,0.046753,-0.087718,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,266500
1458,-0.206864,0.452498,-0.087718,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,142125


In [17]:
# selecting the X and y from the final_dataset

X = final_dataset.drop('SalePrice', axis = 1)

X.head()

Unnamed: 0,YearBuilt,TotalBsmtSF,MiscVal,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl
0,1.0518,-0.459288,-0.087718,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.157486,0.466175,-0.087718,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.985555,-0.313402,-0.087718,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,-1.863001,-0.687235,-0.087718,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.952432,0.199477,-0.087718,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [18]:
y = final_dataset['SalePrice']

y.head()

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

### train and test split for training the model

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

#### fitting of the model by using the LinearRegression

In [21]:
from sklearn.linear_model import LinearRegression

LinReg = LinearRegression()

LinReg.fit(X_train, y_train)

LinearRegression()

#### predicting the housing prices using the LinearRegression Model

In [22]:
y_predicted = LinReg.predict(X_test)

y_predicted[:5]

array([192276.55632359, 175534.90911437, 195615.78440997, 216474.36860449,
       182424.33150523])

##### evaluating the model using the mean_squared_log_error

In [23]:
from sklearn.metrics import mean_squared_log_error

def compute_rmsle(y_test: np.ndarray, y_predicted: np.ndarray, precision: int = 2) -> float:
    rmsle = np.sqrt(mean_squared_log_error(y_test, y_predicted))
    return round(rmsle, precision)

In [24]:
print('The mean squared log error of the model is:', compute_rmsle(y_test, y_predicted))


The mean squared log error of the model is: 0.43


##### evalauting the model the mean_squared_error to see the difference

In [25]:
from sklearn.metrics import mean_squared_error

def compute_rmsle1(y_test: np.ndarray, y_predicted: np.ndarray, precision: int = 2) -> float:
    rmsle = np.sqrt(mean_squared_error(y_test, y_predicted))
    return round(rmsle, precision)

In [26]:
print('The mean squared log error of the model is:', compute_rmsle1(y_test, y_predicted))


The mean squared log error of the model is: 80450.83
