## Week 5 - Exercise 2

Author: Khushee Kapoor

Last Updated: 22/4/22

### Setting Up

To start, we have imported the following libraries:

- NumPy: to work with the data
- Pandas: to manipulate the dataframe
- MatPlotLib: for data visualization
- Seaborn: for data visulization

In [1]:
# importing the libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

Next, we read the dataset and store it into a dataframe using the read_csv() function from the Pandas library.

In [2]:
# reading the dataset
df = pd.read_csv('HousePrice.csv')

After that, we view the first few rows of the dataframe to get a glimpse of it. To do this, we use the head() function from the Pandas library.

In [3]:
# viewing the first 5 rows
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### Q1. Perform the preprocessing if required, scale the train and test data using standard scaler.

To sovle Question 1, we first view the dimensions of the dataframe by using the shape attribute.

In [4]:
# viewing the dimensions of the dataframe
df.shape

(1460, 81)

As we can see, there are 1460 rows and 81 columns in the dataset. Next, we check for missing values. To do that, we use the isnull() and sum() functions from the Pandas library. We also use the loc() and lambda functions to locate only those columns with missing values to prevent overload of workspace.

In [5]:
# checking for missing values
df.isnull().sum().loc[lambda x: x>0]

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

As we can see, the columns 'Alley', 'PoolQC', 'Fence', and 'MiscFeature' have more then 50% missing values. Hence we drop them using the drop() function. We deal with the rest of the missing values and perform the remaining preprocessing in the pipeline that follows.

In [6]:
# dropping the columns with many missing values
df = df.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature'])

### Q2. . Split the dataset into train size of 70% and test size of 30% and Apply the Ridge and Lasso regression and fit the model containing all independent variables.

Next we split the data into independent variables (x) and dependent variable (y) and use the train_test_split() function from the sklearn library and divide the dataset into training and testing sets.

In [7]:
# splitting the data into independent and dependent variables
x = df.drop(columns=['SalePrice'])
y = df['SalePrice']

# diving the dataset into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=105)

Next, we segregate the numerical and categorical columns to easily perform the appropriate preprocessing on the variables.

In [8]:
# selecting the numerical columns
numerical_cols = [cname for cname in x.columns if x[cname].dtype in ['int64', 'float64']]

# selecting the categorical columns
categorical_cols = [cname for cname in x.columns if x[cname].dtype == 'object']

After that, we form the following pipeline:

                                      Independent Variables
                                    ___________|__________
                                   |                      |
                               Numerical             Categorical
                                   |                      |
                                Imputing               Imputing
                                   |                      |
                                Scaling            One Hot Encoding
                                   |______________________|
                                               |
                     Depndent Variable ---   Model
                                               |
                                             Output

To do this, we use the following libraries:

- Pipeline: to create the pipeline
- ColumnTransformer: to aggregrate the preprocessing steps for the numerical and categorical columns
- StandardScaler: to scale the numerical values
- SimpleImputer: to impute the missing categorical values using the most frequent value and the missing numerical values with the mean of the column
- OneHotEncoder: to one-hot-encode the categorical values

In [9]:
# importing libraries for preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


# imputing and scaling the numerical columns 
numerical_transformer = Pipeline(steps = [
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])


# imputing and one-hot encoding the categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))    
])


# bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols),
      ])

For the first pipeline, we build a Lasso model using the Lasso module from the sklearn library. We then compile the preprocessor and the model to build a pipeline.

In [10]:
# importing the Lasso module
from sklearn.linear_model import Lasso

# building model for prediction
lasso = Lasso(random_state=105)


# bundle preprocessing and modeling code in a pipeline
lasso_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor), 
    ('lasso', lasso)
])

Then we fit the lasso pipeline with the training data and verify the accuracy by calculating the mean absolute error, residual sum of squares and r2-score by using the built-in functions from the sklearn library.

In [11]:
# fitting the pipeline
lasso_pipe.fit(x_train, y_train)

# computing and printing the mean absolute error
from sklearn.metrics import mean_absolute_error
print(str.format('Mean Absolute Error: {:.2f}', mean_absolute_error(y_test, lasso_pipe.predict(x_test))))

# computing and printing the residual sum of squares
print(str.format('Residual Sum of Squares: {:.2f}', np.sum(np.square(y_test - lasso_pipe.predict(x_test)))))

# computing and printing the r2-score
print(str.format('R2 Score: {:.2f}', lasso_pipe.score(x_test, y_test)))

Mean Absolute Error: 19132.53
Residual Sum of Squares: 515289678437.34
R2 Score: 0.78


  model = cd_fast.sparse_enet_coordinate_descent(


As we can see, the:

- Mean Absolute Error: 19132.53
- Residual Sum of Squares: 515289678437.34
- R2 Score: 0.78

which means that the dependent variables have moderate explanatory power.

Next, for the second pipeline, we build a Ridge model using the Ridge module from the sklearn library. We then compile the preprocessor and the model to build a pipeline.

In [12]:
# importing the Ridge module
from sklearn.linear_model import Ridge

# building model for prediction
ridge = Ridge(random_state=105)


# bundle preprocessing and modeling code in a pipeline
ridge_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor), 
    ('ridge', ridge)
])

Then we fit the ridge pipeline with the training data and verify the accuracy by calculating the mean absolute error, residual sum of squares and r2-score by using the built-in functions from the sklearn library.

In [13]:
# fitting the pipeline
ridge_pipe.fit(x_train, y_train)

# computing and printing the mean absolute error
from sklearn.metrics import mean_absolute_error
print(str.format('Mean Absolute Error: {:.2f}', mean_absolute_error(y_test, ridge_pipe.predict(x_test))))

# computing and printing the residual sum of squares
print(str.format('Residual Sum of Squares: {:.2f}', np.sum(np.square(y_test - ridge_pipe.predict(x_test)))))

# computing and printing the r2-score
print(str.format('R2 Score: {:.2f}', ridge_pipe.score(x_test, y_test)))

Mean Absolute Error: 18304.91
Residual Sum of Squares: 444963788838.02
R2 Score: 0.81


As we can see, the:

- Mean Absolute Error: 18304.91
- Residual Sum of Squares: 444963788838.02
- R2 Score: 0.81

which means that the dependent variables have moderate explanatory power. However, the Ridge model performs slightly better than the Lasso model.

### Q3. Make predictions on test data “HousePriceTest.csv” and tabulate performance of both models on unseen data.

To solve Question 3, we first read the dataset and store it into a dataframe using the read_csv() function from the Pandas library.

In [14]:
# reading the dataset
test = pd.read_csv('HousePriceTest.csv')

After that, we view the first few rows of the dataframe to get a glimpse of it. To do this, we use the head() function from the Pandas library.

In [15]:
# viewing the first 5 rows
test.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


As we can see, there are 80 columns in this dataset, which is 1 less than the training set. The missing column in the one we need to predict. To do that, we use the predict function and predict using both the pipelines.

In [16]:
# predicting using the lasso pipeline
lasso_pred = lasso_pipe.predict(test)

# predicting using the ridge pipeline
ridge_pred = ridge_pipe.predict(test)

To see the difference between the predictions made by both the models, we take the absolute difference between both the predictions and find its average using the abs() and mean() functions from the NumPy functions.

In [17]:
# finding mean difference between predictions
np.mean(np.abs(lasso_pred - ridge_pred))

7994.793854680997

As we can see, the mean difference between the predictions is 7994 dollars.