# Project 2: Ames Housing Data and Kaggle Challenge
---
Book 1: Data Cleaning & Exploratory Data Analysis<br>
Book 2: Preprocessing & Features Engineering<br>
**Book 3: ML Modelling, Conclusion & Recommendation**<br>
Author: Lee Wan Xian

## Contents:
- [Model Preparation](#Model-Preparation)
- [Model Instantiate & Fitting](#Model-Instantiate-&-Fitting)
- [Model Evaluation](#Model-Evaluation)
- [Conclusion & Recommendation](#Conclusion-&-Recommendation)

## Python Libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNetCV
from sklearn.metrics import r2_score, mean_squared_error

## Data Import

In [2]:
# Import data files (Train)
dftrainraw = pd.read_csv('../datasets/train.csv')
dftrain_clean = pd.read_csv('../output/train_clean.csv')

# Import data files (Test)
dfrealtestraw = pd.read_csv('../datasets/test.csv')
dfrealtest_clean = pd.read_csv('../output/realtest_clean.csv')

## Model Preparation

Define the predictor variables as $X$ & outcome variable (`saleprice`) as $y$ for both train & realtest datasets

In [3]:
# create a list of predictor column headers as the features (X-variables)

features = ['mas_vnr_area', 'exter_qual', 'bsmt_qual', 'bsmtfin_type_1',
       'heating_qc', 'kitchen_qual', 'totrms_abvgrd', 'fireplace_qu',
       'age', 'property_area', 'overall_score',
       'ms_zoning_A (agr)', 'ms_zoning_C (all)', 'ms_zoning_FV',
       'ms_zoning_I (all)', 'ms_zoning_RH', 'ms_zoning_RL', 'ms_zoning_RM',
       'neighborhood_Blmngtn', 'neighborhood_Blueste', 'neighborhood_BrDale',
       'neighborhood_BrkSide', 'neighborhood_ClearCr', 'neighborhood_CollgCr',
       'neighborhood_Crawfor', 'neighborhood_Edwards', 'neighborhood_Gilbert',
       'neighborhood_Greens', 'neighborhood_GrnHill', 'neighborhood_IDOTRR',
       'neighborhood_Landmrk', 'neighborhood_MeadowV', 'neighborhood_Mitchel',
       'neighborhood_NAmes', 'neighborhood_NPkVill', 'neighborhood_NWAmes',
       'neighborhood_NoRidge', 'neighborhood_NridgHt', 'neighborhood_OldTown',
       'neighborhood_SWISU', 'neighborhood_Sawyer', 'neighborhood_SawyerW',
       'neighborhood_Somerst', 'neighborhood_StoneBr', 'neighborhood_Timber',
       'neighborhood_Veenker', 'house_style_1.5Fin', 'house_style_1.5Unf',
       'house_style_1Story', 'house_style_2.5Fin', 'house_style_2.5Unf',
       'house_style_2Story', 'house_style_SFoyer', 'house_style_SLvl',
       'foundation_BrkTil', 'foundation_CBlock', 'foundation_PConc',
       'foundation_Slab', 'foundation_Stone', 'foundation_Wood',
       'ms_subclass_20', 'ms_subclass_30', 'ms_subclass_40', 'ms_subclass_45',
       'ms_subclass_50', 'ms_subclass_60', 'ms_subclass_70', 'ms_subclass_75',
       'ms_subclass_80', 'ms_subclass_85', 'ms_subclass_90', 'ms_subclass_120',
       'ms_subclass_150', 'ms_subclass_160', 'ms_subclass_180',
       'ms_subclass_190', 'mas_vnr_type_BrkCmn', 'mas_vnr_type_BrkFace',
       'mas_vnr_type_NA', 'mas_vnr_type_None', 'mas_vnr_type_Stone',
       'mas_vnr_type_CBlock']

In [4]:
# Set X-variables & y-variable from train_clean df

X = dftrain_clean[features]
y = dftrain_clean['saleprice']

In [5]:
# Set X variables for realtest_clean df

X_realtest = dfrealtest_clean[features]

### Train-test Split

Perform Train-Test split on $X$ so that we can have 75% of data observations to train the model & the remaining 25% to test the model that has been trained.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

### Scaling

Perform Scaling to standardize all values in $X$ to the same scale. We will use `sklearn`'s RobustScaler as there are some outliers in the dataframe that could skew the features' sample mean and variance.

In [7]:
rbs = RobustScaler(quantile_range=(15.0,85.0))
X_train = rbs.fit_transform(X_train)
X_test = rbs.transform(X_test)

In [8]:
X_realtest_trf = rbs.transform(X_realtest)

## Model Instantiate & Fitting

### Baseline Model

We will define the baseline model as the linear model that predicts the mean value of y (`saleprice`).

In [9]:
# Setting the baseline model as y_mean
y_mean = np.mean(y_train)

### Linear Regression

In [10]:
# Instantiate linear reg model
lr = LinearRegression()

# fit linear reg model to train data
lr.fit(X_train, y_train)

### Ridge Regression

In [11]:
# Create hyperparameters dict for Ridge
ridge_params = {
    "alpha": np.logspace(0, 10, 500),
    "fit_intercept": [True, False],
    "positive": [True, False],
    "max_iter": [1000, 15000],
}

In [12]:
# Instantiate Ridge model
ridge_grid = GridSearchCV(Ridge(),
                          ridge_params,
                          n_jobs=-1,
                          cv=10,
                          verbose=1
                         )

# fit Ridge model to train data
ridge_grid.fit(X_train, y_train)

Fitting 10 folds for each of 4000 candidates, totalling 40000 fits


### Lasso Regression

In [13]:
# Create hyperparameters dict for Lasso
lasso_params = {
    "alpha": np.logspace(0, 10, 500),
    "fit_intercept": [True, False],
    "positive": [True, False],
    "max_iter": [1000, 15000],
}

In [14]:
# Instantiate Lasso model
lasso_grid = GridSearchCV(Lasso(),
                          lasso_params,
                          n_jobs=-1,
                          cv=10,
                          verbose=1
                         )

# fit Lasso model to train data
lasso_grid.fit(X_train, y_train)

Fitting 10 folds for each of 4000 candidates, totalling 40000 fits


### ElasticNet Regression

In [15]:
# Instantiate ElasticNetCV model
elastic = ElasticNetCV(l1_ratio = [.1, .5, .7, .75, .8, .85, .9, .95, .99, 1],
                       n_alphas = 500,
                       max_iter = 15000,
                       cv = 10
                      )

# fit ElasticNetCV model to train data
elastic.fit(X_train, y_train)

## Model Evaluation

### Model Metrics

**Baseline - y_mean**

In [16]:
# Create y_predictions from baseline model using a function
def y_baseline_pred(array):
    
    y_pred_base = []
    for num in array:
        y_pred_base.append(y_mean)
 
    return y_pred_base 

In [17]:
# Show the R square score for baseline model
print(f'R_squared score is {r2_score(y_test, y_baseline_pred(y_test))}')

R_squared score is -0.00023828527031999336


In [18]:
# Show the RMSE for baseline model
print(f'RMSE is {mean_squared_error(y_test, y_baseline_pred(y_test), squared=False)}')

RMSE is 79061.81923093161


**Linear Regression**

In [19]:
# R square score on train data
cross_val_score(lr, X_train, y_train, cv=10).mean()

-2.447159268420406e+18

In [20]:
# R square score on test data
cross_val_score(lr, X_test, y_test, cv=10).mean()

-2.2705010160622044e+22

In [21]:
# Show the RMSE for Linear regression
y_pred_lr = lr.predict(X_test) # generate y prediction
mean_squared_error(y_test, y_pred_lr, squared=False) # calculate RMSE

339946158182472.56

**Ridge Regression**

In [22]:
# R square score on train data
ridge_grid.score(X_train, y_train)

0.8529911855619441

In [23]:
# R square score on test data
ridge_grid.score(X_test, y_test)

0.8731315294256781

In [24]:
# Show the RMSE for ridge regression
y_pred_ridge = ridge_grid.predict(X_test) # generate y prediction
mean_squared_error(y_test, y_pred_ridge, squared=False) # calculate RMSE

28157.359043985216

**Lasso Regression**

In [25]:
# R square score on train data
lasso_grid.score(X_train, y_train)

0.854653889445648

In [26]:
# R square score on test data
lasso_grid.score(X_test, y_test)

0.871862790774903

In [27]:
# Show the RMSE for lasso regression
y_pred_lasso = lasso_grid.predict(X_test) # generate y prediction
mean_squared_error(y_test, y_pred_lasso, squared=False) # calculate RMSE

28297.801577736373

**ElasticNet Regression**

In [28]:
# R square score on train data
elastic.score(X_train, y_train)

0.8533856231632482

In [29]:
# R square score on test data
elastic.score(X_test, y_test)

0.8729836991942456

In [30]:
# Show the RMSE for ElasticNet regression
y_pred_elastic = elastic.predict(X_test) # generate y prediction
mean_squared_error(y_test, y_pred_elastic, squared=False) # calculate RMSE

28173.759088168445

### Chosen Model

Model|$R_2$ Train|$R_2$ Test|Root Mean Square Error (RMSE)
---|---|---|---
Baseline|n.a.|-0.000238|79061.81
Linear|-2.4471e+18|-2.2705e+22|3.3994e+14
Ridge|0.85299|0.87313|28157.35
Lasso|0.85465|0.87186|28297.80
ElasticNet|0.85338|0.87298|28173.75

Based on the metrics tables above, the linear regression model has performed way worse than the baseline model. Their negative $R_2$ scores deemed that they are very underfitted and the difference in $R_2$ scores and RMSE is huge. 

Among Ridge, Lasso & ElasticNet models, all 3 models do not show signs of underfitting or overfitting. Their $R_2$ scores for both train & test dataframes are close to 1. Their respective $R_2$ train scores are lower than their respective $R_2$ test scores. Hence, all 3 models have a good tradeoff balance between bias and variance. The difference in respective $R_2$ train score & $R_2$ test score is the smallest for both Lasso and ElasticNet models.

Looking at RMSE, the Ridge model is the best out of the 3. RMSE represents the approximate average distance squared of the actual value from the predicted value. Thus the lower the RMSE, the lesser likely the model predicts the sale price too far from the actual sale price.

Hence, we chose ElasticNet regression model as the most suitable model for predicting house prices. Reasons are stated as per below:

1. Second best model in terms of RMSE
2. $R_2$ scores are in a good range (within 0.85 to 0.88)
3. Difference in $R_2$ score for train dataset vs test dataset is the second smallest amongst the models (0.019)
4. $R_2$ score for train dataset is lower than that of test dataset

A strength of this ElasticNet regression model is that it combines the characteristics of both lasso & ridge regression and reduces the impact of various features without eliminating them unnecessarily. A weakness of this model is that it takes up a lot of computation power.

In [31]:
print(f'The optimal value for regularization strength in ElasticNet model is {elastic.alpha_}')
print(f'The optimal scaling of L1 to L2 penalty in ElasticNet model is {elastic.l1_ratio_}')

The optimal value for regularization strength in ElasticNet model is 43.10537110544543
The optimal scaling of L1 to L2 penalty in ElasticNet model is 1.0


In [32]:
# Show the key features that affect the proprty's sale price
chosen_model_coefs = pd.Series(elastic.coef_, index=X.columns)

# Display out the key features that has strong predictor capabilities
print(f'Top 7 features with strong positive influence on sale price')
print(chosen_model_coefs.sort_values(ascending=False).head(7))
print('')
print(f'Top 7 features with strong negative influence on sale price')
print(chosen_model_coefs.sort_values(ascending=False).tail(7))

Top 7 features with strong positive influence on sale price
neighborhood_GrnHill    106631.766923
neighborhood_StoneBr     80611.951747
neighborhood_NridgHt     46388.993096
property_area            40291.741447
neighborhood_NoRidge     38025.659838
foundation_Slab          30800.003397
overall_score            18112.627275
dtype: float64

Top 7 features with strong negative influence on sale price
neighborhood_Sawyer    -10627.193133
neighborhood_NAmes     -11333.435029
neighborhood_IDOTRR    -13091.149102
neighborhood_Edwards   -16274.008191
neighborhood_OldTown   -17423.637943
ms_subclass_160        -22705.614447
ms_subclass_120        -26693.563105
dtype: float64


From the chosen model, the key features that have the most signicant effect on sale price are the neighborhood, property area, building class and overall condition/quality of the property.<br> Based on the coefficients, when the property falls under 2-STORY PUD - 1946 & NEWER or 1-STORY PUD (Planned Unit Development) - 1946 & NEWER building class, there is a high likelihood that the property price will be cheap. When the property is located in Green Hills, Stone Brook and Northridge Heights neighborhood, there is a high likelihood that the property price will be expensive.

## Conclusion & Recommendation

### Conclusion

**Prediction Model**

ElasticNet regression model, where regularization strength value = 43.1 and L1 to L2 penalty = 1.0 (i.e. Lasso penalty), is the most suitable model to use for predicting property prices. This model is able to explain more than 85% of the variability in Sale Price, based on the key features of the property. Reasons are stated as per below:

1. Second best model in terms of RMSE
2. $R_2$ scores are in a good range (within 0.85 to 0.88)
3. Difference in $R_2$ score for train dataset vs test dataset is the second smallest amongst the models (0.019)
4. $R_2$ score for train dataset is lower than that of test dataset

**Key Features**

Relating back to the problem statement, the key features that have the most significant effect on Sale Price are the neighborhood, property area, building class and overall condition and quality of the property.<br> For instance, if the property is located in Green Hills, Stone Brook, Northridge or Northridge Heights neighborhood, it adds more value to the property. To add on, the property area has a positive relationship with the value of the property. Thus, the bigger the property area, the higher the sale price. If the property falls under 2-STORY PUD - 1946 & NEWER or 1-STORY PUD (Planned Unit Development) - 1946 & NEWER building class or located in Old Town, Edwards or Iowa DOT and Rail Road neighborhood, the property sale price would mostly likely be low.

**Future Improvements**

Further improvements can be done on the prediction model by considering the below variables as predictor features,
1. Annual income of the buyer as the wealthier buyers are able to tolerate higher prices
2. Macroeconomic factors (i.e. interest rate of home loans)
3. Number of years of lease left for leasehold properties
4. Crime rate of the neighborhood where the property resides

### Recommendation

We recommend that the client should use our <font color="blue">ElasticNet regression model</font> for predicting property prices in Ames, Iowa. This model can be generalized to other cities, provided that the categorical features are properly encoded into numerical data beforehand.

The client should take note of <font color="blue">the neighborhood where the property resides and the building class of the property</font> to gauge its sale price. It is also good to note that <font color="blue">the property area and overall condition/quality of property</font> have strong effect on sale prices.

Thus, the client should promote houses located in Green Hills, Stone Brook, Northridge or Northridge Heights neighborhood to high-income buyers. As for low-income buyers, the client can promote houses that fall under 2-STORY PUD - 1946 & NEWER or 1-STORY PUD (Planned Unit Development) - 1946 & NEWER building class.

### Kaggle Scores

In [33]:
# Generate y variable predictions based on the models (baseline, linear, ridge, lasso, elasticnet)

SalePrice_baseline = y_baseline_pred(dfrealtestraw['Id'])
SalePrice_linear = lr.predict(X_realtest_trf)
SalePrice_ridge = ridge_grid.predict(X_realtest_trf)
SalePrice_lasso = lasso_grid.predict(X_realtest_trf)
SalePrice_elastic = elastic.predict(X_realtest_trf)

In [34]:
# Create the dataframes to upload into Kaggle for scoring
# Baseline
dfresult_baseline = pd.DataFrame({
    'Id': dfrealtestraw['Id'],
    'SalePrice': SalePrice_baseline
})

# Linear
dfresult_linear = pd.DataFrame({
    'Id': dfrealtestraw['Id'],
    'SalePrice': SalePrice_linear
})

# Ridge
dfresult_ridge = pd.DataFrame({
    'Id': dfrealtestraw['Id'],
    'SalePrice': SalePrice_ridge
})

# Lasso
dfresult_lasso = pd.DataFrame({
    'Id': dfrealtestraw['Id'],
    'SalePrice': SalePrice_lasso
})

# ElasticNet
dfresult_elastic = pd.DataFrame({
    'Id': dfrealtestraw['Id'],
    'SalePrice': SalePrice_elastic
})

In [35]:
# Export the dataframes into csv files

dfresult_baseline.to_csv('../output/result_baseline.csv', index=False)
dfresult_linear.to_csv('../output/result_linear.csv', index=False)
dfresult_ridge.to_csv('../output/result_ridge.csv', index=False)
dfresult_lasso.to_csv('../output/result_lasso.csv', index=False)
dfresult_elastic.to_csv('../output/result_elastic.csv', index=False)

Below are the results of our models' prediction in Kaggle.

![image info](../image/kaggle_result.jpg)