<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Ames Housing Data and Kaggle Challenge

# Modeling and Evaluation

### Import libraries

In [1]:
# perform all necessary library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.dummy import DummyRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV, ElasticNet, ElasticNetCV
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from math import sqrt
from IPython.display import display_html
import statsmodels.api as sm

pd.options.display.max_columns = None
pd.options.display.max_rows = 50

### Import csv files containing features and target data

In [2]:
train_cleaned = pd.read_csv('./data/train_cleaned.csv', index_col=0)
test_cleaned = pd.read_csv('./data/test_cleaned.csv', index_col=0)

In [3]:
X = train_cleaned.loc[:,train_cleaned.columns !='saleprice']
y = train_cleaned['saleprice']

## Baseline - Null model

In [4]:
# Step 1: Instantiate model
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X, y)

# Step 2: Train model
null_pred = dummy_regr.predict(X)

# Step 3: Calculate model performance based on R_squared
print('R-squared score :', dummy_regr.score(X, y))

# Step 4: Calculate model RMSE for Null Model
print('RMSE score :', metrics.mean_squared_error(y, null_pred, squared=False))

# Step 5: Calculate cross-val score for Null Model
null_cv = cross_val_score(dummy_regr, X, y, cv=5)
print('Cross-val score :', null_cv.mean())

R-squared score : 0.0
RMSE score : 78952.14805343687
Cross-val score : -0.0015439541293237546


## Model Preparation

### Model prep: Train/Test split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Model prep: Scaling

In [6]:
ss = StandardScaler()
ss.fit(X_train)        #done only once! set same scale factor for both train & test data
X_train_sc = ss.transform(X_train)  #scaled train data!
X_test_sc = ss.transform(X_test)    #scaled test data!

## Create, Train and Test Various Models

### Simple Linear Regression model

In [7]:
# Step 1: Instantiate model
lr = LinearRegression()
lr.fit(X_train_sc, y_train)            #model to learn from scaled data

# Step 2: Train model
lr_pred = lr.predict(X_test_sc)

# Step 3: Calculate model performance based on R_squared
        # r-squared - to be as close to 1 as possible
R_squared = r2_score(lr_pred, y_test)                  # calculate model performance against actual data
print('R-squared score :', R_squared)

# Step 4: Calculate model RMSE 
        # RMSE - to be as close to 0 as possible
print('RMSE score :', np.sqrt(metrics.mean_squared_error(y_test, lr_pred)))

# Step 5: Calculate cross-val score for Linear Regression
lr_cv = cross_val_score(lr, X_train_sc, y_train, cv=5)
print('Cross-val score :', lr_cv.mean())

R-squared score : 0.7643499248984708
RMSE score : 34224.70304842655
Cross-val score : 0.7968483130561045


### Ridge Regression (L2) model

In [8]:
# Step 1: Instantiate model
ridge = RidgeCV(alphas=np.linspace(0.5, 5, 300))
ridge.fit(X_train_sc, y_train)             #model to learn from scaled data

# Step 2: Train model
ridge_pred = ridge.predict(X_test_sc)

# Step 3: Calculate model performance
R_squared = r2_score(ridge_pred, y_test)                  # calculate model performance against actual data
print('R-squared score :', R_squared)

# Step 4: Calculate model RMSE 
        # RMSE - to be as close to 0 as possible
print('RMSE score :', np.sqrt(metrics.mean_squared_error(y_test, ridge_pred)))

# Step 5: Calculate cross-val score for Ridge Regression
ridge_cv = cross_val_score(ridge, X_train_sc, y_train, cv=5)
print('Cross-val score :', ridge_cv.mean())

R-squared score : 0.7636436357553351
RMSE score : 34224.46059507788
Cross-val score : 0.7968650290888906


### Lasso Regression (L1) model

In [9]:
# Step 1: Instantiate model
lasso = LassoCV(n_alphas=300)
lasso.fit(X_train_sc, y_train)         #model to learn from scaled data

# Step 2: Train model
lasso_pred = lasso.predict(X_test_sc)

# Step 3: Calculate model performance based on R_squared
        # r-squared - to be as close to 1 as possible
R_squared_lasso = r2_score(y_test, lasso_pred)
print('R-squared score :', R_squared_lasso)

# Step 4: Calculate model RMSE 
        # RMSE - to be as close to 0 as possible
print('RMSE score :', np.sqrt(metrics.mean_squared_error(y_test, lasso_pred)))

# Step 5: Calculate cross-val score for Lasso Regression
lasso_cv = cross_val_score(lasso, X_train_sc, y_train, cv=5)
print('Cross-val score :', lasso_cv.mean())

R-squared score : 0.8083214052992227
RMSE score : 34220.393314003835
Cross-val score : 0.7968656613672936


## Review Model Scores and Evaluate Models

**Model Selected** <br>
The model I selected to run on the next test set is the Lasso Regression model. It has the lowest RMSE mean of 34220.39 (rounded off to 2dp), and the highest r-squared value of 0.808 (rounded off to 3sf). It also gives a cross-val score of 0.797 (rounded off to 3sf). I will explain further on the meaningfulness of each metric and how the process has culminated in my choosing the Lasso Regression (L1) model to predict sale prices for unseen data in the test set.

|Model|RMSE|R-squared|Cross-val|Selected model|
|:---|---:|---:|---:|---:|
|Dummy|78952.15|0.0|-0.00154|    
|Simple Linear|34224.70|0.764|0.797|    
|Ridge|34224.46|0.763|0.797|
|**Lasso**|**34220.39**|**0.808**|**0.797**|**x**

**Regression Metrics** <br>
I used three metrics to evaluate the performance of all the three models. <br>

The first metric is the Root Mean Squared Error (RMSE), which measures how spread out these residuals are to the model. In other words, it shows how close the observed data points are to the model's predicted values. The Lasso Regression (L1) model returned an RMSE score of 34,220. This suggests that the observed data points in the test set are rather loosely fit around the model's predicted values. That said, the Lasso Regression (L1) model returned the lowest RMSE score, which means that the test-split data fits best around it compared to the other models.

The second metric is coefficient of determination, r-squared, which measures the extent to which variance in the dependent variable can be explained by the independent variable. In other words, the r-squared value shows how well the data fit the regression model. The Lasso Regression (L1) model returned an r-squared value of 0.808 which suggests that that 80.8% of the variability in a house's sale price is explained by the x-variables in the model. The x-variables in the model selected refer to neighborhood ranking, overall quality of the house, amount of above grade living area, garage area, garage finish, and number of floors/storeys the floor has. <br>

The third metric is the cross-val score. A cross_val_score is received when we perform k-fold cross validation. By dividing the test sample into k=5 groups, we introduce some variation to our model, and takes the accuracy score of each test fold is calculated and the mean of these accuracy scores makes up the cross-val score. The Lasso Regression (L1) model with k-value of 5, returned a cross-val score of 0.79687 (to 5 sf), which means that the accuracy of the Lasso (L1) model with Cross Validation is almost 80%.


## Tune, Train and Score the selected model

In [10]:
# Step 1: scale test_cleaned data
# ss.fit(test_cleaned)
test_cleaned_sc = ss.transform(test_cleaned)
 
# Step 2: run model on test_cleaned data to get saleprice predictions 
results = lasso.predict(test_cleaned_sc)

# Step 3: Calculate model R-squared
        # R-squared - 
R_squared_lasso = r2_score(y_test, lasso_pred)            # calculate model performance against actual data
print('Test R-squared score :', R_squared_lasso)

# Step 4: Calculate model RMSE 
        # RMSE - to be as close to 0 as possible
print('Test RMSE score :', np.sqrt(metrics.mean_squared_error(y_test, lasso_pred)))

# Step 5: Calculate cross-val score for Lasso Regression
        # 
lasso_cv = cross_val_score(lasso, X_train_sc, y_train, cv=5)
print('Test Cross-val score :', lasso_cv.mean())

Test R-squared score : 0.8083214052992227
Test RMSE score : 34220.393314003835
Test Cross-val score : 0.7968656613672936


In [11]:
# create new dataframe # merge 
test = pd.read_csv("./data/test.csv")
df_final = test[['Id']].merge(pd.DataFrame(results), left_index=True, right_index=True)

# save as csv
df_final.to_csv("./data/predictions.csv", header=["Id", "SalePrice"], index=False)

In [12]:
df_final.shape

(878, 2)

![Screenshot%202022-11-18%20at%209.13.29%20AM.png](attachment:Screenshot%202022-11-18%20at%209.13.29%20AM.png)

## Conclusion and Recommendations

There is no perfect model as building any machine-learning model necessitates dealing with the bias-variance trade-off. For this project, Lasso Regression was the best model according to RMSE scores. It inevitably traded some variance for bias, which ultimately lead to a smaller error overall. As the three models instantiated focused on only 6 features, future iterations of this project can experiment with nominal features such as exterior 1 or  masonry veneer type. More research into the data collection methods especially regarding the columns exterior 1 and exterior 2 can shed some light into how to quantify it appropriately to be fed into the eventual machine-learning model. The current dataset have marked a number of duplicate observations between the two columns (exterior 1 and 2), which obscured my initial intention to include it, as external research proved that exterior renovation can have significant impact on a property's resale value <a href="https://www.familyhandyman.com/list/which-exterior-renovation-adds-most-value/" target="_blank">(source)</a>. 

In addition, the next iteration of this project can also explore the use of non-linear regression models for sale price prediction. As non-linear regression can fit a wider range of curves, there is a possibility of developing a model with a better fit than my efforts with linear regression techniques.

To find out the model's generalisability, further research is needed to capture more up-to-date data, from property markets aside from Ames, Iowa to shed light on how different economic situations and climates can influence house features and its sale prices.
