# Project - House Prices Prediction

### Notebook 3 of 3

1. Data Importing and Cleaning (This Notebook)
2. Exploratory Data Analysis and Feature Engineering
3. **Model Training, Evaluation and Implementation**

In [1]:
#Import libraries for data cleaning and exploration

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import statsmodels.api as sm

In [2]:
# Import Datasets

house_train_df = pd.read_csv("datasets/train.csv")
house_test_df = pd.read_csv("datasets/test.csv")
house_train_df_cleaned = pd.read_csv("datasets/train_cleaned_2.csv")
house_test_df_cleaned = pd.read_csv("datasets/test_cleaned_2.csv")

## Model Selection

We will be using regression models since Sale Price is a continuous variable. The models that we will evaluate are the Linear Regression, Ridge and Lasso models. We will utilise a RMSE and R-squared metric score to identify the best regression model between the three. 

# Run OLS Model 

In [3]:
# Define response and feature
X = house_train_df_cleaned.drop('SalePrice', axis=1)
y = house_train_df_cleaned['SalePrice']

In [4]:
X_ols = sm.add_constant(X, prepend=True)
results = sm.OLS(y, X_ols).fit()

  x = pd.concat(x[::order], 1)


In [5]:
# OLS Results
results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.917
Model:,OLS,Adj. R-squared:,0.913
Method:,Least Squares,F-statistic:,215.6
Date:,"Mon, 13 Jun 2022",Prob (F-statistic):,0.0
Time:,11:16:10,Log-Likelihood:,-23470.0
No. Observations:,2049,AIC:,47140.0
Df Residuals:,1948,BIC:,47710.0
Df Model:,100,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6.82e+04,2.38e+04,-2.867,0.004,-1.15e+05,-2.16e+04
Lot Area,0.6462,0.104,6.230,0.000,0.443,0.850
Mas Vnr Area,39.5275,4.919,8.035,0.000,29.880,49.175
Bsmt Exposure,5082.7446,666.963,7.621,0.000,3774.709,6390.780
Heating QC,1120.7299,724.629,1.547,0.122,-300.400,2541.860
TotRms AbvGrd,146.8929,681.708,0.215,0.829,-1190.061,1483.847
Garage Area,28.3505,4.324,6.556,0.000,19.869,36.832
Misc Val,-0.3698,1.250,-0.296,0.767,-2.820,2.081
overall_score,1223.2482,80.772,15.145,0.000,1064.840,1381.656

0,1,2,3
Omnibus:,461.163,Durbin-Watson:,1.979
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4084.793
Skew:,0.801,Prob(JB):,0.0
Kurtosis:,9.729,Cond. No.,1.16e+16


Result:

- An adjusted R-squared value of 0.91 
- Some variables (Heating QC, Total Rooms Above ground, Miscellenous Value) are not statistically significant at the 5% level

# Evaluate Models

In [6]:
#Import Libraries

from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV, ElasticNet, ElasticNetCV
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn import metrics
from sklearn.pipeline import Pipeline

In [7]:
# perform Train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [8]:
# Scale Data using Min Max Scalar

mc = MinMaxScaler(feature_range=(0,1))
S_train = mc.fit_transform(X_train)
S_test = mc.transform(X_test)

In [9]:
final_mc = MinMaxScaler(feature_range=(0,1))
X_scaled = final_mc.fit_transform(X)

Due to the use of One Hot Encoding to encode our categorical variable, a Min Max Scalar is used to scale values to between 0 and 1. Additionally, not all variables are normally distributed, a standard Z scaling might not be suitable

We will train our data on 3 models and evaluate the best model

## Linear Regression

1. Cross validation Score

In [10]:
# Fit on train dataset
linear_reg_cv = LinearRegression()
linear_reg_cv = linear_reg_cv.fit(X_train, y_train)

In [11]:
lr_cross_val = cross_val_score(linear_reg_cv, X_train, y_train, cv=5)

In [12]:
lr_cross_val.mean()

0.8918015416726787

2. Train-Test Model

In [13]:
linear_reg = LinearRegression()
linear_reg = linear_reg.fit(X_train, y_train)

In [14]:
# Train R-squared
linear_reg.score(X_train, y_train)

0.9142891314975099

In [15]:
y_pred_lr = linear_reg.predict(X_test)

3. Model metrics scores

In [16]:
linear_reg.score(X_test, y_test)

0.9161585513320083

In [17]:
rmse = metrics.mean_squared_error(y_test, y_pred_lr, squared=False)
rmse

23022.928647426434

Summary

- Cross validation score (0.89) and Test score (0.91) does not differ much - Cross Validation score is lower than test
- RMSE of around 23000

## Ridge 

1. Cross Validation Score

In [18]:
r_alphas = np.logspace(0, 5, 100)
ridge_cv = RidgeCV(alphas=r_alphas, scoring='r2', cv=3)
ridge_cv = ridge_cv.fit(S_train, y_train)

In [19]:
# get optimal Alpha
ridge_cv.alpha_

1.1233240329780274

In [20]:
# CV R-Squared score
ridge_cv.score(S_train, y_train)

0.9128778880649232

2. Train-Test Model

In [21]:
# Use an alpha dervied from before
ridge_model = Ridge(alpha=ridge_cv.alpha_)
ridge_model.fit(S_train, y_train)

Ridge(alpha=1.1233240329780274)

In [22]:
ridge_model.score(S_train, y_train)

0.9128778880649232

In [23]:
y_pred_ridge = ridge_model.predict(S_test)

3. Model metrics score

In [24]:
# R-squared for test
ridge_model.score(S_test, y_test)

0.9171444780777653

In [25]:
rmse = metrics.mean_squared_error(y_test, y_pred_ridge, squared=False)
rmse

22887.160192306328

Summary

- Cross validation R-squred score (0.913) and Test score (0.917) are almost identical
- RMSE of around 22900

## Lasso Model

1. Cross validation score

In [26]:
# get optimal alpha from cv
l_alphas = np.logspace(-3, 0, 100)
lasso_cv = LassoCV(alphas=l_alphas, cv=5, max_iter=100000)
lasso_cv = lasso_cv.fit(S_train, y_train)

In [27]:
lasso_cv.alpha_

1.0

In [28]:
lasso_cv.coef_

array([ 6.17495455e+04,  5.56018054e+04,  2.20927687e+04,  5.75255689e+03,
        7.33789554e+03,  3.13475167e+04,  7.23774258e+03,  1.05188109e+05,
        3.00240417e+04,  6.39719718e+03,  6.75873946e+04,  7.78257889e+04,
        1.34353507e+05, -3.82475383e+03,  3.88011134e+04,  1.64000288e+04,
        1.95702578e+04,  1.46499207e+04, -7.42307431e+04,  1.04248624e+04,
        1.16554768e+04,  2.91657955e+04,  1.25581614e+04,  4.20545021e+03,
        1.10973504e+04,  2.95943123e+04, -1.31406764e+04, -2.97347463e+03,
       -2.31600038e+04, -3.88878340e+04,  0.00000000e+00, -3.30658256e+04,
       -2.77487079e+04, -2.55485678e+04,  3.40550753e+03,  2.61847648e+04,
        0.00000000e+00,  1.74331780e+04,  1.43566376e+04,  1.34243638e+04,
        4.31855065e+03, -1.56205837e+03, -3.67653636e+04,  2.21447720e+03,
       -4.51440781e+02,  6.03872931e+03, -1.32990611e+04, -2.00492362e+04,
       -2.17846311e+04,  7.91466171e+02, -2.29525596e+04, -2.00843978e+04,
       -4.34382321e+03,  

None of the coefficients have been zero-ed out by the Lasso Model

In [29]:
lasso_cv.score(S_train, y_train)

0.9142747247246612

2. Train-Test Model

In [30]:
lasso_model = Lasso(alpha=lasso_cv.alpha_, max_iter=100000)
lasso_model.fit(S_train, y_train)

Lasso(max_iter=100000)

In [31]:
lasso_model.score(S_train, y_train)

0.9142747247246612

In [32]:
y_pred_lasso = lasso_model.predict(S_test)

3. Model metrics score

In [33]:
lasso_model.score(S_test, y_test)

0.9166083168005401

In [34]:
rmse = metrics.mean_squared_error(y_test, y_pred_lasso, squared=False)
rmse

22961.092628324528

Summary

- Cross validation R-squred score (0.914) and Test score (0.916) almost identical
- RMSE of around 22900

|Model|CV R-Squared|Validation R-Squred|RMSE|Alpha|
|---|---|---|---|---|
|Linear Reg|0.89|0.91|23022|NA
|Ridge|0.91|0.91|22887|1.12
|Lasso|0.91|0.91|22961|1.0

Based on the R-squared and RMSE metrics, the Lasso model gave the best model. 

- difference between the three models in terms of R-squred and RMSE does not seem to differ much

## Train Final Model

We fit the model on the entire train dataset

In [35]:
final_model = Ridge(alpha=ridge_cv.alpha_)
final_model.fit(X_scaled, y)

Ridge(alpha=1.1233240329780274)

In [36]:
final_model.score(X_scaled, y)

0.9160216751239636

In [37]:
final_model.coef_

array([ 69883.45164596,  59979.58525026,  21382.31808703,   5360.61259496,
         9540.0136788 ,  40085.31321736,  -2361.53586506, 100537.71245104,
        28944.4304586 ,   8433.96954929,  76065.22113162,  74414.37123138,
       121719.73738366,   2655.03073198,  39188.72832682,  20354.56146337,
        21003.28419806,  19423.42066272, -60594.92054932,   7228.19892505,
         3426.41534695,  20113.59471415,   9244.5425287 ,   5199.91725387,
         9956.55840507,  17387.06230098,  -4018.63575218,  -1271.63026323,
       -13297.82773334, -24639.86726472,  -9739.82508099, -20730.45510917,
       -11063.0908002 , -14598.71943632,  -8044.18375976,  11819.14231889,
        12988.10938908,   4626.09192748,   1412.31728833,  -2053.52612627,
         6429.58470894,   -696.76556002, -15275.64631775,   2030.39468698,
         3977.87091367,  13254.98405567,  -8332.50123183, -13848.56053725,
       -17366.29531772,   4366.16138883, -16498.27688535, -15512.00556116,
        -1036.17929993,  

In [38]:
final_model_coefs = pd.Series(final_model.coef_, index=X.columns)

In [39]:
# Coefficients on final model
final_model_coefs.apply(abs).sort_values(ascending=False).head(10)

gr_liv_area_score    121719.737384
overall_score        100537.712451
x3_GrnHill            79141.243754
total_sf              76065.221132
bsmt_fin_score        74414.371231
Lot Area              69883.451646
house_age             60594.920549
Mas Vnr Area          59979.585250
x3_StoneBr            47190.502469
Garage Area           40085.313217
dtype: float64

The most important variable in the model is above ground living area and overall score (Quality and Condition). The Lot Area, basement finishing area (Basement SF with finishing), neighbourhood and house age are also important features in determining Sale Price. 

# Transform test dataset and generate predictions

In [40]:
# Define feature variables for test dataset
X_final = house_test_df_cleaned

In [41]:
# Scale 
X_final_scaled = final_mc.transform(X_final)

In [42]:
# generate pedictions
y_pred_final = final_model.predict(X_final_scaled)
y_pred_final = pd.Series(y_pred_final, name="SalePrice")

In [43]:
y_pred_final

0      138898.380124
1      170899.650726
2      211745.671624
3       99872.344033
4      171597.434851
           ...      
873    184763.276596
874    224245.784922
875    133032.858125
876    117622.996318
877    119488.299449
Name: SalePrice, Length: 878, dtype: float64

In [44]:
# get ID column from test dataset
id_col = house_test_df['Id']

In [45]:
# Prediction submission in correct format
submission = pd.concat([id_col, y_pred_final], axis=1)
submission = pd.DataFrame(submission)

In [46]:
#export to csv

submission.to_csv("Output/submission.csv", index = False)

Ridge: RMSE on test dataset is 21648

# Conclusion and Recommendation

### Conclusion

A Regression model (Ridge) is useful in predicting Sale Prices and can be used to predict new house prices as it has a RMSE of 21648 on the test dataset (similating new data). 

The most important variable in the model are above ground living area and overall score (Quality and Condition). The Lot Area, basement finishing area (Basement SF with finishing), neighbourhood and house age are also important features in determining Sale Price.

Relating to to our problem statements and the key questions we wanted to answer.
1. Neighbourhood is a strong influencer on Sale Price - possibly because 
2. House Size is in fact the most important feature from our model
3. The inner house area above the ground floor seems to be the most valuable area
4. The overall finishing (Quality and Condition) also influences Sale Price quite significantly
5. House age is also a good predictor of houses


Limitations of our findings

There are other social, economic and political factors that are likely to heaviliy influence house prices. We are not able to control for these variables with the information provided in our dataset. Some of these factors are:
- Demographics
- Interest rate
- The Economy
- Government Policies


## Recommendation

1. We can build a Ridge Regression model to help predict prices of houses that will be added to the listing from the property agency
2. The most important data to collect are above ground living area and overall finishing (Quality and Condition), Lot Area, Basement SF & Finishing, neighbourhood and house age.
3. More external data, such as Demographics and Interest rate should be collected and added to our prediction model