# Summary 
---

# House Price Prediction with Regression

####  This dataset was collected from kaggle and is collection of 79 explanatory variables (features) describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. It has consists of 2919 records (rows) and 81 features(columns).

#### Our objective is to predict house prices using one basic machine learning algorithm, Linear Regression. We will also use regression with regularization such as Ridge and Lasso to try to improve our prediction accuracy. 

#### Note: This is a continuation of our previous data analysis. We have already performed data wrangling, cleaning, EDA and feature selection and are ready for the next step i.e. Modelling.

---

In [51]:
# Load Necessary Libraries

import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from sklearn.preprocessing import RobustScaler

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso

In [52]:
df_train = pd.read_csv('linear_regression_train.csv')
df_test = pd.read_csv('linear_regression_test.csv')

features = list(df_train.drop(columns = 'LogSalePrice').columns)

---
# Train-Test Split dataset
---

#### Before we can start modelling the data, we need to confirm that our dataset is split into training and test sets. We will train the models with the training set and predict with the test set. Furthermore, we can use cross validation on our test set when performing regulariztion techniques.
#### It is very important for the data to be split into training and test sets to prevent any data leakage. Here as our data is already split, we will then separate outcome and explanatory variables

In [53]:
# Separating outcome and explanatory variables into their resp. datasets

X_train = pd.DataFrame(df_train[features])
y_train = pd.DataFrame(df_train['LogSalePrice'])
X_test = pd.DataFrame(df_test[features])
y_test = pd.DataFrame(df_test['LogSalePrice'])

In [54]:
# Selecting only numeric variables for scaling i.e. without OHE variables

scale_features = set(features) - set([ 'GarageType_2Types', 'GarageType_Attchd', 'GarageType_Basment',
 'GarageType_BuiltIn', 'GarageType_CarPort', 'GarageType_Detchd', 'SaleCondition_Abnorml',
 'SaleCondition_AdjLand', 'SaleCondition_Alloca', 'SaleCondition_Family', 'SaleCondition_Normal',
 'SaleCondition_Partial'])

# Feature Scaling (MinMax Transformation)

#### We know our dataset is skewed and also has a number of outliers left in our features, hence we will apply Robust scaling to normalize the data features.

In [55]:
rs = RobustScaler()
rs.fit(X_train[scale_features])    
X_train = rs.transform(X_train[scale_features])
X_test = rs.transform(X_test[scale_features])

In [56]:
pd.DataFrame(X_train).describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
count,1455.0,1455.0,1455.0,1455.0,1455.0,1455.0,1455.0,1455.0,1455.0,1455.0,...,1455.0,1455.0,1455.0,1455.0,1455.0,1455.0,1455.0,1455.0,1455.0,1455.0
mean,0.125239,0.700417,0.383505,-0.175322,-0.143335,-0.134021,0.01859,-0.037816,0.478255,-0.39244,...,-0.125182,0.417978,-0.436426,0.039145,0.026291,-0.22469,-0.233677,0.434137,-0.041458,-0.410879
std,0.839946,1.729798,0.503088,1.080718,1.532272,0.8163,0.089414,0.657316,0.505739,0.638579,...,1.151036,0.513648,0.551271,0.16196,0.408794,0.508096,0.748474,0.499667,0.732253,1.446801
min,-1.991952,0.0,0.0,-7.170639,-16.111618,-3.0,-0.693147,-2.195652,0.0,-1.0,...,-4.632837,0.0,-2.0,0.0,0.0,-0.769489,-2.0,0.0,-3.249866,-4.813013
25%,-0.392354,0.0,0.0,-0.564281,-0.467817,-1.0,0.0,-0.413043,0.0,-1.0,...,-0.531291,0.0,-1.0,0.0,0.0,-0.769489,-1.0,0.0,-0.566359,-0.593842
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.607646,0.0,1.0,0.435719,0.532183,0.0,0.0,0.586957,1.0,0.0,...,0.468709,1.0,0.0,0.0,0.0,0.230511,0.0,1.0,0.433641,0.406158
max,4.458753,6.315358,2.0,4.148223,12.128781,5.0,0.693147,0.804348,1.316713,2.0,...,4.696462,1.441655,1.0,1.098612,6.605298,0.719911,2.0,1.158032,2.572957,1.231924


# Modelling

#### We will build four models and evaluate their performances with R-squared metric. Additionally, we will gain insights on the features that are strong predictors of house prices.

## Linear Regression

In [57]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

In [58]:
# Predicted values

yr_hat = lr.predict(X_test)
yr_hat

array([[12.2313391 ],
       [12.28313879],
       [12.1464138 ],
       ...,
       [11.80869761],
       [11.91590374],
       [11.52882309]])

In [59]:
# Evaluating the algorithm with the test set 

lr_score = lr.score(X_test, y_test) 
print("Accuracy: ", lr_score)

Accuracy:  0.8352608416622371


#### We note our basic Linear Regression performed quite well.
#### Note: Our linear regression performs gradient descent under the hood

In [60]:
# cross validation to find 'validate' score across multiple samples, automatically does Kfold stratifying

lr_cv = cross_val_score(lr, X_train, y_train, cv = 5, scoring= 'r2')
print("Cross-validation results: ", lr_cv)
print("R2: ", lr_cv.mean())

Cross-validation results:  [0.85110989 0.83913766 0.85731319 0.85012893 0.83686478]
R2:  0.8469108918952589


It doesn't appear that for this train-test dataset, the model is not over-fitting the data (the cross-validation performance is very close in value). It may be a slightly over-fitted but we can't really tell by the R-squared metric alone. If it is over-fitted, we can do some data transforms or feature engineering to improve its performance. But our main objective initially is to spot-check a few algorithms and fine tune the model later on. 

To help prevent over-fitting in which may result from simple linear regression, we can use regression models with regularization. Let's look at ridge and lasso next.

## Regularization

The alpha parameter in ridge and lasso regularizes the regression model. The regression algorithms with regularization differ from linear regression in that they try to penalize those features that are not significant in our prediction. Ridge will try to reduce their effects (i.e., shrink their coeffients) in order to optimize all the input features. Lasso will try to remove the not-significant features by making their coefficients zero. In short, Lasso (L1 regularization) can eliminate the not-significant features, thus performing feature selection while Ridge (L2 regularization) cannot.   

## Ridge Regression

In [61]:
# set alpha to a default value of 1 as baseline  

ridge = Ridge(alpha = 1)  
ridge.fit(X_train, y_train)

ridge_cv = cross_val_score(ridge, X_train, y_train, cv = 5, scoring = 'r2')
print ("Cross-validation results: ", ridge_cv)
print ("R2: ", ridge_cv.mean())

Cross-validation results:  [0.85167273 0.83869145 0.8570851  0.85018822 0.83686158]
R2:  0.8468998158408972


## Lasso Regression

In [62]:
# set alpha to almost zero as baseline

lasso = Lasso(alpha = .001)  
lasso.fit(X_train, y_train)

lasso_cv = cross_val_score(lasso, X_train, y_train, cv = 5, scoring = 'r2')
print ("Cross-validation results: ", lasso_cv)
print ("R2: ", lasso_cv.mean())

Cross-validation results:  [0.85310785 0.83912763 0.85520906 0.84922742 0.84087026]
R2:  0.8475084449253029


#### Note: Alpha is the regularization parameter. The alpha values chosen for ridge and lasso serve as a starting point and are not likely the best. To determine the best alpha for the model, we can use GridSearch. We would feed GridSearch a range of alpha values and it will try them all in cross-validation to output the best one for the model.

# Conclusion

#### We performed a basic Linear Regression model first and it performed quite well, with an accuracy score of 0.9 on our test set.
#### We further checked lasso and ridge regression with just 1 value as baseline and they performance was almost similar

# Suggestion for Next Steps

* Data preprocessing. Try different types of data transfoms to expose the data structure better, so we may be able to improve model accuracy
* Checking different limits for collinearty when removing features. 
* Use VIF to detect collinearity in Simple Linear Regression model
* Use of dimensionality reducing techniques(e.g. - PCA, Isolation Forest,) to detect and remove collinear features (This will take away our ability to make inference from the data but can help in increasing prediction score) 
* Try different scalers for better performance
* Try Polynomial model for prediction
* Try Generalized Linear Models for prediction
* Try GridSearch to identify optimal parameters. 
* Try other models like KNN,Random Forest,SVM etc and fine tune the models with ensembles