# Assignment 2
Kyle Gallagher  |  MSDS422 - Fulton

## Problem Question:
Use many explanatory variables for your predictions. Employ at least two regression modeling methods selected from those discussed in Chapter 4 of the Géron (2017) textbook: linear regression, ridge regression, lasso regression, and elastic net. Evaluate these methods within a cross-validation design using the root mean-squared error (RMSE) as an index of prediction error.  Submit your models to Kaggle.com for evaluation on the test set.  Python scikit-learn should be your primary environment for conducting this research. Note that it is not necessary to employ polynomial regression in this assignment.

Regarding the management problem, imagine that you are advising a real estate brokerage firm in its attempt to employ machine learning methods. The firm wants to use machine learning to complement conventional methods for assessing the market value of residential real estate. Of the modeling methods examined in your study, which would you recommend to management, and why?

# Analysis and Insights
## Data Prep
I started out with importing the training data and doing some high level checks of the data; looking at a sample of data, checking the data types, and looking at summary statistics.
After that I split the data into training and test sets and seperated the target variable (saleprice). I then built a data pipeline using sklearn. The pipeline identifies the numeric and categorical fields and applies transformations to each of them seperately. 
After transformation I applied two modeling methods; simple linear regression and Ridge regression.

## Results
The simple linear model didn't have great results. Results improved with the ridge regression depending on the alpha used. I evalued both results using both a training/test set as well as 5 fold cross validation. Interestinly enough based on the training/test results alpha of 100 had the best results while an alpha of 10 performed best on the 5 fold CV. I submitted both to Kaggle and they had identical results on the holdout set. 
On Kaggle the evaluation score came out to be 0.16576 (RMS Log Error) putting me ahead of 63% of participants. 

## Next Steps
Not bad results for a first pass. With more time below are the next steps I'd take to further improve results:
* run a loop over a range of alpha values to determine the best results for ridge regression
* test other regression methods (elastic, lasso, etc)
* add an ordinal transformation to the pipeline (I only used one hot encoding but some fields would benefit from an ordinal transformation)

## Analysis
Advice for the real estate brokerage firm: As of now I'd recommend a ridge regression model since it includes some regularization. As you can see below it increased accuracy over the conventional linear regression model. 

# Appendix - Code and Output

## Adding Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
%matplotlib inline

## Importing Training Data

In [2]:
# Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
df = pd.read_csv('./Data/train.csv', sep=',', engine='python')

## High Level Summary Statistics

In [33]:
# Getting a look at the first 5 rows
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [87]:
# Getting summary statistics for all object variables
# None appear to have huge unique counts so we should be fine with one hot encoding
with pd.option_context('display.max_columns', 50):
    print(df.describe(include='object'))

       MSZoning Street Alley LotShape LandContour Utilities LotConfig  \
count      1460   1460    91     1460        1460      1460      1460   
unique        5      2     2        4           4         2         5   
top          RL   Pave  Grvl      Reg         Lvl    AllPub    Inside   
freq       1151   1454    50      925        1311      1459      1052   

       LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle  \
count       1460         1460       1460       1460     1460       1460   
unique         3           25          9          8        5          8   
top          Gtl        NAmes       Norm       Norm     1Fam     1Story   
freq        1382          225       1260       1445     1220        726   

       RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType ExterQual  \
count       1460     1460        1460        1460       1452      1460   
unique         6        8          15          16          4         4   
top        Gable  CompShg     VinylS

In [41]:
# Getting summary statistics for all number variables
with pd.option_context('display.max_columns', 50):
    print(df.describe(include='number'))

                Id   MSSubClass  LotFrontage        LotArea  OverallQual  \
count  1460.000000  1460.000000  1201.000000    1460.000000  1460.000000   
mean    730.500000    56.897260    70.049958   10516.828082     6.099315   
std     421.610009    42.300571    24.284752    9981.264932     1.382997   
min       1.000000    20.000000    21.000000    1300.000000     1.000000   
25%     365.750000    20.000000    59.000000    7553.500000     5.000000   
50%     730.500000    50.000000    69.000000    9478.500000     6.000000   
75%    1095.250000    70.000000    80.000000   11601.500000     7.000000   
max    1460.000000   190.000000   313.000000  215245.000000    10.000000   

       OverallCond    YearBuilt  YearRemodAdd   MasVnrArea   BsmtFinSF1  \
count  1460.000000  1460.000000   1460.000000  1452.000000  1460.000000   
mean      5.575342  1971.267808   1984.865753   103.685262   443.639726   
std       1.112799    30.202904     20.645407   181.066207   456.098091   
min       1.000

## Splitting into Training and Test

In [3]:
train_set, test_set = train_test_split(df, test_size = 0.2, random_state = 1234)

In [4]:
housing_train = train_set.drop("SalePrice", axis = 1)
housing_labels_train = train_set.loc[:,'SalePrice']

In [5]:
housing_test = test_set.drop("SalePrice", axis = 1)
housing_labels_test = test_set.loc[:,'SalePrice']

## Feature Creation and Transformation

In [8]:
# Setting up a pipeline as shown in "Hands-On Machine Learning with Scikit-Learn" 

num_attribs = list(housing_train.select_dtypes([np.number]))
cat_attribs = list(housing_train.select_dtypes([np.object]))

# Numeric pipeline: imputing with median and applying standard scaler
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy = "median")),
    ('std_scaler', StandardScaler())
])

# Categorical pipeline: applying one hot encoding and imputing with the mode
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy = "most_frequent")),
    ('1h_encoder', OneHotEncoder(handle_unknown='ignore'))
    
])

# Full pipeline containing the numeric and categorical pipelines
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs)
])

# Fitting then transforming the training data
housing_train = full_pipeline.fit_transform(housing_train)
# Only Transforming the test data
housing_test = full_pipeline.transform(housing_test)

## Modeling
Linear Regression:

In [9]:
lin_reg = LinearRegression()
lin_reg.fit(housing_train, housing_labels_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [10]:
# Checking predictions
housing_predictions = lin_reg.predict(housing_train)
lin_rmse = mean_squared_error(housing_labels_train, housing_predictions, squared = False)

housing_predictions = lin_reg.predict(housing_test)
lin_rmse_test = mean_squared_error(housing_labels_test, housing_predictions, squared = False)

print("-------RMSE---------")
print("TRAIN:", lin_rmse)
print("TEST:", lin_rmse_test)

-------RMSE---------
TRAIN: 20423.474830378134
TEST: 38231.66517138771


In [11]:
# Calculating 5 fold Cross Validation score
scores = cross_val_score(lin_reg, housing_train, housing_labels_train, scoring = "neg_mean_squared_error", cv = 5)
lin_reg_scores = np.sqrt(-scores)
print("Mean: ", lin_reg_scores.mean())
print("Standard Deviation: ", lin_reg_scores.std())

Mean:  37178.761227011695
Standard Deviation:  9056.19621467214


Ridge Regression:

In [39]:
ridge_reg = Ridge(alpha = 1)
ridge_reg.fit(housing_train, housing_labels_train)

Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)

In [13]:
# Calculating 5 fold Cross Validation score (a = 100)
scores = cross_val_score(ridge_reg, housing_train, housing_labels_train, scoring = "neg_mean_squared_error", cv = 5)
lin_reg_scores = np.sqrt(-scores)
print("Mean: ", lin_reg_scores.mean())
print("Standard Deviation: ", lin_reg_scores.std())

Mean:  34387.992961940225
Standard Deviation:  11785.581554488794


In [25]:
# Calculating 5 fold Cross Validation score (a = 10)
scores = cross_val_score(ridge_reg, housing_train, housing_labels_train, scoring = "neg_mean_squared_error", cv = 5)
lin_reg_scores = np.sqrt(-scores)
print("Mean: ", lin_reg_scores.mean())
print("Standard Deviation: ", lin_reg_scores.std())

Mean:  33825.14438975966
Standard Deviation:  10297.983874710744


In [29]:
# Calculating 5 fold Cross Validation score (a = 1)
scores = cross_val_score(ridge_reg, housing_train, housing_labels_train, scoring = "neg_mean_squared_error", cv = 5)
lin_reg_scores = np.sqrt(-scores)
print("Mean: ", lin_reg_scores.mean())
print("Standard Deviation: ", lin_reg_scores.std())

Mean:  36256.71136303083
Standard Deviation:  9133.698128611952


In [14]:
# Checking predictions (a = 100)
housing_predictions = ridge_reg.predict(housing_train)
ridge_rmse = mean_squared_error(housing_labels_train, housing_predictions, squared = False)

ridge_predictions_test = ridge_reg.predict(housing_test)
ridge_rmse_test = mean_squared_error(housing_labels_test, ridge_predictions_test, squared = False)

print("-------RMSE---------")
print("TRAIN:", ridge_rmse)
print("TEST:", ridge_rmse_test)

-------RMSE---------
TRAIN: 29475.741235660455
TEST: 25733.81210736826


In [38]:
# Checking predictions (a = 10)
# Using root mean squared log error since that's what the evaluation metric on kaggle is
housing_predictions = ridge_reg.predict(housing_train)
ridge_rmse = mean_squared_error(housing_labels_train, housing_predictions, squared = False)

ridge_predictions_test = ridge_reg.predict(housing_test)
ridge_rmse_test = mean_squared_error(housing_labels_test, ridge_predictions_test, squared = False)

print("-------RMSE---------")
print("TRAIN:", ridge_rmse)
print("TEST:", ridge_rmse_test)

-------RMSE---------
TRAIN: 25083.36022160616
TEST: 27073.313543383276


In [40]:
# Checking predictions (a = 1)
# Using root mean squared log error since that's what the evaluation metric on kaggle is
housing_predictions = ridge_reg.predict(housing_train)
ridge_rmse = mean_squared_error(housing_labels_train, housing_predictions, squared = False)

ridge_predictions_test = ridge_reg.predict(housing_test)
ridge_rmse_test = mean_squared_error(housing_labels_test, ridge_predictions_test, squared = False)

print("-------RMSE---------")
print("TRAIN:", ridge_rmse)
print("TEST:", ridge_rmse_test)

-------RMSE---------
TRAIN: 21766.978659486078
TEST: 30223.787871174423


## Submission Steps:

In [33]:
# Reading in test data
submit = pd.read_csv('./Data/test.csv', sep=',', engine='python')

In [34]:
# Applying Transformations
housing_submit = full_pipeline.transform(submit)

In [35]:
# Getting predictions for submission
final_predictions = ridge_reg.predict(housing_submit)

In [36]:
# Packaging submission up
housing_id = np.array(submit['Id']).astype(int)
my_solution = pd.DataFrame(final_predictions, housing_id, columns = ['SalePrice'])
my_solution.to_csv("submission_1.csv", index_label = ["Id"])