## Task

Your task is to build the best model possible using [this dataset](https://docs.google.com/spreadsheets/d/1JCXULApcqesUDbAzKFYGAWvphWx0K36J0bGyiniprWw/edit?usp=sharing). Your goal is to predict the price of the houses using various information. Use linear regression to make predictions and evaluate your model.  Be sure to develop a baseline using the mean of your training y values.  Find the MAE, MSE, RMSE, and R2.

The data given is from Kaggle(https://www.kaggle.com/harlfoxem/housesalesprediction) and was scraped from the [King County government website](https://data.kingcounty.gov/)

In [37]:
# imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn import set_config
set_config(display='diagram')

In [2]:
# load the dataset, you can assume this data is already cleaned.
path = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vR33clYD7L7KarKwDrJr1GzW7GQRbTIzITBBHA7J-luNwIJylRrQR74p_k6AHJE-OfI5y3L2KmFIWo7/pub?output=csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,221900,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,538000,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,180000,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,604000,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,510000,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [3]:
# the 'zipcode' column is a nominal categorical variable  
# convert the datatype of 'zipcode' to 'object'

df['zipcode'] = df['zipcode'].astype('object')

In [4]:
# assign 'price' to y and other features to X

y = df['price']
X = df.drop(columns = 'price')

In [5]:
# split into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [13]:
# create one-hot encode the categorical columns (zipcode) and scale the numeric columns
# Use 2 preprocessing pipelines, one for numeric and one for categorical
# Combine the pipelines in a ColumnTransformer with the appropriate column selectors
num_sel = make_column_selector(dtype_include = 'number')
cat_sel = make_column_selector(dtype_include = 'object')

ohe = OneHotEncoder(handle_unknown='ignore', sparse = False)
scaler = StandardScaler()

num_tuple = (scaler, num_sel)
cat_tuple = (ohe, cat_sel)

transformer = make_column_transformer(num_tuple, cat_tuple, remainder = 'passthrough')


In [38]:
# instantiate a baseline model using the 'mean' strategy
dummy = DummyRegressor(strategy = 'mean')

base_pipe = make_pipeline(transformer, dummy)
base_pipe.fit(X_train, y_train)
# put your ColumnTransformer and the baseline model into a pipeline
# fit your pipe onto the training data

In [26]:
# define a function that takes true and predicted values as arguments
def eval_model(true, pred):
    mae = mean_absolute_error(true, pred)
    mse = mean_squared_error(true, pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(true, pred)

    print(f'MAE: {mae:,.2f}, \nMSE: {mse:,.2f},\nRMSE: {rmse:,.2f},\nR2: {r2:,.2f}') ## comma turns on comma seperators. 2f turns on rounding.
# and prints all 4 metrics

In [32]:
# find MAE, MSE, RMSE and R2 on the baseline model for both the train and test data

print('Train Evaluation')

eval_model(y_train, base_pipe.predict(X_train))

print('\nTest Evaluation')

eval_model(y_test, base_pipe.predict(X_test))

Train Evaluation
MAE: 231,043.36, 
MSE: 129,980,641,701.74,
RMSE: 360,528.28,
R2: 0.00

Test Evaluation
MAE: 240,408.87, 
MSE: 149,877,865,995.95,
RMSE: 387,140.63,
R2: -0.00


In [None]:
# instantiate a linear regression model
# put your ColumnTransformer and linear regression model into a pipeline
# fit your pipe on the training data

lin_reg = make_pipeline(transformer, LinearRegression())
lin_reg.fit(X_train, y_train)

In [27]:
# find MAE, MSE, RMSE, and R2 of the linear regreesion model on both train and test data

print('Train Evaluation')

eval_model(y_train, lin_reg.predict(X_train))

print('\nTest Evaluation')

eval_model(y_test, lin_reg.predict(X_test))

Train Evaluation
MAE: 95,023.19, 
MSE: 24,963,388,337.70,
RMSE: 157,998.06,
R2: 0.81

Test Evaluation
MAE: 97,610.86, 
MSE: 28,793,626,852.23,
RMSE: 169,686.85,
R2: 0.81


# Evaluate:

* What does your model's MAE score mean in the context of this dataset?
* What does the MSE score mean?
* What does the RMSE score mean?
* How about the R^2?

# BONUS!

* Is your model overfit?  How do you know?