## Modeling
- **Establish your baseline score.**
- Fit linear regression. Look at your coefficients. Are any of them wildly overblown?
- Fit lasso/ridge/elastic net with default parameters.
- Go back and remove features that might be causing issues in your models.
- Tune hyperparameters.
- **Identify a production model.** (This does not have to be your best performing Kaggle model, but rather the model that best answers your problem statement.)
- Refine and interpret your production model.

In [33]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression #class(LinearRegression) inside module(linear_model) inside library(sklearn)
from sklearn import metrics

In [34]:
# Set up the data
data = "../data/"
train = "datasets/clean_train.csv"
test = "datasets/test.csv"

In [35]:
#Read in the data
train_df = pd.read_csv(data+train)
test_df = pd.read_csv(data+test)

In [50]:
# Find the mean of our target variable, assign it to a column "baseline" in our dataframe
train_df['baseline'] = train_df['SalePrice'].mean()
# Establish a baseline score with the "null model"
mse = metrics.mean_squared_error(train_df['SalePrice'],train_df['baseline'])**(1/2)
print('"Null-Model" MSE(baseline score):', mse)

"Null-Model" MSE(baseline score): 79242.76900782275


In [36]:
features =['Year Remod/Add', 'Year Built', '1st Flr SF','Total Bsmt SF','Garage Area', 'Gr Liv Area', 'Overall Qual']

In [37]:
# Setting up our features and our target from the train_df to feed into a linear regression.
X = train_df[features]
y_actual   = train_df['SalePrice']

In [38]:
# Verify dimensions in X and y represent an equal number of observations in each, n = No. of rows
print('X:        ', X.shape) # X.shape equals (n,p)
print('y_actual: ', y_actual.shape) # y.shape equals (n, null)

X:         (2049, 7)
y_actual:  (2049,)


In [39]:
# Check that everything is copacetic.
X.head(3)

Unnamed: 0,Year Remod/Add,Year Built,1st Flr SF,Total Bsmt SF,Garage Area,Gr Liv Area,Overall Qual
0,2005,1976,725,725.0,475.0,1479,6
1,1997,1996,913,913.0,559.0,2122,7
2,2007,1953,1057,1057.0,246.0,1057,5


In [40]:
# Instantiate linear regression model
lm = LinearRegression()

In [41]:
# Fit the linear regression to chosen features.
lm.fit(X, y_actual)

LinearRegression()

In [43]:
# The `lm` object contains our model's coefficients
pd.Series(lm.coef_, index=features)

Year Remod/Add      313.096954
Year Built          238.379913
1st Flr SF           17.657896
Total Bsmt SF        17.925420
Garage Area          50.627231
Gr Liv Area          43.846471
Overall Qual      20246.462076
dtype: float64

In [44]:
# And the y-intercept.
lm.intercept_

-1162782.7051056402

In [30]:
# Create predictions using the `lm` object.
y_pred = lm.predict(X)

In [31]:
# Evaluate the model locally with training values of Sale Price
metrics.mean_squared_error(y_actual,y_pred)**(1/2)

36297.670599703146