<br> 

# Linear Data Housing Appraiser (LinDHA) Mk. 1

![mk1](../images/tony_stark_buildingmk1.jpg)

Let's try to do a basic regression to predict sales price. We'll use all the numeric variables with a correlation above 0.2. If two variables are highly correlated, we'll pick one of them to keep and omit the other for the sake of keeping the model simple and easy to run.
- ```1st Flr SF``` correlates strongly with ```Total Bsmt SF```, so we drop ```1st Flr SF```.
- ```TotRms AbvGr``` correlates heavily with ```Gr Liv Area```, so we drop ```TotRms AbvGr```.
- ```Garage Yr Blt``` corresponds highly with ```Year Built``` so we can drop ```Garage Yr Blt```.
- ```Garage Car``` corresponds highly with ```Garage Area``` so we can drop ```Garage Car```.


In [None]:
# all the good stuff
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# the stars of the show
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score



pd.options.display.max_columns = 100
pd.options.display.max_rows = 3000

import PaulBettany as jarvis

In [None]:
# import data sets
train = pd.read_csv('../data/train.csv')
total = pd.read_csv('../data/ames-cleaned.csv', index_col='Id')

In [None]:
# list out features we want to use
features = ['Lot Area', 'Overall Qual', 'Year Built', 'Year Remod/Add', 'Mas Vnr Area', 'BsmtFin SF 1',
            'Total Bsmt SF', '2nd Flr SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Full Bath', 
            'Half Bath', 'Bedroom AbvGr', 'Fireplaces', 'Garage Area', 'Wood Deck SF', 
            'Open Porch SF']

len(features)

In [None]:
# create a Project with the data we will use
lindhamk1 = jarvis.Project(total[features+['SalePrice']].iloc[:len(train.index)], total[features+['SalePrice']].iloc[len(train.index):], target='SalePrice', name='LinDHA Mk 1')

In [None]:
# check to see if we have the correct features
lindhamk1.data.head()

In [None]:
# attach a model to the Project
lindhamk1.model = LinearRegression()

# set the random seed of the project
lindhamk1.seed = 42

# prepare data for training and testing
lindhamk1.prepare_data()

In [None]:
# preview model performance before building
lindhamk1.cross_val()

In [None]:
# build the Mk. 1
lindhamk1.prototype()

In [None]:
# grade the model with our metrics
lindhamk1.grade()

- Using only the 17 highest correlated numerical features, we were able to get a mean absolute error of around 20,000 dollars.

In [None]:
lindhamk1.plot_residuals(slice='both')


################################## Code to Export Presentation Graphic ##########################
#
#plt.figure(figsize = (15,8));
# 
#plt.scatter(x=lindhamk1.y_trainpred, y=lindhamk1.train_errors, s=5);
#plt.axhline(y=0, c='red', linestyle='--');
#plt.xticks(c='white');
#plt.yticks(c='white');
#plt.title('Mk 1 (Training Residuals)', c='white', fontsize=30);
#plt.xlabel('Predicted Price', c='white', fontsize=25);
#plt.ylabel('Error', c='white', fontsize=25);
#
#plt.savefig('../images/mk1-residuals.png');
#
##########################################################################################

- Quite clearly, the issues is **underfitting**. There seems to be some inherent curvature to the errors/residuals which our model has not been fully able to capture.
- Of course, this is to be expected since we only used 17 numerical features to try and fit dataset with thousands of points. In other words, using only 17 numerical features (with no polynomial/interaction terms) means our model lacks the sophistication and **capacity** to properly express the relationship in the data.

With these considerations, it might be a good idea to introduce more features. This is what we shall do in the Mk. 2 model in the next notebok.

In [None]:
# take a look at the parameters the model learned
lindhamk1.parameters

Interestingly, we see some negative coefficients attached to variables that we didn't expect: ```2nd Flr SF```, ```Full Bath```, ```Half Bath```, and ```Bedroom AbvGr```. The model is telling us that increasing the number of bathrooms/bedrooms and 2nd floor square footage will somehow drive the price down.

This does not make any sense so either:
1) This really is how real-estate works (and we are learning something new).
2) The LINE-M assumptions are being violated and the true relationship between the features and target are being muddied.

Without any domain knowledge, we can't rule out hypothesis 1 for certain, but it does much more likely the LINE-M assumptions are not being met. However, since this doesn't technically hurt the model's predictive power, we will push-on without worrying about how to fix the LINE-M violations.

In [None]:
lindhamk1.build_model()

In [None]:
# save our work
lindhamk1.save(csv=False, pkl_path='../saved-files/lindhamk1.pkl')