# Model Validation
* Want to validate models you've created
    * PREDICTIVE ACCURACY

* Summarize model quality:
    * If you compare predicted and actual home values for 10,000 houses, there should be a mix of good/bad predictions
    * Looking in a list of 10,000 values is pointless, we need to quantify it into a single metric.

# Mean Absolute Error (MAE)

* error = (actual - predicted)

### MAE:
* take abs(error)
* take avg() of those abs(error)'s

In [16]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

In [17]:
melb_fp = './input/melb_data.csv'
melb_data = pd.read_csv(melb_fp)

# filter w/ missing prices
filtered_melb_data = melb_data.dropna(axis=0)

# target + features
y = filtered_melb_data.Price
melb_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melb_data[melb_features]

melb_model = DecisionTreeRegressor()

# fit
melb_model.fit(X, y)


DecisionTreeRegressor()

In [18]:
# MAE
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melb_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

434.71594577146544

# Problem with "in-sample" scores
* Above calc was a "in-sample" score
    * We used a single "sample" of houses for both building and validating it

* The model's practical value comes from making predictions in NEW data
* SOLUTION:
* Exclude some data from the model-building process, then use that to test the mode's accuracy **validation data**

In [125]:
from sklearn.model_selection import train_test_split

# split data into train and validation
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

# define
melb_model = DecisionTreeRegressor()

# fit
melb_model.fit(train_X, train_y)

# preficted vals on validation data
val_predictions = melb_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

29268.701369863014


# EXC

In [127]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = './input/train.csv'

home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]


In [128]:
from sklearn.model_selection import train_test_split

# fill in and uncomment
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

In [131]:
# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)

DecisionTreeRegressor(random_state=1)

In [None]:
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)

In [132]:
# print the top few validation predictions
print(val_predictions[0:5])
# print the top few actual prices from validation data
print(val_X.head())

[186500. 184000. 130000.  92000. 157900.]
      LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
258     12435       2001       963       829         2             3   
267      8400       1939      1052       720         2             4   
288      9819       1967       900         0         1             3   
649      1936       1970       630         0         1             1   
1233    12160       1959      1188         0         1             3   

      TotRmsAbvGrd  
258              7  
267              8  
288              5  
649              3  
1233             6  


In [133]:
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)

# uncomment following line to see the validation_mae
print(val_mae)

29268.701369863014
