#1 How models work
Machine learning models make predictions. The running example decision through the course is deciding on the selling price of a home.

**Decision tree** models make predictions by modeling a series of choices leading to a final decision. Given some **training data** in which outcomes are known, a decision tree algorithm can break the data into a series of decisions leading to multiple outcomes that are each a prediction. Each prediction is called a **leaf** of the tree. Once the decision tree has established a series of breaks, it can be applied to new data--the **testing data**--in which outcomes are unknown to generate predictions about those outcomes.

#2 Explore data

In [1]:
# load pandas
import pandas as pd

In [2]:
# import the Melbourne housing data (downloaded from https://www.kaggle.com/dansbecker/melbourne-housing-snapshot)
melbourne_data = pd.read_csv('melb_data.csv')
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [3]:
# load training data: Iowa housing data
iowa_data = pd.read_csv('train.csv')

In [4]:
# summary statistics
iowa_data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


#3 First machine learning model

##3.1 Select prediction target and features

In [5]:
# examine columns of Melbourne data
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

The data have too many variables. How to select those that matter to the prediction problem?

Step 1: use intuition about how the housing market works. What variables seem most improtant to home values? (**Your theory of home prices informs your model of home prices.**)

In [6]:
# Select a column as the prediction target
y = melbourne_data.Price

In [7]:
# Select features to use in predicting Price (note type in longitude column name)
melbourne_features = ['Rooms','Bathroom','Landsize','Lattitude','Longtitude']
X = melbourne_data[melbourne_features]

In [8]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,13580.0,13580.0,13580.0,13580.0,13580.0
mean,2.937997,1.534242,558.416127,-37.809203,144.995216
std,0.955748,0.691712,3990.669241,0.07926,0.103916
min,1.0,0.0,0.0,-38.18255,144.43181
25%,2.0,1.0,177.0,-37.856822,144.9296
50%,3.0,1.0,440.0,-37.802355,145.0001
75%,3.0,2.0,651.0,-37.7564,145.058305
max,10.0,8.0,433014.0,-37.40853,145.52635


In [9]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
0,2,1.0,202.0,-37.7996,144.9984
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
3,3,2.0,94.0,-37.7969,144.9969
4,4,1.0,120.0,-37.8072,144.9941


##3.2 Build the model
Building a model proceeds in three steps:

1. Define
    + Model type
    + Some other parameters depending on model type
2. Fit to training data features and outcomes
3. Predict outcomes from test data features
4. Evaluate the model's prediction accuracy

In [10]:
# define the model as a decision tree
from sklearn.tree import DecisionTreeRegressor

melbourne_model = DecisionTreeRegressor(random_state=1)

Many machine learning models involve some randomness. Specifying random_state=1 sets the random number generator to a known starting point, enabling reproducibility when others use the same machine learning algorithm on the same data.

In [11]:
# fit the model
melbourne_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

Applying the melbourne_model decision tree to the training data featuers X and known outcomes y generates a fitted decision tree that can be applied to test data to generate predicted outcomes. However, we don't have test data for the Melbourne data, so apply the decision tree model to first 5 rows of training data to see how it works.

In [12]:
# Predict house prices for first 5 rows of training data
print('Making predictions for the following 5 houses:')
print(X.head())
print('The predictions are')
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     202.0   -37.7996    144.9984
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
3      3       2.0      94.0   -37.7969    144.9969
4      4       1.0     120.0   -37.8072    144.9941
The predictions are
[1480000. 1035000. 1465000.  850000. 1600000.]


##3.3 Prediction model for Iowa housing data

###3.3.1 Define the model

In [13]:
# print features in the iowa data
iowa_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [14]:
# create target for training data: sale price
y = iowa_data['SalePrice']

In [15]:
# create feature set for training data
features = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
X = iowa_data[features]

In [16]:
# examine feature data to ensure no gross errors
display(X.describe())
display(X.head())
display(X.tail())

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,10516.828082,1971.267808,1162.626712,346.992466,1.565068,2.866438,6.517808
std,9981.264932,30.202904,386.587738,436.528436,0.550916,0.815778,1.625393
min,1300.0,1872.0,334.0,0.0,0.0,0.0,2.0
25%,7553.5,1954.0,882.0,0.0,1.0,2.0,5.0
50%,9478.5,1973.0,1087.0,0.0,2.0,3.0,6.0
75%,11601.5,2000.0,1391.25,728.0,2.0,3.0,7.0
max,215245.0,2010.0,4692.0,2065.0,3.0,8.0,14.0


Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9


Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
1455,7917,1999,953,694,2,3,7
1456,13175,1978,2073,0,2,3,7
1457,9042,1941,1188,1152,2,4,9
1458,9717,1950,1078,0,1,2,5
1459,9937,1965,1256,0,1,3,6


###3.3.2 Specify and fit the model

In [17]:
# create decision tree from scikitlearn
iowa_model = DecisionTreeRegressor(random_state=1)

In [18]:
# fit the model to training data features X and training data outcomes y
iowa_model.fit(X,y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

###3.3.3 Predict outcomes from training data features (we again don't have test data lacking outcomes)

In [19]:
predictions = iowa_model.predict(X)
print(predictions)

[208500. 181500. 223500. ... 266500. 142125. 147500.]


###3.3.4 Evaluate model prediction accuracy
Because we generated predictions from training data where outcomes are known, we can compare the model accuracy by comparing predicted prices to actual prices.

In [20]:
pred=pd.DataFrame(predictions)
display(pred.head())
display(y.head())


Unnamed: 0,0
0,208500.0
1,181500.0
2,223500.0
3,140000.0
4,250000.0


0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

#4 Model validation
Model validation measures the quality of the model. The first form of a model is rarely the highest quality, and modeling can usually be improved through iteration through model building, fitting, validation, and rebuilding.

In most applications, the **quality measure** for a model is **predictive accuracy**. Are the model's predictions equal to or "close to" what actually happens in the process being modeled?

Model validation goes beyond simply comparing predictions to outcomes. Measures of predictive accuracy have been developed. One common measure is **mean absolute error (MAE)**.

The MAE is calculated by subtracting the predicted outcome for each prediction from its actual outcome, summing the absolute value of each calculation, and dividing the sum by the number of predictions.

The sklearn package has a function to calculate AME: mean_absolute_error(actual outcomes, predicted outcomes).

In [23]:
# calculate MAE for iowa_model
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y, predictions)

62.35433789954339

##4.1 The danger of "in-sample" model validation
The MAE calculated above for the iowa_model is not useful because the predictions were generated from the same training data used to fit the model.

This is bad becauase it overweights arbitrary patterns in the training data that might not exist outside the training data.

Because the point of a machine learning model is to make predictions when outcomes are not known, we need to assess the model's performance using data in which the model has not already seen the outcome.

How can we do this? We don't use all of our data when we fit the model. That way, we have additional data to test the model on that has not been used to train the model. Data set aside in this way are called **validation data**.

##4.2 Validation data
We now have three types of data for machine learning:

1. training data
2. test data
3. validation data

Training and validatino data have features and known outcomes. Test data has features but unknown outcomes.

In [24]:
# split iowa training data into training and validation data
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

In [27]:
# fit iowa housing decision tree to training data
iowa_model.fit(train_X, train_y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

In [29]:
# predict iowa house prices using validation data
val_predictions = iowa_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

32966.449315068494


The MAE using in-sample prediction was about 62.

The MAE using validation prediction is about 33,000.

Quite a difference, with the model being far closer in prediction when using in-sample prediction than validation prediction. That's because in-sample prediction already has the answers.

In [32]:
# average iowa home price
iowa_data['SalePrice'].describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

The mean house price in the full data is about $180,921. An MAE of $33,000 means the model's predictions are, on average, wrong by about 20%.

#5 Underfitting and overfitting

##5.1 Experimenting with models
For a decision tree, one of the more important options is the **tree depth**, the number of decision splits between the top level (all data) and prediction leaves.

The more leaves that are generated, the sparser the data in each leaf.

The greater the tree depth, the greater the number of leaves and the sparser the data in each leaf. At the extreme, we could generate a decision tree where each leaf has one outcome in it, and the tree would make predictions that are very accurate.

However, such a model would likely perform poorly with new data. A model with high performance in training data and low performance in validation data is **overfit**.

Conversely, we might **underfit** a model by failing to capture enough important elements. Such a model would make poor predictions in both the training and validation data.

Both overfitting and underfitting produce poor validation scores. Somewhere between a model being overfit and underfit is an optimal point at which the validation score is maximized. That is usually the desired configuration of the model.

For a decision tree, the max_leaf_nodes argument specifies the maximum number of leaves in the tree. This can be used to assess different model configurations.

In [33]:
# define a function to assess model with varoious max_leaf_node values
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X,train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [34]:
# for loop to assess different max_leaf_node values
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print('Max leaf nodes: %d \t\t Mean Absolute Error: %d' %(max_leaf_nodes, my_mae))

Max leaf nodes: 5 		 Mean Absolute Error: 35190
Max leaf nodes: 50 		 Mean Absolute Error: 27825
Max leaf nodes: 500 		 Mean Absolute Error: 32662
Max leaf nodes: 5000 		 Mean Absolute Error: 33382


For the Iowa home price data used above, the MAE is least with 50 nodes, suggesting that is better than 5, 500, or 5000.

After determining the optimal number of leaves, the model can be fit on the full training data using the optimal leaf number.

In [39]:
# train model on full training data
final_iowa_model = DecisionTreeRegressor(max_leaf_nodes=50)
final_iowa_model.fit(X, y)

final_iowa_pred = final_iowa_model.predict(X)
print(mean_absolute_error(y, final_iowa_pred))

20288.66665281336


Prior to tuning the decision tree, the MAE was about $32,000. After tuning, it is about $20,000.

#6 Random forests

#7 Handling missing values

#8 Using categorical data with one hot encoding

#9 XGBoost

#10 Partial dependence plots

#11 Pipelines

#12 Cross-validation

#13 Data leakage