# House Price Estimation
## Gradient Boosting
- Ensemble of decision trees
- Predicts values
- Decision trees are models with branching decision points
- Creating large decision trees can overfit data
- Better approach is to create lots of simple decision trees and combine them at the end
- Instead of creating independent trees, boosting creates trees that build on each other in areas it previously didn't do as well.

In [94]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
import pandas as pd

In [95]:
df = pd.read_csv("../data/ml_house_data_set.csv")
print(df.dtypes)
df

year_built               int64
stories                  int64
num_bedrooms             int64
full_bathrooms           int64
half_bathrooms           int64
livable_sqft             int64
total_sqft               int64
garage_type             object
garage_sqft              int64
carport_sqft             int64
has_fireplace             bool
has_pool                  bool
has_central_heating       bool
has_central_cooling       bool
house_number             int64
street_name             object
unit_number            float64
city                    object
zip_code                 int64
sale_price             float64
dtype: object


Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_type,garage_sqft,carport_sqft,has_fireplace,has_pool,has_central_heating,has_central_cooling,house_number,street_name,unit_number,city,zip_code,sale_price
0,1978,1,4,1,1,1689,1859,attached,508,0,True,False,True,True,42670,Lopez Crossing,,Hallfort,10907,270897.0
1,1958,1,3,1,1,1984,2002,attached,462,0,True,False,True,True,5194,Gardner Park,,Hallfort,10907,302404.0
2,2002,1,3,2,0,1581,1578,none,0,625,False,False,True,True,4366,Harding Islands,,Lake Christinaport,11203,2519996.0
3,2004,1,4,2,0,1829,2277,attached,479,0,True,False,True,True,3302,Michelle Highway,,Lake Christinaport,11203,197193.0
4,2006,1,4,2,0,1580,1749,attached,430,0,True,False,True,True,582,Jacob Cape,,Lake Christinaport,11203,207897.0
5,2005,1,3,2,0,1621,1672,attached,430,0,True,False,True,True,78445,Michelle Highway,,Lake Christinaport,11203,196559.0
6,1979,1,3,2,1,2285,2365,detached,532,0,True,False,True,True,246,Harris Estates,,Morrisport,10924,434697.0
7,1958,1,5,2,0,1745,1741,none,0,0,False,False,False,False,35725,Jessica Isle,,Lake Christinaport,11203,64887.0
8,1958,1,5,2,0,1747,1745,none,0,0,False,False,False,False,35725,Jessica Isle,,Lake Christinaport,11203,143636.0
9,1961,1,1,1,0,998,1161,none,0,242,False,False,False,False,73327,Kurt Crescent,,Lake Christinaport,11203,81896.0


## How much data
- Aim for 10x more data points than features as a rough rule
- More data is almost always better but not always necessary

## Feature Engineering
- Use features that correlate strongly with output values
- Useless features can harm accuracy
- You can add or drop features
- Can combine multipe features into a single one:
    - Replace multiple unit measurements with one
    - Use binning to replace exact measurements with broader category
- One hot encoding: replace binary factor with two columns: e.g. is_brisbane (0/1) and is_sydney (0/1)
- Cure of dimensionality: As the number of dimensions (features) in the data increase, number of datapoints required to build a good model grows exponentially.

## Engineering Home Prices Dataset
- garage_type has 3 factors that can be modified with one hot encoding
    - Attached
    - None
    - Detached
- Features with True/False values are ok as they will be treated as 1/0s
- House and unit number are probably irrelevant and can drop
- Street name, city and zip code can be useful, but will include redundant information. If we know the zip code, we already know the city. Street name can be useful, but will increase the complexity of the model as we will have a feature for every street in the dataset.

In [96]:
del df['house_number']
del df['unit_number']
del df['street_name']
del df['zip_code']

In [99]:
# Pandas now creates uint8 instead of bool column and gradient boosting algorithm is having trouble with it

features_df = pd.get_dummies(df, columns=['garage_type', 'city'])

# remove y
del features_df['sale_price']

features_df.dtypes

year_built                   int64
stories                      int64
num_bedrooms                 int64
full_bathrooms               int64
half_bathrooms               int64
livable_sqft                 int64
total_sqft                   int64
garage_sqft                  int64
carport_sqft                 int64
has_fireplace                 bool
has_pool                      bool
has_central_heating           bool
has_central_cooling           bool
garage_type_attached         uint8
garage_type_detached         uint8
garage_type_none             uint8
city_Amystad                 uint8
city_Brownport               uint8
city_Chadstad                uint8
city_Clarkberg               uint8
city_Coletown                uint8
city_Davidfort               uint8
city_Davidtown               uint8
city_East Amychester         uint8
city_East Janiceville        uint8
city_East Justin             uint8
city_East Lucas              uint8
city_Fosterberg              uint8
city_Hallfort       

In [100]:
X = features_df.as_matrix()
y = df['sale_price'].as_matrix()

X

array([[1978, 1, 4, ..., 0, 0, 0],
       [1958, 1, 3, ..., 0, 0, 0],
       [2002, 1, 3, ..., 0, 0, 0],
       ..., 
       [1983, 1, 1, ..., 0, 0, 0],
       [1981, 1, 3, ..., 0, 0, 0],
       [1980, 1, 3, ..., 0, 0, 0]], dtype=object)

In [101]:
y

array([  270897.,   302404.,  2519996., ...,    98280.,    98278.,
         186480.])

## Shuffle and split data into train/test datasets
- 70% data for training
- 30% data for testing

## GradientBoostingRegressor
- n_estimators: How many decision trees to build, but determins how long it takes
- learning_rate: How much each additional decision tree contributes to the overal prediction
- max_depth: How many layers deep each decision tree can go
- min_samples_leaf: How many times value must appear in dataset before a tree will make a decision based on it. Can prevent outliers influencing model
- max_features: Percentage of features in model we choose to randomly consider when we create a branch in our decision tree
- loss: How error rate is calculated

In [106]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = ensemble.GradientBoostingRegressor(
    n_estimators=1000,
    learning_rate = 0.1,
    max_depth = 6,
    min_samples_leaf = 9,
    max_features = 0.1,
    loss = 'huber'
)

model.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='huber', max_depth=6,
             max_features=0.1, max_leaf_nodes=None,
             min_impurity_split=1e-07, min_samples_leaf=9,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=1000, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False)

## Saving and Loading Model
```
joblib.dump(model, 'trained_house_classifier_model.pkl')
model = joblib.load('trained_house_classifier_model.pkl')
```

## Testing accuracy with mean squared error

In [107]:
mse = mean_absolute_error(y_train, model.predict(X_train))
mse

48531.808854775212

In [108]:
mse = mean_absolute_error(y_test, model.predict(X_test))
mse

59006.933054274341

## Overfitting
- Fits training data too much
- Memorises data without learning underlying pattern well enough to generalize
- Models that are too complex overfit: e.g. can create few decision trees, make each one smaller, or prefer simpler ones
- If reducing the complexity of the model doesn't work, we may not have enough data

## Underfitting
- Model is too simple to predict trend
- Make model more complex by making more decision trees more making them each deeper

## Identifying under/overfitting
- Use training and test set error rates
- Overfitting: low error rate training, high error rate testing
- Underfitting: both high error rate with training and testing
- Good fit: low error for both training and testing

## Grid Search
- List out range of settings for each parameter and test them all out to see which gives the best result

In [None]:
model = ensemble.GradientBoostingRegressor()

param_grid = {
    'n_estimators': [500, 1000, 3000],
    'learning_rate': [4, 6],
    'max_depth': [3, 5, 9, 17],
    'min_samples_leaf': [0.1, 0.05, 0.02, 0.01],
    'max_features': [1.0, 0.3, 0.1],
    'loss': ['ls', 'lad', 'huber']
}

gs_cv = GridSearchCV(model, param_grid, n_jobs=4)
gs_cv.fit(X_train, y_train)

gs_cv.best_params_

## Feature Importance

In [122]:
feature_labels = list(features_df.columns.values)

importance = model.feature_importances_
feature_indexes_by_importance = importance.argsort()

# later occurring features in list are more imporant
for index in feature_indexes_by_importance:
    print('{} - {:.2f}%'.format(feature_labels[index], (importance[index] * 100.0)))

city_Martinezfort - 0.00%
city_Julieberg - 0.00%
city_New Michele - 0.00%
city_New Robinton - 0.00%
city_Davidtown - 0.05%
city_Lake Jennifer - 0.07%
city_Rickytown - 0.07%
city_West Terrence - 0.10%
city_Fosterberg - 0.10%
city_West Brittanyview - 0.11%
city_Amystad - 0.13%
city_East Justin - 0.13%
city_Port Daniel - 0.13%
city_South Stevenfurt - 0.14%
city_Toddshire - 0.16%
city_Leahview - 0.17%
city_Clarkberg - 0.19%
city_Davidfort - 0.19%
city_West Gerald - 0.19%
city_Jenniferberg - 0.20%
city_Joshuafurt - 0.22%
city_Port Adamtown - 0.23%
city_Brownport - 0.24%
city_West Lydia - 0.24%
city_Scottberg - 0.26%
city_Wendybury - 0.29%
city_East Lucas - 0.29%
city_Lake Dariusborough - 0.30%
city_Lake Carolyn - 0.30%
city_East Janiceville - 0.30%
city_Port Jonathanborough - 0.32%
city_West Gregoryview - 0.35%
city_Lake Christinaport - 0.35%
city_Morrisport - 0.37%
city_Richardport - 0.37%
city_East Amychester - 0.37%
city_Hallfort - 0.44%
city_Justinport - 0.46%
city_Jeffreyhaven - 0.54%
