# Part 6 - Advanced Regression Techniques
In this notebook we will investigate some popular advanced regression techniques:  
* XGBoost
* Random Forest
* MultiLayer Perceptron (Neural Network)
  
We will use the exact same dataset and features as before and compare the results with our Linear Regressor.  
You will be happy to learn that the same procedure for training a Linear Regressor applies to nearly all other regression models!

In [1]:
import time
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
%matplotlib inline

## Let's load the data and remind ourselves of the contents

In [2]:
df = pd.read_csv('./data/sf/data_clean_engineered.csv')
df.head()

Unnamed: 0,bath,bed,sqft,price,property_type_apartment,property_type_auction,property_type_coming,property_type_condo,property_type_coop,property_type_house,...,postal_code_94121,postal_code_94122,postal_code_94123,postal_code_94124,postal_code_94127,postal_code_94131,postal_code_94132,postal_code_94133,postal_code_94134,postal_code_94501
0,2.0,3.0,1520.0,1995000.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1.0,1.0,566.0,625000.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1.0,1.0,914.0,1196000.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1.5,1.0,1022.0,935000.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2.0,2.0,1912.0,2750000.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
features = [feature for feature in df.columns if feature != 'price']
X = df[features]
y = df['price']
X_np = X.values
y_np = y.values.reshape((len(df), 1))

In [4]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.30, random_state=123) # split 70% train, 30% validation

In [5]:
def evaluate_model(model, X, y):
    y_pred = model.predict(X) # predict y values from input X
    mse = mean_squared_error(y_true=y, y_pred=y_pred)
    print("Mean Squared Error: {}".format(mse))
    print("Accuracy: {}%".format(model.score(X, y)*100.0))

# TODO: LightGBM

## XGBoost
XGBoost is similar to Random Forest where several weak learners are combined to produce a result.  
The really amazing about XGBoost (in addition to performing very well across many datasets) is you can visualize the "feature importance" to get an idea of how the model generates its prediction.

Import the xgboost library and fit our regressor same as before

In [6]:
from xgboost import XGBRegressor
xgb_regressor = XGBRegressor()
xgb_model = xgb_regressor.fit(X_train, y_train)
evaluate_model(xgb_model, X_val, y_val)

Mean Squared Error: 218177234338.2799
Accuracy: 75.99880238118249%


### Visualize the Feature Importance that XGBRegressor has assigned

In [7]:
# create a dataframe of feature importances
feature_importances = pd.DataFrame(columns=X.columns)
feature_importances.loc[0] = xgb_model.feature_importances_
# melt columns so we can easily sort and visualize
df_melt = pd.melt(feature_importances, value_vars=X.columns).sort_values(by='value', ascending=False)
df_melt

Unnamed: 0,variable,value
2,sqft,0.393841
0,bath,0.139384
14,postal_code_94105,0.045381
33,postal_code_94133,0.038898
8,property_type_house,0.037277
6,property_type_condo,0.035656
10,property_type_new,0.034036
1,bed,0.030794
26,postal_code_94121,0.029173
29,postal_code_94124,0.027553


### Retrain on entire dataset and save model to disk

In [8]:
xgb_model = xgb_regressor.fit(X, y)
with open('./models/sf/xgb.pkl', 'wb') as f:
    pickle.dump(xgb_model, f)

## Random Forest  
https://en.wikipedia.org/wiki/Random_forest

In [9]:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor()
rf_model = rf_regressor.fit(X_train, y_train)
evaluate_model(rf_model, X_val, y_val)

Mean Squared Error: 232236896432.18262
Accuracy: 74.45212988167529%


### Retrain on entire dataset and save model to disk

In [10]:
rf_model = rf_regressor.fit(X, y)
with open('./models/sf/random_forest.pkl', 'wb') as f:
    pickle.dump(rf_model, f)

## MultiLayer Perceptron
https://en.wikipedia.org/wiki/Multilayer_perceptron

In [11]:
from sklearn.neural_network import MLPRegressor
mlp_regressor = MLPRegressor(max_iter=20000, random_state=123, solver='lbfgs')
mlp_model = mlp_regressor.fit(X_train, y_train)
evaluate_model(mlp_model, X_val, y_val)

Mean Squared Error: 301642137841.24176
Accuracy: 66.81701194697015%


### Retrain on entire dataset and save model to disk

In [12]:
mlp_model = mlp_regressor.fit(X, y)
with open('./models/sf/mlp.pkl', 'wb') as f:
    pickle.dump(mlp_model, f)

## As you can see, once our data pipeline is established it is quite easy to implement various regressors!