#### REGRESSION TREE METHODS

If we are interested to know not just which general features or variables predict the model, but specific predictors of Y (such if male or if young, or if income less than 10000$), then regression tree is appropriate.

It involves stratifying or segmenting the predictor space into several simple regions. 
To make a prediction for a given observation, the method typically use the mean or the mode of the training observations in the region to which it belongs.

* Idea: when the data has numerous features that interact in complicated nonlinear, it's difficult to create a one global model.
We then need a stratified model, with specific predictors.
Node= point/leaf where the decision is made

Two models: Regression Tree and Random Forest


#### SIMPLE REGRESSION TREE

##### Step 1: Import Packages and read dataset

In [None]:
import numpy as np
import pandas as pd

from IPython.display import display
#http://python.6.x6.nabble.com/IPython-User-ipython-notebook-how-to-display-image-not-from-pylab-td4497427.html

# plotting modules
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
sns.set(style='whitegrid', context='notebook')

# make sure charts appear in the notebook:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

In [None]:
df='~\'
df= pd.read_csv(df)

##### Step 2: Define Y and Xs

In [None]:
## Define y
y = data['log_inc']

## Define X (exclude inc, incsq, log_inc)
columns_ = data.columns.tolist()
exclude_cols = ['inc', 'incsq', 'log_inc'] 
X = data[[i for i in columns_ if i not in exclude_cols]] 
## Print shapes of y and X
print y.shape, X.shape


##### Step 3 : Split Train and Test Samples 

In [None]:
## Train test split 70/30
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Print shapes of X(s) and y(s)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape


##### Step 4 : Call and Build a regression Tree model

In [None]:
#Build a regression tree
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()

#First, we specify the parameters we want in a dictionary list with strings and values. 
params = {"max_depth": [3,5,10,20], 
          "max_features": [None, "auto"], 
          "min_samples_leaf": [1, 3, 5, 7, 10],
          "min_samples_split": [2, 5, 7],
           "criterion": ['mse']
         }


* max_depth specifies how deep we want to search to go into the data. For us, the range is 3 - 20
With a max_depth of 1, the model suffers from high bias.
With a max_depth of 10, the model suffers from high variance. 

* Max features, defines the maximum number of independent variables we want to have.
none will overfit the model because we will have too much

* min_samples_leaf : The minimum number of samples required to be at a leaf node.
* 'max_leaf_nodes': The max number of leaves in the tree.



##### Step 5: Cross Validate the method 

In [None]:
### Here crossvalidate using the Gridsearch
from sklearn.grid_search import GridSearchCV
dtr_gs = GridSearchCV(dtr, params, n_jobs=-1, cv=5, verbose=1)


* njobs=-1 removes the criteria (one sample) after its already used.
* cv is cross validation and means the sample is split equally, trained, and then tested on 5 different samples.
* verbosity specifies the number of message the search will display. Higher verbosity means that as the search goes on, it prints more message about it.

* Here is the gridsearch, and we are asking the computer to go and find the one that gives the best model. grid search is a technique to find good values for model parameters that cannot be optimized directly. 

##### Step 6 : Fit your best model found in gridsearch

In [None]:
#Fit the tree model : Now we need to bring everything together and build a model on the train data
dtr_gs.fit(X_train, y_train)

#Print best estimator, best parameters, and best score (best fit to explain Y)
''' dtr_best = is the regression tree regressor with best parameters/estimators'''

dtr_best = dtr_gs.best_estimator_ 
print "best estimator", dtr_best
print "\n==========\n"
print "best parameters",  dtr_gs.best_params_
print "\n==========\n"
print "best score", dtr_gs.best_score_


##### Step 7: Define a function that we will use to print all features (indpdt variables) by importance.
The ones with most explanatory power

In [None]:
Define Function to Print Feature importances
''' Here I am defining a function to print feature importance using best models'''

def feature_importance(X, best_model):
    feature_importance = pd.DataFrame({'feature':X.columns, 'importance':best_model.feature_importances_})
    feature_importance.sort_values('importance', ascending=False, inplace=True)
    return feature_importance

#Using the function
feature_importance(X, dtr_best)


##### Step 8 : Predict on the Test sample. We train a model on the train data part, we need to test it

In [None]:
#Predict on the Test Data
y_pred_dtr= dtr_best.predict(X_test)
y_pred_dtr


##### Step 9 : Evaluate the performance of your model (MSE in train and test data, R2 in train and test data)
You need to know if your model performed well on the test data. 
Evaluation using MSE and R^2

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error


#Define Function that calls the MSE and R^2 at once, using the name of the method and calling the best model

def rsquare_meansquare_error(train_y, test_y, train_X, test_X, test, best_model):
    """ first we need to predict on the test and train data"""
    y_train_pred = best_model.predict(train_X)
    y_test_pred = best_model.predict(test_X)
    
    """ We call the MSE in the following lines"""
    print ('MSE ' + test + ' train data: %.2f, test data: %.2f' % (
        mean_squared_error(train_y, y_train_pred),
        mean_squared_error(test_y, y_test_pred)))
    
    """ We call the R^2 in the following lines"""
    print('R^2 ' + test + ' train data: %.2f, test data: %.2f' % (
        r2_score(train_y, y_train_pred),
        r2_score(test_y, y_test_pred)))


In [None]:
#Using function
rsquare_meansquare_error(y_train, y_test, X_train, X_test, "Regression tree", dtr_best)


Depending on the MSE and R^2 results on train and test, and comparing to another model such as random forest, you can tell if your model is good or not. Train and test Samples results have to be very close for a good model

#### For Tree Visualization

In [None]:
#Visualize your tree USING the "best" parameteres/estimators
# REQUIREMENTS:
# pip install pydotplus
# brew install graphviz

# Use graphviz to make a chart of the regression tree decision points:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
import pydot

dot_data = StringIO()
''' dtr_best was defined before in section B'''

## Graph
export_graphviz(dtr_best, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                feature_names=X.columns)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())



The tree will print by descending order of importance the most important features of your model. The most important or the one with most explanatory power will be at the top, and so on...

### RANDOM FOREST

It's called Bagging or “Bootstrap Aggregating”. Instead of training our whole sample, it trains M different trees on different subsets of the data, choosing randomly with replacement and then compute the ensemble. The data are chosen by selecting a random subset of features, and a random subset of observations to train model.
Random forests often have very good predictive accuracy, and reduces variance.

For commands, same process as before, but just few changes


##### Step 1: Import Packages and read dataset

##### Step 2: Define Y and Xs (Same)

In [None]:
## Define y
y = data['log_inc']

## Define X (exclude inc, incsq, log_inc)
columns_ = data.columns.tolist()
exclude_cols = ['inc', 'incsq', 'log_inc'] 
X = data[[i for i in columns_ if i not in exclude_cols]] 
## Print shapes of y and X
print y.shape, X.shape


##### Step 3 : Split Train and Test Samples (Same)

In [None]:
## Train test split 70/30
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Print shapes of X(s) and y(s)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape


##### Step 4 : Call and Build a random forest model (different)

In [None]:
#Build a Random regression tree
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor( )

params = {'max_depth':[3,4,5], 
          'max_features':[2,3,4], 
          'max_leaf_nodes':[5,6,7], 
          'min_samples_split':[3,4],
         'n_estimators': [100]
         }



##### Step 5: Cross Validate the method

In [None]:
### Here crossvalidate using the Gridsearch
from sklearn.grid_search import GridSearchCV
estimator_rfr = GridSearchCV(forest, params, n_jobs=-1,  cv=5,verbose=1)


##### Step 6 : Fit your best model found in gridsearch (different)

In [None]:
#Fit the tree model : Now we need to bring everything together and build a model on the train data
#Print best estimator, best parameters, and best score (best fit to explain Y)

''' rfr_best = is the random forest regression tree regressor with best parameters/estimators'''

rfr_best = estimator_rfr.best_estimator_
print "best estimator", rfr_best
print "\n==========\n"
print "best parameters", estimator_rfr.best_params_
print "\n==========\n"
print "best score", estimator_rfr.best_score_



##### Step 7: Define a function that we will use to print all features (indpdt variables) by importance.
Just call the function defined before, and print the features
The ones with most explanatory power.

In [None]:
#Using the function defined in Simple Reg Tree to Print feature Importance
feature_importance(X, rfr_best)



##### Step 8 : Predict on the Test sample. We train a model on the train data part, we need to test it

In [None]:
#Predict on the Test Data
y_pred_rfdtr= rfr_best.predict(X_test)
y_pred_rfdtr

##### Step 9 : Evaluate the performance of your model (MSE in train and test data, R2 in train and test data)
You need to know if your model performed well on the test data. 
Evaluation using MSE and R^2.

We already created the function above

In [None]:
#Evaluate the performance of your model (MSE in train and test data, R2 in train and test data) using function created above
rsquare_meansquare_error(y_train, y_test, X_train, X_test, "Random Forest Regression tree", rfr_best)


Last Step is to compare MSE and R^2 of Simple Regression tree and Random Forest Tree.
Choose the one that has the most close values on train and test, and also high R^2.