## Random Forest

 - Ensemble of Decision Trees
 - Training via the bagging method (Repeated sampling with replacement)
     - Sample from samples
     - Sample from predictors. m=sqrt(p) for classification and m=p/3 for regression problems.
 - Utilize uncorrelated trees
 

Random Forest

 - Sample both observations and features of training data
 

Bagging

 - Samples only observations at random
 - Decision Tree select best feature when splitting a node

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [7]:
df = sns.load_dataset('titanic')
df.dropna(inplace=True)
X = df[['pclass', 'sex', 'age']]
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
X['sex'] = lb.fit_transform(X['sex'])
y = df['survived']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [9]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report, and confusion matrix of classifier
    '''
    if train:
        '''
        Training Performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report:\n {} \n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix:  \n {} \n".format(confusion_matrix(y_train, clf.predict(X_train))))
        
        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report:\n {} \n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix:  \n {} \n".format(confusion_matrix(y_test, clf.predict(X_test))))

In [10]:
rf_clf = RandomForestClassifier(random_state=42)

In [11]:
rf_clf.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [12]:
print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.9764

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.98      0.97        43
           1       0.99      0.98      0.98        84

   micro avg       0.98      0.98      0.98       127
   macro avg       0.97      0.98      0.97       127
weighted avg       0.98      0.98      0.98       127
 

Confusion Matrix:  
 [[42  1]
 [ 2 82]] 

Average Accuracy: 	 0.8067
Accuracy SD 		 0.1399


In [13]:
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7818

Classification Report:
               precision    recall  f1-score   support

           0       0.62      0.62      0.62        16
           1       0.85      0.85      0.85        39

   micro avg       0.78      0.78      0.78        55
   macro avg       0.74      0.74      0.74        55
weighted avg       0.78      0.78      0.78        55
 

Confusion Matrix:  
 [[10  6]
 [ 6 33]] 



### Grid Search

In [14]:
from sklearn import pipeline
from sklearn.model_selection import GridSearchCV

In [15]:
rf_clf = RandomForestClassifier(random_state=42)

In [16]:
params_grid = {'max_depth': [3, None],
               'min_samples_split': [2, 3, 10],
               'min_samples_leaf': [1, 3, 10],
               'criterion': ['gini', 'entropy']}

In [17]:
grid_search = GridSearchCV(rf_clf, params_grid,
                          n_jobs=-1, cv=5,
                          verbose=1, scoring='accuracy')

In [18]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:   16.8s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:   18.6s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'max_depth': [3, None], 'min_samples_split': [2, 3, 10], 'min_samples_leaf': [1, 3, 10], 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [19]:
grid_search.best_score_

0.8110236220472441

In [20]:
grid_search.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=3,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [23]:
print_score(grid_search, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.9685

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.98      0.95        43
           1       0.99      0.96      0.98        84

   micro avg       0.97      0.97      0.97       127
   macro avg       0.96      0.97      0.97       127
weighted avg       0.97      0.97      0.97       127
 

Confusion Matrix:  
 [[42  1]
 [ 3 81]] 

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 36 candidates, totalling 180 fits
Average Accuracy: 	 0.7430
Accuracy SD 		 0.1313


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.4s finished


In [24]:
print_score(grid_search, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7636

Classification Report:
               precision    recall  f1-score   support

           0       0.60      0.56      0.58        16
           1       0.82      0.85      0.84        39

   micro avg       0.76      0.76      0.76        55
   macro avg       0.71      0.70      0.71        55
weighted avg       0.76      0.76      0.76        55
 

Confusion Matrix:  
 [[ 9  7]
 [ 6 33]] 



Grid search actually did worse. Don't have a lot of features or data. Also could do a more extensive grid search.

***

## Extra-Trees (Extremely Randomized Trees) Ensemble

 - Random Forest is built upon Decision Trees
 - Decision Tree node splitting is based on gini or entropy or some other algorithm
 - Extra-Trees make use of random thresholds for each feature unlike Decision Tree

In [25]:
from sklearn.ensemble import ExtraTreesClassifier

In [26]:
xt_clf = ExtraTreesClassifier(random_state=42)

In [27]:
xt_clf.fit(X_train, y_train)



ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

In [29]:
print_score(xt_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.9764

Classification Report:
               precision    recall  f1-score   support

           0       0.93      1.00      0.97        43
           1       1.00      0.96      0.98        84

   micro avg       0.98      0.98      0.98       127
   macro avg       0.97      0.98      0.97       127
weighted avg       0.98      0.98      0.98       127
 

Confusion Matrix:  
 [[43  0]
 [ 3 81]] 

Average Accuracy: 	 0.8399
Accuracy SD 		 0.1222


In [30]:
print_score(xt_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7091

Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.56      0.53        16
           1       0.81      0.77      0.79        39

   micro avg       0.71      0.71      0.71        55
   macro avg       0.66      0.67      0.66        55
weighted avg       0.72      0.71      0.71        55
 

Confusion Matrix:  
 [[ 9  7]
 [ 9 30]] 



Could use grid search to vary hyperparameters to get better results.