# Random Forests

## PART 1 : THEORY

In [None]:
################## ENSEMBLES OF MODELS ###############

## A widely used and effective method in machine learning involves creating learning models known as ensembles. 

## An ensemble takes multiple individual learning models and combines them to produce an aggregate model that 
## is more powerful than any of its individual learning models alone. Why are ensembles effective? Well, 
## one reason is that if we have DIFFERENT  LEARNING MODELS, although each of them might perform well individually,
## they'll tend to make different kinds of mistakes on the data set. And typically, this happens because each 
## individual model might overfit to a different part of the data. 

## By COMBINING  DOFFERENT individual models into an ensemble, we can AVERAGE OUT their individual mistakes 
## to reduce the risk of overfitting while maintaining strong prediction performance

In [None]:
################# RANDOM FORESTS A QUICK LOOK #########

## Again, as the name would suggest this difference is accomplished by introducing random variation into 
## the process of building each decision tree. This random variation during tree building happens in two ways. 

## First, the data used to build each tree is selected randomly and second, the features chosen in each 
## split tests are also randomly selected.

<img src="rpics/r1.png" alt="Drawing" style="width: 700px;"/>

In [None]:
## Radnom Forest Process
## To create a random forest model you first decide on how many trees to build. This is set using the n_estimated 
## parameter for both RandomForestClassifier and RandomForestRegressor. Each tree were built from a
## different random sample of the data called the bootstrap sample. 

<img src="rpics/r2.png" alt="Drawing" style="width: 700px;"/>

In [None]:
#################### BOOTSTRAPPING : Bootstrap Samples #################

## Bootstrap samples are commonly used in statistics and machine learning. If your training set has 
## N instances or samples in total, a bootstrap sample of size N is created by just 
## REPEATEDLY  picking one of the N dataset rows at RANDOM with REPLACEMENT, that is, allowing for 
## the possibility of picking the same row again at each selection. 

## You repeat this random selection process N times. The resulting bootstrap sample has N rows just like 
## the original training set but with possibly some rows from the original dataset missing and others 
## occurring multiple times just due to the nature of the random selection with replacement. 

<img src="rpics/r3.png" alt="Drawing" style="width: 900px;"/>

In [None]:
############### MAX_FEATURES_PARAMETER ###################

## The random forest model is quite sensitive to the max_features parameter. Max_Features is set to one, 
## the random forest is limited to performing a split on the SINGLE FEATURE that was selected randomly 
## instead of being able to take the BEST  SPLIT over SEVERAL VARIABLES . 

## This means the trees in the forest will likely be very different from each other and possibly with many 
## levels in order to produce a good fit to the data. 

## On the other hand if Max_features is HIGH, close to the total number of features that each instance has, 
## the trees in the forest will tend to be similar and probably will require fewer levels to fit the data 
## using the most informative features.

<img src="rpics/r4.png" alt="Drawing" style="width: 700px;"/>

In [None]:
############# Prediction using Random Forests ######

## Once a random forest model is trained, it predicts the target value for new instances by first making a 
## prediction for every tree in the random forest. 

## For regression tasks the overall prediction is then typically the mean of the individual tree predictions. 

## For classification the overall prediction is based on a weighted vote. 
## Each tree gives a probability for each possible target class label then the probabilities for each class 
## are averaged across all the trees and the class with the highest probability is the final predicted class. 

<img src="rpics/r5.png" alt="Drawing" style="width: 700px;"/>

In [None]:
## A random Forest Example

## Here we're showing the training data plotted in terms of two feature values with height on the 
## x axis and width on the y axis. As usual, there are four categories of fruit to be predicted. 

## Because the number of features is restricted to just two in this very simple example, 
## the randomness in creating the tree ensemble is coming mostly from the bootstrap sampling of 
## the training data. 

## You can see that the decision boundaries overall have the box like shape that we associate with 
## decision trees but with some additional detail variation to accommodate specific local changes 
## in the training data.

<img src="rpics/r6.png" alt="Drawing" style="width: 700px;"/>

In [None]:
## CLASSIFIER PARAMETERS

<img src="rpics/r8.png" alt="Drawing" style="width: 700px;"/>

In [None]:
###### PROS AND CONS

<img src="rpics/r7.png" alt="Drawing" style="width: 700px;"/>

## PART 2 : PRACTICAL EXAMPLES

In [None]:
## DIFFERENT REGRESSION PROBLEMS

## regression problem with one input variable
## Complex regression problem with one input variable

## binary classification problem with two informative features
## binary classification problem with non-linearly separable classes

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification, make_blobs
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer
# from adspy_shared_utilities import load_crime_dataset


cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000'])

# fruits dataset
fruits = pd.read_table('fruit_data_with_colors.txt')

feature_names_fruits = ['height', 'width', 'mass', 'color_score']
X_fruits = fruits[feature_names_fruits]
y_fruits = fruits['fruit_label']
target_names_fruits = ['apple', 'mandarin', 'orange', 'lemon']

X_fruits_2d = fruits[['height', 'width']]
y_fruits_2d = fruits['fruit_label']


# synthetic dataset for simple regression
from sklearn.datasets import make_regression
plt.figure()
plt.title('Sample regression problem with one input variable')
X_R1, y_R1 = make_regression(n_samples = 100, n_features=1,
                            n_informative=1, bias = 150.0,
                            noise = 30, random_state=0)
plt.scatter(X_R1, y_R1, marker= 'o', s=50)
plt.show()

# synthetic dataset for more complex regression
from sklearn.datasets import make_friedman1
plt.figure()
plt.title('Complex regression problem with one input variable')
X_F1, y_F1 = make_friedman1(n_samples = 100, n_features = 7,
                           random_state=0)

plt.scatter(X_F1[:, 2], y_F1, marker= 'o', s=50)
plt.show()



# synthetic dataset for classification (binary)
plt.figure()
plt.title('Sample binary classification problem with two informative features')
X_C2, y_C2 = make_classification(n_samples = 100, n_features=2,
                                n_redundant=0, n_informative=2,
                                n_clusters_per_class=1, flip_y = 0.1,
                                class_sep = 0.5, random_state=0)
plt.scatter(X_C2[:, 0], X_C2[:, 1], marker= 'o',
           c=y_C2, s=50, cmap=cmap_bold)
plt.show()


# more difficult synthetic dataset for classification (binary)
# with classes that are not linearly separable
X_D2, y_D2 = make_blobs(n_samples = 100, n_features = 2,
                       centers = 8, cluster_std = 1.3,
                       random_state = 4)
y_D2 = y_D2 % 2
plt.figure()
plt.title('Sample binary classification problem with non-linearly separable classes')
plt.scatter(X_D2[:,0], X_D2[:,1], c=y_D2,
           marker= 'o', s=50, cmap=cmap_bold)
plt.show()

# Breast cancer dataset for classification
cancer = load_breast_cancer()
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
## Important go and Run Helper FUnction below only then run this!!!!!!!!!!!!!!

########## PLOTTING DECISION BOUNDRIES #######

## Notice that we did not have to perform scaling or other pre-processing as we did with a number of other 
## supervised learning methods. 
## This is one advantage of using random forests. 

## Also note that we passed in a fixed value for the RANDOM STATE state parameter in order to make the 
## results reproducible. 
## If we didn't set the random state parameter, the model would likely be DIFFERENT  EACH TIME 
## due to the randomized nature of the random forest algorithm. 

## So, on the positive side, random forest are widely used because they're very powerful. 
## They give excellent prediction performance on a wide variety of problems 
##and they don't require careful scaling of the feature data or extensive parameter tuning.

In [5]:
import numpy as numpy
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#import plot_class_regions_for_classifier_subplot
#import plot_class_regions_for_classifier

X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2,
                                                   random_state = 0)
fig, subaxes = plt.subplots(1, 1, figsize=(6, 6))

clf = RandomForestClassifier().fit(X_train, y_train)
title = 'Random Forest Classifier, complex binary dataset, default settings'
plot_class_regions_for_classifier_subplot(clf, X_train, y_train, X_test,
                                         y_test, title, subaxes)

plt.show()

<IPython.core.display.Javascript object>

In [15]:
fruits.shape

(59, 7)

In [16]:
fruits.head(6)

Unnamed: 0,fruit_label,fruit_name,fruit_subtype,mass,width,height,color_score
0,1,apple,granny_smith,192,8.4,7.3,0.55
1,1,apple,granny_smith,180,8.0,6.8,0.59
2,1,apple,granny_smith,176,7.4,7.2,0.6
3,2,mandarin,mandarin,86,6.2,4.7,0.8
4,2,mandarin,mandarin,84,6.0,4.6,0.79
5,2,mandarin,mandarin,80,5.8,4.3,0.77


In [18]:
X_fruits.head(5)

Unnamed: 0,height,width,mass,color_score
0,7.3,8.4,192,0.55
1,6.8,8.0,180,0.59
2,7.2,7.4,176,0.6
3,4.7,6.2,86,0.8
4,4.6,6.0,84,0.79


In [20]:
y_fruits.head(3)

0    1
1    1
2    1
Name: fruit_label, dtype: int64

In [22]:
X_fruits_2d.head(3)

Unnamed: 0,height,width
0,7.3,8.4
1,6.8,8.0
2,7.2,7.4


In [24]:
y_fruits_2d.head(3)

0    1
1    1
2    1
Name: fruit_label, dtype: int64

In [11]:
#### THIS RUNS ABOUT A MINUTE

import matplotlib.pyplot as plt
import matplotlib
import matplotlib.patches as mpatches


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#from adspy_shared_utilities import plot_class_regions_for_classifier_subplot

X_train, X_test, y_train, y_test = train_test_split(X_fruits.to_numpy(),
                                                   y_fruits.to_numpy(),
                                                   random_state = 0)
fig, subaxes = plt.subplots(6, 1, figsize=(6, 32))

title = 'Random Forest, fruits dataset, default settings'
pair_list = [[0,1], [0,2], [0,3], [1,2], [1,3], [2,3]]

for pair, axis in zip(pair_list, subaxes):
    X = X_train[:, pair]
    y = y_train
    
    clf = RandomForestClassifier().fit(X, y)
    plot_class_regions_for_classifier_subplot(clf, X, y, None,
                                             None, title, axis,
                                             target_names_fruits)
    
    axis.set_xlabel(feature_names_fruits[pair[0]])
    axis.set_ylabel(feature_names_fruits[pair[1]])
    
plt.tight_layout()
plt.show()

clf = RandomForestClassifier(n_estimators = 10,
                            random_state=0).fit(X_train, y_train)

print('Random Forest, Fruit dataset, default settings')
print('Accuracy of RF classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

<IPython.core.display.Javascript object>

Random Forest, Fruit dataset, default settings
Accuracy of RF classifier on training set: 1.00
Accuracy of RF classifier on test set: 0.80


In [None]:
## Random Forests on a real-world dataset

In [6]:
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)

clf = RandomForestClassifier(max_features = 8, random_state = 0)
clf.fit(X_train, y_train)

print('Breast cancer dataset')
print('Accuracy of RF classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Breast cancer dataset
Accuracy of RF classifier on training set: 1.00
Accuracy of RF classifier on test set: 0.97


In [None]:
### PROS and CONS

<img src="rpics/r7.png" alt="Drawing" style="width: 700px;"/>

In [None]:
<img src="rpics/r1.png" alt="Drawing" style="width: 700px;"/>

In [None]:
<img src="rpics/r1.png" alt="Drawing" style="width: 700px;"/>

In [None]:
<img src="rpics/r1.png" alt="Drawing" style="width: 700px;"/>

In [None]:
<img src="rpics/r1.png" alt="Drawing" style="width: 700px;"/>

In [None]:
<img src="rpics/r1.png" alt="Drawing" style="width: 700px;"/>

## THE HELPER FUNCTION

In [13]:
def plot_class_regions_for_classifier_subplot(clf, X, y, X_test, y_test, title, subplot, target_names = None, plot_decision_regions = True):

    numClasses = numpy.amax(y) + 1
    color_list_light = ['#FFFFAA', '#EFEFEF', '#AAFFAA', '#AAAAFF']
    color_list_bold = ['#EEEE00', '#000000', '#00CC00', '#0000CC']
    cmap_light = ListedColormap(color_list_light[0:numClasses])
    cmap_bold  = ListedColormap(color_list_bold[0:numClasses])

    h = 0.03
    k = 0.5
    x_plot_adjust = 0.1
    y_plot_adjust = 0.1
    plot_symbol_size = 50

    x_min = X[:, 0].min()
    x_max = X[:, 0].max()
    y_min = X[:, 1].min()
    y_max = X[:, 1].max()
    x2, y2 = numpy.meshgrid(numpy.arange(x_min - k, x_max + k, h), numpy.arange(y_min - k, y_max + k, h))
    # numpy.c_ Translates slice objects to concatenation along the second axis
    # e.g. np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])]
    # ravel() Returns a contiguous flattened array.
    # x = np.array([[1, 2, 3], [4, 5, 6]])
    # np.ravel(x) = [1 2 3 4 5 6]
    P = clf.predict(numpy.c_[x2.ravel(), y2.ravel()])
    P = P.reshape(x2.shape)

    if plot_decision_regions:
        subplot.contourf(x2, y2, P, cmap=cmap_light, alpha = 0.8)

    subplot.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=plot_symbol_size, edgecolor = 'black')
    subplot.set_xlim(x_min - x_plot_adjust, x_max + x_plot_adjust)
    subplot.set_ylim(y_min - y_plot_adjust, y_max + y_plot_adjust)

    if (X_test is not None):
        subplot.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cmap_bold, s=plot_symbol_size, marker='^', edgecolor = 'black')
        train_score = clf.score(X, y)
        test_score  = clf.score(X_test, y_test)
        title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score)

    subplot.set_title(title)

    if (target_names is not None):
        legend_handles = []
        for i in range(0, len(target_names)):
            patch = mpatches.Patch(color=color_list_bold[i], label=target_names[i])
            legend_handles.append(patch)
        subplot.legend(loc=0, handles=legend_handles)


def plot_class_regions_for_classifier(clf, X, y, X_test=None, y_test=None, title=None, target_names = None, plot_decision_regions = True):

    numClasses = numpy.amax(y) + 1
    color_list_light = ['#FFFFAA', '#EFEFEF', '#AAFFAA', '#AAAAFF']
    color_list_bold = ['#EEEE00', '#000000', '#00CC00', '#0000CC']
    cmap_light = ListedColormap(color_list_light[0:numClasses])
    cmap_bold  = ListedColormap(color_list_bold[0:numClasses])

    h = 0.03
    k = 0.5
    x_plot_adjust = 0.1
    y_plot_adjust = 0.1
    plot_symbol_size = 50

    x_min = X[:, 0].min()
    x_max = X[:, 0].max()
    y_min = X[:, 1].min()
    y_max = X[:, 1].max()
    x2, y2 = numpy.meshgrid(numpy.arange(x_min-k, x_max+k, h), numpy.arange(y_min-k, y_max+k, h))
    # numpy.c_ Translates slice objects to concatenation along the second axis
    # e.g. np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])]
    # ravel() Returns a contiguous flattened array.
    # x = np.array([[1, 2, 3], [4, 5, 6]])
    # np.ravel(x) = [1 2 3 4 5 6]
    P = clf.predict(numpy.c_[x2.ravel(), y2.ravel()])
    P = P.reshape(x2.shape)
    plt.figure()
    if plot_decision_regions:
        plt.contourf(x2, y2, P, cmap=cmap_light, alpha = 0.8)

    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=plot_symbol_size, edgecolor = 'black')
    plt.xlim(x_min - x_plot_adjust, x_max + x_plot_adjust)
    plt.ylim(y_min - y_plot_adjust, y_max + y_plot_adjust)

    if (X_test is not None):
        plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cmap_bold, s=plot_symbol_size, marker='^', edgecolor = 'black')
        train_score = clf.score(X, y)
        test_score  = clf.score(X_test, y_test)
        title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score)

    if (target_names is not None):
        legend_handles = []
        for i in range(0, len(target_names)):
            patch = mpatches.Patch(color=color_list_bold[i], label=target_names[i])
            legend_handles.append(patch)
        plt.legend(loc=0, handles=legend_handles)

    if (title is not None):
        plt.title(title)
    plt.show()

def plot_fruit_knn(X, y, n_neighbors, weights):
    if isinstance(X, (pd.DataFrame,)):
        X_mat = X[['height', 'width']].as_matrix()
        y_mat = y.as_matrix()
    elif isinstance(X, (np.ndarray,)):
        # When X was scaled is already a matrix
        X_mat = X_mat[:, :2]
        y_mat = y.as_matrix()
        print(X_mat)

    # Create color maps
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF','#AFAFAF'])
    cmap_bold  = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#AFAFAF'])

    clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X_mat, y_mat)

    # Plot the decision boundary by assigning a color in the color map
    # to each mesh point.

    mesh_step_size = .01  # step size in the mesh
    plot_symbol_size = 50

    x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1
    y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1
    xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, mesh_step_size),
                         numpy.arange(y_min, y_max, mesh_step_size))
    # numpy.c_ Translates slice objects to concatenation along the second axis
    # e.g. np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])]
    # ravel() Returns a contiguous flattened array.
    # x = np.array([[1, 2, 3], [4, 5, 6]])
    # np.ravel(x) = [1 2 3 4 5 6]

    Z = clf.predict(numpy.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot training points
    plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())

    patch0 = mpatches.Patch(color='#FF0000', label='apple')
    patch1 = mpatches.Patch(color='#00FF00', label='mandarin')
    patch2 = mpatches.Patch(color='#0000FF', label='orange')
    patch3 = mpatches.Patch(color='#AFAFAF', label='lemon')
    plt.legend(handles=[patch0, patch1, patch2, patch3])


    plt.xlabel('height (cm)')
    plt.ylabel('width (cm)')

    plt.show()
