# <font color="blue">Lesson 5 - Ensemble Models</font>

## Random Forest
We'll use sklearn to create a random forest model on the iris dataset. First we'll do a small amount of pre-processing, and then we'll fit and train our model.

### Read Data into Pandas Data Frames

We can read the iris data in using sklearn, which provides us with the following accessor methods to pull out features and targets: 
- iris.features = array of features from the iris dataset
- iris.targets = array of targets (species labels) from the iris dataset

We'll use pandas to cast these immediatly into a dataframe that is easy to work with. 

In [15]:
# Load the library with the iris dataset
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Set random seed
np.random.seed(0)

# Create an object called iris with the iris data
iris = load_iris()

# Create a dataframe with the four feature variables
features = pd.DataFrame(iris.data, columns=iris.feature_names)

# View the top 5 rows
features.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


## Add Species Targets

In [16]:
targets = pd.DataFrame({"species":iris.target})

# View the top 5 rows
targets.head()

Unnamed: 0,species
0,0
1,0
2,0
3,0
4,0


## Create Training and Test Data
We can easily spit our dataset into training and test sets using sklearn's train-test-split method. 

In [17]:
# split dataset into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features,targets, 
                                                    test_size=30, 
                                                    random_state=42)

## Train the Model
Now that we have parsed and split up our dataset, we'll use the [sklearn RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) with the following arguments to fit our model: 
- n_estimators = number of trees in forest, default 10
- criterion = either gini or entropy method to measure the quality of the split

Because sklearn's random forest classifier expects the target data to be in the shape of a 1 dimensional array, we'll use the [numpy ravel method](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html) to reshape it. 

In [18]:
# Create a random forest Classifier
clf = RandomForestClassifier(n_jobs=2, 
                             random_state=0,
                            n_estimators=20)

# take the training features and learn how they relate
# to the training y (the species)
clf.fit(X_train, np.ravel(y_train))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

Let's use the built-in accessor method to view more information about our which of our features have the most influence: 

In [19]:
# view feature importances, higher = more important
print(clf.feature_importances_)

[0.14264044 0.02926386 0.3947516  0.4333441 ]


What feature does each number correspond to? We want to list the feature with the importance value. So we'll use the [built-in zip](https://docs.python.org/3.3/library/functions.html#zip) function.

In [20]:
# View a list of the features and their importance scores
list(zip(features, clf.feature_importances_))

[('sepal length (cm)', 0.1426404433850623),
 ('sepal width (cm)', 0.029263857201866056),
 ('petal length (cm)', 0.3947516007880515),
 ('petal width (cm)', 0.4333440986250202)]

### Consider this:
Which feature is the best predictor of iris species? 

### Apply Classifier to testing data to make predictions

In [21]:
clf.predict(X_test)

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

## Evaluate Model
For plotting, we'll create two numpy arrays required for input to matplotlib. One array will contain the predictions, the other array will contain the actual values 

In [22]:
# Create actual english names for the plants for each predicted plant class
preds = clf.predict(features)

In [23]:
# View the PREDICTED species for the first five observations
preds[0:5]

array([0, 0, 0, 0, 0])

In [24]:
# View the ACTUAL species for the first five observations
# actual = np.array(X_test['species'])
actual = np.array(targets['species'])

Display (or print) the full arrays: `preds` and `actual`

In [27]:
print(preds)
print(actual)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


### Confusion Matrix

We can use pandas machine learning package to creat a confusion matrix.
Learn more at <a href="https://pypi.org/project/pandas_ml/">Python Software Foundation: pandas_ml</a>

Usage `confusion_matrix = ConfusionMatrix(y_true, y_pred)`

In [28]:
# Install the pandas machine learning package
!pip install pandas_ml

Collecting pandas_ml
[?25l  Downloading https://files.pythonhosted.org/packages/ae/72/6d90debfcb9ea74ec00927fa7ed0204dcc560b1f9ffcd8b239daa7fd106d/pandas_ml-0.6.1-py3-none-any.whl (100kB)
[K    100% |████████████████████████████████| 102kB 2.0MB/s a 0:00:01
[?25hCollecting enum34 (from pandas_ml)
  Downloading https://files.pythonhosted.org/packages/af/42/cb9355df32c69b553e72a2e28daee25d1611d2c0d9c272aa1d34204205b2/enum34-1.1.6-py3-none-any.whl
Installing collected packages: enum34, pandas-ml
Successfully installed enum34-1.1.6 pandas-ml-0.6.1
[33mYou are using pip version 19.0.2, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [30]:
from pandas_ml import ConfusionMatrix

confusion_matrix = ConfusionMatrix(actual, preds)
print("Confusion matrix:\n%s" % confusion_matrix)

Confusion matrix:
Predicted   0   1   2  __all__
Actual                        
0          50   0   0       50
1           0  50   0       50
2           0   0  50       50
__all__    50  50  50      150


We can also visualize these results using matplotlib: 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

confusion_matrix.plot()

### Consider this:
What is the accuracy of this model?