# <font color="blue">Lesson 5 - Ensemble Models</font>

## Random Forest
We'll use sklearn to create a random forest model on the iris dataset. First we'll do a small amount of pre-processing, and then we'll fit and train our model.

### Read Data into Pandas Data Frames

We can read the iris data in using sklearn, which provides us with the following accessor methods to pull out features and targets: 
- iris.features = array of features from the iris dataset
- iris.targets = array of targets (species labels) from the iris dataset

We'll use pandas to cast these immediatly into a dataframe that is easy to work with. 

In [None]:
# Load the library with the iris dataset
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Set random seed
np.random.seed(0)

# Create an object called iris with the iris data
iris = load_iris()

# Create a dataframe with the four feature variables
features = pd.DataFrame(iris.data, columns=iris.feature_names)

# View the top 5 rows
features.head()

## Add Species Targets

In [None]:
targets = pd.DataFrame({"species":iris.target})

# View the top 5 rows
targets.???()

## Create Training and Test Data
We can easily spit our dataset into training and test sets using sklearn's train-test-split method. 

In [None]:
# split dataset into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features,targets, 
                                                    test_size=30, 
                                                    random_state=42)

## Train the Model
Now that we have parsed and split up our dataset, we'll use the [sklearn RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) with the following arguments to fit our model: 
- n_estimators = number of trees in forest, default 10
- criterion = either gini or entropy method to measure the quality of the split

Because sklearn's random forest classifier expects the target data to be in the shape of a 1 dimensional array, we'll use the [numpy ravel method](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html) to reshape it. 

In [None]:
# Create a random forest Classifier
clf = RandomForestClassifier(n_jobs=2, 
                             random_state=0,
                            n_estimators=20)

# take the training features and learn how they relate
# to the training y (the species)
clf.fit(X_train, np.ravel(y_train))

Let's use the built-in accessor method to view more information about our which of our features have the most influence: 

In [None]:
# view feature importances, higher = more important
print(clf.feature_importances_)

What feature does each number correspond to? We want to list the feature with the importance value. So we'll use the [built-in zip](https://docs.python.org/3.3/library/functions.html#zip) function.

In [None]:
# View a list of the features and their importance scores
list(zip(features, clf.feature_importances_))

### Consider this:
Which feature is the best predictor of iris species? 

### Apply Classifier to testing data to make predictions

In [None]:
clf.predict(X_test)

## Evaluate Model
For plotting, we'll create two numpy arrays required for input to matplotlib. One array will contain the predictions, the other array will contain the actual values 

In [None]:
# Create actual english names for the plants for each predicted plant class
preds = clf.predict(features)

In [None]:
# View the PREDICTED species for the first five observations
preds[0:5]

In [None]:
# View the ACTUAL species for the first five observations
# actual = np.array(X_test['species'])
actual = np.array(targets['species'])

Display (or print) the full arrays: `preds` and `actual`

### Confusion Matrix

We can use pandas machine learning package to creat a confusion matrix.
Learn more at <a href="https://pypi.org/project/pandas_ml/">Python Software Foundation: pandas_ml</a>

Usage `confusion_matrix = ConfusionMatrix(y_true, y_pred)`

In [None]:
# Install the pandas machine learning package
!pip install pandas_ml

In [None]:
from pandas_ml import ConfusionMatrix
confusion_matrix = ConfusionMatrix(???, ???)
print("Confusion matrix:\n%s" % confusion_matrix)

We can also visualize these results using matplotlib: 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

confusion_matrix.plot()

### Consider this:
What is the accuracy of this model?