<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# Ensembles and Random Forests - Practice
 
_Author: B Rhodes (DC)_

---

## Introduction

In this notebook, you'll build models using two tree-based ensemble models: bag of trees and random forest. 
We'll be working with a dataset to predict a salaries based on census data. 

### Objectives

You will be able to: 

- Use `scikit-learn` to train a random forest model.  
- Be able to determine the performance, of the model.
- Identify, visualize and interpret the important features from an ensemble model.


## Import data
Below we will use information derived from census data to predict whether someone makes more or less than $50k/year. Our goal is to determine which factors best predict an individual's salary.

Let's get our standard imports.

In [None]:
## imports


The dataset is stored in the file `'salaries_census.csv'`.  

**Steps**: 
1. Import the dataset from the file above and store it in a DataFrame. The data file includes an index, so be sure to set the `index_col` parameter to `0`.  
2. Verify that everything loaded correctly.
3. Perform a little EDA.

In [None]:
# Import the data
path = '../data/salaries_census.csv'

# load the data into a dataframe 


### **Task**: Perform some EDA

In [None]:
# check the .info() or .dtypes


We have six (6) features and one (1) independent variable:

- `Age`: continuous 

- `Education`: Categorical. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool 

- `Occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces 

- `Relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried 

- `Race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black 

- `Sex`: Female, Male 

- `Target`: `<= 50k` and `>50k`. Use `.map()` this to `<= 50k`:0 and `>50k`:1.

In [None]:
## map the target categories to 0 & 1 - reassign the target column.


**Task**: Assign feature and target variables. Easiest way is to assign the `'Target'` column to a variable and then drop it from the dataset and assign the rest to another variable.

*Hint*: Use conventional variable names for features and target.

In [None]:
# Split the outcome and predictor variables




In the cell below, examine the data type of each column:  

In [None]:
# Your code here


**Task**: Create dummy variables to deal with the categorical features. Check the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) to recall how to do this.  

In [None]:
# Create dummy variables


Perform a train/test split with a 75/25 split. Set the `random_state` to 42.  

In [None]:
# Perform a test train split.


## Build a Decision Tree for Comparison

We'll begin by fitting a decision tree classifier (single tree) to provide a baseline and have something to compare to the ensemble methods.  

### Build the tree

**Task**: Instantiate and fit a decision tree classifier with the following parameters `criterion='gini'` and `max_depth=5`.

In [None]:
# Instantiate and fit a DecisionTreeClassifier


### Feature Importance

Let's explore the importance each feature used in our decision tree model. The trained classifier has an attribute `feature_importances_` that shows the relative importance of each feature. Display a sorted list of tuples in the form (feature, feature importance). *Hint*: use `zip()`.

In [None]:
# List the feature importances


**Use the function below to visualize the data feature importances.**

In [None]:
def plot_feature_importances(model):
    n_features = X_train.shape[1]
    plt.figure(figsize=(8,8))
    plt.barh(range(n_features), model.feature_importances_, align='center') 
    plt.yticks(np.arange(n_features), X_train.columns.values) 
    plt.xlabel('Feature importance')
    plt.ylabel('Feature')


### Model performance

Check the model performance on the test data. 

In the cells below:

* Use the model to generate predictions on the test set  
* Print out a `confusion_matrix` of the test set predictions 
* Print out a `classification_report` of the test set predictions 

In [None]:
# Test set predictions


# Confusion matrix and classification report
print('Confusion Matrix:')

print()
print("Classification Report")


Now, let's check the model's accuracy. Run the cell below to display the test set accuracy of the model. 

In [None]:
print("Testing Accuracy for Decision Tree Classifier: {:.4}%".format( #YOURCODE HERE ) * 100))

## Ensemble Method #1: Bagged Trees

Use the bagging method (aka bag of trees) for our first ensemble method. 

**Tasks**:
1. Instantiate and fit a BaggingClassifier ([check the documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)).  
    1. Set the estimator to  a `DecisionTreeClassifier` with the same values from above for `criterion` and `max_depth`.  
    2. Also set the `n_estimators` parameter for our `BaggingClassifier` to `20`. 

In [None]:
# Instantiate a BaggingClassifier

# Fit to the training data


**Task**: Use the `.score()` method to check the accuracy of the model on the training data. 

In [None]:
# Training accuracy score


print("Training Accuracy for Bagging Classifier: {:.4}%".format(bag_score_train * 100))


**Task**: Use the `.score()` method to check the accuracy of the model on the testing data. 

In [None]:
# Test accuracy score

print("Testing Accuracy for Bagging Classifier: {:.4}%".format(bag_score_test * 100))



## Random forests

Another popular ensemble method is the **_Random Forest_**. Let's fit a random forest classifier next and see how it measures up compared to all the others. 

### Fit a random forests model

In the cell below, instantiate and fit a `RandomForestClassifier`, and set the number estimators to `100` and the max depth to `5`. Then, fit the model to our training data. 

In [None]:
# Instantiate and fit a RandomForestClassifier


# fit the classifier


Now, let's check the training and testing accuracy of the model using its `.score()` method: 

In [None]:
# Training accuracy score

print("Training Accuracy for Random Forest Classifier: {:.4}%".format(rf_score_train * 100))

In [None]:
# Test accuracy score

print("Test Accuracy for Random Forest Classifier: {:.4}%".format(rf_score_test * 100))

### Feature importance

In [None]:
# Plot the feature importance
plot_feature_importances(forest)

### Look at the trees in your forest

Let's create a forest with some small trees. You'll learn how to access trees in your forest!

In the cell below, create another `RandomForestClassifier`.  Set the number of estimators to 5, the `max_features` to 10, and the `max_depth` to 2.

In [None]:
# Instantiate and fit a RandomForestClassifier


# Fit the classifier


Changing `max_features` to smaller values will generate to different trees in your forest! The trees in your forest are stored in the `.estimators_` attribute.

In the cell below, get the first tree from `forest_2.estimators_` and store it in `rf_tree_1`

In [None]:
# Get the first tree from forest_2


Use the `plot_feature_importances()` function to visualize which features this tree was given to use during subspace sampling. 

In [None]:
# Feature importance
plot_feature_importances(rf_tree_1)

Assign the second tree to `rf_tree_2`, and visualize it using `plot_feature_importances()`. 

In [None]:
# Second tree from forest_2


In [None]:
# Feature importance
plot_feature_importances(rf_tree_2)

**Question**: What can you conclude about these two trees and what can you say about how random forests work?

## Summary

Above, we built a few different tree ensemble methods. We demonstrated how to visualize and interpret feature importances, as well as compare individual trees from a random forest to see if we could notice the differences in the features they were trained on. 