# DS106 Machine Learning : Lesson Eight Companion Notebook

### Table of Contents <a class="anchor" id="DS106L8_toc"></a>

* [Table of Contents](#DS106L8_toc)
    * [Page 1 - Introduction](#DS106L8_page_1)
    * [Page 2 - What is a Decision Tree?](#DS106L8_page_2)
    * [Page 3 - Classification Basics](#DS106L8_page_3)
    * [Page 4 - Decision Trees in Python](#DS106L8_page_4)
    * [Page 5 - Random Forest in Python](#DS106L8_page_5)
    * [Page 6 - Hyperparameter Tuning](#DS106L8_page_6)
    * [Page 7 - Hyperparameter Tuning in Python](#DS106L8_page_7)
    * [Page 8 - Feature Importance](#DS106L8_page_8)
    * [Page 9 - Pulling Data from an API](#DS106L8_page_9)
    * [Page 10 - Key Terms](#DS106L8_page_10)
    * [Page 11 - Lesson 3 Hands-On](#DS106L8_page_11)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS106L8_page_1"></a>

[Back to Top](#DS106L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Decision Trees and Random Forests
VimeoVideo('244082619', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO106-ML-L03overview.zip)**.

# Introduction

The next machine learning topic you will tackle is decision trees and random forests. By the end of this module, you will be able to: 

* Understand the theory behind decision trees and random forests
* Compute decision trees and random forests in Python

This lesson will culminate with a hands-on in which you predict the survivors of the Titanic in Python using both decision tree and random forest methodology.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - What is a Decision Tree?<a class="anchor" id="DS106L8_page_2"></a>

[Back to Top](#DS106L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# What is a Decision Tree?

*Decision trees* are effectively flow charts.  Start with a question, and based on the answer to that question, ask another question until you reach a terminal answer.  Each question you ask is considered a *Node* or a point of divergence and an *Edge* is the path from one Node to another.  The first Node, or the place you start, is called the *Root Node*.  Finally, the possible outcomes on the far end are the *Leaves*.  

Below you have a very simple decision tree, for discerning shapes.  It is called a decision tree because it is very similar to a tree in that it "branches off."  Think about a tree, and turn it on it's head, and you would have the general shape of your decision tree. 

![Three horizontal lines. From top to bottom, the lines are labeled root node, node, and leaves. An upside tree is placed on the lines on the left. Above the root node line is a single point with two lines extending downward from it. The line on the left is labeled triangle. The line on the right is labeled square. In the node section, there are two points, each of which connects to the lines descending from the root node section. The point on the left has two lines descending from it, one line labeled straight and the other labeled rotated. The point on the right has two lines descending from it, one line labeled straight and one line labeled rotated. Each line descends into the leaves section and contains to a shape, a straight triangle, a rotated triangle, a straight square, a rotated square. The line from the top point to rotated square is green and is labeled edge.](Media/106.L5.9.gif)

---

## What is a Random Forest? 

A decision tree by itself may not be horribly accurate, because of variance in a dataset, which brings you to the next concept: *Random Forests*. Random Forests are a collection of decision trees where the Nodes, or classification points/questions, are randomized.  For every Node, the program considers *M* number of variables and choose one of them.  By convention, M is usually the square root of *P*, which is the total number of variables.

Perhaps you're wondering how Random Forest fits into the context of machine learning for data scientists. Well, it's a supervised technique, used to predict the target variable or classification simply by learning decision rules from the data.  

![A random forest of multiple decision trees showing how difference choices at each node leads to different paths.](https://upload.wikimedia.org/wikipedia/commons/c/c7/Randomforests_ensemble.gif)

---

## Why use Random Forests over other Methods?

A decision tree regressor is a more powerful model compared to the linear regression seen in this course. The power of decision trees lies within its ability to identify nonlinear relationships within the data. 

Other tree methods can introduce multicollinearity, meaning that it makes the trees too highly correlated.  If you average things that are already a lot alike, you are not reducing variability and instead are introducing bias.  But random forests randomizes everything equally and can thus reduce bias and prevent your tree outcomes from overlapping excessively. Random forests are also an awesome choice because there are no assumptions to test for, you don't need to scale any variables, and the model itself doesn't need a lot of playing around to get a good fit right off the bat.

---

## Weaknesses of Decision Trees and Random Forests

Decision trees are not without their downsides. While decision trees are great for complex datasets, they overfit the data and as a result do not generalize well. This would make is extremely difficult to model and predict spam emails from entering the inbox. Additionally, decision trees can become unstable in a noisy environment. At each decision later the noisy data can alter the shape of the tree and path decisions in it.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>It is important to note that there is no one-size-fits-all when it comes to modeling and regression. Each model, whether it be linear, logarithmic, or a decision tree, has strengths and weaknesses depending on the data it will be generalized to.</p>
    </div>
</div>

Random forests also have a catch - they can get very complex very fast! When you starting adding in all the variability in predictor variables, category levels, number of decision trees, and any model tuning you might want to do, you can end up requiring a lot of processing! Remember combinations/permutations from probability? Well things can get huge quickly, and that will increase your needs for processing power and time. 

---

## When to Use Random Forests? 

Random forest is one of the first tools many data scientists use when trying to get a baseline for how accurately they can predict a categorical variable with machine learning.  If they are not satisfied with the accuracy, then they try to adjust their dataset or work with other tests.  Besides being an invaluable tool, which many jobs require, it is something fun to tell your friends, “I have walked the Random Forest and mastered its wiles!” 

![A darkly colored and hair creature with bright eyes walking in a dense forest.](Media/106.L5.11.jpg)

Enough talking, more importing and dataset setup!  

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Classification Basics<a class="anchor" id="DS106L8_page_3"></a>

[Back to Top](#DS106L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Classification Basics

Both decision trees and random forests are classification models, which means that they are meant to correctly categorize, or classify, the y variable. 

Remember back to the confusion matrix in your binary logistic regression, where you saw predictions in a table against the actual values? You'll use that in machine learning, too. But, instead of being limited to *binary* outcomes (only two options like Y/N), you can have more than two outcomes. 

If they were correctly classified as something, this is called a *true positive* or *TP* for short. When they were correctly classified as not something else, then this is called a *true negative* or *TN*. Basically, anytime you have a T in there, you've done a good job. 

If the model incorrectly classified something, then this is called a *false positive (FP)*, and if the model incorrectly classified something as not something else, then this is called a *false negative (FN)*. So when you've got a F, you're not doing so hot.   

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Decision Trees in Python<a class="anchor" id="DS106L8_page_4"></a>

[Back to Top](#DS106L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Decision Trees in Python

Now that you understand the basic concepts behind decision trees and random forests, you'll learn how to complete them in Python.  You will start with decision trees. 

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/528497271"> recorded live workshop </a> that goes over the material on decision trees and random forests. </p>
    </div>
</div>

---

## Import Packages

Start by importing packages, as usual: 

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
```

---

## Load in Data

Then load in data.  How about importing your trusted old friend, ```iris```? You can utilize the code here:

```python
iris = sns.load_dataset('iris')
```

And if that didn't work, you can **[access the iris dataset here](https://repo.exeterlms.com/documents/V2/DataScience/Machine-Learning/Iris.zip)**.

---

## Data Wrangling 

Excellent! The next step is to specify your x and y variables using subsetting. ```y``` is the column you are predicting, and ```x``` is everything you are using to predict it. 

```python
x = iris.drop('species', axis=1)
y = iris['species']
```

---

## Train Test Split

As you've done before, you will split the data into training and testing sets.  The train variables are creating your initial model, and the test variables are what you'll use to determine the fit of the model. Note that just for following along, you will set the ```random_state``` to 76, which is not necessary, but it will give you the same split as the example.

```python
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=76)
```

---

## Create Initial Decision Tree

Before you jump into the Random Forest, try a single decision tree.  To do this, utilize the ```DecisionTreeClassifier()``` function and then ```fit()``` the model. Once more, to keep everyone on the same page, the random_state is 76. 

```python
decisionTree = DecisionTreeClassifier(random_state=76)
decisionTree.fit(x_train, y_train)
```

Here is the result you will receive: 

```text
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=76,
            splitter='best')
```

---

## Assess the Model

Now that the data is fit, the next step is to create a set of predictions and interpret the results. You can start by using the ```predict()``` function, and then you'll utilize the same confusion matrix and classification report coding as you did last lesson. 

```python
treePredictions = decisionTree.predict(x_test)
```

![The results of using the predict function. Column headings are precision, recall, F 1 score, and support. Row headings are setosa, versicolor, and virginica. A final row is labeled average forward slash total. Row one, one point zero zero, one point zero zero, one point zero zero, nineteen. Row two, zero point eight three, zero point seven seven, zero point eight zero, thirteen. Row three, zero point seven nine, zero point eight five, zero point eight one, thirteen. Final row, zero point eight nine, zero point eight nine, zero point eight nine, forty five.](Media/106.L5.4.png)

---

### Reading the Confusion Matrix 

Now go ahead and print out the confusion matrix: 

```python
print(confusion_matrix(y_test, treePredictions))
```

Here is your result: 

```text
[[19  0  0]
 [ 0 10  3]
 [ 0  2 11]]
```

It's time you were enlightened about the mysteries of the confusion matrix so that it becomes less confusing! 

This matrix will always be square, and along the right hand side you have the levels of the variable you are predicting, crossed by the same variables down the left hand side.  The order of the variables is what is shown in the second portion of the above output.  

The variables on the top represent the actual values, and the variables on the side represent the predicted values. 

In order to help you visualize all this, here's what that table would look like if it had headers: 

<table class="table table-striped">
    <tr>
        <th></th>
        <th>setosa (actual)</th>
        <th>versicolor (actual)</th>
        <th>virginica (actual)</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>setosa (predicted)</td>
        <td>19</td>
        <td>0</td>
        <td>0</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>versicolor (predicted)</td>
        <td>0</td>
        <td>10</td>
        <td>3</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>virginica (predicted)</td>
        <td>0</td>
        <td>2</td>
        <td>11</td>
    </tr>
</table>

It instantly becomes more readable! This matrix show how your predictions matched up with reality. In the upper left corner is the number 19.  Because this is marked setosa both in the column and the row, that means that these are the predictions that were right! So, 19 irises were classified as the species setosa and actually were a setosa.  Moving to the next column, same row, is the number 0.  This means that there were no irises in the setosa species that were accidentally classified as versicolor.  Similarly, in the third column, there were no irises accidentally classified in our model as virginica.

If you move on to the second row, for versicolor, you can see that no versicolor irises were accidentally misclassified as setosa.  In the next column, you can note that 10 irises were versicolor, and actually were classified as versicolor.  Then in the last column, you see that there were three versicolor irises that were misclassified as virginica species instead.

For the third row, there were no virginica irises misclassified as setosa, there were two misclassified as versicolor, and there were eleven properly classified as virginica.  

So this decision tree model is really good at predicting the species of setosa, but misclassified a few of the versicolor species as virginica and vice versa.

In summary, any time you have a cell in which the column and the row header is the same, those are the cases that were correctly classified.  Ideally, if your model was 100% accurate, all other columns would be zero.  But very few models are perfectly accurate, so there are bound to be some errors. Looking at where your model is misclassifying cases is a great way to learn more about your model and a jumping off point to begin tweaking it to predict better next time.

---

### How Well Does your Model Fit?

Now print out the classification report, so you can examine how well the model fits the data! 

```python
print(classification_report(y_test, treePredictions))
```

And here is the resulting table:

```text
            precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       0.83      0.77      0.80        13
   virginica       0.79      0.85      0.81        13

   micro avg       0.89      0.89      0.89        45
   macro avg       0.87      0.87      0.87        45
weighted avg       0.89      0.89      0.89        45
```

The first thing to look at is `precision`. Although Python will do all the calculations for you, if you are someone who just has to know how everything works, here is the formula for precision:

![PrecisionFormula](Media/precision.png)

The total number of true positives are divided by the total of positives, whether true or false.

The next thing to look at is `recall`. The formula for recall is:

![RecallFormula](Media/recall.png)

So instead of putting the true positives over all positives, you put it over the sum of the true positives and the false negatives.

Then, you can look at the `f1-score`. This is a measure that is slightly more complex, as it is derived using both precision and recall.

![f1](Media/f1.png)

So setosa was predicted with 100% precision, while versicolor was predicted with 83% accuracy and virginica was predicted with 79% accuracy! Not too shabby, especially considering that you can predict the species of the flower with 89% accuracy.  Note, this is unusually high for a single tree, but the dataset is very clean and simple.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Random Forest in Python<a class="anchor" id="DS106L8_page_5"></a>

[Back to Top](#DS106L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Random Forest in Python

Okay, that was one tree, to make a forest, just run that code a few hundred times and average the results.  Do this now… What? Why are you about to throw your computer out the window?! You think there is a better way?  Fine.  It’s ```RandomForestClassifier()```.  

---

## Import Packages

You will need the following packages to complete random forests in Python:

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
```

---

## Load in Data

You will continue to use the ```iris``` dataset for random forest creation.  You can utilize the code here:

```python
iris = sns.load_dataset('iris')
```

And if that didn't work, you can **[access the iris dataset here](https://repo.exeterlms.com/documents/V2/DataScience/Machine-Learning/Iris.zip)**.

---

## Data Wrangling 

Once more, subset your data:  

```python
x = iris.drop('species', axis=1)
y = iris['species']
```

---

## Train Test Split

And then ```train_test_split()``` the heck out of that data! 

```python
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=76)
```

---

## Initial Random Forest Model

And at last you are ready to kick it into high gear by creating your random forest model.  You'll use the function ```RandomForestClassifer()```, with the arguments ```n_estimators=``` to specify how many decision trees you want the random forest to stem from, and of course ```random_state=``` just to follow along with this content:

```python
forest = RandomForestClassifier(n_estimators=500, random_state=76)
forest.fit(x_train, y_train)
```

This is exactly like when you set up a single tree, except now you specify the number of trees you want to make with ```n_estimators=```, which you have set to 500.  This means you will be testing your data with 500 decision trees.  If you have a very large dataset, you may want to set ```n_estimators=``` smaller to decrease the time it takes to run the ```RandomForestClassifier()``` function.  However, in general, the higher your ```n_estimators=```, the more accurate your model will be. You'll then ```fit()``` to the data. Here is the output:

```text
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
            oob_score=False, random_state=76, verbose=0, warm_start=False)
```

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>When you are doing this normally, you can skip making the single tree; the decision tree is mostly here for comparison.  That means that you can run this code immediately after splitting your data into training and testing sets.</p>
    </div>
</div>

---

## Evaluate Model Fit
 
The final step is to create your prediction set and print a report! Again, you'll make use of both the confusion matrix and the classification report.

```python
forestPredictions = forest.predict(x_test)
print(confusion_matrix(y_test, forestPredictions))
print(classification_report(y_test, forestPredictions))
```

And here are the results for the confusion matrix:

```text
[[19  0  0]
 [ 0 11  2]
 [ 0  0 13]]
```

Looks like the random forest did better than the decision tree, which is to be expected.  All setosa and virginica irises were correctly classified, and only two of the versicolor irises were misclassified as virginica. 

Now, looking at the classification report, again the random forest model did a better job: 

```text
             precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      0.85      0.92        13
   virginica       0.87      1.00      0.93        13

   micro avg       0.96      0.96      0.96        45
   macro avg       0.96      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45
```

That is it.  The new model is 96% accurate, and in general it will become more and more accurate the larger your dataset is.  There is 100% accuracy for both setosa and versicolor irises, but only 87% accuracy for virginica. Go forth and explore the Random Forest!

![A path through a forest.](Media/106.L5.12.jpeg)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Hyperparameter Tuning<a class="anchor" id="DS106L8_page_6"></a>

[Back to Top](#DS106L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Hyperparameter Tuning

In machine learning, the components of the model you're running is called a *parameter*. This is very different than how you use the term in statistics, where a parameter just means a statistic for a population, so the context is important when you encounter the word parameter! 

There are parameters that the model will handle itself, based on your training data. There is nothing you can do with these. But some parameters, you can mess around with to help improve your model fit. Parameters that are adjustable are called *hyperparameters*. 

The process of "messing with" hyperparameters is called *tuning*. Think about a tune-up for your car.  You want to fix the little things so that overall, your car runs better and smoother. That's exactly what is going on with tuning your machine learning model as well.

---

## Hyperparameters for Decision Trees and Random Forests

There are four hyperparameters for decision trees and random forests that are important:

* Maximum depth
* Number of estimators
* Maximum number of features
* Minimum number of samples for a leaf

You will learn about each of them more below!

---

### Maximum Depth

The maximum depth is how far down the "roots" of your tree go. How many nodes do you allow?

![Maximum Depth](Media/depth.png)

---

## Number of Estimators

The number of estimators is how many trees that make up a forest. Generally, the more trees you have, the better your accuracy will be, but more trees also increases computation time.

![Number of Estimators](Media/estimators.png)

---

## Maximum Number of Features

A *feature* is the decision points, or branches, on the tree. You can set a limit as to how many are allowed.

---

## Minimum Number of Samples

The minimum number of samples is how many data points are being sorted at each feature. This has a minimum instead of a maximum. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Hyperparameter Tuning in Python<a class="anchor" id="DS106L8_page_7"></a>

[Back to Top](#DS106L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Hyperparameter Tuning in Python

Now that you know a little bit about hyperparameter tuning, you'll go ahead and do it in Python! Huzzah!

---

## Load in Libraries

In addition to everything else you used for random forests, you'll also need the following library:

```python
from sklearn.model_selection import RandomizedSearchCV
```

---

## Number of Estimators

The first thing you want to do is determine how many trees you should be using, which is the number of estimators. The code below will help you find the best number of estimators based on the accuracy of the model. Remember that an accuracy of 1 is the highest you can get, so the closer you get to one, the better.

You can create an array that contains the most likely number of estimators, which is what is shown in the first line. While you could put any numbers in this array, these typically get used frequently in ML. Then you'll create an empty list named `results` that will end up filled using a for loop! 

And lastly, on to the for loop itself! This iterates over your `n_estimators_array` and creates a random forest for each, prints out the accuracy for each, and finally adds it to your `results` list. So you don't have to test everything manually! The very last line in the for loop prints out each result as it becomes available. Depending on how fast your computer is, this code may take a minute (you are doing 11 random forests, after all!) and you can see the results come up in real time. Pretty cool!

```python
n_estimators_array = [1, 4, 5, 8, 10, 20, 50, 75, 100, 250, 500]
results = []
for n in n_estimators_array:
    forest = RandomForestClassifier(n_estimators=n, random_state=76)
    forest.fit(x_train, y_train)
    result = accuracy_score(y_test, forest.predict(x_test))
    results.append(result) 
    print(n, ':', result)
```

Here are the results:

```text
1 : 0.9111111111111111
4 : 0.9555555555555556
5 : 0.9333333333333333
8 : 0.9555555555555556
10 : 0.9777777777777777
20 : 0.9555555555555556
50 : 0.9555555555555556
75 : 0.9555555555555556
100 : 0.9555555555555556
250 : 0.9555555555555556
500 : 0.9555555555555556
```

So it looks like the best accuracy arises when you use only 10 trees instead of the standard 500! Good to know. 

If you wanted a visual representation of this, that can be done too with your good old `plt()` function! 

```python
plt.plot(n_estimators_array, results)
```

Here is the resulting plot:

![Number of Estimators Plot for Iris Data](Media/estimatorPlot.png)

You really get a sense from this graph that things have completely stagnated before 100 trees, so it certainly is a waste of processing power to request 500!

---

## Tuning the Remaining Three

If you're wondering if there is an easier way to find the best hyperparameter values without having to go through each, then guess what? There is! You can automate it and find em' all in one whack with the `RandomizedSearchCV` library. Although you'll be doing this just with random forests right now, this library will work with any algorithm in the `sklearn` library! If you're not convinced yet that a randomized grid search is the bee's knees, then to sweeten the pot, they've thrown in cross-validation for the accuracy calculations as well!

Below you are creating lists with all the hyperparameter values you want to trial. There is one for each of the remaining three features, named: `max_features`, `max_depth`, and `min_samples_leaf`. Then, you'll create a dictionary with the hyperparameter names as the keys and the list variables as the values. This is called a *grid* and is aptly named `random_grid`. 

```python
# Number of features to consider at every split
max_features = ['auto', None, 'log2']
# Maximum number of levels in tree
max_depth = [10, 20, 30, 40, 50, 60, 70, 80, 90, None]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
random_grid = {'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_leaf': min_samples_leaf}
print(random_grid)
```

Phew! If you run that code, the result you get is just a print out of the options: 

```text
{'max_features': ['auto', None, 'log2'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, None], 'min_samples_leaf': [1, 2, 4]}
```

So nothing has actually happened yet - but you're prepared for the next move, which is to initialize one random forest for every one of those hyperparameters in the `random_grid`. Since you know that you only want ten trees, the first line sets up a random forest model with that. 

The next line of code gives you a random search of the `random_grid` you created using the function `RandomizedSearchCV()`. The arguments for that function include the `estimator=`, which is what you've named your latest iteration of the random forest with only ten estimators, the `param_distributions=` argument, which is where you plug in the `random_grid` dictionary, `n_iter=`, which is the number of iterations, or times to complete the random forest, and lastly, the `cv=` argument, which allows you to choose how many folds you'd like in your cross validation. The `random_state=` argument is not required to run code, but including it means that your results should be the same as those in the lesson.

```python
rf = RandomForestClassifier(n_estimators=10)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 90, cv = 3, random_state=42)
```

With that created, it's time to fit! 

```python
rf_random.fit(x_train, y_train)
```

Here are the results you should receive:

```text
RandomizedSearchCV(cv=3, estimator=RandomForestClassifier(n_estimators=10),
                   n_iter=90,
                   param_distributions={'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, None],
                                        'max_features': ['auto', None, 'log2'],
                                        'min_samples_leaf': [1, 2, 4]},
                   random_state=42)
```

It basically just tells you what it did, which is not particularly helpful. What would be helpful is knowing which hyperparameter produced the best accuracy. But that isn't possible, is it?

It is! Try this line of code out!

```python
rf_random.best_params_
```

And now you are getting somewhere:

```text
{'min_samples_leaf': 4, 'max_features': 'auto', 'max_depth': 30}
```

This means that the model with the best accuracy has at least 4 samples per leaf, leaves the max features setting on auto, and has a maximum depth of 30 decision points. Pretty nifty! Now all you need to do is run one last random forest that actually has those parameters! This is relatively plug-and-play here, since in your random grid search code, you used approximately the same names. 

```python
forest = RandomForestClassifier(n_estimators=10, min_samples_leaf=4, max_features="auto", max_depth=30)
forest.fit(x_train, y_train)
```

Running the above code just tells you the details of the model:

```text
RandomForestClassifier(max_depth=30, min_samples_leaf=4, n_estimators=10)
```

But if you want the details (and of course you want the details!) you can use the same prediction and classification report info as before, but with your new and improved model: 

```python
forestPredictions = forest.predict(x_test)
print(confusion_matrix(y_test, forestPredictions))
print(classification_report(y_test, forestPredictions))
```

Here is the resultant information: 

```text
print(classification_report(y_test, forestPredictions))
[[19  0  0]
 [ 0 10  3]
 [ 0  0 13]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      0.77      0.87        13
   virginica       0.81      1.00      0.90        13

    accuracy                           0.93        45
   macro avg       0.94      0.92      0.92        45
weighted avg       0.95      0.93      0.93        45
```

Looking good here! Overall accuracy is 95% weighted. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Feature Importance<a class="anchor" id="DS106L8_page_8"></a>

[Back to Top](#DS106L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Feature Importance

When you have more than one x variable in a machine learning model, how do you know which one of them is important? Are they all equally important? Do some help predict y more than others? If you were doing regression, you could find out by looking at the individual *p* values related to the variable, or you could do a stepwise regression of some sort to tease out how much each matters. But these options don't exist for machine learning. However, something called *feature importance* does.

Each variable in machine learning can also be referred to as a *feature*. So, determining the feature importance just means that you can figure out which variable makes more difference to the prediction of the y.

---

## Feature Importance in Python

It's a pretty quick and easy line of code to get feature importance! They are outputs of your model, and so you just need to call them in a format that is useful. You'll create a new variable called `feature_importances` that is formatted as a pandas series, using the function `pd.Series()`. Then, you can call `forest.feature_importances_`, which by default is created when you run `forest`. Lastly, for readability, you can index it with the argument `index=` and put in `x.columns` so that the name of the column names in your dataset show on the left.

```python
feature_importances = pd.Series(forest.feature_importances_, index=x.columns)
feature_importances
```

Then it's up to you to call the variable, and once you have - voila! You get details on how important each of the features are:

```text
sepal_length    0.068363
sepal_width     0.005359
petal_length    0.687949
petal_width     0.238329
dtype: float64
```

The bigger, the better for feature importance. Wouldn't it be nice to see each of them in order of feature importance, rather than in column order? Well, that can be arranged! The `sort_values()` function will sort them. The `inplace=True` argument, like always, makes this change permanent, and `ascending=False` means that this goes from largest to smallest, which is exactly what you'd like to see! Then all you need to do is print it out and do a little happy dance. 

```python
feature_importances.sort_values(inplace=True, ascending=False)
print(feature_importances)
```

But wait! There's more! If you're someone visual, you can also graph this. A simple bar graph will do if you aren't showing it off to anyone:

```python
feature_importances.plot(kind='barh', figsize=(7,6))
```

And here is the resulting image:

![Feature Importance for Iris Data](Media/feature_importance.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Pulling Data from an API<a class="anchor" id="DS106L8_page_9"></a>

[Back to Top](#DS106L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Pulling Data from an API

Although you can utilize built-in datasets or upload your own datasets in Python, there is a third option: to pull data from an *application programming interface*, known as an *API*.  This is a computer program that web developers have written that allow you to access the data from their website.  Everything from NASA to YouTube has a Python API interface, and it can be a great way to access lots of data easily. In this tutorial, you will be utilizing an API from Quandl, a financial and housing data site. 

---

## Accessing the Quandl API

You will be using an API from Quandl. As with many APIs, Quandl has their own Python package that's been written, which allows you to connect with their API.  In order to access it, in your terminal, you will need to do a ```pip install```, since this isn't a package that comes base with Anaconda Python: 

```python
pip install quandl
```

---

## Exploring the Quandl Website

Go to **[the Quandl website](https://www.quandl.com)**, and in the top search bar, look for ```FMAC/HPI_AK```: 

![The Quandl website. Atop the page is a search bar containing F M A C forward slash H P I underscore A K. The main banner reads, welcome to your new home page, from here you can view your subscriptions and browse data of interest. Below the main banner is a section highlighting a featured research and analytics firm.](Media/quandl1.png)

You should see something like this:

![The Quandl website after searching for F M A C forward slash H P I underscore A K. The results page reads, There are one databases with data on F M A C forward slash H P I underscore A K. Below is the result, Freddie Mac, data from Freddie Macs Primary Mortgage Market Survey and other region specific historical mortgage rates. There are three links to data, house price indices fairbanks alaska, house price indices anchorage alaska, and house price indices alaska. ](Media/quandl2.png)

Go ahead and click on the last one; to do so you'll click on the text to the right of the dataset name. Once you do so, you'll see a page dedicated to just that dataset, looking like this:

![The Quandl website after searching for F M A C forward slash H P I underscore A K and selecting the search result house price indicies alaska. The main section of the page shows a chart with a line rising from left to right. The y axis is home price and the x axis is years. A description of the data is on the left, and options to export the data are on the right.](Media/quandl3.png)

While you could easily download this, and save it as a ```.csv```, this would be tedious to do for all 50 states, and in this case you've been asked by a client to compare housing price indexes across the country.  

Notice the Quandl Code in the upper right hand corner of your screen. It says FMAC/HPI_AK, where AK is the state abbreviation for Alaska. You can use this quandl code to do the extraction...wouldn't it be great if you could get Python to do the tedious work for you?

Start by importing the ```quandl``` package, now that it is installed:

```python
import quandl
```

Then, use the command below will get the Alaska data:

```python
Alaska = quandl.get("FMAC/HPI_AK")
Alaska.head()
```

If you are wanting to pull multiple datasets, or larger datasets, than you may end up with an error, because you're pulling so much data, Quandl wants to make sure you're not a bot! If that happens, you can sign up for a free account with quandl, and then go to the ```ACCOUNT SETTINGS``` section, located under the little profile person:

![The Quandl website showing a users A P I Key in the user profile section. There is a link to request a new A P I key. Below are fields for the user to enter first name, last name, school email, name of college or university, and how the user will be using this data.](Media/quandl4.png)

You'll see at the top that there's an API key, and you can use that in your Jupyter Notebook to access data if necessary, like this:

```python
quandl.ApiConfig.api_key = "wytwq8oKXqFezaidUqez"
```

Your specific API key from the ```ACCOUNT SETTINGS``` section of the webpage will go between the double quotes. Then you can proceed to import more and/or bigger datasets!

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>Want to learn more about API integration in Python? Check out <a href="https://medium.com/quick-code/absolute-beginners-guide-to-slaying-apis-using-python-7b380dc82236">this Medium article!</a></p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Key Terms<a class="anchor" id="DS106L8_page_10"></a>

[Back to Top](#DS106L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Decision Tree</td>
        <td>Flow chart.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Node</td>
        <td>Question asked in a decision tree.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Root Node</td>
        <td>The first question asked in a decision tree.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Leaves</td>
        <td>Possible outcomes in a decision tree.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Random Forest</td>
        <td>A series of decision trees in which the root node is randomized and iterated.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>M </td>
        <td>Number of variables chosen in a model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>P</td>
        <td>Total number of variables from which to choose.</td>
    </tr>
</table>

---

## Key Python Packages

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sklearn.tree</td>
        <td>Performs decision trees.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sklearn.ensemble</td>
        <td>Performs random forests.</td>
    </tr>
</table>

---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>DecisionTreeClassifier()</td>
        <td>Performs decision trees.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>predict()</td>
        <td>Makes predictions about your model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>RandomForestClassifier()</td>
        <td>Performs random forest models.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Application Programming Interface (API)</td>
        <td>Allows you to access data from a website.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - Lesson 3 Hands-On<a class="anchor" id="DS106L8_page_11"></a>

[Back to Top](#DS106L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Decision Trees and Random Forests Hands-On

In this Hands-On exercise, you will create a project which will solidify your understanding of decision trees and random forests. This Hands-On will be completed in Python, using your text editor or IDE of choice (e.g. VSCode, Jupyter Notebooks, Spyder, etc.). 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Part I

Create a decision tree model of the ```Titanic``` dataset that predicts survival from seaborn.  You will need to import the data using this code: 

```python
Titanic = sns.load_dataset('titanic')
```

If seaborn isn't working for you, **[click here](https://repo.exeterlms.com/documents/V2/DataScience/Machine-Learning/Titanic.zip)** to download the data.

You will need to compute some data wrangling before charging ahead. Make sure to complete the following wrangling tasks: 

* Recode string data
* Remove missing data
* Drop any variables that are redundant and will add to multicollinearity. 

Once you have created a decision tree model, interpret the confusion matrix and classification report.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Decision trees and random forests cannot have any string or missing data, so you will need to recode some variables and use the .dropna function! </p>
    </div>
</div>

---

## Part II

Now create a random forest model of the Titanic dataset that predicts survival.  Interpret the confusion matrix and classification report.  How did the predictive value change from the decision tree? 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>

