![A Cloud Guru](acg_logo.png)
<hr/>

<center><h1>Create a Random Forest Classifier Using scikit-learn</h1></center>

|![Sparky](sparky.png)|![Penny](penny.png)|
|:-------------------:|:-------------------:|
|Sparky the Dog|Penny the Cat|

<center><h2>Are You Likely to Be a Dog Person or a Cat Person?</h2></center>

# Introduction

In this lab, we take a quick look at __scikit-learn__, a machine learning framework.

Designed to be native to Python, scikit-learn contains various classification, regression, and clustering algorithms, including __random forests__ which we use in this lab.

# Scenario

You run a local pet store, and you want to know what kind of products to recommend to new customers. You've contracted a survey company to collect information on 199 of your current customers, including if they think themselves to be 'dog people' or 'cat people'. The questions the survey asked were:

- Do you like walking?
- Do you like running?
- What is your favorite color?
- How many miles do you walk in a day?
- Do you like dogs or cats?

Clearly this is a fictitious dataset. If this were real, you'd fire the survey company!

We're going to use a random forest algorithm trained with our survey data to build a model that classifies new customers as dog or cat people.

We do this using standard Python libraries such as NumPy, Pandas, and matplotlib along with scikit-learn.

## scikit-learn
### Machine Learning in Python

- Simple and efficient tools for data mining and data analysis.
- Accessible to everybody and reusable in various contexts.
- Built on NumPy, SciPy, and matplotlib.
- Open source and commercially usable - BSD license.

_(Source: https://scikit-learn.org/)_

# How to Use This Lab

All of the code is provided for you in this lab as our solution to the tasks presented. You could simply execute the notebook to get a result, but that's not really very hands-on and it won't teach you anything but how to execute cells in a Jupyter notebook. To get the most from this lab, you should understand what the code in each cell is trying to accomplish, and then take the time to experiment: make changes, break it, fix it, and learn! You can always pull the code down again to get a clean copy.

# 1)  Import Libraries

In [None]:
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sn
#Use a magic command to display the graphs better
%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split

# Set the random seed to have reproducible results
np.random.seed(42)

# These two includes help us render a graph
from subprocess import call
from IPython.display import Image

# 2) Prepare the Data

We have a dataset created from a set of people who either liked dogs or cats. Clearly this is a fictional dataset and some of the data is useless. But let's see what we can learn from it.

First we load the data. Pandas provides a very convenient way of loading CSV data, which is what our survey results are stored as.

In [None]:
# TODO: Import the data.csv file into a Pandas DataFrame

Let's take a look at the first few rows of our data:

In [None]:
df.head(10)

Now we should check to see what data types it's using.

In [None]:
df.dtypes

For our first data preparation, let's change the names of some of the columns, including marking ```dogs_cats``` as an out label.

In [None]:
df.columns = ['walk', 'run', 'color', 'distance', 'label']

The data is not shuffled very well. We see a lot of similar information, so we didn't learn much. We don't want to look at the rest of our data individually to improve our understanding. Rather, we want to describe the data as a whole. Let's have Pandas tell us more about the data statistically.

In [None]:
df.describe()

For our next data preparation step, let's format the data in the columns so the model will understand it better. 

Despite the `walk`, `run`, and `label` columns being integers, they are actually representing a binary value, 1 or 0. For `walk` and `run`, this corresponds to True or False for liking that activity. For `label`, this represents whether you prefer cats or dogs. Instead of using integers, let's change these fields to the boolean data type.

`color_type` is not random text. It actually represents one of three categories (red, green, or blue). Instead of treating it as text, let's change it to a categorical data type.

Note that we are not modifying distance. The question value it represents asks for a number, and it currently is a number. No change needed!

In [None]:
# TODO: Set the Data types for each of our columns

In [None]:
df.dtypes

The algorithm will not understand color as a list of strings. With the data as it is now, we can expect to get the following error when we trt to fit the data:
`ValueError: could not convert string to float: 'green'`
So we use the `get_dummies` command to perform a one hot encoding process directly into the dataframe.

In [None]:
# TODO: Perform one-hot encoding of our `colors` column

With data formatting complete, let's take a quick look at what our data looks like now.

In [None]:
df.head(10)

<a id='section3'></a>

The last data preparation task is to split the data into training and testing sets. We use scikit-learn's `train_test_split` to split off 20% of the data for testing so we have 80% for training. We also set `random_state` so that we'll have a random, but repeatable, data split.

In [None]:
# TODO: Split the dataset into 80% Training, and 20% Testing

# 3) Create the Model

The time has come to create the model. We're going to create a `RandomForestClassifier` moodel. We set one parameter, but there are many more options for tweaking your model. You can read about them in the [Random Forest Classifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
model = RandomForestClassifier(max_depth=5)

Now we `fit` the model on the training data.

In [None]:
# TODO: Fit the model to use our training data

With 199 survey responses, this trains very quickly. Now, let's see what we got.

# 4) Evaluate the Model

First, we grab the estimator from the trained model.

In [None]:
estimator = model.estimators_[0]
feature_names = [i for i in X_train.columns]

Then we use ```export_graphviz``` to get a graphical representation of __one of__ the trees in our forest. __After this code block has run, wait a moment for the graph to appear.__

In [None]:
export_graphviz(estimator, out_file='tree.dot',
                feature_names = feature_names,
                rounded = True,
                filled = True)

# Use CLI Graphviz to create a png file from our graph data:
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

Image(filename = 'tree.png')

Now let's run our testing data through the model.

In [None]:
# TODO: Use the Testing dataset to produce the first predictions

To see how it performed, let's use ```confusion_matrix``` to create a confusion matrix for the test data predictions.

In [None]:
conf_matrix = confusion_matrix(y_test, y_predict)
conf_matrix

That's not very easy to read. Let's add some style to it using the `seaborn` library.

In [None]:
tn, fp, fn, tp = conf_matrix.ravel()
cleaned_conf_matrix = np.array([[tp, fp], [fn, tn]])
y_true = ["Dog", "Cat"]
df_cm = pd.DataFrame(cleaned_conf_matrix, columns=y_true, index=y_true)
df_cm.index.name = 'Predicted'
df_cm.columns.name = 'Actual'

df_cm.dtypes

plt.figure(figsize = (8,5))
plt.title('Confusion Matrix')
sn.set(font_scale=1.4)#for label size
sn.heatmap(df_cm, cmap="Blues", annot=True, annot_kws={"size": 16})# font size

And from the confusion matrix, let's calculate **sensitivity** and **specificity**.

In [None]:
sensitivity = tp / (tp + fn)
print('Sensitivity : ', sensitivity )

specificity = tn / (tn + fp)
print('Specificity : ', specificity)

Now let's look at the __ROC__ graph.

In [None]:
y_pred_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for Dog vs Cat people')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

And calculate the __AUC__. Quick quiz: what does AUC stand for?

In [None]:
auc(fpr, tpr)

# 5) Predict for yourself!

Change the values for the survey questions below and see what the model predicts for you!

* `walk`/`run` should be set to either `0` for no, or `1` for yes
* `distance` should be set to a `float` number
* The `color_*` should be set to `1` for your favorite, and `0` for the rest

In [None]:
data = pd.DataFrame([{
    'walk': 1,
    'run': 0,
    'distance': 8.5,
    'color_red': 1,
    'color_green': 0,
    'color_blue': 0
}])

prediction = model.predict(data)
print(prediction)

Okay, that's not very interesting. Since this is a binary classifier by nature, the result will be a boolean value. So, let's render this a bit more visually:

In [None]:
if prediction[0]:
    animal = 'dog'
    pic = 'sparky.png'
else:
    animal = 'cat'
    pic = 'penny.png'
    
img=mpimg.imread(pic)
imgplot = plt.imshow(img)
plt.grid(False)
plt.xticks([])
plt.yticks([])
plt.title("We predict that you're a %s person!" % animal)
plt.show()

# Conclusion
Did it get the answer right for you? Probably not, but the data the model is trained from is essentially random, so the results are as well. The Random Forest algorithm did its best to find a pattern and split the data along the nodes, but it's trying to find a signal in data that is only noise.

In this lab, we've demonstrated how you can turn survey data into predictions using scikit-learn. Using real data from a properly designed survey, you can very quickly get pretty good results.

Go ahead. Try a few more predictions for fun. You know you want to!