<h1><center>Decision Trees Workshop</center></h1>

### Abstract

This lab demonestrates how to use the "Decision Tree" machine learning algorithm.  For this example we will use historical data of patients and their response to medication to create a decision tree to prescribe the proper drug to a new patient.

First, we need to import the following libraries
- numpy (as np)
- pandas
- DecisionTreeClassifier from sklearn.tree

In [3]:
import numpy as np 
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

### The dataset
We have a list of data on patients and what drug they responded to<br>
We now need to build a model to show what drug might be appropriate for future patients with the same illness.<br>
We will use a binary classifer

### Downloading the Data
We will use pd.read.csv to import the data directly into the program and then display the sturcture

In [4]:
my_data = pd.read_csv("https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv")
my_data[0:5]

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


Let's see how much data we are dealing with.

In [5]:
my_data.size

1200

### Pre-processing

Using my_data, we need to declare teh following variables
- X: Feature Matrix (independant variables of my_data)
- Y: Response Vector (target)
Remove the column containtn the target name since it doesn't contain numeric values.

In [7]:
X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
X[0:5]

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.113999999999999],
       [28, 'F', 'NORMAL', 'HIGH', 7.797999999999999],
       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)

Since some of these features are categorical instead of numeric such as __Sex__ or __BP__, we will need to convert them.  We can use __pandas.get_dummies()__ to do so.

In [9]:
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1])

le_BP = preprocessing.LabelEncoder()
le_BP.fit(['LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])

le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) 

X[0:5]

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.113999999999999],
       [28, 0, 2, 0, 7.797999999999999],
       [61, 0, 1, 0, 18.043]], dtype=object)

Now, let's poplulate the target variable

In [10]:
y = my_data["Drug"]
y[0:5]

0    drugY
1    drugC
2    drugC
3    drugX
4    drugY
Name: Drug, dtype: object

<hr>

## Setting up the Decision Tree
We will be using __train/test/split__ on our decision tree.  Let's import the library from sklearn.cross_validation

In [11]:
from sklearn.model_selection import train_test_split

Train_test_split will return 4 different parameters.  We will name them:<br>
X_trainset, X_testset, y_trainset, y_testset<br>
<br>
The train_test_split will need the parameters:<br>
X, y, test_size = 0.3, and random_state = 3<br>
<br>
The X and y are the arrays requied before the split, the test_size represetns the ratio of the testing to training data, and the random_state ensures that we obtain the same splits.

In [12]:
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)

Now, let's confirm our datasets shape

In [13]:
print(X_trainset.shape)
print(y_trainset.shape)
print(X_testset.shape)
print(y_testset.shape)

(140, 5)
(140,)
(60, 5)
(60,)


<hr>

### Modeling
We will first need to create an instance of the DecisionTreeClassifier called __drugTree__.<br>
Inside of the classifier, specify _criterion = "entropy"_ so we can see the information gain of each node.

In [15]:
drugTree = DecisionTreeClassifier(criterion = "entropy", max_depth = 4)
drugTree #shows the default parameters

DecisionTreeClassifier(criterion='entropy', max_depth=4)

Next, we will fit the data using the training matrix and training response vector

In [16]:
drugTree.fit(X_trainset, y_trainset)

DecisionTreeClassifier(criterion='entropy', max_depth=4)

<hr>

### Prediction
Let's use our datesets to make some predictions and store it in __predTree__.

In [17]:
predTree = drugTree.predict(X_testset)

We can print out predTree and y_testset to visually compare the predicted to actual values

In [18]:
print (predTree[0:5])
print (y_testset[0:5])

['drugY' 'drugX' 'drugX' 'drugX' 'drugX']
40     drugY
51     drugX
139    drugX
197    drugX
170    drugX
Name: Drug, dtype: object


<hr>

### Evaluation
Now, let's evaluate the accuracy of the model using __metrics__ imported from sklearn

In [21]:
from sklearn import metrics
import matplotlib.pyplot as plt
print(f'DecisionTrees\'s Accuracy: {metrics.accuracy_score(y_testset, predTree)*100:.2f}%')

DecisionTrees's Accuracy: 98.33%


__Accuracy classification score__ computes subset accuracy: the set of labels predictd for a sample must exactly match the labes in y_true<br>
In multilabel classification, the function returens the subset accuracy.  If the entire set of predicted lables for a sample stricly match, the accuracy is 1, otherwise it is 0

### Practice
Calculate accuracy without sklearn

In [22]:
total_entries = len(predTree)
correct_values = 0
for i, entry in enumerate(y_testset):
    if entry == predTree[i]:
        correct_values += 1
accuracy = correct_values/total_entries
print(f'{accuracy*100:.2f}%')

98.33%


<hr>

### Visualization
Let's visualize the tree

In [33]:
# Notice: if you have not already, uncomment the below lines to install pydotplus and grphviz libraries
!pip install pydotplus
!pip install graphviz



In [28]:
import io
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline 

In [34]:
dot_data = io.StringIO()
filename = "drugtree.png"
featureNames = my_data.columns[0:5]
targetNames = my_data["Drug"].unique().tolist()
out=tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_trainset), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')

InvocationException: GraphViz's executables not found