# **Drugs A, B, C, X, Y for Decision Trees**

The code below is taken from Pablo M Gomez's submission on [kaggle.com](https://www.kaggle.com/pablomgomez21/decision-trees-practice).

You are encouraged to go to the link above and check the full code. In this lab, you will do the necessary steps to explore the data and prepare it for sklearn algorithms.

**About the data set**

Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y.

Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The features of this dataset are Age, Sex, Blood Pressure, and the Cholesterol of the patients, and the target is the drug that each patient responded to.

It is a sample of multiclass classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe a drug to a new patient.

**Import libraries**

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# import required sklearn libraries for Decision Tree Classifier
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# Acquire data

In [2]:
# Read in the data using panda's read_csv method
my_data = pd.read_csv("SupervisedLearning/DrugSelection/drug200.csv")

#TODO: Write code to inspect the first five rows of the dataframe
my_data.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


In [10]:
print(my_data.to_string())

     Age Sex      BP Cholesterol  Na_to_K   Drug
0     23   F    HIGH        HIGH   25.355  drugY
1     47   M     LOW        HIGH   13.093  drugC
2     47   M     LOW        HIGH   10.114  drugC
3     28   F  NORMAL        HIGH    7.798  drugX
4     61   F     LOW        HIGH   18.043  drugY
5     22   F  NORMAL        HIGH    8.607  drugX
6     49   F  NORMAL        HIGH   16.275  drugY
7     41   M     LOW        HIGH   11.037  drugC
8     60   M  NORMAL        HIGH   15.171  drugY
9     43   M     LOW      NORMAL   19.368  drugY
10    47   F     LOW        HIGH   11.767  drugC
11    34   F    HIGH      NORMAL   19.199  drugY
12    43   M     LOW        HIGH   15.376  drugY
13    74   F     LOW        HIGH   20.942  drugY
14    50   F  NORMAL        HIGH   12.703  drugX
15    16   F    HIGH      NORMAL   15.516  drugY
16    69   M     LOW      NORMAL   11.455  drugX
17    43   M    HIGH        HIGH   13.972  drugA
18    23   M     LOW        HIGH    7.298  drugC
19    32   F    HIGH

# Inspect data

In [3]:
#TODO: Write code to inspect the shape of the data frame
my_data.shape

(200, 6)

In [4]:
#TODO: Write code to display information about the data frame
my_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    object 
 3   Cholesterol  200 non-null    object 
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 9.5+ KB


In [6]:
#TODO: Write code to display statistics about the data frame
my_data.describe(include='all')

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
count,200.0,200,200,200,200.0,200
unique,,2,3,2,,5
top,,M,HIGH,HIGH,,drugY
freq,,104,77,103,,91
mean,44.315,,,,16.084485,
std,16.544315,,,,7.223956,
min,15.0,,,,6.269,
25%,31.0,,,,10.4455,
50%,45.0,,,,13.9365,
75%,58.0,,,,19.38,


# Clean data

**Correcting**

In [0]:
#TODO: Write code to drop rows having missing values
# I couldn't find any missing values. The docmentation doesn't mention how they have listed missing
# values and neither does the kaggle.com submission. Thus, I am led to believe that there are no 
# mising values.

**Converting**

Declare two variables:

* X: feature matrix with the data 
* y: response vector with target information[link text](https://)


In [11]:
#TODO: Write code to declare X
# Hint: remove the column containing the target of this prediction problem
# Note: To run the next section, X is expected to be an array. 
# You can get an array from a data frame with: X = X.values
X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values

#TODO: Write code to inspect the first five rows of X
# Note: If X is an array, instead of using the head() function,
# you will need to use array functionality to output the first five values.
X[0:5] 

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.114],
       [28, 'F', 'NORMAL', 'HIGH', 7.798],
       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)

As you may figure out, some features in this dataset are categorical, such as **Sex** or **BP**. Unfortunately, Sklearn Decision Trees does not handle categorical variables. We can still convert these features to numerical values using **pandas.get_dummies()** to convert the categorical variable into dummy/indicator variables.

**Note:** If you run this block once, in order to run it again, you will need to redeclare X in the previous block or it will throw errors trying to convert data it has already converted.

In [12]:
# Define a label encoder for the sex feature to be 0 or 1
# X is expected to be an array here. If it's a dataframe, get the array version by running:
# X = X.values

le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1]) 

#TODO: Write code to encode the BP feature in X[:,2]
# from 'Low', 'NORMAL', 'HIGH', to 0, 1, 2
le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])

# Define a label encoder for the Chol feature to be 0 or 1
le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) 

X[0:5]

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.114],
       [28, 0, 2, 0, 7.798],
       [61, 0, 1, 0, 18.043]], dtype=object)

In [13]:
#TODO: Write code to declare y
# Hint: this is the column containing the target of this prediction problem
y = my_data["Drug"]
#TODO: Write code to inspect the first five rows of y
y[0:5]

0    drugY
1    drugC
2    drugC
3    drugX
4    drugY
Name: Drug, dtype: object

# Earn Your Wings

Use a decision tree classifier on the cleaned data set to predict y for the given data. Report the accuracy score. Add comments in your code to explain each step that you take in your implementation.

In [19]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt

# Split data into training and validation datasets
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)
print("X_trainsetX SHAPE:", str(X_trainset.shape))
print("y_trainsetX SHAPE:", str(y_trainset.shape))
print("X_testsetX SHAPE:", str(X_testset.shape))
print("y_testsetX SHAPE:", str(y_testset.shape))

# Create an instance of the DecisionTreeClassifier
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree # it shows the default parameters

# Train the DecisionTreeClassifier
drugTree.fit(X_trainset,y_trainset)

# Test the DecisionTreeClassifier on our reserved validation set
predTree = drugTree.predict(X_testset)
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

X_trainsetX SHAPE: (140, 5)
y_trainsetX SHAPE: (140,)
X_testsetX SHAPE: (60, 5)
y_testsetX SHAPE: (60,)
DecisionTrees's Accuracy:  0.9833333333333333
