# Drug Analysis

## Import Data

We have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug C, Drug X and Y.

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

drug_data = pd.read_csv('drug200.csv')
drug_data.head(10)

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY
5,22,F,NORMAL,HIGH,8.607,drugX
6,49,F,NORMAL,HIGH,16.275,drugY
7,41,M,LOW,HIGH,11.037,drugC
8,60,M,NORMAL,HIGH,15.171,drugY
9,43,M,LOW,NORMAL,19.368,drugY


## We want to try split into 2 different dataset (70% / 30%) and evaluate the results 

In [2]:
drug_data.shape

(200, 6)

# Preprocessing 

## 1. We want to convert the string into label (Sex, BP, NA_to_K)

In [3]:
from sklearn import preprocessing

label = preprocessing.LabelEncoder()
new_sex = label.fit_transform(drug_data['Sex'])
#print(f"sex_label = {new_sex}")

new_BP = label.fit_transform(drug_data['BP'])
#print(f"BP_label = {new_BP}")

new_Cholesterol = label.fit_transform(drug_data['Cholesterol'])
#print(f"Cholesterol_label = {new_Cholesterol}")

In [4]:
drug_data['Cholesterol'].unique()

array(['HIGH', 'NORMAL'], dtype=object)

## 2. Convert df into array so we can train it

- Explanatory Variable (X)

In [5]:
X = drug_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
X[0:3]

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.113999999999999]], dtype=object)

In [6]:
X[:,1] = new_sex
X[:,2] = new_BP
X[:,3] = new_Cholesterol
X[0:5]

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.113999999999999],
       [28, 0, 2, 0, 7.797999999999999],
       [61, 0, 1, 0, 18.043]], dtype=object)

- Response Variable (y)

In [7]:
y = drug_data['Drug'].values
y[0:3]

array(['drugY', 'drugC', 'drugC'], dtype=object)

## 3. Split the data

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

## 4. Training the Machine Learning Model

### Model 1 = Gini Decision Tree Model (Random_state = 1)

In [9]:
Model1 = DecisionTreeClassifier(criterion="gini", max_depth = 4)
Model1

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [10]:
Model1.fit(X_train, y_train)
y_pred_G = Model1.predict(X_test)
print(f"The prediction output is {y_pred_G[0:5]}, meanwhile the real output we have is {y_test[0:5]}")

The prediction output is ['drugX' 'drugY' 'drugX' 'drugC' 'drugY'], meanwhile the real output we have is ['drugX' 'drugY' 'drugX' 'drugC' 'drugY']


### Model 2 = Entropy Decision Tree Model (Random_state = 1)

In [11]:
Model2 = DecisionTreeClassifier(criterion="entropy", max_depth = 5)
Model2

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [12]:
Model2.fit(X_train, y_train)
y_pred_E = Model2.predict(X_test)
print(f"The prediction output is {y_pred_E[0:5]}, meanwhile the real output we have is {y_test[0:5]}")

The prediction output is ['drugX' 'drugY' 'drugX' 'drugC' 'drugY'], meanwhile the real output we have is ['drugX' 'drugY' 'drugX' 'drugC' 'drugY']


## 5. Evaluation using accuracy and classification report

In [13]:
from sklearn.metrics import accuracy_score
acc1 = accuracy_score(y_test, y_pred_G)
acc2 = accuracy_score(y_test, y_pred_E)
print(f"Accuracy for Gini = {acc1} & Accuracy for Entropy = {acc2}")

Accuracy for Gini = 0.9666666666666667 & Accuracy for Entropy = 0.9666666666666667


In [14]:
from sklearn.metrics import classification_report, confusion_matrix

cls_report1 = classification_report(y_test, y_pred_G)
cls_report2 = classification_report(y_test, y_pred_E)

print(f"{cls_report1}")
print(f"{cls_report2}")

              precision    recall  f1-score   support

       drugA       0.67      1.00      0.80         4
       drugB       1.00      0.67      0.80         6
       drugC       1.00      1.00      1.00         4
       drugX       1.00      1.00      1.00        19
       drugY       1.00      1.00      1.00        27

    accuracy                           0.97        60
   macro avg       0.93      0.93      0.92        60
weighted avg       0.98      0.97      0.97        60

              precision    recall  f1-score   support

       drugA       0.67      1.00      0.80         4
       drugB       1.00      0.67      0.80         6
       drugC       1.00      1.00      1.00         4
       drugX       1.00      1.00      1.00        19
       drugY       1.00      1.00      1.00        27

    accuracy                           0.97        60
   macro avg       0.93      0.93      0.92        60
weighted avg       0.98      0.97      0.97        60



In [15]:
confusion_matrix = confusion_matrix(y_test, y_pred_G)
print(confusion_matrix)

[[ 4  0  0  0  0]
 [ 2  4  0  0  0]
 [ 0  0  4  0  0]
 [ 0  0  0 19  0]
 [ 0  0  0  0 27]]


In this project decision tree model (model1 & model2) yield similar result. 

In theory, Entropy would yield better result due to its complexity. Whereas, Gini impurity is also pretty accurate with less latency due to straight forward split method (simpler computation).

__First Result: Gini and Entropy yield same result. Factors could be because our data only consist 200 data (pharmacy).__

> Next, I found out that if we change the random_state in our decision tree model we can get better prediction

### Model 3 = Decision Tree Model (Random_state = 3)

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 3)

In [17]:
# since both Gini and Entropy yield same accuracy so we choose one here
Model = DecisionTreeClassifier(criterion="gini", max_depth = 4)
Model

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [18]:
Model.fit(X_train, y_train)
y_pred_new = Model.predict(X_test)
print(f"The prediction output is {y_pred_new[0:5]}, meanwhile the real output we have is {y_test[0:5]}")

The prediction output is ['drugY' 'drugX' 'drugX' 'drugX' 'drugX'], meanwhile the real output we have is ['drugY' 'drugX' 'drugX' 'drugX' 'drugX']


In [19]:
acc1 = accuracy_score(y_test, y_pred_new)
print(f"Accuracy for Gini = {acc1}")

Accuracy for Gini = 0.9833333333333333


In [20]:
cls_report = classification_report(y_test, y_pred_new)

print(f"{cls_report}")

              precision    recall  f1-score   support

       drugA       1.00      1.00      1.00         7
       drugB       1.00      1.00      1.00         5
       drugC       1.00      1.00      1.00         5
       drugX       1.00      0.95      0.98        21
       drugY       0.96      1.00      0.98        22

    accuracy                           0.98        60
   macro avg       0.99      0.99      0.99        60
weighted avg       0.98      0.98      0.98        60



In [21]:
from sklearn.metrics import classification_report, confusion_matrix
cfm = confusion_matrix(y_test, y_pred_new)
print(cfm)

[[ 7  0  0  0  0]
 [ 0  5  0  0  0]
 [ 0  0  5  0  0]
 [ 0  0  0 20  1]
 [ 0  0  0  0 22]]


As we can see the random state (1 to 3) increase the f1-score accuracy by 0.01 (0.97 to 0.98)

The confusion matrix also explains positive correlation of this model.

__Second result: Using the 'Decision Tree Model' with different 'random_state', by increasing randomness in data we can obtain better accuracy.__

> Next, I want to model it using Random Forest, which normally will yiel better accuracy rate

### Model 4 = Random Forest Model (Random_state = 1)

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

In [23]:
from sklearn.ensemble import RandomForestClassifier

rand_forest = RandomForestClassifier(n_estimators=200)
rand_forest.fit(X_train, y_train)

y_pred = rand_forest.predict(X_test)
y_pred

array(['drugX', 'drugY', 'drugX', 'drugC', 'drugY', 'drugX', 'drugX',
       'drugY', 'drugY', 'drugY', 'drugX', 'drugC', 'drugY', 'drugY',
       'drugA', 'drugA', 'drugX', 'drugX', 'drugB', 'drugY', 'drugX',
       'drugX', 'drugX', 'drugY', 'drugB', 'drugX', 'drugX', 'drugY',
       'drugX', 'drugX', 'drugC', 'drugY', 'drugY', 'drugY', 'drugA',
       'drugY', 'drugA', 'drugY', 'drugY', 'drugY', 'drugB', 'drugY',
       'drugY', 'drugX', 'drugB', 'drugY', 'drugX', 'drugX', 'drugY',
       'drugA', 'drugY', 'drugY', 'drugY', 'drugY', 'drugY', 'drugY',
       'drugX', 'drugX', 'drugX', 'drugA'], dtype=object)

In [24]:
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy for Gini = {acc}")

Accuracy for Gini = 0.95


In [25]:
cls_report = classification_report(y_test, y_pred)

print(f"{cls_report}")

              precision    recall  f1-score   support

       drugA       0.67      1.00      0.80         4
       drugB       1.00      0.67      0.80         6
       drugC       1.00      0.75      0.86         4
       drugX       0.95      1.00      0.97        19
       drugY       1.00      1.00      1.00        27

    accuracy                           0.95        60
   macro avg       0.92      0.88      0.89        60
weighted avg       0.96      0.95      0.95        60



__Third result: In this case, the random forest gives less accuracy compare to decision tree model, this could possibly cause underfitting.__

Whereas normally __'Random Forest model'__ will give much better accuracy than __'Decision Tree model'__

> So we can try use __cross validation method (K-fold Library)__ and check the accuracy

### Model 5 : Cross-Validation on Random Forest

In [26]:
from sklearn.metrics import classification_report, confusion_matrix
cfm = confusion_matrix(y_test, y_pred)
print(cfm)

[[ 4  0  0  0  0]
 [ 2  4  0  0  0]
 [ 0  0  3  1  0]
 [ 0  0  0 19  0]
 [ 0  0  0  0 27]]


In [27]:
from sklearn.model_selection import KFold
kfold = KFold(10, True, 1)

for train, test in kfold.split(X):
    print('train: %s, test: %s' % (X[train], X[test]))
    
for train, test in kfold.split(y):
    print('train: %s, test: %s' % (y[train], y[test]))

train: [[23 0 0 0 25.355]
 [47 1 1 0 13.093]
 [47 1 1 0 10.113999999999999]
 [28 0 2 0 7.797999999999999]
 [22 0 2 0 8.607000000000001]
 [49 0 2 0 16.275]
 [41 1 1 0 11.037]
 [60 1 2 0 15.171]
 [43 1 1 1 19.368]
 [47 0 1 0 11.767000000000001]
 [43 1 1 0 15.376]
 [74 0 1 0 20.941999999999997]
 [50 0 2 0 12.703]
 [16 0 0 1 15.515999999999998]
 [69 1 1 1 11.455]
 [43 1 0 0 13.972000000000001]
 [32 0 0 1 25.974]
 [57 1 1 1 19.128]
 [63 1 2 0 25.916999999999998]
 [47 1 1 1 30.568]
 [48 0 1 0 15.036]
 [33 0 1 0 33.486]
 [28 0 0 1 18.809]
 [31 1 0 0 30.366]
 [49 0 2 1 9.381]
 [39 0 1 1 22.697]
 [18 0 2 1 8.75]
 [74 1 0 0 9.567]
 [49 1 1 1 11.014000000000001]
 [65 0 0 1 31.875999999999998]
 [32 1 0 1 9.445]
 [39 1 1 1 13.937999999999999]
 [39 0 2 1 9.709]
 [15 1 2 0 9.084]
 [58 0 0 1 14.239]
 [50 1 2 1 15.79]
 [23 1 2 0 12.26]
 [50 0 2 1 12.295]
 [66 0 2 1 8.107000000000001]
 [37 0 0 0 13.091]
 [68 1 1 0 10.290999999999999]
 [23 1 2 0 31.686]
 [28 0 1 0 19.796]
 [58 0 0 0 19.416]
 [67 1 2 1 10

In [28]:
X_train_cv = X[train]
X_test_cv = X[test]
y_train_cv = y[train]
y_test_cv = y[test]
print(f'length of train X: {len(X_train_cv)}, length of test X: {len(X_test_cv)}')


length of train X: 180, length of test X: 20


In [29]:
from sklearn.ensemble import RandomForestClassifier

rand_forest = RandomForestClassifier(n_estimators=200)
rand_forest.fit(X_train_cv, y_train_cv)

y_pred_cv = rand_forest.predict(X_test_cv)
y_pred_cv

array(['drugY', 'drugY', 'drugX', 'drugY', 'drugX', 'drugX', 'drugA',
       'drugX', 'drugY', 'drugY', 'drugA', 'drugX', 'drugA', 'drugB',
       'drugA', 'drugX', 'drugA', 'drugY', 'drugY', 'drugY'], dtype=object)

In [30]:
acc = accuracy_score(y_test_cv, y_pred_cv)
print(f"Accuracy for Gini = {acc}")

Accuracy for Gini = 1.0


In [31]:
cls_report = classification_report(y_test_cv, y_pred_cv)

print(f"{cls_report}")

              precision    recall  f1-score   support

       drugA       1.00      1.00      1.00         5
       drugB       1.00      1.00      1.00         1
       drugX       1.00      1.00      1.00         6
       drugY       1.00      1.00      1.00         8

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20



In [32]:
from sklearn.metrics import classification_report, confusion_matrix
cfm = confusion_matrix(y_test_cv, y_pred_cv)
print(cfm)

[[5 0 0 0]
 [0 1 0 0]
 [0 0 6 0]
 [0 0 0 8]]


__Fourth result: The accuracy given by cross validation is 100%, as this is the best method model by now to prevent overfitting.__

## Conclusion

- Decision Tree and Random Forest model are alternative models for classification dataset.
- __Gini and Entropy yield same result__. Factors could be because our data only consist 200 data which is consider small. So no big difference.
- Using the 'Decision Tree Model' with different 'random_state', by __increasing randomness in data we can obtain better accuracy__.
- In this case, __the random forest gives less accuracy compare to decision tree model__, this could possibly __because underfitting__.
- __The accuracy given by cross validation is 100%__, as this is the best method model by now to prevent overfitting. The confusion matrix perfectly fill all the diagonal area. The classification report also gives clear result for the accuracy.

Finally, all this models yiel accuracy > 95% which is very reliable in real world. 

By having the patient 'Age', 'Sex', 'Blood-Pressure rate', 'Cholesterol rate', and 'Sodium-Potassium rate'. We can predict using which drug medicine to cure the patient.