# Drug Analysis

## Import Data

We have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug C, Drug X and Y.

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

drug_data = pd.read_csv('drug200.csv')
drug_data.head(10)

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY
5,22,F,NORMAL,HIGH,8.607,drugX
6,49,F,NORMAL,HIGH,16.275,drugY
7,41,M,LOW,HIGH,11.037,drugC
8,60,M,NORMAL,HIGH,15.171,drugY
9,43,M,LOW,NORMAL,19.368,drugY


## We want to try split into 2 different dataset (70% / 30%) and evaluate the results 

In [2]:
drug_data.shape

(200, 6)

# Preprocessing 

## 1. We want to convert the string into label (Sex, BP, NA_to_K)

In [3]:
from sklearn import preprocessing

label = preprocessing.LabelEncoder()
new_sex = label.fit_transform(drug_data['Sex'])
#print(f"sex_label = {new_sex}")

new_BP = label.fit_transform(drug_data['BP'])
#print(f"BP_label = {new_BP}")

new_Cholesterol = label.fit_transform(drug_data['Cholesterol'])
#print(f"Cholesterol_label = {new_Cholesterol}")

In [4]:
drug_data['Cholesterol'].unique()

array(['HIGH', 'NORMAL'], dtype=object)

## 2. Convert df into array so we can train it

- Explanatory Variable (X)

In [5]:
X = drug_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
X[0:3]

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.113999999999999]], dtype=object)

In [6]:
X[:,1] = new_sex
X[:,2] = new_BP
X[:,3] = new_Cholesterol
X[0:5]

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.113999999999999],
       [28, 0, 2, 0, 7.797999999999999],
       [61, 0, 1, 0, 18.043]], dtype=object)

In [7]:
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-1.29159102, -1.040833  , -1.11016894, -0.97043679,  1.28652212],
       [ 0.16269866,  0.96076892,  0.10979693, -0.97043679, -0.4151454 ],
       [ 0.16269866,  0.96076892,  0.10979693, -0.97043679, -0.82855818],
       [-0.988614  , -1.040833  ,  1.32976279, -0.97043679, -1.14996267],
       [ 1.0110343 , -1.040833  ,  0.10979693, -0.97043679,  0.27179427]])

- Response Variable (y)

In [8]:
y = drug_data['Drug'].values
y[0:3]

array(['drugY', 'drugC', 'drugC'], dtype=object)

## 3. Split the data

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

## 4. Training the Machine Learning Model

### Model 1 = Gini Decision Tree Model (Random_state = 1)

In [10]:
Model1 = DecisionTreeClassifier(criterion="gini", max_depth = 4)
Model1

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [11]:
Model1.fit(X_train, y_train)
y_pred_G = Model1.predict(X_test)
print(f"The prediction output is {y_pred_G[0:5]}, meanwhile the real output we have is {y_test[0:5]}")

The prediction output is ['drugX' 'drugY' 'drugX' 'drugC' 'drugY'], meanwhile the real output we have is ['drugX' 'drugY' 'drugX' 'drugC' 'drugY']


### Model 2 = Entropy Decision Tree Model (Random_state = 1)

In [12]:
Model2 = DecisionTreeClassifier(criterion="entropy", max_depth = 5)
Model2

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [13]:
Model2.fit(X_train, y_train)
y_pred_E = Model2.predict(X_test)
print(f"The prediction output is {y_pred_E[0:5]}, meanwhile the real output we have is {y_test[0:5]}")

The prediction output is ['drugX' 'drugY' 'drugX' 'drugC' 'drugY'], meanwhile the real output we have is ['drugX' 'drugY' 'drugX' 'drugC' 'drugY']


## 5. Evaluation using accuracy and classification report (Max_depth = 5)

In [14]:
from sklearn.metrics import accuracy_score
acc1 = accuracy_score(y_test, y_pred_G)
acc2 = accuracy_score(y_test, y_pred_E)
print(f"Accuracy for Gini = {acc1} & Accuracy for Entropy = {acc2}")

Accuracy for Gini = 0.9833333333333333 & Accuracy for Entropy = 0.9833333333333333


In [15]:
from sklearn.metrics import classification_report, confusion_matrix

cls_report1 = classification_report(y_test, y_pred_G)
cls_report2 = classification_report(y_test, y_pred_E)

print(f"{cls_report1}")
print(f"{cls_report2}")

              precision    recall  f1-score   support

       drugA       0.80      1.00      0.89         4
       drugB       1.00      0.83      0.91         6
       drugC       1.00      1.00      1.00         4
       drugX       1.00      1.00      1.00        19
       drugY       1.00      1.00      1.00        27

    accuracy                           0.98        60
   macro avg       0.96      0.97      0.96        60
weighted avg       0.99      0.98      0.98        60

              precision    recall  f1-score   support

       drugA       0.80      1.00      0.89         4
       drugB       1.00      0.83      0.91         6
       drugC       1.00      1.00      1.00         4
       drugX       1.00      1.00      1.00        19
       drugY       1.00      1.00      1.00        27

    accuracy                           0.98        60
   macro avg       0.96      0.97      0.96        60
weighted avg       0.99      0.98      0.98        60



In [16]:
confusion_matrix = confusion_matrix(y_test, y_pred_G)
print(confusion_matrix)

[[ 4  0  0  0  0]
 [ 1  5  0  0  0]
 [ 0  0  4  0  0]
 [ 0  0  0 19  0]
 [ 0  0  0  0 27]]


In this project decision tree model (model1 & model2) yield similar result. 

In theory, Entropy would yield better result due to its complexity. Whereas, Gini impurity is also pretty accurate with less latency due to straight forward split method (simpler computation).

__First Result: Gini and Entropy yield same result. Factors could be because our data only consist 200 data (pharmacy).__

> Next, I found out that if we change the random_state in our decision tree model we can get better prediction

### Model 3 = Decision Tree Model (Max_depth = 3)

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

In [18]:
# since both Gini and Entropy yield same accuracy so we choose one here
Model = DecisionTreeClassifier(criterion="gini", max_depth = 3)
Model

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [19]:
Model.fit(X_train, y_train)
y_pred_new = Model.predict(X_test)
print(f"The prediction output is {y_pred_new[0:5]}, meanwhile the real output we have is {y_test[0:5]}")

The prediction output is ['drugX' 'drugY' 'drugX' 'drugX' 'drugY'], meanwhile the real output we have is ['drugX' 'drugY' 'drugX' 'drugC' 'drugY']


In [20]:
acc1 = accuracy_score(y_test, y_pred_new)
print(f"Accuracy for Gini = {acc1}")

Accuracy for Gini = 0.9166666666666666


In [21]:
cls_report = classification_report(y_test, y_pred_new)

print(f"{cls_report}")

              precision    recall  f1-score   support

       drugA       0.80      1.00      0.89         4
       drugB       1.00      0.83      0.91         6
       drugC       0.00      0.00      0.00         4
       drugX       0.83      1.00      0.90        19
       drugY       1.00      1.00      1.00        27

    accuracy                           0.92        60
   macro avg       0.73      0.77      0.74        60
weighted avg       0.86      0.92      0.89        60



  'precision', 'predicted', average, warn_for)


In [22]:
from sklearn.metrics import classification_report, confusion_matrix
cfm = confusion_matrix(y_test, y_pred_new)
print(cfm)

[[ 4  0  0  0  0]
 [ 1  5  0  0  0]
 [ 0  0  0  4  0]
 [ 0  0  0 19  0]
 [ 0  0  0  0 27]]


As we can see when we lower the max depth from 5 to 3, it decreases the f1-score accuracy by 0.16 (0.98 to 0.82)

The confusion matrix also explains positive correlation of this model.

__Second result: Using the 'Decision Tree Model' with different 'max_depth', higher tree depth will yield higher accuracy.__

> Next, I want to model it using Random Forest, which normally will yield better accuracy rate

### Model 4 = Random Forest Model (Random_state = 1)

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

In [24]:
from sklearn.ensemble import RandomForestClassifier

rand_forest = RandomForestClassifier(n_estimators=200)
rand_forest.fit(X_train, y_train)

y_pred = rand_forest.predict(X_test)
y_pred

array(['drugX', 'drugY', 'drugX', 'drugC', 'drugY', 'drugX', 'drugX',
       'drugY', 'drugY', 'drugY', 'drugX', 'drugC', 'drugY', 'drugY',
       'drugA', 'drugA', 'drugX', 'drugX', 'drugB', 'drugY', 'drugX',
       'drugX', 'drugX', 'drugY', 'drugB', 'drugX', 'drugX', 'drugY',
       'drugX', 'drugX', 'drugC', 'drugY', 'drugY', 'drugY', 'drugA',
       'drugY', 'drugA', 'drugY', 'drugY', 'drugY', 'drugB', 'drugY',
       'drugY', 'drugX', 'drugB', 'drugY', 'drugX', 'drugX', 'drugY',
       'drugB', 'drugY', 'drugY', 'drugY', 'drugY', 'drugY', 'drugY',
       'drugX', 'drugX', 'drugX', 'drugA'], dtype=object)

In [25]:
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy for Gini = {acc}")

Accuracy for Gini = 0.9666666666666667


In [26]:
cls_report = classification_report(y_test, y_pred)

print(f"{cls_report}")

              precision    recall  f1-score   support

       drugA       0.80      1.00      0.89         4
       drugB       1.00      0.83      0.91         6
       drugC       1.00      0.75      0.86         4
       drugX       0.95      1.00      0.97        19
       drugY       1.00      1.00      1.00        27

    accuracy                           0.97        60
   macro avg       0.95      0.92      0.93        60
weighted avg       0.97      0.97      0.97        60



In [27]:
from sklearn.metrics import classification_report, confusion_matrix
cfm = confusion_matrix(y_test, y_pred)
print(cfm)

[[ 4  0  0  0  0]
 [ 1  5  0  0  0]
 [ 0  0  3  1  0]
 [ 0  0  0 19  0]
 [ 0  0  0  0 27]]


__Third result: In this case, the random forest gives a bit less accuracy compare to decision tree model (max_depth = 5), this could possibly cause underfitting.__

Whereas normally __'Random Forest model'__ will give much better accuracy than __'Decision Tree model'__

> So we can try use __cross validation method on 4 different models (K-fold Library)__ and check the accuracy

### Model 5 :  Decision Tree (K-fold) on different models

Lets create different model that **age, gender** dependent only with **BP**, **Na_to_K** and **Cholesterol** only.

In [28]:
# Model depend on BP
X_BP = drug_data[['Age', 'Sex', 'BP']].values
X_BP[:,1] = new_sex
X_BP[:,2] = new_BP
X_BP= preprocessing.StandardScaler().fit(X_BP).transform(X_BP)

# Model depend on Cholesterol
X_Chol = drug_data[['Age', 'Sex', 'Cholesterol']].values
X_Chol[:,1] = new_sex
X_Chol[:,2] = new_Cholesterol
X_Chol= preprocessing.StandardScaler().fit(X_Chol).transform(X_Chol)

# Model depend on Na_to_K
X_NaK = drug_data[['Age', 'Sex', 'Na_to_K']].values
X_NaK[:,1] = new_sex
X_NaK= preprocessing.StandardScaler().fit(X_NaK).transform(X_NaK)


In [29]:
from sklearn.model_selection import KFold
kfold = KFold(10, True, 1)

for train, test in kfold.split(X_BP):
    X_train_BP = X_BP[train]
    X_test_BP = X_BP[test]
    # print('train: %s, test: %s' % (X[train], X[test]))

for train, test in kfold.split(X_Chol):
    X_train_Chol = X_Chol[train]
    X_test_Chol = X_Chol[test]

for train, test in kfold.split(X_NaK):
    X_train_NaK = X_NaK[train]
    X_test_NaK = X_NaK[test]

for train, test in kfold.split(X):
    X_train_cv = X[train]
    X_test_cv = X[test]
    
for train, test in kfold.split(y):
    y_train_cv = y[train]
    y_test_cv = y[test]
    # print('train: %s, test: %s' % (y[train], y[test]))

In [30]:
# Use the best max_depth 5
Model1 = DecisionTreeClassifier(criterion="gini", max_depth = 5)
Model2 = DecisionTreeClassifier(criterion="gini", max_depth = 5)
Model3 = DecisionTreeClassifier(criterion="gini", max_depth = 5)
Model4 = DecisionTreeClassifier(criterion="gini", max_depth = 5)

Model1.fit(X_train_BP, y_train_cv)
Model2.fit(X_train_Chol, y_train_cv)
Model3.fit(X_train_NaK, y_train_cv)
Model4.fit(X_train_cv, y_train_cv)


y1_pred = Model1.predict(X_test_BP)
y2_pred = Model2.predict(X_test_Chol)
y3_pred = Model3.predict(X_test_NaK)
y4_pred = Model4.predict(X_test_cv)


In [31]:
column_name = drug_data.columns[2:5].to_list()
column_name.append("all")
column_name

['BP', 'Cholesterol', 'Na_to_K', 'all']

In [32]:
y_pred = [y1_pred, y2_pred, y3_pred, y4_pred]

a= []
b= []
nl = "\n"

for i,k,s in zip(y_pred, range(0,4), column_name):
    a.append(accuracy_score(y_test_cv,i)) 
    b.append(confusion_matrix(y_test_cv,i))
    print(f"Accuracy for {s}: {a[k]}, which has confusion matrix {nl} {b[k]}{nl}")
    

Accuracy for BP: 0.5, which has confusion matrix 
 [[1 0 0 0 4]
 [0 1 0 0 0]
 [0 0 0 0 0]
 [0 0 1 3 2]
 [0 0 0 3 5]]

Accuracy for Cholesterol: 0.3, which has confusion matrix 
 [[0 0 0 2 3]
 [0 0 0 1 0]
 [0 0 0 0 0]
 [0 0 1 1 4]
 [0 1 0 2 5]]

Accuracy for Na_to_K: 0.7, which has confusion matrix 
 [[1 0 4 0]
 [0 0 1 0]
 [0 1 5 0]
 [0 0 0 8]]

Accuracy for all: 1.0, which has confusion matrix 
 [[5 0 0 0]
 [0 1 0 0]
 [0 0 6 0]
 [0 0 0 8]]



__Fourth result: The accuracy using kfold split given by__
- All input is 100%,
- Na_to_K is 70%,
- BP is 50%,
- Cholesterol 30%.

## Conclusion

- Decision Tree and Random Forest model are alternative models for classification dataset.
- __Gini and Entropy yield same result__. Factors could be because our data only consist 200 data which is consider small. So no big difference.
- Using the 'Decision Tree Model' with different 'max_depth', might __change the accuracy of the model__.
- In this case, __the random forest gives a bit less accuracy compare to decision tree model with max_depth-5__, this could possibly __because underfitting__.
- __The accuracy given by k-fold split for all input is 100%__, as this is the best method model by now to prevent overfitting. The confusion matrix perfectly fill all the diagonal area. The classification report also gives clear result for the accuracy.

Finally, with all inputs with optimize model could yiel a 100% prediction accuracy however the limitation of this project, the perfect result could due to small dataset.

By having the patient 'Age', 'Sex', 'Blood-Pressure rate', 'Cholesterol rate', and 'Sodium-Potassium rate'. We can predict using which drug medicine to cure the patient.