<h2>In this notebook i will use a dataset containing basic patient information like gender, age and sex and a prescribed fictional drug. 
For this dataset, i will investigate which ML model is most accurate to predict which drug should be prescribed for a new patient.<h2>

In [3]:
#starting off by importing necessary libraries for data analysis and exploration.

import pandas as pd
import numpy as np

df = pd.read_csv('G:\Data Science & PowerBI\drug200.csv')
df.head(10)

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY
5,22,F,NORMAL,HIGH,8.607,drugX
6,49,F,NORMAL,HIGH,16.275,drugY
7,41,M,LOW,HIGH,11.037,drugC
8,60,M,NORMAL,HIGH,15.171,drugY
9,43,M,LOW,NORMAL,19.368,drugY


<h3>This preview of the data in the dataframe shows that we have non-numerical and categorical data. In order to predict whether a patient should get drug Y, C or X as a prescription, i will apply several classification models and determine which model is best suited to apply, based on the calculation of the accuracy score. Since it is a relatively 'simple' data set, i will appely the following models:
    <ul>
        <li>decision tree</li>
        <li>nearest neighbor</li>
        <li>logistic regression</li>
    </ul>
</h3>

In [4]:
#before i start deployinging the models, i will convert the data (Sex, BP, Cholesterol) into numerical data
#let's start by analysing which unique values each column has

sex = df['Sex'].unique()
BP = df['BP'].unique()
Cholesterol = df['Cholesterol'].unique()

print('The column Sex contains the values:', sex)
print('The column BP contains the values:', BP)
print('The column Cholesterol contains the values:', Cholesterol)

The column Sex contains the values: ['F' 'M']
The column BP contains the values: ['HIGH' 'LOW' 'NORMAL']
The column Cholesterol contains the values: ['HIGH' 'NORMAL']


In [5]:
#now let's convert these unique values to numerical data so that the models can be deployed.
df['Sex'].replace({'F': '0', 'M': '1'}, inplace=True)
df['BP'].replace({'HIGH': '0', 'LOW': '1', 'NORMAL': '2'}, inplace=True)
df['Cholesterol'].replace({'HIGH': '0', 'NORMAL': '1'}, inplace=True)

df.head(5)

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,0,0,0,25.355,drugY
1,47,1,1,0,13.093,drugC
2,47,1,1,0,10.114,drugC
3,28,0,2,0,7.798,drugX
4,61,0,1,0,18.043,drugY


In [6]:
#because both decision trees and nearest neighbor require an array X as input, i will define X.
#Y will be needed for the decision tree so i will define it here as well

x = df[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
y = df['Drug'].values

<h3>Decision Tree</h3>

In [7]:
#before deploying the model, i will first train the model by splitting the data in training and testing sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=None)

#after training i fit the model
from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

In [8]:
#by calling the function below, a prediction is made for the test set. I want to compare the predicted outcome with the actual outcome
prediction = clf.predict(X_test)

data = pd.DataFrame(y_test, prediction)
data.head(5)

Unnamed: 0,0
drugX,drugX
drugB,drugB
drugC,drugC
drugC,drugC
drugB,drugB


In [9]:
#How accurate is the prediction

from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, prediction)
print('The accuracy of this decision tree in predicting a Drug is:', score * 100, '%')

The accuracy of this decision tree in predicting a Drug is: 100.0 %


<h3>Nearest Neighbor</h3>

In [10]:
#training of the dataset is already done.
#defining X and Y is alread done.


from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=10)

#fit the new model
neigh.fit(X_train, y_train)
KNeighborsClassifier()

#Visualise predicted and actual data
prediction_knn = neigh.predict(X_test)

data2 = pd.DataFrame(y_test, prediction_knn)
data2.head(5)

Unnamed: 0,0
drugA,drugX
drugB,drugB
drugX,drugC
drugX,drugC
drugY,drugB


In [11]:
#Determine the accuracy of the model
score2 = accuracy_score(y_test, prediction_knn)
print('The accuracy of this nearest neighbor model(k=10) in predicting a Drug is:', score2 * 100, '%')

The accuracy of this nearest neighbor model(k=10) in predicting a Drug is: 62.121212121212125 %


<h3>Logistic regression</h3>

In [12]:
#training of the dataset is already done.
#defining X and Y is alread done.

# import libraries
from sklearn.linear_model import LogisticRegression

# Set regularization rate
reg = 0.01

# train logistic regression model on the training set
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)
print (model)

LogisticRegression(C=100.0, solver='liblinear')


In [13]:
#make predictions using regression model
prediction_lr = model.predict(X_test)
data3 = pd.DataFrame(y_test, prediction_lr)
data3.head(5)

Unnamed: 0,0
drugX,drugX
drugB,drugB
drugY,drugC
drugX,drugC
drugB,drugB


In [16]:
#what is this models accuracy?
score3 = accuracy_score(y_test, prediction_lr)
print('The accuracy of this logistic regression model in predicting a Drug is:', score3 * 100, '%')

The accuracy of this logistic regression model in predicting a Drug is: 93.93939393939394 %


<h1>I can conclude, based on this analysis, that with an accuracy of 96.9% the decision tree is the most suited model in predicting which drug should be prescribed to a patient, based on their sex, age, BP, an Cholesterol.
with an accuracy of 93.9%, the logistic regression model is second best and the 62.2% of the nearest neighbor model comes in third.</h1>
