# Pima Indians Classification Analysis using KNN, Decision trees, Logistic regression and SVM 



In this analysis, I will be comparing the KNN, decision tree, logistic regression and SVM algorithms to see which one can predict the occurence of diabetes most accurately using the sklearn library.

# Pre-processing the data

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets, preprocessing

#Reading in the data
df = pd.read_csv('/Users/lawrence/Google Drive/proj/ML/diabetes.csv')
df.head()



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
df.columns
#Convert to numpy array for sklearn
X = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']].values
y = df['Outcome'].values


#Standardise and preprocess the data
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))


In [3]:
#Split into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=2)

# Train using Kneighbours (KNN)

In [4]:
from sklearn.neighbors import KNeighborsClassifier

k = 4
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)

In [5]:
#Make predictions
yhat = neigh.predict(X_test)
yhat[0:5]

array([0, 0, 0, 1, 0])

In [6]:
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

Train set Accuracy:  0.8175895765472313
Test set Accuracy:  0.7792207792207793


In [7]:
#Testin for different Ks
Ks = 20
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc #K=4 appears to be a good fit


array([0.72727273, 0.76623377, 0.75974026, 0.77922078, 0.74675325,
       0.73376623, 0.75324675, 0.72727273, 0.72727273, 0.72727273,
       0.72077922, 0.74675325, 0.74675325, 0.75974026, 0.76623377,
       0.77922078, 0.74675325, 0.77272727, 0.77272727])

# Using Decision trees

In [8]:
from sklearn.tree import DecisionTreeClassifier

diabTree = DecisionTreeClassifier(criterion="entropy", max_depth = 3)
diabTree # it shows the default parameters

diabTree.fit(X_train,y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=3)

In [9]:
predTree = diabTree.predict(X_test)
print("Decision Tree Accuracy: ", metrics.accuracy_score(y_test, predTree))

Decision Tree Accuracy:  0.7337662337662337


# Using Logistic Regression

In [10]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(C=0.03, solver='liblinear').fit(X_train,y_train)

In [11]:
yhat = LR.predict(X_test)  #Make prediction on test set
print("Logistic Regression Accuracy: ", metrics.accuracy_score(y_test, yhat))

Logistic Regression Accuracy:  0.7792207792207793


# Using SVM

In [12]:
from sklearn import svm

clf = svm.SVC(kernel='rbf') #Create classifier
clf.fit(X_train, y_train) #Fit training data

yhat = clf.predict(X_test) #Make new prediction on test set
print("SVM Accuracy: ", metrics.accuracy_score(y_test, yhat)) #evaluate accuracy

SVM Accuracy:  0.7402597402597403


# Conclusion

All for algorithms appear to be doing a reasonably good job at explaining the data, with accuracy rates of around 75% when being tested on the test set. Logistic Regression and KNN do a slightly better job (around 77-78%) than SVM and decision trees (73-74%).