# Classification of Pima Indians Diabetes Data
This project compares the performance of Naive Bayes, SVM, and Decision Trees on Pima Indians Diabetes Data. Naive Bayes and Decision Trees are coded "from scratch" with NumPy. 

## Dataset Description

The dataset 768 samples with the following features:
1. Number of times pregnant
2. Plasma glucose concentration from an oral glucose tolerance test
3. Blood pressure (mmHg)
4. Triceps skin fold thickness (mm)
5. 2-hour serum insulin (mu U/ml)
6. Body Mass Index
7. Diabetes Pedigree Function
8. Age
9. Outcome: 0 or 1 (target variable)

Importing the necessary modules:

In [1]:
import NaiveBayes
import SVM
import DecisionTree
import DTscratch
import datatools
import crossvalidate
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Loading in the dataset:

In [2]:
np.set_printoptions(formatter={'float': lambda x: "{0:0.3f}".format(x)}) # set the print format
filename = "pima-indians-diabetes.data.csv"
labels, features = datatools.loadDataset(filename)
print("Size of Dataset: ",len(labels))

Size of Dataset:  768


In [3]:
print("Number of Samples with Diabetes: ", np.sum(labels == 1))
print("Number of Samples with No Diabetes: ", np.sum(labels == 0))

Number of Samples with Diabetes:  268
Number of Samples with No Diabetes:  500


### 10-fold Cross Validation with Naive Bayes:

In [4]:
CV_preds = crossvalidate.n_fold(features, labels, NaiveBayes, n_folds=10)
print("Naive Bayes 10 fold CV accuracy:", datatools.accuracy(labels, CV_preds))

Fold 1 out of 10
Fold 2 out of 10
Fold 3 out of 10
Fold 4 out of 10
Fold 5 out of 10
Fold 6 out of 10
Fold 7 out of 10
Fold 8 out of 10
Fold 9 out of 10
Fold 10 out of 10
Naive Bayes 10 fold CV accuracy: 65.10416666666666


  prob = prob/np.sum(prob)  # normalizing


Confusion matrix and classification report after using Naive Bayes:

In [5]:
print(confusion_matrix(labels, CV_preds))
print(classification_report(labels,CV_preds))

[[500   0]
 [268   0]]
              precision    recall  f1-score   support

           0       0.65      1.00      0.79       500
           1       0.00      0.00      0.00       268

    accuracy                           0.65       768
   macro avg       0.33      0.50      0.39       768
weighted avg       0.42      0.65      0.51       768



  'precision', 'predicted', average, warn_for)


From the confusion matrix and report, it is seen that the Naive Bayes classifier is unable to separate the classes, and classifies every sample as "No Diabetes"

### 10-fold Cross Validation with SVM
Let us see how the SVM performs using a linear kernel

In [6]:
CV_preds = crossvalidate.n_fold(features, labels, SVM, 'linear', n_folds=10)
print("SVM 10 fold CV accuracy:", datatools.accuracy(labels, CV_preds))

Fold 1 out of 10
Fold 2 out of 10
Fold 3 out of 10
Fold 4 out of 10
Fold 5 out of 10
Fold 6 out of 10
Fold 7 out of 10
Fold 8 out of 10
Fold 9 out of 10
Fold 10 out of 10
SVM 10 fold CV accuracy: 77.08333333333334


Confusion matrix and classification report after using SVM with a linear kernel:

In [7]:
print(confusion_matrix(labels, CV_preds))
print(classification_report(labels,CV_preds))

[[439  61]
 [115 153]]
              precision    recall  f1-score   support

           0       0.79      0.88      0.83       500
           1       0.71      0.57      0.63       268

    accuracy                           0.77       768
   macro avg       0.75      0.72      0.73       768
weighted avg       0.77      0.77      0.76       768



From the confusion matrix, it is seen that the SVM misclassifies 60 "No Diabetes" samples as "Diabetes" and 117 "Diabetes" samples as "No Diabetes". Overall, class separation is acheived with high precision and recall for predicting "No Diabetes". The false negative rate for "Diabetes" is a little high.

Let us see if using a non-linear kernel, radial basis function(RBF) improves performance:

In [8]:
CV_preds = crossvalidate.n_fold(features, labels, SVM, 'rbf', n_folds=10)
print("SVM 10 fold CV accuracy:", datatools.accuracy(labels, CV_preds))

Fold 1 out of 10
Fold 2 out of 10
Fold 3 out of 10
Fold 4 out of 10
Fold 5 out of 10
Fold 6 out of 10
Fold 7 out of 10
Fold 8 out of 10
Fold 9 out of 10
Fold 10 out of 10
SVM 10 fold CV accuracy: 76.30208333333334


Confusion matrix and classification report after using SVM with a RBF kernel:

In [9]:
print(confusion_matrix(labels, CV_preds))
print(classification_report(labels,CV_preds))

[[455  45]
 [137 131]]
              precision    recall  f1-score   support

           0       0.77      0.91      0.83       500
           1       0.74      0.49      0.59       268

    accuracy                           0.76       768
   macro avg       0.76      0.70      0.71       768
weighted avg       0.76      0.76      0.75       768



Using an RBF kernel improves classification of "No Diabetes", but more "Diabetes" samples are incorrectly classified than correctly classified. From this, it can be concluded that the linear kernel has better performance overall. 

### 10-fold Cross Validation with Decision Tree
Let us see how the Decision Tree performs. The decision tree is set to have a max depth of 10 and a minimum of 5 samples before splitting a node.

In [10]:
CV_preds = crossvalidate.n_fold(features, labels, DTscratch, 10, 5, n_folds=10)
print("Decision Tree 10 fold CV accuracy:", datatools.accuracy(labels, CV_preds))

Fold 1 out of 10
Fold 2 out of 10
Fold 3 out of 10
Fold 4 out of 10
Fold 5 out of 10
Fold 6 out of 10
Fold 7 out of 10
Fold 8 out of 10
Fold 9 out of 10
Fold 10 out of 10
Decision Tree 10 fold CV accuracy: 70.57291666666666


Confusion matrix and classification report after using a Decision Tree:

In [11]:
print(confusion_matrix(labels, CV_preds))
print(classification_report(labels,CV_preds))

[[386 114]
 [112 156]]
              precision    recall  f1-score   support

           0       0.78      0.77      0.77       500
           1       0.58      0.58      0.58       268

    accuracy                           0.71       768
   macro avg       0.68      0.68      0.68       768
weighted avg       0.71      0.71      0.71       768



The Decision Tree acheives decent classification performance, but classification of "No Diabetes" is slightly worse that that of the SVM with a linear kernel.

## Conclusion
From the algorithms compared above, the SVM with a linear kernel has the most optimal classification performance.