# Diabetes Prediction Using Machine Learning

#### This program will demonstrate how to build a machine learning model that is capable of predicting whether a person will test positive or negative for diabetes based on several physical qualities. The data used was found at [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).

Import necessary libraries

In [5]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report 

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn import svm
import sklearn as sk

## Step 1: Load dataset into a dataframe
Prepares data in a form suitable for further analysis.

In [6]:
df = pd.read_csv('diabetes_data.csv')

## Step 2: Quickly study data
Studying data before moving forward is always a good idea. While doing this, you should determine what information is going to be used for the predictions and what information might not be necessary.

In [7]:
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


The **describe** command provides statistics about your data. This can be useful in getting a general understanding of the shape of your data.

In [8]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


## Step 3: Clean data
**Cleaning** your data means getting rid of unnecessary information or duplicate values and fixing any problems with your dataset. In this case, there are no changes needed to be made so our data is prepared.

## Step 4: Split data
Before splitting the data, we must determine what we want to predict. In this case, we want to predict the outcome column, so we will place that in the output dataframe (y), and we will place the rest of the columns in the input dataframe (X).

In [9]:
y = df['Outcome']
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [10]:
X = df.drop(columns = 'Outcome')
list(X.columns)

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

## Step 5: Create training and testing datasets
When working with machine learning, it is important to use one portion of your data to train the model and the other to test the model. In order to do this, we must first determine how much of our data we want to designate to training the model and how much we'll reserve to test it. An effective ratio in order to do this would be using **80%** of our data to train the model, and **20%** to test it. 
Using the **train_test_split** function will create four new variables: 
- X_train = input set used for training the model (80% of all inputs in original dataset)
- X_test = input set used for testing the model (20% of all inputs in original dataset)
- y_train = output set used for training the model (80% of all outputs in original dataset)
- y_test = output set used for measuring the accuracy of the model (20% of all outputs in original dataset)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

print('Number of items in:')
print('- input training set:', X_train.shape)
print('- output training set:', y_train.shape)
print('- input testing set:', X_test.shape)
print('- output testing set:', y_test.shape)

Number of items in:
- input training set: (614, 8)
- output training set: (614,)
- input testing set: (154, 8)
- output testing set: (154,)


## Step 6: Initialize machine learning model
Now it is time to create and initialize the machine learning model. Before doing so, it is important to choose which model is most fitting for the dataset being used. For this example, we will be using the **DecisionTreeClassifier** model.

In [15]:
#Create the model
model = DecisionTreeClassifier()

#Train the model
model.fit(X_train, y_train)

#Get predictions from the model
predictions = model.predict(X_test)

## Step 7: Compare predictions with actual values
Now that we have trained and tested the model, we can compare the predictions our model produced with the actual values stored in y_test.

In [263]:
print('Predictions:')
print(predictions[:10])
print('\nActual values:')
print(list(y_test)[:10])

Predictions:
[0 0 0 0 0 1 0 0 0 0]

Actual values:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


We can calculate an accuracy score, which tells us how accurate our model was in determining whether a person was diabetic or not. This model ended up being about **76.6%** accurate, which is pretty good for this amount of data.

In [268]:
score = accuracy_score(y_test, predictions)
print('Accuracy: ' + str(score*100) + '%')

Accuracy: 76.62337662337663%


We can also create a confusion matrix and a classification report to see where our model may have faltered.

In [275]:
print('Confusion Matrix:\n')
print(confusion_matrix(y_test, predictions))

print('\nClassification Report:\n')
print(classification_report(y_test, predictions))

Confusion Matrix:

[[78 21]
 [15 40]]

Classification Report:

              precision    recall  f1-score   support

           0       0.84      0.79      0.81        99
           1       0.66      0.73      0.69        55

    accuracy                           0.77       154
   macro avg       0.75      0.76      0.75       154
weighted avg       0.77      0.77      0.77       154



**Cross validation** is also a great way to measure the accuracy of the model. Cross validation runs multiple training and testing sets through the model and calculates the average accuracy from those sets. This can give a better estimate on the overall accuracy of the model since it measures multiple combinations of testing and training sets. 

In [21]:
scores = cross_val_score(model, X, y, scoring='accuracy', cv=10)
print('Average accuracy (10 fold cross validation):', str(np.mean(scores) * 100) + '%')

Average accuracy (10 fold cross validation): 71.35167464114834%


## Step 8: Test with other ML models
It is often a good idea to test other ML models to see which is most proficient for this particular dataset. Below are the results from several other models.

In [293]:
model = RandomForestClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('RandomForestClassifier Accuracy: ' + str(accuracy_score(y_test, predictions)*100) + '%')

model = GaussianNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('\nGaussianNB Accuracy: ' + str(accuracy_score(y_test, predictions)*100) + '%')

model = MLPClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('\nMLPClassifier Accuracy: ' + str(accuracy_score(y_test, predictions)*100) + '%')

model = KNeighborsClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('\nKNeighborsClassifier Accuracy: ' + str(accuracy_score(y_test, predictions)*100) + '%')

model = SVC()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('\nSVC Accuracy: ' + str(accuracy_score(y_test, predictions)*100) + '%')

RandomForestClassifier Accuracy: 74.02597402597402%

GaussianNB Accuracy: 76.62337662337663%

MLPClassifier Accuracy: 65.5844155844156%

KNeighborsClassifier Accuracy: 66.23376623376623%

SVC Accuracy: 76.62337662337663%


## Step 9: Enter your own test values
The entire goal of machine learning is for it to be used in a real world setting. A model similar to the one we have created here could certainly be used in the medical field to diagnose patients with diabetes. Below is an example of how you can test out your own numbers to see if the patient would be diabetic or not. 

In [321]:
preg = input('Enter the number of pregnancies the patient has had: ')
gluc = input('Enter the glucose level of the patient: ')
bp = input('Enter the blood pressure of the patient: ')
st = input('Enter the skin thickness of the patient: ')
ins = input('Enter the insulin level of the patient: ')
bmi = input('Enter the BMI of the patient: ')
dpf = input('Enter the Diabetes Pedigree Function of the patient: ')
age = input('Enter the age of the patient: ')

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict([[preg, gluc, bp, st, ins, bmi, dpf, age]])
if predictions[0] == 0:
    print('This patient is not diabetic.')
else:
    print('This patient is diabetic.')

Enter the number of pregnancies the patient has had: 4
Enter the glucose level of the patient: 156
Enter the blood pressure of the patient: 78
Enter the skin thickness of the patient: 37
Enter the insulin level of the patient: 0
Enter the BMI of the patient: 43
Enter the Diabetes Pedigree Function of the patient: .8
Enter the age of the patient: 60
This patient is diabetic.
