# Diabetes Prediction Machine Learning Model
This project uses the PIMA Diabetes Dataset on Kaggle: https://www.dropbox.com/s/uh7o7uyeghqkhoy/diabetes.csv?dl=0 to predict the likelihood of a patient having diabetes based on a number of health-related variables. This dataset only contains female patients at least 21 years old from Pima Indian heritage.

## 1. Import libaries

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

## 2. Import the dataset (Data Collection and Analysis)

In [2]:
#load the diabetes dataset to a pandas dataframe
## use "r" to specify the exact file path link
diabetes_dataset = pd.read_csv(r"C:\Users\haley\OneDrive\Documents\GitHub Projects\Diabetes Predition ML\diabetes.csv")

In [3]:
#print the first 5 rows of the dataset
## for 20 rows use diabetes_dataset.head(20)
diabetes_dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
#number of rows and columns in this dataset
diabetes_dataset.shape

(768, 9)

In [5]:
#statistical measures of data
diabetes_dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [6]:
#the sum of all 0 non-diabetic and 1 diabetic cases. The class variable is the Outcome column.
diabetes_dataset['Outcome'].value_counts()

# 0 = non-diabetic, 1 = diabetic

0    500
1    268
Name: Outcome, dtype: int64

In [7]:
#statistics per outcome
diabetes_dataset.groupby('Outcome').mean()

###In the results we can see that diabetics (Outcome = 1) have a significantly higher Insulin level.

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [8]:
#separate the data and labels
X = diabetes_dataset.drop(columns='Outcome', axis=1)    #### creating a dataframe with only the X variables. The axis tells the function where to begin.
Y = diabetes_dataset['Outcome']    #### Only keep the Outcome column

##print(X)
##print(y)

## 3. Data Standardization

In [9]:
#Scale the data. This normalizes the data on a scale of -1 to 1. The mean will be 0. 
scaler = StandardScaler()

# only do this for the input variables in X
scaler.fit(X)

# transform
standardized_data = scaler.transform(X)

In [10]:
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [11]:
# similar X and Y as above but this updates it to use the scaled data
X = standardized_data
Y = diabetes_dataset['Outcome']

print(X)
print(Y)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


## 4. Train/Test Split

In [12]:
#creating 4 variables for the trains and test data
## The X, Y note the variables and labels we defined above that need to be split
## Test_size is the percentage of test data. 0.2 means 20% of data will be used as test.
## Stratify notes that Y contains the outcomes. Stratify prevents all 0 or all 1s from being put in the same split set.
## Random_state is the type of split. Any number works, but if you use 2 on multiple models you know they are split according to the same methodology.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

In [13]:
# use .shape to count how many rows and columns are in the newly created variables.
print(X.shape, X_train.shape, X_test.shape)
#### (768, 8) the total, 80% in train (614, 8), 20% in test (154, 8)

(768, 8) (614, 8) (154, 8)


## 5. Train the model

There are many types of kernels (linear, sigmond, hyberbolic tangent, etc). Linear kernels are used for classification when the dataset can be separated linearly. It only works for 2 classes.

In [14]:
# load the SVM
classifier = svm.SVC(kernel='linear')

In [15]:
#training the support vector machine classifier
classifier.fit(X_train, Y_train)

SVC(kernel='linear')

## 6. Model Evaluation

### Accuracy Score

Training Data

In [16]:
#accuracy score of training data
X_train_prediction = classifier.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [17]:
print('Accuracy Score of Training Data: ', training_data_accuracy)
#predicts the correct outcome 78.66% of the time

Accuracy Score of Training Data:  0.7866449511400652


Test Data

In [18]:
#accuracy score of training data
X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [19]:
print('Accuracy Score of Test Data: ', test_data_accuracy)
#predicts the correct outcome 77.27% of the time

Accuracy Score of Test Data:  0.7727272727272727


The training data accuracy score is 78.66% while the test data is lower at 77.27%. This means the model does not predict the test data as well as the training data. The model is overfitting the training data.

Overfitting can occur where there is high variance (large, complex dataset) and low bias (the difference between the actual and predicted values). is a high sample size with a lot of noise and irrelevant (or inaccurate) data.

When Training A.S. > Testing A.S. = Overfitting

## 7. Test the ML model with new patient data
Input a new patient's data and run it through the machine learning model.

### New Patient 1

In [29]:
#The data below is for 1 new patient. They correlate with the number of pregancies, glucose, etc
input_data = (4, 110, 92, 0, 0, 37.6, 0.191, 30)    #This is in list datatype format

#change from list to numpy array
input_data_array = np.asarray(input_data)

#Reshape the array because we are predicting for only one instance. It is easier to reshape with when we turn the list into a numpy array.
input_data_reshaped = input_data_array.reshape(1, -1) #1, -1 are the parameters and means we are only trying to predict 1 instance. The 1 instance is instance "0" so the -1, 1 are inbetween.

#standardize the input data
standardized_data = scaler.transform(input_data_reshaped)
print("Scaled data: ",standardized_data)

#the classifier makes the actual prediciton
prediction = classifier.predict(standardized_data)
print(prediction)



#reformatting result based on outcome
if (prediction[0] == 0):
    print('The patient is NOT diabetic.')
else:
    print('The patient is diabetic.')
        

Scaled data:  [[ 0.04601433 -0.34096773  1.18359575 -1.28821221 -0.69289057  0.71168975
  -0.84827977 -0.27575966]]
[0]
The patient is NOT diabetic.




### New Patient 2

In [34]:
input_data2 = (5, 166, 72, 19, 175, 25.8, 0.587, 51)

input_data_array2 = np.asarray(input_data2)


input_data_reshaped2 = input_data_array2.reshape(1, -1)

standardized_data2 = scaler.transform(input_data_reshaped2)
print("Scaled data: ",standardized_data2)


prediction2 = classifier.predict(standardized_data2)
print(prediction2)


if (prediction2[0] == 0):
    print('The patient is NOT diabetic.')
else:
    print('The patient is diabetic.')
        

Scaled data:  [[ 0.3429808   1.41167241  0.14964075 -0.09637905  0.82661621 -0.78595734
   0.34768723  1.51108316]]
[1]
The patient is diabetic.


