# Diabetes Prediction using Python and Machine Learning

In this project, we will build a Machine Learning project based on Support Vector Machine (SVM) technique to predict if a determined (female) patient, given her clinical data, has diabetes or not.

### Importing the libraries

First of all, we will import the libraries needed into our project:

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

### Data Collection and Analysis

Then, we will load the Diabetes Dataset to a pandas DataFrame:
   

In [2]:
diabetes_dataset = pd.read_csv('diabetes.csv')
diabetes_dataset.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
diabetes_dataset.shape

(768, 9)

Now, we are going to obtain statistical data about the dataset:

In [4]:
diabetes_dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


The, we are going to analyze the distribution of the outcomes:
    
    0 = Non - Diabetic
    1 = Diabetic

In [5]:
diabetes_dataset['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

Now, we are going to obtain the mean value of all attributes of the dataset, grouping by the different outcomes.

In [6]:
diabetes_dataset.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


By analysing the mean of the attributes obtained, we can see that there is a correlation between different variables to the outcome, such as Age, Glucose and Insulin. 

Now, we are going to separate the data from the output and print both:

In [7]:
X = diabetes_dataset.drop(columns = 'Outcome', axis=1)
Y = diabetes_dataset['Outcome']
print(X)
print(Y)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  
0                       0.627   50  
1                       0.351   31  


### Data Standardization

As the different variables have different scales, we need to standardize them so our Machine Learning model can make some predictions:

In [8]:
scaler = StandardScaler()
X = np.asarray(X)
scaler.fit(X)
standardized_data = scaler.transform(X)
X = standardized_data
X

array([[ 0.63994726,  0.84832379,  0.14964075, ...,  0.20401277,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575, ..., -0.68442195,
        -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, ..., -1.10325546,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 ,  0.00330087,  0.14964075, ..., -0.73518964,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.1597866 , -0.47073225, ..., -0.24020459,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.8730192 ,  0.04624525, ..., -0.20212881,
        -0.47378505, -0.87137393]])

Now, all of the parameters are in a similar range.

### Train Test Split

In this part, we need to split our data in testing and training sets for our model. 

Our test data will have 20% of the size of the original dataset and the training set will have the other 80%.

The stratify parameter is used to balance the division of the dataset by the percentage of different outputs.

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state = 2)
print(X.shape, X_train.shape, X_test.shape)

(768, 8) (614, 8) (154, 8)


### Training the model

Now we are going to train the SVM model with the train data obtained in the last section:

In [10]:
classifier = svm.SVC(kernel='linear')
classifier.fit(X_train, Y_train)

### Model Evaluation

Now, we will evaluate our training model:

In [11]:
X_train_prediction = classifier.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy score of the training data: ', training_data_accuracy)

Accuracy score of the training data:  0.7866449511400652


With this result we can see that the accuracy of the model with training data is 78.66%, which is a good (but not optimal) number.

Now, we are going to examine the accuracy score on a set of data that has not been used yet, the test data:

In [12]:
X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score of the test data: ', test_data_accuracy)

Accuracy score of the test data:  0.7727272727272727


Now, we can see that the accuracy of the model with test data is about 77.3%, which is also good.

### Making a Predictive System

Now, we are going to make a system that can predict if determined patient has diabetes or not, based on our created model.

For this, we are going to use the data of one random patient that we already know the Outcome.

Example: 1,100,66,29,196,32,0.444,42,0

The 0 on the end shows us that this patient does not have diabetes. This information will obviously not be used on our predictive system.

In [24]:
input_data = (1,100,66,29,196,32,0.444,42)
input_data_array = np.asarray(input_data) # Changing input data to array form
input_data_reshaped = input_data_array.reshape(1, -1) # Reshaping data as we are predicting for only one case
standardized_data = scaler.transform(input_data_reshaped) # Standardizing the data to coincide with our model
prediction = classifier.predict(standardized_data) # Predicting the Outcome
if prediction[0] == 0:
    print("Patient is not Diabetic")
else:
    print("Patient is Diabetic")

Patient is not Diabetic
