# Diabetes Prediction Using SVM

I explore a diabetes prediction algorithm using a [Diabetes dataset](https://www.dropbox.com/s/uh7o7uyeghqkhoy/diabetes.csv?dl=0). Using a Support Vector Machine for my prediction algorithm, I intend on predicting whether an individual has diabetes or not based on certain attributes provided by the dataset headers. 
> This table is taken from [kaggle](https://www.kaggle.com/) and contains data specifying diabetes data from a particular group, which may create difficulty in generalization.

>> "This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage."

## Overview

With Diabetes becoming a growing health issue in society, I would like to be able to predict to a certain accuracy the probability of an individual being diagnosed with diabetes given certain health attributes. The objective of this project aims to predict whether the person in question has Diabetes or not based on the below elements: 

- Pregnancies
- Glucose
- Blood Pressure
- Skin Thickness
- Insulin Level
- BMI
- Diabetes Pedigree
- Age 

## Workflow

For the purposes of this notebook, I will use a subset of the Diabetes dataset provided by the [following](https://www.dropbox.com/s/uh7o7uyeghqkhoy/diabetes.csv?dl=0). 

1. Begin by collecting Diabetes data and understanding the content of the dataset
2. Preprocess data
3. Train - Test data split
4. Feed a SVM model after determining data subset to be satisfactory

## Install Modules

In [2]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install sklearn

Note: you may need to restart the kernel to use updated packages.


## Import Dependencies

In [5]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

## Data Collection and Processing

In [9]:
# Loading the dataset to a pandas dataframe
diabetes_data = pd.read_csv('/Users/jeffshen/Desktop/Diabetes_Prediction/diabetes.csv')

In [10]:
# Display first ten rows of dataset
diabetes_data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [13]:
# row by column
diabetes_data.shape

(768, 9)

In [14]:
# Statistically describe the dataset
diabetes_data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [22]:
diabetes_data["Outcome"].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

### 1 has diabetes
### 0 is non-diabetic

In [23]:
# find mean sort by the outcome
diabetes_data.groupby("Outcome").mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [24]:
# Separate data and label
x = diabetes_data.drop(columns = "Outcome", axis = 1)
y = diabetes_data["Outcome"]

In [25]:
print(x,y)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  
0                       0.627   50  
1                       0.351   31  


## Data Standardization

In [26]:
scaler = StandardScaler()

In [28]:
scaler.fit(x)
sdata = scaler.transform(x)
print(sdata)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [30]:
x = sdata
print(x)
print(y)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


## Train-Test Split

In [31]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = y, random_state = 2)

## Train SVM Model

In [32]:
classifier = svm.SVC(kernel = "linear")

In [49]:
# fit training model to svm classifier
classifier.fit(x_train, y_train)

SVC(kernel='linear')

## Model Evaluation 
### Accuracy Score

In [35]:
# accuracy score of training data
x_train_prediction = classifier.predict(x_train)
train_accuracy = accuracy_score(x_train_prediction, y_train)
print("Accuracy score: ", train_accuracy)

Accuracy score:  0.7866449511400652


In [36]:
# accuracy on test data
x_test_prediction = classifier.predict(x_test)
test_accuracy = accuracy_score(x_test_prediction, y_test)
print("Accuracy score of test data: ", test_accuracy)

Accuracy score of test data:  0.7727272727272727


## Predictive System for Diabetes

In [51]:
data = (6,148,72,35,0,33.6,0.627,50)

# convert input to numpy array
input_array = np.asarray(data)

# reshape numpy array to predict for one instance
input_reshape = input_array.reshape(1,-1)

standard_data = scaler.transform(input_reshape)
prediction = classifier.predict(standard_data)

if prediction[0] == 1:
    print("Has Diabetes")
else:
    print("Does Not Have Diabetes")


Has Diabetes


