**DIABETES PREDICTION USING SVM (SUPERVISED MACHINE ALGORITHM)**

Supervised Machine Learning Algorithm (Classification) , we feed the data to our ML model and the ML model learns from the data and respective labels. 
SVM stands for Support Vector Machine.
Once we feed the data to the SVM model, the model tries to plot the data in a graph and then finds a hyperplane.(the hyperplane separates the 2 labels)

1. Data preprocessing : analyze the data and make it suitable for the model (standardize the data so that all the data lies in the same range)
2. Train Test Split : we split the data into training and testing data in order to train the model using the training set and try to find the accuracy score of the model using testing data

In [None]:
#importing libraries
import numpy as np #for numpy arrays
import pandas as pd #for pandas dataframe
from sklearn.preprocessing import StandardScaler #to standardize the data
from sklearn.model_selection import train_test_split#to split the data
from sklearn import svm
from sklearn.metrics import accuracy_score

In [None]:
#loading the diabetes dataset to pandas dataframe
diabetes_dataset = pd.read_csv("diabetes.csv")

In [None]:
diabetes_dataset.head()
#outcome =1 represents the patient is diabetic and 0 represents non diabetic

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
diabetes_dataset.shape

(768, 9)

In [None]:
#diabetes_dataset.isna()
diabetes_dataset.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [None]:
diabetes_dataset.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [None]:
#getting the statistical measures of the data
diabetes_dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [None]:
diabetes_dataset['Outcome'].value_counts()
#this function when applied to a column gives the count of each value of that column

0    500
1    268
Name: Outcome, dtype: int64

In [None]:
diabetes_dataset.groupby('Outcome').mean()
#the dataset is first grouped by the column Outcome and then for each group, the mean value of each column is shown

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [None]:
#separating data and labels
#axis=1 means columns, axis=0 means rows
x=diabetes_dataset.drop(columns="Outcome",axis=1)
y=diabetes_dataset["Outcome"]

In [None]:
x.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
5,5,116,74,0,0,25.6,0.201,30
6,3,78,50,32,88,31.0,0.248,26
7,10,115,0,0,0,35.3,0.134,29
8,2,197,70,45,543,30.5,0.158,53
9,8,125,96,0,0,0.0,0.232,54


In [None]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

**Data Standardization**
As a part of data preprocessing, to bring the data in the similar range for better prediction by ML model.

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(x) #fitting the inconsistent data with the standardscaler instance

StandardScaler(copy=True, with_mean=True, with_std=True)

In [None]:
standardized_data = scaler.transform(x)#based on the standardization,we are transforming the data to a common range

In [None]:
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [None]:
x = standardized_data
y = diabetes_dataset["Outcome"]

In [None]:
print(x)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [None]:
print(y)

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


**SPLITTING THE DATASET INTO TRAINING AND TEST DATASET**

In [None]:
"""random state is given a non negative number to replicate the splitting 
(like an index or serial number for a particular splitting of data) """
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,
                                                    stratify=y,random_state=1)

In [None]:
print(x.shape, x_train.shape, x_test.shape)

(768, 8) (614, 8) (154, 8)


In [None]:
print(y.shape, y_train.shape, y_test.shape)

(768,) (614,) (154,)


**TRAINING MODEL**

In [None]:
#svm function is used to the support vector machine
#svc is support vector classifier
#kernel is a parameter. we will use linear model
diabetes_classifier = svm.SVC(kernel = 'linear')
#svm model is loaded into diabetes_classifier

In [None]:
#fitting the model with training data
diabetes_classifier.fit(x_train,y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

**MODEL EVALUATION**

In [None]:
#ACCURACY SCORE 
#for training data
x_train_prediction = diabetes_classifier.predict(x_train)
x_train_accuracy_score = accuracy_score(x_train_prediction, y_train)

In [None]:
print("Accuracy score of the model on the training data : "+
      str(x_train_accuracy_score))

Accuracy score of the model on the training data : 0.7833876221498371


In [None]:
#ACCURACY SCORE 
#for testing data
x_test_prediction = diabetes_classifier.predict(x_test)
x_test_accuracy_score = accuracy_score(x_test_prediction, y_test)

In [None]:
print("Accuracy score of the model on the test data : "+str(x_test_accuracy_score))

Accuracy score of the model on the test data : 0.7792207792207793


**PREDICTION**

In [None]:
#input_data = (4,110,92,0,0,37.6,0.191,30) #non diabetic case
input_data = (5,166,72,19,175,25.8,0.587,51) #diabetic case

#changing the input data to numpy array
input_data_as_array = np.asarray(input_data)

#reshape the array as we are predicting for one instance
input_data_reshaped = input_data_as_array.reshape(1,-1)
print(input_data_reshaped)

#standardize the data
std_data = scaler.transform(input_data_reshaped)
print(std_data)

#prediction
prediction = diabetes_classifier.predict(std_data)
print(prediction)

#output is a list and not an integer
if (prediction[0]==0):
    print("Not diabetic")
elif (prediction[0]==1):
    print("Diabetic")

[[  5.    166.     72.     19.    175.     25.8     0.587  51.   ]]
[[ 0.3429808   1.41167241  0.14964075 -0.09637905  0.82661621 -0.78595734
   0.34768723  1.51108316]]
[1]
Diabetic
