<a href="https://colab.research.google.com/github/nairmeghna/projects/blob/main/Diabetes_Prediction_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# importing the dependencies

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

# **About Dataset**

This dataset contains diabetes information about only females. It contains the following columns:


*   Pregnancies
*   Glucose

*   BloodPressure
*   SkinThickness : The collagen content

*   Insulin
*   BMI : Body Mass Index

*   DiabetesPedigreeFunction : The liklihood of diabetes based age and family history
*   Age

*   Outcome : Shows 0 for non diabetic and 1 for diabetic observations











In [None]:
# loading the dataset

dataset = pd.read_csv("/content/diabetes.csv")
dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
# number of rows and columns in the dataset

dataset.shape

(768, 9)

This means that there are 768 rows (i.e. 768 observations) and 9 categories.

In [None]:
# getting some statistical measurements

dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Here, count shows the number of datapoints we have, mean is the average value of all the 768 observations.

In [None]:
dataset['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

This shows us that there are 500 observations that do not have diabetes and 268 have diabetes.

In [None]:
dataset.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


And the above code shows the mean value of each category for each outcome. Please note that for the Outcome 1, all the categories have higher values than for Outcome 0.

# **Standardizing the data**

In [None]:
# seperating the data and labels

X = dataset.drop(columns = 'Outcome', axis = 1)
Y = dataset['Outcome']

Next, we will stardardize the data. This is done to bring all the values in all the categories in a particular range so that it is easy for the ML Model to predict data.

In [None]:
# Standardizing the data

scaler = StandardScaler()
scaler.fit(X)

# tranforming this data

standardized_data = scaler.transform(X)

print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


It is observed that after standardization all the values are in the same range.

In [None]:
# defining variables again

X = standardized_data
Y = dataset['Outcome']

# **Splitting the data**

In [None]:
# splitting data into train and test

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y, random_state = 2)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

(768, 8) (614, 8) (154, 8)


Here, we have taken 20% of the data as test data (specified by test_size = 0.2).

In [None]:
# training the model

classifier = svm.SVC(kernel = 'linear')

In [None]:
# training the svm classifier
classifier.fit(X_train, Y_train)

# **Model Evaluation**

In [None]:
# model evaluation by finding the accuracy score on training data

X_train_prediction = classifier.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print("Accuracy score of the training data: ", training_data_accuracy)

Accuracy score of the training data:  0.7866449511400652


So the accuracy score on training data is above 75% (~79%) which is pretty good. Next we will check the accuracy for test data.

In [None]:
# model evaluation by finding the accuracy score on test data

X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print("Accuracy score of the test data: ", test_data_accuracy)

Accuracy score of the test data:  0.7727272727272727


The accuracy score for test data is 77% which is very close to 79% obtained for training data. This means that the model has not encountered overfitting.

# **Building the Model**

In [None]:
# building a predicting system

input_data = (4, 110, 92, 0, 0, 37.6, 0.191, 30)

# changing the list datatype to a numpy array

input_data_as_np = np.asarray(input_data)

# reshaping the array since we are predicting only for 1 instance

input_data_reshaped = input_data_as_np.reshape(1, -1)

# standardizing the input_data

std_data = scaler.transform(input_data_reshaped)
print(std_data)

prediction = classifier.predict(std_data)
print(prediction)

if (prediction[0] == 0):
  print('The person is not diabetic')
else:
  print('The person is diabetic')

[[ 0.04601433 -0.34096773  1.18359575 -1.28821221 -0.69289057  0.71168975
  -0.84827977 -0.27575966]]
[0]
The person is not diabetic




The model has correctly predicted the value of the Outcome for the standardized data that was fed into it.

In [None]:
# building a predicting system

input_data = (5, 160, 94, 23, 102, 40, 0.6, 40)

# changing the list datatype to a numpy array

input_data_as_np = np.asarray(input_data)

# reshaping the array since we are predicting only for 1 instance

input_data_reshaped = input_data_as_np.reshape(1, -1)

# standardizing the input_data

std_data = scaler.transform(input_data_reshaped)
print(std_data)

prediction = classifier.predict(std_data)
print(prediction)

if (prediction[0] == 0):
  print('The person is not diabetic')
else:
  print('The person is diabetic')

[[0.3429808  1.22388954 1.28699125 0.15453319 0.19276481 1.01629594
  0.38694877 0.57511787]]
[1]
The person is diabetic


