#The aim of my project is to develop a predictive model that can accurately determine whether an individual has diabetes based on their medical and demographic information. This involves utilizing features such as the number of pregnancies, glucose levels, blood pressure, skin thickness, insulin levels, body mass index (BMI), diabetes pedigree function, and age to make predictions about the presence of diabetes.

import libraries

In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [50]:
import warnings
warnings.filterwarnings('ignore')

Import CSV file


In [51]:
df = pd.read_excel("/content/diabetes (1).xlsx")

In [52]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [53]:
# to display shape of dataframe
df.shape

(768, 9)

In [54]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [55]:
# to overview of DataFrame columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [56]:
# to generates various statistics summary
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [57]:
# to find out null value in all columns
df.isnull().sum()


Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [58]:
#to find the duplicate value
df.duplicated().sum()


0

In [59]:
df['Outcome'].value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64


0 --> Non-Diabetic\
1 --> Diabetic

In [60]:
df.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [61]:
# separating the data and labels
X = df.drop(columns = 'Outcome', axis=1)
Y = df['Outcome']

In [62]:
print(X)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  
0                       0.627   50  
1                       0.351   31  


In [63]:
print(Y)

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


Data Standardization

In [64]:
from sklearn.preprocessing import StandardScaler

In [65]:
scaler = StandardScaler()


In [66]:
scaler.fit(X)

In [67]:
std_data =scaler.transform(X)

In [68]:
print(std_data)


[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [69]:
X=std_data
Y=df['Outcome']

In [70]:
print(X)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [71]:
print(Y)

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


Train Test Split

In [72]:
from sklearn.model_selection import train_test_split

In [73]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2,random_state=15)

In [74]:
print(X.shape, X_train.shape, X_test.shape)

(768, 8) (614, 8) (154, 8)


In [75]:
print(Y.shape, Y_train.shape, Y_test.shape)

(768,) (614,) (154,)


Implementing the Logistic Regression Model

In [76]:
#Fitting the Logistic Regression model
from sklearn.linear_model import LogisticRegression

from sklearn import metrics
from sklearn.metrics import accuracy_score

In [77]:
log=LogisticRegression()

In [78]:
# Passing independant and dependant training data to the model for training
log.fit(X_train,Y_train)

In [79]:
#accuracy on training data
y_train_prediction = log.predict(X_train)
training_data_accuracy = accuracy_score(Y_train,y_train_prediction)

print(training_data_accuracy)

0.7866449511400652


In [80]:
#accuracy on test data
y_test_prediction = log.predict(X_test)
test_data_accuracy = accuracy_score(Y_test,y_test_prediction)

print(test_data_accuracy)

0.7597402597402597


In [81]:
#Actual value and the predicted value
a = pd.DataFrame({'Actual value': Y_test, 'Predicted value':y_test_prediction})
a.sample(5)

Unnamed: 0,Actual value,Predicted value
421,0,0
671,0,0
388,1,1
582,0,0
518,0,0


#Implementing the Decision Tree Model

In [82]:
# Importing Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

In [83]:
# Create Decision Tree classifer object
classification = DecisionTreeClassifier(criterion="entropy", splitter="best", max_depth = 10,random_state = 52)

# Train Decision Tree Classifer
classification = classification.fit(X_train,Y_train)

In [84]:
#accuracy on training data
y_train_prediction_dt = classification.predict(X_train)
training_data_accuracy_dt = accuracy_score(Y_train,y_train_prediction_dt)

print(training_data_accuracy_dt)

0.9706840390879479


In [85]:
#accuracy on test data
y_test_prediction_dt = classification.predict(X_test)
test_data_accuracy_dt = accuracy_score(Y_test,y_test_prediction_dt)

print(test_data_accuracy_dt)

0.6948051948051948


#Implementing the Random Forest Model

In [38]:
#Fitting a Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

In [86]:
# Create Random Forest classifier object
classifier= RandomForestClassifier(n_estimators= 11,criterion="entropy",max_depth=11,random_state = 52)
classifier.fit(X_train, Y_train)

In [87]:
#accuracy on training data
y_train_prediction_rf = classifier.predict(X_train)
training_data_accuracy_rf = accuracy_score(Y_train,y_train_prediction_rf)

print(training_data_accuracy_rf)

0.9739413680781759


In [88]:
#accuracy on test data
y_test_prediction_rf = classifier.predict(X_test)
test_data_accuracy_rf = accuracy_score(Y_test,y_test_prediction_rf)

print(test_data_accuracy_rf)

0.7272727272727273


#Implementing the Support Vector Machines Model

In [89]:
#Importing SVM model
from sklearn import svm

In [90]:
#Creating a svm Classifier
clf = svm.SVC( kernel='linear')

#Train the model using the training sets
clf.fit(X_train, Y_train)

In [91]:
#accuracy on training data
y_train_prediction_svm = classifier.predict(X_train)
training_data_accuracy_svm = accuracy_score(Y_train,y_train_prediction_svm)

print(training_data_accuracy_svm)

0.9739413680781759


In [92]:
#accuracy on training data
y_test_prediction_svm = classifier.predict(X_test)
test_data_accuracy_svm = accuracy_score(Y_test,y_test_prediction_svm)

print(test_data_accuracy_svm)

0.7272727272727273


When I choose a machine learning model for my classification problem, I should consider both the train and test accuracy to understand how well the model generalizes to new data. Here’s an analysis of the provided accuracies:

Logistic Regression:
Train Accuracy: 0.78,
Test Accuracy: 0.75

Decision Tree:
Train Accuracy: 0.97,
Test Accuracy: 0.69

Random Forest:
Train Accuracy: 0.97,
Test Accuracy: 0.72

SVM (Support Vector Machine):
Train Accuracy: 0.97,
Test Accuracy: 0.72

Analysis:
#In Logistic Regression:The train and test accuracies are quite close, indicating that the model generalizes well and does not overfit. The overall accuracy is lower compared to other models, but the consistency between train and test accuracy is a good sign of model reliability

--------------------------------------------------------------------------------------------------------------------------------------------------------

#Making a predictive system

In [93]:
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age']

input_data = (5,120,72,19,175,25.8,0.587,51)

# changing the input_data to numpy array
input_data_as_numpy_array = np.asarray(input_data)

# Reshape the array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

# standardize the input data
std_data = scaler.transform(input_data_reshaped)
print(std_data)

prediction = classifier.predict(std_data)
print(prediction)

if (prediction[0] == 0):
  print('The person is not diabetic')
else:
  print('The person is diabetic')

[[ 0.3429808  -0.02799627  0.14964075 -0.09637905  0.82661621 -0.78595734
   0.34768723  1.51108316]]
[0]
The person is not diabetic
