<a href="https://www.kaggle.com/code/miladistic/logistic-regression-for-diabetes-prediction?scriptVersionId=126886036" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Exploratory Data Analysis and Logistic Regression for Diabetes Prediction

This code loads a diabetes dataset and performs exploratory data analysis, data preprocessing, and logistic regression modeling to predict diabetes outcomes.

## Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
from sklearn.metrics import ConfusionMatrixDisplay

## Data Loading and Exploration

Read the diabetes data set into a Pandas DataFrame

In [2]:
df = pd.read_csv('/kaggle/input/diabetes-dataset/diabetes.csv')
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


Display the first five rows of the DataFrame

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Display summary statistics for each column in the DataFrame

In [4]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Display information about the DataFrame, including the number of rows and columns, data types, and missing values

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Display the column names of the DataFrame

In [6]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

Display the shape of the DataFrame

In [7]:
df.shape

(768, 9)

## Data Preprocessing

Replace all zeros in the SkinThickness column with the mean of the column

In [8]:
df["SkinThickness"]=df["SkinThickness"].replace(0,df["SkinThickness"].mean())
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35.000000,0,33.6,0.627,50,1
1,1,85,66,29.000000,0,26.6,0.351,31,0
2,8,183,64,20.536458,0,23.3,0.672,32,1
3,1,89,66,23.000000,94,28.1,0.167,21,0
4,0,137,40,35.000000,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48.000000,180,32.9,0.171,63,0
764,2,122,70,27.000000,0,36.8,0.340,27,0
765,5,121,72,23.000000,112,26.2,0.245,30,0
766,1,126,60,20.536458,0,30.1,0.349,47,1


Replace all zeros in the Insulin column with the mean of the column

In [9]:
df["Insulin"]=df["Insulin"].replace(0,df["Insulin"].mean())
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35.000000,79.799479,33.6,0.627,50,1
1,1,85,66,29.000000,79.799479,26.6,0.351,31,0
2,8,183,64,20.536458,79.799479,23.3,0.672,32,1
3,1,89,66,23.000000,94.000000,28.1,0.167,21,0
4,0,137,40,35.000000,168.000000,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48.000000,180.000000,32.9,0.171,63,0
764,2,122,70,27.000000,79.799479,36.8,0.340,27,0
765,5,121,72,23.000000,112.000000,26.2,0.245,30,0
766,1,126,60,20.536458,79.799479,30.1,0.349,47,1


## Correlation Analysis

Display the correlation matrix of the DataFrame

In [10]:
corr=df.corr()
corr

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.129459,0.141282,0.013376,-0.018082,0.017683,-0.033523,0.544341,0.221898
Glucose,0.129459,1.0,0.15259,0.145378,0.390835,0.221071,0.137337,0.263514,0.466581
BloodPressure,0.141282,0.15259,1.0,0.18089,0.074858,0.281805,0.041265,0.239528,0.065068
SkinThickness,0.013376,0.145378,0.18089,1.0,0.240361,0.501131,0.154961,0.026423,0.175026
Insulin,-0.018082,0.390835,0.074858,0.240361,1.0,0.189337,0.157806,0.038652,0.179185
BMI,0.017683,0.221071,0.281805,0.501131,0.189337,1.0,0.140647,0.036242,0.292695
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.154961,0.157806,0.140647,1.0,0.033561,0.173844
Age,0.544341,0.263514,0.239528,0.026423,0.038652,0.036242,0.033561,1.0,0.238356
Outcome,0.221898,0.466581,0.065068,0.175026,0.179185,0.292695,0.173844,0.238356,1.0


## Feature Selection and Data Splitting

Create a new DataFrame containing only the Outcome column

In [11]:
y=df['Outcome']
y.shape
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

Create a new DataFrame containing all the columns except Outcome, Pregnancies, BloodPressure, SkinThickness, and DiabetesPedigreeFunction

In [12]:
X=df.drop(columns=['Outcome', 'Pregnancies', 'BloodPressure', 'SkinThickness', 'DiabetesPedigreeFunction'])
X

Unnamed: 0,Glucose,Insulin,BMI,Age
0,148,79.799479,33.6,50
1,85,79.799479,26.6,31
2,183,79.799479,23.3,32
3,89,94.000000,28.1,21
4,137,168.000000,43.1,33
...,...,...,...,...
763,101,180.000000,32.9,63
764,122,79.799479,36.8,27
765,121,112.000000,26.2,30
766,126,79.799479,30.1,47


Split the X and y DataFrames into training and testing sets, with 20% of the data in the testing set

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 38)

Display the shapes of the X_train, X_test, y_train, and y_test DataFrames

In [14]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(614, 4)
(614,)
(154, 4)
(154,)


## Logistic Regression Modeling

Create a LogisticRegression model and fit it to the training data

In [15]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

Print the accuracy of the model on the training data

In [16]:
model.score(X_train, y_train)

0.757328990228013

Print the accuracy of the model on the testing data

In [17]:
model.score(X_test, y_test)

0.7987012987012987

Make predictions on the testing data using the model

In [18]:
predict = model.predict(X_test)
predict

array([0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

## Model Interpretation

In this section, the code uses the statsmodels library to obtain more information about the logistic regression model. It adds a constant to the training set using the add_constant method and creates an OLS (Ordinary Least Squares) model using the sm.OLS function. It then fits the OLS model to the training set and prints out the regression summary, which includes information about the coefficients, standard errors, p-values, and goodness-of-fit measures.

In [19]:
X2 = sm.add_constant(X_train)
est = sm.OLS(y_train, X2)
est2 = est.fit()
print(est2.summary())

                            OLS Regression Results                            
Dep. Variable:                Outcome   R-squared:                       0.261
Model:                            OLS   Adj. R-squared:                  0.256
Method:                 Least Squares   F-statistic:                     53.68
Date:                Sun, 23 Apr 2023   Prob (F-statistic):           9.19e-39
Time:                        11:40:19   Log-Likelihood:                -324.02
No. Observations:                 614   AIC:                             658.0
Df Residuals:                     609   BIC:                             680.1
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.8651      0.091     -9.462      0.0

In [20]:
Xtest = X_test.iloc[1]
Xtest

Glucose    197.000000
Insulin     79.799479
BMI         34.700000
Age         62.000000
Name: 579, dtype: float64

In [21]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predict))

              precision    recall  f1-score   support

           0       0.81      0.91      0.86       101
           1       0.78      0.58      0.67        53

    accuracy                           0.80       154
   macro avg       0.79      0.75      0.76       154
weighted avg       0.80      0.80      0.79       154



## Model Improvement

In [22]:
X=X.drop(columns=['Insulin'])
X

Unnamed: 0,Glucose,BMI,Age
0,148,33.6,50
1,85,26.6,31
2,183,23.3,32
3,89,28.1,21
4,137,43.1,33
...,...,...,...
763,101,32.9,63
764,122,36.8,27
765,121,26.2,30
766,126,30.1,47


In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 38)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(614, 3)
(614,)
(154, 3)
(154,)


In [24]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [25]:
model.score(X_train, y_train)

0.760586319218241

In [26]:
model.score(X_test, y_test)

0.7857142857142857

In [27]:
X2 = sm.add_constant(X_train)
est = sm.OLS(y_train, X2)
est2 = est.fit()
print(est2.summary())

                            OLS Regression Results                            
Dep. Variable:                Outcome   R-squared:                       0.260
Model:                            OLS   Adj. R-squared:                  0.257
Method:                 Least Squares   F-statistic:                     71.54
Date:                Sun, 23 Apr 2023   Prob (F-statistic):           1.18e-39
Time:                        11:40:19   Log-Likelihood:                -324.20
No. Observations:                 614   AIC:                             656.4
Df Residuals:                     610   BIC:                             674.1
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.8609      0.091     -9.450      0.0