# Exploratory Data Analysis and Logistic Regression for Diabetes Prediction

This code loads a diabetes dataset and performs exploratory data analysis, data preprocessing, and logistic regression modeling to predict diabetes outcomes.

## Import necessary libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
from sklearn.metrics import ConfusionMatrixDisplay

## Data Loading and Exploration

Read the diabetes data set into a Pandas DataFrame

In [None]:
df = pd.read_csv('diabetes.csv')
df

Display the first five rows of the DataFrame

In [None]:
df.head()

Display summary statistics for each column in the DataFrame

In [None]:
df.describe()

Display information about the DataFrame, including the number of rows and columns, data types, and missing values

In [None]:
df.info()

Display the column names of the DataFrame

In [None]:
df.columns

Display the shape of the DataFrame

In [None]:
df.shape

## Data Preprocessing

Replace all zeros in the SkinThickness column with the mean of the column

In [None]:
df["SkinThickness"]=df["SkinThickness"].replace(0,df["SkinThickness"].mean())
df

Replace all zeros in the Insulin column with the mean of the column

In [None]:
df["Insulin"]=df["Insulin"].replace(0,df["Insulin"].mean())
df

## Correlation Analysis

Display the correlation matrix of the DataFrame

In [None]:
corr=df.corr()
corr

## Feature Selection and Data Splitting

Create a new DataFrame containing only the Outcome column

In [None]:
y=df['Outcome']
y.shape
y

Create a new DataFrame containing all the columns except Outcome, Pregnancies, BloodPressure, SkinThickness, and DiabetesPedigreeFunction

In [None]:
X=df.drop(columns=['Outcome', 'Pregnancies', 'BloodPressure', 'SkinThickness', 'DiabetesPedigreeFunction'])
X

Split the X and y DataFrames into training and testing sets, with 20% of the data in the testing set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 38)

Display the shapes of the X_train, X_test, y_train, and y_test DataFrames

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

## Logistic Regression Modeling

Create a LogisticRegression model and fit it to the training data

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

Print the accuracy of the model on the training data

In [None]:
model.score(X_train, y_train)

Print the accuracy of the model on the testing data

In [None]:
model.score(X_test, y_test)

Make predictions on the testing data using the model

In [None]:
predict = model.predict(X_test)
predict

## Model Interpretation

In this section, the code uses the statsmodels library to obtain more information about the logistic regression model. It adds a constant to the training set using the add_constant method and creates an OLS (Ordinary Least Squares) model using the sm.OLS function. It then fits the OLS model to the training set and prints out the regression summary, which includes information about the coefficients, standard errors, p-values, and goodness-of-fit measures.

In [None]:
X2 = sm.add_constant(X_train)
est = sm.OLS(y_train, X2)
est2 = est.fit()
print(est2.summary())

In [None]:
Xtest = X_test.iloc[1]
Xtest

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predict))

## Model Improvement

In [None]:
X=X.drop(columns=['Insulin'])
X

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 38)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

In [None]:
X2 = sm.add_constant(X_train)
est = sm.OLS(y_train, X2)
est2 = est.fit()
print(est2.summary())