# **Diabetes Prediction:**

> The dataset comprises crucial health-related features such as 'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', and 'Age'. The main objective was to predict the 'Outcome' label, which signifies the likelihood of diabetes.



## About the Data:
Data Overview: This is a [diabetes.csv](https://www.kaggle.com/datasets/mathchi/diabetes-data-set) data

## **Import Required Libraries:**

In [None]:
# Ignore warning messages to prevent them from being displayed during code execution
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np    # Importing the NumPy library for linear algebra operations
import pandas as pd   # Importing the Pandas library for data processing and CSV file handling

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import seaborn as sns                   # Importing the Seaborn library for statistical data visualization
import matplotlib.pyplot as plt         # Importing the Matplotlib library for creating plots and visualizations
import plotly.express as px             # Importing the Plotly Express library for interactive visualizations

## **Exploratory Data Analysis:**


### **Load and Prepare Data:**

In [None]:
df=pd.read_csv('/kaggle/input/diabetes-data-set/diabetes.csv')

### **UnderStanding the Variables**

In [None]:
df.head(10)

In [None]:
df.tail(10)

In [None]:
df.sample(5)

In [None]:
df.describe()

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
df.size

In [None]:
df.shape

### **Data Cleaning:**


In [None]:
df.shape

In [None]:
df=df.drop_duplicates()

In [None]:
df.shape

Check null Values


In [None]:
df.isnull().sum()

There is no Missing Values present in the Data

In [None]:
df.columns

**Check the number of Zero Values in Dataset**

In [None]:
print("No. of Zero Values in Glucose ", df[df['Glucose']==0].shape[0])

In [None]:
print("No. of Zero Values in Blood Pressure ", df[df['BloodPressure']==0].shape[0])

In [None]:
print("No. of Zero Values in SkinThickness ", df[df['SkinThickness']==0].shape[0])

In [None]:
print("No. of Zero Values in Insulin ", df[df['Insulin']==0].shape[0])

In [None]:
print("No. of Zero Values in BMI ", df[df['BMI']==0].shape[0])

**Replace zeroes with mean of that Columns**

In [None]:
df['Glucose']=df['Glucose'].replace(0, df['Glucose'].mean())
print('No of zero Values in Glucose ', df[df['Glucose']==0].shape[0])

In [None]:
df['BloodPressure']=df['BloodPressure'].replace(0, df['BloodPressure'].mean())
df['SkinThickness']=df['SkinThickness'].replace(0, df['SkinThickness'].mean())
df['Insulin']=df['Insulin'].replace(0, df['Insulin'].mean())
df['BMI']=df['BMI'].replace(0, df['BMI'].mean())

Validate the Zero Values:

In [None]:
df.describe()

## **Data Visualization:**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame containing the dataset
# If you haven't imported your dataset yet, import it here

# Create subplots
f, ax = plt.subplots(1, 2, figsize=(10, 5))

# Pie chart for Outcome distribution
df['Outcome'].value_counts().plot.pie(explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)
ax[0].set_title('Outcome')
ax[0].set_ylabel(' ')

# Count plot for Outcome distribution
sns.countplot(x='Outcome', data=df, ax=ax[1])  # Use 'x' instead of 'Outcome'
ax[1].set_title('Outcome')

# Display class distribution
N, P = df['Outcome'].value_counts()
print('Negative (0):', N)
print('Positive (1):', P)

# Adding grid and showing plots
plt.grid()
plt.show()

*  *1 Represent --> Diabetes Positive*
*  *0 Represent --> Daibetes Negative*

### Histograms:

In [None]:
df.hist(bins=10, figsize=(10, 10))
plt.show()

### Scatter Plot:

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(df, figsize =(20, 20))

### Pair plot:


In [None]:
sns.pairplot(data=df, hue='Outcome')
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.heatmap(df.corr(), annot=True, cmap='Reds')
plt.plot()
# Creating a heatmap of the correlation matrix for the columns in the DataFrame data

In [None]:
mean = df['Outcome'].mean()
# Calculating the mean value of the 'Outcome' column in the DataFrame data
mean
# Displaying the calculated mean value

## **Split the DataFrame into X and y**

In [None]:
target_name='Outcome'

y=df[target_name]

X= df.drop(target_name, axis=1)

In [None]:
X.head()

In [None]:
y.head()

### **Future Scalling**

In [None]:
# Standard Scaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
SSX = scaler.transform(X)


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(SSX, y, test_size=0.2, random_state=7)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

# **Classification Algorithms:**

## **Logistic Regression:**

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='liblinear', multi_class='ovr')
lr.fit(X_train, y_train)

## **Descision Tree:**

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(X_train, y_train)

# **Making prediction:**



> Logistic Regression:



In [None]:
X_test.shape

In [None]:
lr_pred=lr.predict(X_test)

In [None]:
lr_pred.shape



> Decision Tree:



In [None]:
dt_pred=dt.predict(X_test)

In [None]:
dt_pred.shape

# **Model Evaluation for Logistic Regression:**

> Train Score and Test Score



In [None]:
# For Logistic Regression:
from sklearn.metrics import accuracy_score
print("Train Accuracy of Logistic Regression: ", lr.score(X_train, y_train)*100)
print("Accuracy (Test) Score of Logistic Regression: ", lr.score(X_test, y_test)*100)
print("Accuracy Score of Logistic Regression: ", accuracy_score(y_test, lr_pred)*100)

In [None]:
# For Decesion Tree:
print("Train Accuracy of Decesion Tree: ", dt.score(X_train, y_train)*100)
print("Accuracy (Test) Score of Decesion Tree: ", dt.score(X_test, y_test)*100)
print("Accuracy Score of Decesion Tree: ", accuracy_score(y_test, dt_pred)*100)

# **Confusion Matrix**



*   *Confusion Matrix of "Logistic Regression"*


In [None]:
from sklearn.metrics import classification_report, confusion_matrix

cm = confusion_matrix(y_test, lr_pred)
cm

In [None]:
sns.heatmap(confusion_matrix(y_test, lr_pred), annot=True, fmt="d")

In [None]:
TN =cm[0, 0]
FP =cm[0,1]
FN = cm[1,0]
TP  = cm[1,1]

In [None]:
TN, FP, FN, TP

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
cm = confusion_matrix(y_test, lr_pred)

print('TN - True Negative {}'.format(cm[0,0]))
print('FP - False Positive {}'.format(cm[0,1]))
print('FN - False Negative {}'.format(cm[1,0]))
print('TP - True Positive {}'.format(cm[1,1]))
print('Accuracy Rate: {}'.format(np.divide(np.sum([cm[0,0], cm[1,1]]), np.sum(cm))*100))
print('Misclassification Rate: {}'.format(np.divide(np.sum([cm[0,1], cm[1,0]]), np.sum(cm))*100))

In [None]:
77.27272727272727+22.727272727272727

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.clf()
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Wistia)
classNames = ['0', '1']
plt.title('Confusion Matrix of Logistic Regression')
plt.ylabel('Actual (true) Values')
plt.xlabel('Predicted Values')
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames, rotation=45)
plt.yticks(tick_marks, classNames)
s = [['TN', 'FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        plt.text(j, i, str(s[i][j]) + " = " + str(cm[i][j]))

plt.show()

In [None]:
pd.crosstab(y_test, lr_pred, margins=False)

In [None]:
pd.crosstab(y_test, lr_pred, margins=True)

In [None]:
pd.crosstab(y_test, lr_pred, rownames=['Actual values'], colnames=['Predicted values'], margins=True)

### **Precision:**
     PPV- positive Predictive Value

Precision = True Positive/True Positive + False Positive  
Precision = TP/TP+FP  

In [None]:
TP, FP

In [None]:
Precision = TP/(TP+FP)
Precision

In [None]:
33/(33+11)

In [None]:
# precision Score:

precision_score = TP/float(TP+FP)*100
print('Precision Score: {0:0.4f}'.format(precision_score))

In [None]:
from sklearn.metrics import precision_score
print("Precision Score is: ", precision_score(y_test, lr_pred)*100)
print("Micro Average Precision Score is: ", precision_score(y_test, lr_pred, average='micro')*100)
print("Macro Average Precision Score is: ", precision_score(y_test, lr_pred, average='macro')*100)
print("Weighted Average Precision Score is: ", precision_score(y_test, lr_pred, average='weighted')*100)
print("precision Score on Non Weighted score is: ", precision_score(y_test, lr_pred, average=None)*100)

In [None]:
print('Classification Report of Logistic Regression: \n', classification_report(y_test, lr_pred, digits=4))

## **Recall**
    True Positive Rate(TPR)

Recall = True Positive/True Positive + False Negative  
Recall = TP/TP+FN  

In [None]:
recall_score = TP/ float(TP+FN)*100
print('recall_score', recall_score)

In [None]:
TP, FN

In [None]:
33/(33+24)

In [None]:
from sklearn.metrics import recall_score
print('Recall or Sensitivity_Score: ', recall_score(y_test, lr_pred)*100)

In [None]:
print("recall Score is: ", recall_score(y_test, lr_pred)*100)
print("Micro Average recall Score is: ", recall_score(y_test, lr_pred, average='micro')*100)
print("Macro Average recall Score is: ", recall_score(y_test, lr_pred, average='macro')*100)
print("Weighted Average recall Score is: ", recall_score(y_test, lr_pred, average='weighted')*100)
print("recall Score on Non Weighted score is: ", recall_score(y_test, lr_pred, average=None)*100)

In [None]:
print('Classification Report of Logistic Regression: \n', classification_report(y_test, lr_pred, digits=4))

**`   FPR - False Positve Rate`**

In [None]:
FPR = FP / float(FP + TN) * 100
print('False Positive Rate: {:.4f}'.format(FPR))

In [None]:
FP, TN

In [None]:
11/(11+86)

## **Specificity:**

In [None]:
specificity = TN /(TN+FP)*100
print('Specificity : {0:0.4f}'.format(specificity))

In [None]:
from sklearn.metrics import f1_score
print('F1_Score of Macro: ', f1_score(y_test, lr_pred)*100)

In [None]:
print("Micro Average f1 Score is: ", f1_score(y_test, lr_pred, average='micro')*100)
print("Macro Average f1 Score is: ", f1_score(y_test, lr_pred, average='macro')*100)
print("Weighted Average f1 Score is: ", f1_score(y_test, lr_pred, average='weighted')*100)
print("f1 Score on Non Weighted score is: ", f1_score(y_test, lr_pred, average=None)*100)

## **Classification Report of Logistic Regression:**

In [None]:
from sklearn.metrics import classification_report
print('Classification Report of Logistic Regression: \n', classification_report(y_test, lr_pred, digits=4))

## **ROC Curve& ROC AUC**

In [None]:
# Area under Curve:
auc= roc_auc_score(y_test, lr_pred)
print("ROC AUC SCORE of logistic Regression is ", auc)

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, lr_pred)
plt.plot(fpr, tpr, color='orange', label="ROC")
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--', label='ROC curve (area = %0.2f)' % auc(fpr, tpr))
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristics (ROC) Curve of Logistic Regression")
plt.legend()
plt.grid()
plt.show()

## Confusion Matrix:
*    Confusion matrix of "Decision Tree"

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

cm = confusion_matrix(y_test, dt_pred)
cm

In [None]:
sns.heatmap(confusion_matrix(y_test, dt_pred), annot=True, fmt="d")

In [None]:
TN =cm[0, 0]
FP =cm[0,1]
FN = cm[1,0]
TP  = cm[1,1]

In [None]:
TN, FP, FN, TP

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
cm = confusion_matrix(y_test, dt_pred)

print('TN - True Negative {}'.format(cm[0,0]))
print('FP - False Positive {}'.format(cm[0,1]))
print('FN - False Negative {}'.format(cm[1,0]))
print('TP - True Positive {}'.format(cm[1,1]))
print('Accuracy Rate: {}'.format(np.divide(np.sum([cm[0,0], cm[1,1]]), np.sum(cm))*100))
print('Misclassification Rate: {}'.format(np.divide(np.sum([cm[0,1], cm[1,0]]), np.sum(cm))*100))

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.clf()
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Wistia)
classNames = ['0', '1']
plt.title('Confusion Matrix of Decision Tree')
plt.ylabel('Actual (true) Values')
plt.xlabel('Predicted Values')
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames, rotation=45)
plt.yticks(tick_marks, classNames)
s = [['TN', 'FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        plt.text(j, i, str(s[i][j]) + " = " + str(cm[i][j]))

plt.show()

## Precision:


In [None]:
# precision Score:

precision_score = TP/float(TP+FP)*100
print('Precision Score: {0:0.4f}'.format(precision_score))

In [None]:
from sklearn.metrics import precision_score

print("Precision Score is:", precision_score(y_test, dt_pred) * 100)
print("Micro Average Precision Score is:", precision_score(y_test, dt_pred, average='micro') * 100)
print("Macro Average Precision Score is:", precision_score(y_test, dt_pred, average='macro') * 100)
print("Weighted Average Precision Score is:", precision_score(y_test, dt_pred, average='weighted') * 100)
print("Precision Score on Non Weighted score is:", precision_score(y_test, dt_pred, average=None) * 100)

## Recall:

In [None]:
recall_score = TP/ float(TP+FN)*100
print('recall_score', recall_score)

In [None]:
from sklearn.metrics import recall_score
print('Recall or Sensitivity_Score: ', recall_score(y_test, dt_pred)*100)

In [None]:
print("recall Score is: ", recall_score(y_test, dt_pred)*100)
print("Micro Average recall Score is: ", recall_score(y_test, dt_pred, average='micro')*100)
print("Macro Average recall Score is: ", recall_score(y_test, dt_pred, average='macro')*100)
print("Weighted Average recall Score is: ", recall_score(y_test, dt_pred, average='weighted')*100)
print("recall Score on Non Weighted score is: ", recall_score(y_test, dt_pred, average=None)*100)

## FPR

In [None]:
FPR = FP / float(FP + TN) * 100
print('False Positive Rate: {:.4f}'.format(FPR))

## Specificity:

In [None]:
specificity = TN /(TN+FP)*100
print('Specificity : {0:0.4f}'.format(specificity))

In [None]:
from sklearn.metrics import f1_score
print('F1_Score of Macro: ', f1_score(y_test, dt_pred)*100)

In [None]:
print("Micro Average f1 Score is: ", f1_score(y_test, dt_pred, average='micro')*100)
print("Macro Average f1 Score is: ", f1_score(y_test, dt_pred, average='macro')*100)
print("Weighted Average f1 Score is: ", f1_score(y_test, dt_pred, average='weighted')*100)
print("f1 Score on Non Weighted score is: ", f1_score(y_test, dt_pred, average=None)*100)

## **Classification Report of Decision Tree:**

In [None]:
from sklearn.metrics import classification_report
print('Classification Report of Decision Tree: \n', classification_report(y_test, dt_pred, digits=4))

## **ROC Curve& ROC AUC**

In [None]:
# Area under Curve:
auc= roc_auc_score(y_test, dt_pred)
print("ROC AUC SCORE of Decision Treeis ", auc)

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, dt_pred)
plt.plot(fpr, tpr, color='orange', label="ROC")
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--', label='ROC curve (area = %0.2f)' % auc(fpr, tpr))
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristics (ROC) Curve of Decision Tree")
plt.legend()
plt.grid()
plt.show()