<a href="https://colab.research.google.com/github/halaalduh/Diabetes-Prediction-using-Healthcare-Dataset/blob/main/Copy_of_Untitled8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

# Load dataset
df = pd.read_csv("Healthcare-Diabetes.csv")
df.head()

Unnamed: 0,Id,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1,6,148,72,35,0,33.6,0.627,50,1
1,2,1,85,66,29,0,26.6,0.351,31,0
2,3,8,183,64,0,0,23.3,0.672,32,1
3,4,1,89,66,23,94,28.1,0.167,21,0
4,5,0,137,40,35,168,43.1,2.288,33,1


In [18]:
# Basic info about dataset
print("Shape:", df.shape)
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())


Shape: (2768, 10)

Data Types:
Id                            int64
Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

Missing Values:
Id                          0
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


# **Handling Missing Values**

Replaced zeros in Glucose, BloodPressure, SkinThickness, Insulin, BMI with NaN, then filled missing values using mean or median.This improved data accuracy by removing unrealistic zeros and completing missing records.

In [19]:
# Columns where zero is invalid
cols_with_zero = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
for col in cols_with_zero:
    df[col] = df[col].replace(0, np.nan)

# Check missing values again
df.isnull().sum()

Unnamed: 0,0
Id,0
Pregnancies,0
Glucose,18
BloodPressure,125
SkinThickness,800
Insulin,1330
BMI,39
DiabetesPedigreeFunction,0
Age,0
Outcome,0


In [20]:
# Imputation strategy
df['Glucose'] = df['Glucose'].fillna(df['Glucose'].mean())
df['BloodPressure'] = df['BloodPressure'].fillna(df['BloodPressure'].mean())
df['BMI'] = df['BMI'].fillna(df['BMI'].mean())
df['SkinThickness'] = df['SkinThickness'].fillna(df['SkinThickness'].median())
df['Insulin'] = df['Insulin'].fillna(df['Insulin'].median())
df.isnull().sum()

Unnamed: 0,0
Id,0
Pregnancies,0
Glucose,0
BloodPressure,0
SkinThickness,0
Insulin,0
BMI,0
DiabetesPedigreeFunction,0
Age,0
Outcome,0


# **Outlier**

Detected and handled outliers in numeric columns using the IQR (Interquartile Range) method.
Extreme values in attributes like Insulin and DiabetesPedigreeFunction were capped within acceptable limits.
This step reduced the effect of extreme values and made the data distribution more balanced and reliable.

In [21]:
def cap_outliers(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return np.where(series < lower, lower, np.where(series > upper, upper, series))

df['Insulin'] = cap_outliers(df['Insulin'])
df['DiabetesPedigreeFunction'] = cap_outliers(df['DiabetesPedigreeFunction'])


# **Noise** **Removal**

Used a moving average smoothing method to reduce random fluctuations in numeric features.This step helped remove noise and made data patterns clearer.

In [22]:
# Noise removal using rolling mean (smoothing)
for col in ['Glucose', 'BloodPressure', 'BMI']:
    df[col] = df[col].rolling(window=3, min_periods=1).mean()

df.head()

Unnamed: 0,Id,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1,6,148.0,72.0,35.0,126.0,33.6,0.627,50,1
1,2,1,116.5,69.0,29.0,126.0,30.1,0.351,31,0
2,3,8,138.666667,67.333333,29.0,126.0,27.833333,0.672,32,1
3,4,1,119.0,65.333333,23.0,105.0,26.0,0.167,21,0
4,5,0,136.333333,56.666667,35.0,145.0,31.5,1.194,33,1


# **Discretization**

Converted continuous values into categories for better interpretability.
Age was grouped into ranges (20–29, 30–39, etc.), and Glucose was classified as Normal, Prediabetes, or Diabetes.
This helps algorithms identify patterns more effectively.

In [23]:
# ===== Discretization (Binning) =====
# Discretize Age into categories
df['Age_Bin'] = pd.cut(df['Age'],
                       bins=[20, 30, 40, 50, 60, 80],
                       labels=["20-29","30-39","40-49","50-59","60+"])

# Discretize Glucose into categories
df['Glucose_Bin'] = pd.cut(df['Glucose'],
                           bins=[0, 99, 125, 200],
                           labels=["Normal","Prediabetes","Diabetes"])

# Show first rows with new bins
df[['Age', 'Age_Bin', 'Glucose', 'Glucose_Bin']].head()

Unnamed: 0,Age,Age_Bin,Glucose,Glucose_Bin
0,50,40-49,148.0,Diabetes
1,31,30-39,116.5,Prediabetes
2,32,30-39,138.666667,Diabetes
3,21,20-29,119.0,Prediabetes
4,33,30-39,136.333333,Diabetes


# **Normalization**

Applied MinMaxScaler to scale numeric values between 0 and 1.This ensured that all attributes have equal influence on the models.

In [24]:
# Drop Id column (not useful)
df = df.drop(columns=['Id'])

# Apply Min-Max Normalization [0,1] only to numeric columns
scaler = MinMaxScaler()

# Select only numeric columns
num_cols = df.select_dtypes(include=[np.number]).drop(columns=['Outcome']).columns

df[num_cols] = scaler.fit_transform(df[num_cols])

df.head()



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Age_Bin,Glucose_Bin
0,0.352941,0.680645,0.526316,0.271845,0.525,0.288945,0.491935,0.483333,1,40-49,Diabetes
1,0.058824,0.375806,0.467105,0.213592,0.525,0.201005,0.244624,0.166667,0,30-39,Prediabetes
2,0.470588,0.590323,0.434211,0.213592,0.525,0.144054,0.532258,0.183333,1,30-39,Diabetes
3,0.058824,0.4,0.394737,0.15534,0.0,0.09799,0.079749,0.0,0,20-29,Prediabetes
4,0.0,0.567742,0.223684,0.271845,1.0,0.236181,1.0,0.2,1,30-39,Diabetes


In [25]:
# Save preprocessed dataset
df.to_csv("Preprocessed_dataset.csv", index=False)
print("Preprocessed dataset saved as Preprocessed_dataset.csv")

Preprocessed dataset saved as Preprocessed_dataset.csv
