<a href="https://colab.research.google.com/github/guycoding/ML-diabetes/blob/main/Copy_of_Diabetes_Prediction_Analysis_(Classification).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
marshalpatel3558_diabetes_prediction_dataset_path = kagglehub.dataset_download('marshalpatel3558/diabetes-prediction-dataset')

print('Data source import complete.')


# Diabetes Prediction Analysis

In this project we are analysing the Health data of people and trying to predict the probability of getting Diabetes. Diabetes, also known as diabetes mellitus, is a group of common endocrine diseases characterized by sustained high blood sugar levels. Diabetes is due to either the pancreas not producing enough of the hormone insulin, or the cells of the body becoming unresponsive to insulin's effects. Classic symptoms include polydipsia (excessive thirst), polyuria (excessive urination), polyphagia (excessive hunger), weight loss, and blurred vision. If left untreated, the disease can lead to various health complications, including disorders of the cardiovascular system, eye, kidney, and nerves. Diabetes accounts for approximately 4.2 million deaths every year, with an estimated 1.5 million caused by either untreated or poorly treated diabetes.

The number of people diagnosed as living with diabetes has increased sharply in recent decades, from 200 million in 1990 to 830 million by 2022.  It affects one in seven of the adult population, with type 2 diabetes accounting for more than 95% of cases. These numbers have already risen beyond earlier projections of 783 million adults by 2045. The prevalence of the disease continues to increase, most dramatically in low- and middle-income nations. Rates are similar in women and men, with diabetes being the seventh leading cause of death globally. The global expenditure on diabetes-related healthcare is an estimated US$760 billion a year.

So it is important to find out Diabetes and treat it early. So we are analysing the data and trying to predict the possibility and finally it can be given proper treatment. This is an classification problem.

## About the Dataset

The provided dataset contains information related to diabetes risk factors and associated health metrics. This data is real-world data scrapped using Perplexity AI.  Below is a detailed description of the dataset:

| **Column Name**                      | **Type**       | **Description**                                                                 |
|-------------------------------------|----------------|---------------------------------------------------------------------------------|
| Age                                 | Numerical      | Age of the individual in years.                                                |
| Sex                                 | Categorical    | Gender of the individual (e.g., Male, Female).                                 |
| Ethnicity                           | Categorical    | Ethnic background (e.g., White, Asian, Black, Hispanic).                       |
| BMI (Body Mass Index)               | Numerical      | A measure of body fat based on weight and height.                              |
| Waist Circumference                 | Numerical      | Measurement of waist size in centimeters.                                      |
| Fasting Blood Glucose               | Numerical      | Blood glucose levels after fasting, measured in mg/dL.                         |
| HbA1c                               | Numerical      | Glycated hemoglobin percentage, indicating average blood sugar levels.         |
| Blood Pressure Systolic             | Numerical      | Systolic blood pressure (top number), measured in mmHg.                        |
| Blood Pressure Diastolic            | Numerical      | Diastolic blood pressure (bottom number), measured in mmHg.                    |
| Cholesterol Total                   | Numerical      | Total cholesterol level in mg/dL.                                              |
| Cholesterol HDL                     | Numerical      | "Good" cholesterol level in mg/dL.                                             |
| Cholesterol LDL                     | Numerical      | "Bad" cholesterol level in mg/dL.                                              |
| GGT (Gamma-Glutamyl Transferase)    | Numerical      | Liver enzyme level indicative of liver function or damage.                     |
| Serum Urate                         | Numerical      | Uric acid levels in the blood, measured in mg/dL.                              |
| Physical Activity Level             | Categorical    | Level of physical activity (e.g., Low, Moderate, High).                        |
| Dietary Intake Calories             | Numerical      | Daily calorie intake in kilocalories.                                          |
| Alcohol Consumption                 | Categorical    | Alcohol consumption level (e.g., None, Moderate, Heavy).                       |
| Smoking Status                      | Categorical    | Smoking habits (e.g., Never, Former, Current).                                 |
| Family History of Diabetes          | Binary (0 or 1)| 1 = Family history present; 0 = No family history of diabetes.                 |
| Previous Gestational Diabetes       | Binary (0 or 1)| 1 = History of gestational diabetes; 0 = No history.                           |


Dataset: [https://www.kaggle.com/datasets/marshalpatel3558/diabetes-prediction-dataset](https://www.kaggle.com/datasets/marshalpatel3558/diabetes-prediction-dataset)

## Importing the necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


import missingno as msno
plt.style.use('fivethirtyeight')

## Loading the Dataset

In [None]:
df = pd.read_csv('/kaggle/input/diabetes-prediction-dataset/diabetes_dataset.csv')

### Exploring the Dataset

In [None]:
df.head()

In [None]:
df.drop(columns='Unnamed: 0', inplace=True)


In [None]:
df.shape

In [None]:
df.columns

### Creating a new column for the prediction

In [None]:
df['Outcome'] = ((df['Fasting_Blood_Glucose'] >= 126) | (df['HbA1c'] > 6.5)).astype(int)

### Basic Statistics

In [None]:
# Statsticsl info
df.describe().style.background_gradient(cmap = "Blues")

In [None]:
# Dtypes info
df.info()

In [None]:
#Null Values
df.isnull().sum()

In [None]:
# Returns true for a column having null values, else false
df.isnull().any()

We can see that there is considrable null values in the Alcohol_Consumption.

In [None]:
(df.isnull().sum() / len(df)) * 100

There is 33 % null values in Alcohol_Consumption.

In [None]:
#Unique Values
df.apply(lambda x : len(x.unique()))

### Preprocessing

In [None]:
most_common = df['Alcohol_Consumption'].mode()[0]
df['Alcohol_Consumption'] = df['Alcohol_Consumption'].apply(lambda x: most_common if pd.isnull(x) else x)

In [None]:
df['Age'].value_counts()

In [None]:
df['Age Group'] = df['Age'].apply(lambda age:
    '20-29' if 20 <= age <= 29 else
    '30-39' if 30 <= age <= 39 else
    '40-49' if 40 <= age <= 49 else
    '50-59' if 50 <= age <= 59 else
    '60-69' if 60 <= age <= 69 else
    '70-79' if 70 <= age <= 79 else
    'Under 20' if age < 20 else
    '80+')

### Exploratory Data Analysis

In [None]:
# Create a Matrix Plot
msno.matrix(df)
plt.show()

Here we visualized the null values. We have already filled the null values in Alcohol_Consumption

In [None]:
df.columns

In [None]:
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()

In [None]:
print (numeric_cols)
print( categorical_cols)

In [None]:
# Create bar plots
for col in categorical_cols:
    sns.countplot(x=col, data=df)
    plt.title(f"Chart for {col}")
    plt.xticks(rotation=45)
    plt.legend()
    plt.tight_layout()
    plt.show()

In [None]:
# Create box plots
fig, ax = plt.subplots(ncols=4, nrows=4, figsize=(20, 10))
ax = ax.flatten()

for index, col in enumerate(numeric_cols):
    sns.boxplot(y=col, data=df, ax=ax[index])

plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)
plt.show()


In [None]:

#Create the Histplot

fig, ax = plt.subplots(ncols=4, nrows=3, figsize=(18,12))
ax = ax.flatten()
index = 0

for col, value in df.select_dtypes(include=['int64', 'float64']).items():
    if col != 'type':
        if index >= len(ax):
            break
        sns.histplot(value, ax=ax[index], kde=True)
        ax[index].set_title(col)
        index += 1

plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

In [None]:
sns.barplot(x=df['Ethnicity'], y= df['Fasting_Blood_Glucose'], data=df)
plt.title('Ethnicity vs Fasting_Blood_Glucose')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
sns.barplot(x=df['Sex'], y= df['Fasting_Blood_Glucose'], data=df)
plt.title('Sex vs Fasting_Blood_Glucose')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
grouped_df = df.groupby('Age Group')['Fasting_Blood_Glucose'].mean().reset_index()

age_order = ['Under 20', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80+']
grouped_df['Age Group'] = pd.Categorical(grouped_df['Age Group'], categories=age_order, ordered=True)
grouped_df = grouped_df.sort_values('Age Group')

sns.barplot(x='Age Group', y='Fasting_Blood_Glucose', data=grouped_df)
plt.title('Age Group vs Average Fasting Blood Glucose')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Coorelation Matrix

In [None]:
corr = df.corr(numeric_only= True )
plt.figure(figsize=(16,12))
sns.heatmap(corr,annot=True, cmap='coolwarm')

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

# Input Split

In [None]:
df =df.copy()

y = df['Outcome']

X = df.drop('Outcome', axis=1)

## Model Building

In [None]:
# Classify function
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

def classify(model, X, y):
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.30, random_state=42)

    # Scale
    sc = StandardScaler()
    X_tr = sc.fit_transform(X_tr)
    X_te = sc.transform(X_te)

    # Train model
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)

    print("Test Accuracy:", model.score(X_te, y_te) * 100)
    print("Classification Report:\n", classification_report(y_te, y_pred))

    # Cross-validation
    X_scaled = sc.fit_transform(X)
    acc_scores = cross_val_score(model, X_scaled, y, cv=5)
    print("CV Accuracy Score:", np.mean(acc_scores) * 100)

    f1_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='f1_macro')
    print("F1 Macro Score:", np.mean(f1_scores))


In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
classify(model, X, y)

In [None]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
classify(model, X, y)

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
classify(model, X, y)

In [None]:
# Extra Tree
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
classify(model, X, y)

In [None]:
# xgb
import xgboost as xgb
model = xgb.XGBClassifier()
classify(model, X, y)

In [None]:
#lgbm
import lightgbm
model = lightgbm.LGBMClassifier()
classify(model, X, y)