# Diabetes Prediction and Analysis

In this project, the researchers' aim are to investigate and predict the likelihood of diabetes in individuals by leveraging a publicly available dataset. The researchers' approach involves conducting comprehensive exploratory data analysis, comparing key health metrics between diabetic and non-diabetic groups, and developing machine learning models for prediction. The primary objective is to identify significant health indicators linked to diabetes and to construct reliable models that can support early detection efforts.


## Dataset used

    Clinical health records (custom/preprocessed)
    Key features: gender, age, hypertension, heart_disease, smoking_history, bmi, HbA1c_level, blood_glucose_level
    Target Variable: diabetes
    

Key Analytics Questions Solved:

    Q1: How does diabetes prevalence vary across age groups and genders?
    Q2: What is the average BMI, glucose, and HbA1c level in diabetic vs non-diabetic people?
    Q3: Which features are most important for predicting diabetes?

Project Methodology:

    Age binning to group population by age ranges
    Gender-wise and age-group-wise visual comparison using seaborn.countplot
    Boxplot visualization of key metrics: BMI, glucose, HbA1c
    Summary statistics comparison between diabetic and non-diabetic individuals
    Feature importance analysis using three machine learning classifiers


### Definition of Terms

    gender 
        -> refers to the biological sex of the individual, which can have an impact on their susceptibility to diabetes.

    age 
        -> an important factor as diabetes is more commonly diagnosed in older adults.Age ranges from 0-80 in our dataset.

    hypertension 
        -> medical condition in which the blood pressure in the arteries is persistently elevated. It has values a 0 or 1 where 0 indicates they don’t have hypertension and for 1 it means they have hypertension.

    heart_disease
         -> another medical condition that is associated with an increased risk of developing diabetes. It has values a 0 or 1 where 0 indicates they don’t have heart disease and for 1 it means they have heart disease.

    smoking_history
         -> considered a risk factor for diabetes and can exacerbate the complications associated with diabetes.In our dataset we have 5 categories i.e not current,former,No Info,current,never and ever.

    bmi (Body Mass Index)
         -> a measure of body fat based on weight and height. Higher BMI values are linked to a higher risk of diabetes. The range of BMI in the dataset is from 10.16 to 71.55. BMI less than 18.5 is underweight, 18.5-24.9 is normal, 25-29.9 is overweight, and 30 or more is obese. 

    HbA1c_level (Hemoglobin A1c)
         -> measure of a person's average blood sugar level over the past 2-3 months. Higher levels indicate a greater risk of developing diabetes. Mostly more than 6.5% of HbA1c Level indicates diabetes.

    blood_glucose_level
         -> refers to the amount of glucose in the bloodstream at a given time. High blood glucose levels are a key indicator of diabetes.

    diabetes 
        -> target variable being predicted, with values of 1 indicating the presence of diabetes and 0 indicating the absence of diabetes.


## General Problem Statement



In [21]:
## Import Basic Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
data_set = pd.read_csv('diabetes_prediction_dataset.csv') # Load the dataset
print("Shape of Dataset (Rows, Columns):", data_set.shape, "\n")
print("First 5 Rows:\n", data_set.head(), "\n")


Shape of Dataset (Rows, Columns): (100000, 9) 

First 5 Rows:
    gender   age  hypertension  heart_disease smoking_history    bmi  \
0  Female  80.0             0              1           never  25.19   
1  Female  54.0             0              0         No Info  27.32   
2    Male  28.0             0              0           never  27.32   
3  Female  36.0             0              0         current  23.45   
4    Male  76.0             1              1         current  20.14   

   HbA1c_level  blood_glucose_level  diabetes  
0          6.6                  140         0  
1          6.6                   80         0  
2          5.7                  158         0  
3          5.0                  155         0  
4          4.8                  155         0   



In [31]:
# Check for Missing Values
print("Missing values per column:")
print(data_set.isnull().sum())

# Find data types
print("\nData types of each column:")
print(data_set.dtypes)


Missing values per column:
gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

Data types of each column:
gender                  object
age                    float64
hypertension             int64
heart_disease            int64
smoking_history         object
bmi                    float64
HbA1c_level            float64
blood_glucose_level      int64
diabetes                 int64
dtype: object
