# Diabetes Prediction Dataset 📊

This notebook is designed for analyzing and training a model using the **Diabetes Prediction Dataset**. The dataset is sourced from Kaggle and can be accessed at the following link:

- [Diabetes Prediction Dataset](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset)

---

## 🔧 Setup Instructions

### **1. Install Required Libraries**
To run this notebook, ensure you have the following Python libraries installed. You can install them using pip:

```bash
pip install pandas numpy scikit-learn matplotlib seaborn plotly


In [27]:
# Importing the neccesary libraires for now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [28]:
# Load the Dataset and check the first 5 by using head command
diabetes = pd.read_csv('diabetes_prediction.csv')
diabetes.head(5)

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


# Data Exploration 

**First will be Statistics**

In [29]:
# Shape of the data set to see the number of cols, rows.
diabetes.shape

(100000, 9)

In [30]:
# cols
diabetes.columns

Index(['gender', 'age', 'hypertension', 'heart_disease', 'smoking_history',
       'bmi', 'HbA1c_level', 'blood_glucose_level', 'diabetes'],
      dtype='object')

In [31]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


In [32]:
# checking null valus if any.
diabetes.isnull().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

In [33]:
# Checking the min, max, mean of the dataset
diabetes.describe(include='all')

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
count,100000,100000.0,100000.0,100000.0,100000,100000.0,100000.0,100000.0,100000.0
unique,3,,,,6,,,,
top,Female,,,,No Info,,,,
freq,58552,,,,35816,,,,
mean,,41.885856,0.07485,0.03942,,27.320767,5.527507,138.05806,0.085
std,,22.51684,0.26315,0.194593,,6.636783,1.070672,40.708136,0.278883
min,,0.08,0.0,0.0,,10.01,3.5,80.0,0.0
25%,,24.0,0.0,0.0,,23.63,4.8,100.0,0.0
50%,,43.0,0.0,0.0,,27.32,5.8,140.0,0.0
75%,,60.0,0.0,0.0,,29.58,6.2,159.0,0.0


In [34]:
# Now here we will check the percent of Diabetes column, 
# we will check how many people have Diabetes and how many don't in terms of count and percentage 
values_Count = diabetes['diabetes'].value_counts() # Counting the values
percentage = diabetes['diabetes'].value_counts(normalize=True) * 100 # Finding the percent  

In [35]:
# turning that calculation to a Dataframe so we can see it better 
summary = pd.DataFrame({'count': values_Count, 'percentage': percentage})
summary

Unnamed: 0_level_0,count,percentage
diabetes,Unnamed: 1_level_1,Unnamed: 2_level_1
0,91500,91.5
1,8500,8.5


In [36]:
# Correlation between columns 'diabetes' and 'bmi'
numeric_diabetes = diabetes.select_dtypes(include=['number']) 
numeric_diabetes.corr()

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes
age,1.0,0.251171,0.233354,0.337396,0.101354,0.110672,0.258008
hypertension,0.251171,1.0,0.121262,0.147666,0.080939,0.084429,0.197823
heart_disease,0.233354,0.121262,1.0,0.061198,0.067589,0.070066,0.171727
bmi,0.337396,0.147666,0.061198,1.0,0.082997,0.091261,0.214357
HbA1c_level,0.101354,0.080939,0.067589,0.082997,1.0,0.166733,0.40066
blood_glucose_level,0.110672,0.084429,0.070066,0.091261,0.166733,1.0,0.419558
diabetes,0.258008,0.197823,0.171727,0.214357,0.40066,0.419558,1.0


In [None]:
# We saw that the age had a vlue min = 0.08 which is not logical so to make sure if this data is valuable and not noise we will look into it.
age = diabetes[diabetes['age'] < 1]
age

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
155,Female,0.08,0,0,No Info,14.43,6.5,160,0
218,Female,0.56,0,0,No Info,16.85,5.0,140,0
241,Male,0.88,0,0,No Info,17.49,6.0,140,0
268,Female,0.16,0,0,No Info,12.15,6.6,100,0
396,Male,0.16,0,0,No Info,14.35,6.5,126,0
...,...,...,...,...,...,...,...,...,...
99452,Male,0.32,0,0,No Info,15.93,5.7,100,0
99536,Female,0.40,0,0,No Info,16.66,3.5,140,0
99629,Female,0.64,0,0,No Info,17.58,6.1,140,0
99778,Female,0.32,0,0,No Info,12.26,5.8,126,0


In [43]:
# After looking at the age < 1. we saw that it has a hypertension was 0, heart disease was 0, smoking history was mostly no info, 
# The most important our target variable diabetes was all 0 so we made the descision to drop it.
diabetes = diabetes[diabetes['age'] >= 1]
diabetes.describe(include='all')

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
count,99089,99089.0,99089.0,99089.0,99089,99089.0,99089.0,99089.0,99089.0
unique,3,,,,6,,,,
top,Female,,,,never,,,,
freq,58119,,,,35058,,,,
mean,,42.266093,0.075538,0.039782,,27.413036,5.529091,138.104835,0.085781
std,,22.266528,0.264259,0.195449,,6.586258,1.071261,40.761804,0.280042
min,,1.0,0.0,0.0,,10.01,3.5,80.0,0.0
25%,,24.0,0.0,0.0,,23.77,4.8,100.0,0.0
50%,,43.0,0.0,0.0,,27.32,5.8,140.0,0.0
75%,,60.0,0.0,0.0,,29.64,6.2,159.0,0.0


# Data Visualization 

**we will now viusalize now the things we found in statistics to make sure of our findings**