# Exploratory Data Analysis: Diabetes Dataset

## Goal
The objective of this project is to explore the Diabetes dataset, clean the data if necessary, perform statistical analysis, and visualize key relationships using Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn. The aim is to uncover meaningful patterns and insights about factors influencing diabetes outcomes.

## Why This Project?
- **Data Cleaning**: Practice handling missing or inconsistent data using Pandas.
- **Statistical Analysis**: Use NumPy and Pandas to compute summary statistics (e.g., mean, median) and identify trends.
- **Data Visualization**: Create insightful visualizations with Matplotlib and Seaborn to highlight relationships between features.
- **Feature Relationships**: Investigate how variables like age, BMI, glucose levels, and pregnancies correlate with diabetes outcomes.


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Set visualization style for better aesthetics
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## Step 1: Load the Dataset
We start by loading the Diabetes dataset, which contains health-related features for predicting diabetes outcomes.

In [4]:
df = pd.read_csv('diabetes.csv')
df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Step 2: Exploring the Data
Let's examine the dataset's structure, check for missing values, and summarize the key statistics to understand the data better.

In [5]:
print('Dataset Shape:', df.shape)
print(df.info())
print('\nMissing Values:')
print(df.isnull().sum())
print('\nSummary Statistics:')
print(df.describe())

Dataset Shape: (768, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None

Missing Values:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigr

### Initial Insights
- **Dataset Size**: The dataset contains 768 records and 9 features, including `Pregnancies`, `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, `BMI`, `DiabetesPedigreeFunction`, `Age`, and `Outcome` (0 for no diabetes, 1 for diabetes).
- **No Missing Values**: There are no null values in the dataset, but we need to check for implausible values (e.g., zeros in `Glucose`, `BMI`, or `BloodPressure`).
- **Potential Issues**: Features like `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, and `BMI` have minimum values of 0, which may indicate missing or invalid data, as these measurements cannot realistically be zero in a living person.


## Step 3: Data Cleaning
Although there are no missing values, zero values in certain columns (`Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, `BMI`) are biologically implausible. We'll replace these zeros with the median of the respective column to maintain data integrity.

In [None]:
# Columns with implausible zero values
columns_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# Replace zeros with median for each column
for col in columns_with_zeros:
    df[col] = df[col].replace(0, df[col].median())

# Verify cleaning
print('Summary Statistics After Cleaning:')
print(df[columns_with_zeros].describe())

### Cleaning Insights
- **Zero Values Handled**: Zeros in `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, and `BMI` were replaced with their respective medians to ensure realistic values.
- **Preserving Data**: Using the median avoids introducing bias and maintains the dataset's distribution.


## Step 4: Statistical Analysis with NumPy and Pandas
Let's perform some calculations to understand key trends in the data, focusing on relationships between features and the `Outcome`.

In [27]:
# Average age of patients
avg_age = df['Age'].mean()
print(f'Average Age of Patients: {avg_age:.2f} years')

# Diabetes rate by number of pregnancies
diabetes_by_pregnancies = df.groupby('Pregnancies')['Outcome'].mean()
print('\nDiabetes Rate by Number of Pregnancies:\n', diabetes_by_pregnancies)

# Average BMI of patients with diabetes
avg_bmi = df[df['Outcome'] == 1]['BMI'].mean()
print(f'Average BMI of Diabetes Patients: {avg_bmi:.2f}')

# Average glucose levels of patients with diabetes
avg_glucose = df[df['Outcome'] == 1]['Glucose'].mean()
print(f'Average Glucose Levels of Diabetes Patients: {avg_glucose:.2f}')

# Correlation matrix to explore feature relationships
print('\nCorrelation Matrix:')
print(df.corr())

Average Age of Patients: 33.24 years
Diabetes Rate by Number of Pregnancies:
 Pregnancies
0     0.342342
1     0.214815
2     0.184466
3     0.360000
4     0.338235
5     0.368421
6     0.320000
7     0.555556
8     0.578947
9     0.642857
10    0.416667
11    0.636364
12    0.444444
13    0.500000
14    1.000000
15    1.000000
17    1.000000
Name: Outcome, dtype: float64

Average BMI of Diabetes Patients: 35.14
Average Glucose Levels of Diabetes Patients: 141.26

Correlation Matrix:
                          Pregnancies   Glucose  BloodPressure  SkinThickness  \
Pregnancies                1.000000  0.127911      0.208522       0.013376   
Glucose                    0.127911  1.000000      0.219666       0.228043   
BloodPressure              0.208522  0.219666      1.000000       0.192816   
SkinThickness              0.013376  0.228043      0.192816       1.000000   
Insulin                   -0.037013  0.494650      0.075426       0.178799   
BMI                        0.021546  0.2

### Analysis Insights
- **Average Age**: The average age of patients is approximately 33 years, suggesting a relatively young cohort.
- **Pregnancies and Diabetes**: Higher numbers of pregnancies (especially >6) are associated with a significantly increased diabetes risk, with rates exceeding 50% for 7 or more pregnancies.
- **BMI and Glucose**: Patients with diabetes have a higher average BMI (35.14, indicating obesity) and elevated glucose levels (141.26 mg/dL), which are known risk factors for diabetes.
- **Correlations**:
  - `Glucose` has the strongest correlation with `Outcome` (0.49), indicating it is a key predictor of diabetes.
  - `BMI` (0.31) and `Age` (0.24) also show moderate positive correlations with diabetes.
  - `Pregnancies` and `Age` are correlated (0.54), suggesting older patients tend to have more pregnancies, which may amplify diabetes risk.


## Step 5: Data Visualization with Matplotlib and Seaborn
Visualizations will help us better understand the relationships between features and the diabetes outcome.

In [29]:
# Pairplot for Age, BMI, Glucose, and Outcome
sns.pairplot(df[['Age', 'BMI', 'Glucose', 'Outcome']], hue='Outcome', palette='Set1')
plt.suptitle('Pairplot of Age, BMI, Glucose, and Diabetes Outcome', y=1.02)
plt.show()

# Boxplot for Glucose by Outcome
plt.figure(figsize=(8, 5))
sns.boxplot(x='Outcome', y='Glucose', data=df, palette='Set2')
plt.title('Glucose Levels by Diabetes Outcome')
plt.xlabel('Diabetes Outcome (0 = No, 1 = Yes)')
plt.ylabel('Glucose Level (mg/dL)')
plt.show()

# Heatmap of correlations
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of Diabetes Features')
plt.show()

### Visualization Insights
- **Pairplot**: The scatter plots show that higher `Glucose` and `BMI` values are more common in patients with diabetes (Outcome=1). `Age` also appears to play a role, with older patients more likely to have diabetes.
- **Glucose Boxplot**: Patients with diabetes have significantly higher glucose levels, with a median around 140 mg/dL compared to ~110 mg/dL for non-diabetic patients. Outliers are present, but the trend is clear.
- **Correlation Heatmap**: The heatmap confirms that `Glucose`, `BMI`, and `Age` have the strongest positive correlations with diabetes outcome. `DiabetesPedigreeFunction` also shows a modest correlation (0.17), suggesting genetic predisposition plays a role.


## Step 6: Key Takeaways and Next Steps
- **Key Drivers**: Glucose levels, BMI, and age are the most significant predictors of diabetes in this dataset. Higher values in these features are strongly associated with positive diabetes outcomes.
- **Pregnancy Impact**: Women with more pregnancies (especially >6) face a higher risk of diabetes, possibly due to hormonal or metabolic changes.
- **Data Quality**: Replacing zero values with medians was critical to ensure meaningful analysis, as zeros in key health metrics were implausible.
- **Next Steps**:
  - Explore feature engineering (e.g., BMI categories, age groups) to enhance predictive modeling.
  - Build a machine learning model (e.g., logistic regression, random forest) to predict diabetes outcomes.
  - Investigate outliers further to determine if they represent true anomalies or data errors.
