## Exploratory Data Analysis - Diabetes Dataset
### Introduction
This notebook presents an exploratory data analysis (EDA) of a diabetes dataset from Kaggle. The goal is to identify key patterns, feature distributions, and relationships that may inform predictive modeling for diabetes diagnosis.

**Dataset:** Diabetes Dataset (Kaggle)

**Objective:** Explore and visualize the dataset to gain insights and guide preprocessing and feature engineering.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**EDA Steps:**
1. Load Libraries and Data
2. Dataset Overview
3. Missing Values Analysis
4. Target Variable Analysis
5. Features Distributions
6. Relationship Between Features and Outcome
7. Correlation Matrix

### 1. Load Libraries and Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

df = pd.read_csv("../data/raw/diabetes.csv")
df.columns = df.columns.str.strip()

print(f"Dataset loaded: {df.shape[0]} rows and {df.shape[1]} columns")
df.head()

### 2. Dataset Overview

In [None]:
print(f"Dataset size: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")
print("\nColumns names:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

print("\nData types:")
print(df.dtypes)

print("\nQuick statistics:")
df.describe()

### 3. Missing Values Analysis

In [None]:
zero_cols = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]
zero_count = (df[zero_cols] == 0).sum()
print("\nPotential missing values represented as 0:")
print(zero_count)
print(f"Total rows of data: {len(df)} rows")

zero_count.plot(kind='bar', color='lightgreen')
plt.title("Potential Missing Values (0) per Column")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()

### 4. Target Variable Analysis

In [None]:
outcome_summary = pd.DataFrame({
    "Count": df['Outcome'].value_counts(),
    "Ratio": df['Outcome'].value_counts(normalize=True).round(3)
})
print("Outcome Summary:")
print(outcome_summary)


plt.figure(figsize=(5,3))
sns.countplot(x='Outcome', data=df, palette='Set2')
plt.title('Outcome (0 = no diabetes, 1 = diabetes)')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.show()

### 5. Feature Distributions

In [None]:
df.hist(bins=20, figsize=(12, 10), color='skyblue', edgecolor='black')
plt.suptitle("Distribution of Numerical Features", y=1.02)
plt.tight_layout()
plt.show()

### 6. Relationship Between Features and Outcome

In [None]:
features = ["Glucose", "BloodPressure", "BMI", "Age"]
plt.figure(figsize=(12, 8))

for i, col in enumerate(features, 1):
    plt.subplot(2, 2, i)
    sns.boxplot(x='Outcome', y=col, data=df, palette='Set3')
    plt.title(f'{col} vs Outcome')
    plt.xlabel('Outcome')
    plt.ylabel(col)

plt.tight_layout()
plt.suptitle("Feature vs Outcome Comparison", y=1.02)
plt.show()

### 7. Correlation Matrix

In [None]:
corr_matrix = df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', 
            center=0, square=True, linewidths=0.5, cbar_kws={'shrink': 0.8})
plt.title("Correlation Matrix of Features", fontsize=14)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

print("\nTop 5 correlations with Outcome:")
outcome_corr = corr_matrix['Outcome'].drop('Outcome').sort_values(key=abs, ascending=False)
for feature, corr in outcome_corr.head(5).items():
    print(f"{feature:25}: {corr:6.3f}")

### Conclusion
This EDA highlighted several key findings:  
- The dataset shows a moderate class imbalance (65% non-diabetic, 35% diabetic).  
- Glucose, BMI, and Age are strongly correlated with diabetes outcome.  
- Several features have skewed distributions and outliers that may affect modeling.  
These insights will guide feature selection and preprocessing for predictive modeling.