# Exploratory Data Analysis - Diabetes Dataset

**Authors:**  
Filip Kobus, Łukasz Jarzęcki, Paweł Skierkowski  
**Date:** 22.12.25  
**Team 3**

## Config

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


def plot_distributions(df: pd.DataFrame, n_cols: int = 4, bins: int = 50):
    cols = df.columns

    n_rows = int(np.ceil(len(cols) / n_cols))

    plt.figure(figsize=(3 * n_cols, 2 * n_rows))

    for i, col in enumerate(cols, start=1):
        plt.subplot(n_rows, n_cols, i)
        if df[col].dtypes == 'object':
            df[col].hist(bins=bins)
            plt.title(col)
            plt.xticks(rotation=45, ha='right')
        else:
            sns.histplot(data=df, x=col, kde=True, bins=bins)
            plt.title(f'{col}\nSkewness: {round(df[col].skew(), 2)}')
        plt.tight_layout()


def plot_target_vs_category(df: pd.DataFrame, target: str, n_cols: int = 4):
    cat_cols = df.select_dtypes(include=['object', 'category']).columns
    cat_cols = [c for c in cat_cols if c != target]

    if len(cat_cols) == 0:
        print('No categorical columns.')
        return

    n_rows = int(np.ceil(len(cat_cols) / n_cols))

    plt.figure(figsize=(5 * n_cols, 4 * n_rows))

    for i, col in enumerate(cat_cols, start=1):
        plt.subplot(n_rows, n_cols, i)

        sns.violinplot(data=df, x=col, y=target)

        plt.title(f'{col} vs {target}')
        plt.xticks(rotation=45)

    plt.tight_layout()
    plt.show()


def plot_numerical_boxplots(df, n_cols: int = 3):
    num_cols = df.select_dtypes(include=['number']).columns

    if len(num_cols) == 0:
        print('No numerical columns.')
        return

    n_rows = int(np.ceil(len(num_cols) / n_cols))
    plt.figure(figsize=(5 * n_cols, 4 * n_rows))

    for i, col in enumerate(num_cols, start=1):
        plt.subplot(n_rows, n_cols, i)
        sns.boxplot(data=df, x=col)
        plt.title(col)

    plt.tight_layout()
    plt.show()

In [None]:
diabetes = pd.read_csv('diabetes_dataset.csv').drop(['diabetes_stage'], axis=1)

We're loading the dataset and dropping the 'diabetes_stage' column. This categorical variable tells us what stage of diabetes the patient is in (No Diabetes, Prediabetes, Type 1, Type 2). 

Our goal is to build a regression model that predicts 'diabetes_risk_score', which is a continuous measure of diabetes risk. Using 'diabetes_stage' as a feature would be data leakage since it's essentially another representation of the outcome we're trying to predict. The diabetes stage might be determined based on the same health indicators that contribute to the risk score, so including it would give our model unfair information it wouldn't have in a real prediction scenario.

## Basic Data Overview

In [None]:
diabetes.head()

Looking at the first few rows, we can see a good variety in the data - different ages, genders, ethnicities, and health profiles. The data types look appropriate with categorical variables stored as objects and numerical measurements as numeric types. We have both our target variable 'diabetes_risk_score' and the binary 'diagnosed_diabetes' column still present at this stage.

## Data types

In [None]:
diabetes.info()

The dataset contains 100,000 records with 30 features (after dropping diabetes_stage). We have categorical variables like gender, ethnicity, education, and employment stored as objects. Numerical measurements like age, BMI, blood pressure, and lab values are stored as integers or floats. Binary variables like family history and medical conditions are stored as integers (0/1). This is a good starting point, though we'll need to encode the categorical variables properly before modeling.

## Missing and duplicated data

In [None]:
print(f'Missing data rows: {diabetes.isna().sum().sum()}')

In [None]:
print(f'Duplicated values count: {diabetes.duplicated().sum()}')

The dataset has no missing values or duplicate records, so we can proceed directly with the analysis.

## Numerical data description

In [None]:
diabetes.describe().drop('count', axis=0).T

Looking at the summary statistics, the data appears reasonable. Age ranges from 18 to 90 with a mean around 50. BMI averages 25.6, which is slightly overweight according to standard classifications. Blood pressure values look normal with systolic averaging 116 and diastolic 75. The diabetes risk score ranges from 2.7 to 67.2 with a mean of 30.2, showing good spread in our target variable. About 60% of patients are diagnosed with diabetes, which means the dataset is somewhat imbalanced but not extremely so. Binary history variables show that roughly 22% have family history of diabetes, 25% have hypertension history, and 8% have cardiovascular history.

## Data distribution

In [None]:
plot_distributions(diabetes.drop('diagnosed_diabetes', axis=1))

Most numerical features show approximately symmetric distributions with skewness values close to 0. Variables like age, BMI, blood pressure readings, cholesterol levels, and glucose measurements are nearly normally distributed, which is good for many modeling approaches.

The binary variables (family history, hypertension history, cardiovascular history) show high positive skewness, which is expected since most patients don't have these conditions. Cardiovascular history has particularly high skewness at 3.12, indicating it's relatively rare in the dataset.

Physical activity shows the highest skewness among continuous variables at 1.39, with many people exercising minimally and fewer exercising heavily. Insulin levels show moderate right skew (0.42) with most patients having lower levels and some having particularly high values. Our target variable diabetes_risk_score also shows moderate right skew (0.51), suggesting some patients have particularly high risk scores that pull the distribution to the right.

The categorical variables show imbalanced distributions - gender is fairly balanced between male and female, while ethnicity is dominated by Asian and White populations. Education and income levels show concentration in the middle categories.

Looking at variable types for future encoding, we can identify several groups:
- Education level, income level, and smoking status are ordinal variables with natural ordering and should be encoded using OrdinalEncoder. - Employment status, gender, and ethnicity are nominal categories without inherent ordering and will need OneHotEncoder.
- The binary history variables (family history, hypertension, cardiovascular) are already encoded as 0/1, though it's worth noting that gender should not be treated as strictly binary.
- Alcohol consumption per week can be treated as a continuous numeric variable given its distribution.

## Looking for outliers, features that are logical or input errors.

In [None]:
# blood pressure
impossible_bp = diabetes[
    (diabetes['systolic_bp'] <= diabetes['diastolic_bp'])
    | (diabetes['systolic_bp'] < 60)
    | (diabetes['systolic_bp'] > 300)
]
print(
    f'BP (systolic must be higher than diastolic, out of possible range): '
    f'{len(impossible_bp)} rows.'
)

# cholesterol
cholesterol_error = diabetes[
    diabetes['cholesterol_total']
    < (diabetes['hdl_cholesterol'] + diabetes['ldl_cholesterol']) - 20
]
print(f'Cholesterol (Total < HDL+LDL): {len(cholesterol_error)} rows.')

# screen time, sleep hours, age
sleep_error = diabetes[
    (diabetes['sleep_hours_per_day'] <= 0) | (diabetes['sleep_hours_per_day'] > 24)
]

age_error = diabetes[(diabetes['age'] < 0) | (diabetes['age'] > 110)]

screentime_error = diabetes[
    (diabetes['screen_time_hours_per_day'] < 0)
    | (diabetes['screen_time_hours_per_day'] > 24)
]

print(f'Sleep hours per day error(<=0 or >24h): {len(sleep_error)} rows.')
print(f'Age error: {len(age_error)} rows.')
print(f'Screen time error: {len(screentime_error)} rows.')

# WHR (Waist to hip ratio)
whr_error = diabetes[
    (diabetes['waist_to_hip_ratio'] < 0.5) | (diabetes['waist_to_hip_ratio'] > 2.5)
]
print(f'WHR error: {len(whr_error)} rows.')

# lab errors (impossible zeros)
lab_cols = ['insulin_level', 'cholesterol_total', 'triglycerides']
for col in lab_cols:
    zeros = len(diabetes[diabetes[col] == 0])
    if zeros > 0:
        print(f'ERROR in {col}: {zeros}')

We identified 190 records with data quality issues: 154 rows with impossible blood pressure readings (systolic ≤ diastolic or out of physiological range) and 36 rows where total cholesterol was significantly lower than the sum of HDL and LDL cholesterol, indicating measurement or recording errors that need to be removed before modeling.

## Data Cleaning - Removing Invalid Records

In [None]:
# Deleting
index_to_drop = (
    set(impossible_bp.index)
    | set(cholesterol_error.index)
    | set(sleep_error.index)
    | set(age_error.index)
    | set(screentime_error.index)
    | set(whr_error.index)
)

diabetes_clean = diabetes.drop(index=list(index_to_drop)).copy()

print(f'Data rows before: {len(diabetes)}, after: {len(diabetes_clean)}')

After removing 190 invalid records (0.19% of the dataset), we retained 99,810 clean records for analysis and modeling.

## Checking for outliers (Boxplots)

In [None]:
plot_numerical_boxplots(diabetes_clean.drop(['diagnosed_diabetes'], axis=1))

After removing the impossible values, the remaining outliers visible in the boxplots appear to represent natural variation rather than data errors. Variables like insulin level, triglycerides, and postprandial glucose show significant outliers on the high end, likely representing patients with metabolic syndrome or advanced diabetes. We've decided not to remove these outliers as they represent valid patient profiles good for the model to learn, and will instead handle them during preprocessing with scaling methods.

## Correlation

In [None]:
corr = diabetes_clean.drop(['diagnosed_diabetes'], axis=1).corr(numeric_only=True)
mask = np.triu(np.ones_like(corr, dtype=bool))

plt.figure(figsize=(25, 25))
sns.heatmap(
    corr,
    annot=True,
    vmin=-1,
    vmax=1,
    mask=mask,
    linewidths=0.5,
)

The strongest predictors of diabetes risk score are family history of diabetes (r=0.73), age (r=0.50), and fasting glucose (r=0.47). Physical activity shows a moderate negative correlation (r=-0.35), suggesting it's protective against diabetes risk. Interestingly, lifestyle factors like sleep hours and alcohol consumption show negligible correlations with the target variable.

We identified several concerning multicollinearity issues among predictor variables. Glucose postprandial and HbA1c are extremely highly correlated (r=0.93), as are total cholesterol and LDL cholesterol (r=0.91). Fasting glucose correlates strongly with both HbA1c (r=0.70) and postprandial glucose (r=0.59). BMI and waist-to-hip ratio also show high correlation (r=0.77). These redundancies mean that including all these variables together in linear models could lead to unstable coefficient estimates and inflated standard errors.

For modeling, we have several options to handle this multicollinearity. For linear models like Ridge or Lasso regression, we can keep all variables since regularization naturally handles correlated features. For standard linear regression or models sensitive to multicollinearity, we should consider dropping one variable from each highly correlated pair. Alternatively, tree-based models like Random Forest or XGBoost are naturally resistant to multicollinearity and can use all features without issues. We could also apply PCA to create uncorrelated components, though this would sacrifice interpretability.

## Distribution of target value for different categories for each categorical value

In [None]:
plot_target_vs_category(diabetes_clean, 'diabetes_risk_score', n_cols=3)

The violin plots reveal that categorical variables like gender, ethnicity, education, income, employment, and smoking status show nearly identical diabetes risk score distributions across all categories, suggesting these demographic factors have minimal individual impact compared to continuous health metrics.

## Diabetes Risk Analysis - Age and BMI

In [None]:
# grouping by age and BMI
bmi_bins = [0, 18.5, 25, 30, 35, 40, np.inf]
bmi_labels = [
    'Underweight',
    'Normal',
    'Overweight',
    'Obese I',
    'Obese II',
    'Obese III+',
]

age_bins = [0, 30, 40, 50, 60, 70, 80, np.inf]
age_labels = ['<30', '30-40', '40-50', '50-60', '60-70', '70-80', '80+']

diabetes_clean['bmi_group'] = pd.cut(
    diabetes_clean['bmi'], bins=bmi_bins, labels=bmi_labels
)
diabetes_clean['age_group'] = pd.cut(
    diabetes_clean['age'], bins=age_bins, labels=age_labels
)

#
risk_matrix = diabetes_clean.pivot_table(
    index='bmi_group',
    columns='age_group',
    values='diagnosed_diabetes',
    aggfunc='mean',
    observed=False,
)

plt.figure(figsize=(10, 8))
sns.heatmap(risk_matrix, annot=True, fmt='.0%', cmap='YlOrRd')
plt.title('Diabetes risk (Age and BMI)')
plt.ylabel('BMI category')
plt.xlabel('Age group')
plt.gca().invert_yaxis()
plt.show()

The heatmap shows a clear diagonal pattern where diabetes risk increases with both age and BMI, confirming their combined effect on disease prevalence. The highest risk appears in older age groups with higher obesity levels, reaching 80-83% in the oldest and most obese categories. Interestingly, even young individuals with normal or low BMI show relatively high diabetes rates (43-51%), suggesting the dataset may include Type 1 diabetes cases or other risk factors beyond age and BMI alone.

## Summary

Based on the exploratory findings, we hypothesize that the `diabetes_risk_score` is primarily a function of biological and genetic markers such as age, BMI, and family history, significantly outweighing the influence of demographic variables. We further postulate that while elevated glucose metrics will display a strong positive correlation with the risk score, physical activity will emerge as a critical negative predictor, validating its protective role in the regression model.