<a href="https://colab.research.google.com/github/nischithakn800-ux/import-export-dataset/blob/main/DIABETES_PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Name** - **DIABETES RISK ANALYSIS**

**Project Type - Exploratory Data Analysis**

**Project Prepared By - Nischitha K N**

**Project Summary**

In this project, I performed an interactive exploratory data analysis (EDA) on the Pima Indians Diabetes dataset to uncover patterns and risk factors associated with diabetes. I began by cleaning the data, replacing invalid zero values with missing entries and imputing them using medians to preserve distribution integrity. Using Plotly, I created dynamic visualizations—histograms, box plots, scatter plots, and heatmaps—that revealed how features like glucose, BMI, age, and insulin differ between diabetic and non-diabetic individuals. I found that diabetics tend to have higher values in these key metrics, and correlation analysis confirmed glucose and BMI as strong predictors. To quantify risk, we built a composite risk score by weighting multiple features, which effectively separated high-risk individuals from low-risk ones. I further explored how this score varied across age groups and interacted with genetic predisposition, offering deeper insight into the multifactorial nature of diabetes. This analysis not only highlighted the most influential health indicators but also laid a strong foundation for future modeling, clinical decision-making, or dashboard development.

**GitHub Link**




**Problem Statement**

Diabetes is a chronic health condition that affects millions worldwide and is influenced by a combination of genetic, physiological, and lifestyle factors. Early detection and risk assessment are critical for effective management and prevention. This project aims to analyze the Pima Indians Diabetes dataset to identify key health indicators associated with diabetes, explore patterns between diabetic and non-diabetic individuals, and develop a composite risk score using interactive visualizations. By uncovering meaningful relationships among features such as glucose levels, BMI, age, insulin, and genetic predisposition, the goal is to support data-driven insights that can inform clinical decision-making and lay the foundation for predictive modeling.

**Coding Section**

**Import Libraries**

In [None]:
# Data manipulation and numerical operations
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical tests
from scipy.stats import ttest_ind, mannwhitneyu, pearsonr

# Feature scaling
from sklearn.preprocessing import MinMaxScaler

# Display settings for better visuals
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)


import plotly.express as px
import plotly.figure_factory as ff


**Load and Preview Dataset**

In [None]:
# Load the dataset
df = pd.read_csv('diabetes.csv')

# Preview the first few rows
df.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


**Interactive Correlation Heatmap**

In [None]:
import plotly.figure_factory as ff

corr_matrix = df.corr().round(2)
fig = ff.create_annotated_heatmap(
    z=corr_matrix.values,
    x=list(corr_matrix.columns),
    y=list(corr_matrix.index),
    colorscale='Viridis',
    showscale=True,
    annotation_text=corr_matrix.values
)
fig.update_layout(title_text='Interactive Correlation Heatmap')
fig.show()


The output of your correlation heatmap is a visual summary of how all variables in the diabetes dataset relate to each other.

It shows:

**Strength of relationships**: Each cell displays a correlation coefficient between two variables, ranging from –1 (perfect negative) to +1 (perfect positive).

**Color-coded insights**: Strong correlations appear in bold colors, while weak or no correlations are lighter. This makes it easy to spot patterns at a glance.

**Annotated values**: Each cell includes the exact correlation value, so you can interpret the strength and direction of relationships precisely.

**Interactivity**: You can hover over cells to explore variable pairs, zoom in, and pan around the matrix for deeper inspection.

This heatmap helps you quickly identify which features are most associated with diabetes (like glucose or BMI), which ones are redundant, and which might be worth exploring further in your analysis or modeling. It's a powerful tool for guiding feature selection and understanding the dataset’s internal structure.

**Composite Diabetes Risk Score Histogram**

In [None]:
# Create a simple composite risk score (example)
# This is a simplified example, a real risk score would be more complex
df['RiskScore'] = (df['Glucose'] * 0.5 +
                   df['BMI'] * 0.3 +
                   df['Age'] * 0.2 +
                   df['BloodPressure'] * 0.1 +
                   df['Pregnancies'] * 0.1 +
                   df['SkinThickness'] * 0.1 +
                   df['Insulin'] * 0.1 +
                   df['DiabetesPedigreeFunction'] * 0.1)

# Now the plotting code should work
fig = px.histogram(df, x='RiskScore', nbins=30, title='Composite Diabetes Risk Score',
                   color='Outcome', marginal='rug', opacity=0.7,
                   color_discrete_map={0: 'blue', 1: 'red'})
fig.show()

People with higher risk scores tend to be in the red group (diabetic).

People with lower scores are mostly in the blue group (non-diabetic).

This means your risk score is doing a good job of separating high-risk individuals from low-risk ones.

You can also see overlap—some non-diabetic people have high scores, and vice versa. That’s normal in real-world data.

**Diabetes Outcome Distribution**

In [None]:
fig = px.histogram(df, x='Outcome', color='Outcome', title='Diabetes Outcome Distribution',
                   color_discrete_map={0: 'blue', 1: 'yellow'})
fig.show()


The histogram displays two bars:

One for Outcome = 0 (non-diabetic), colored blue

One for Outcome = 1 (diabetic), colored yellow

Each bar’s height represents the number of individuals in that category.

**Interactive Histograms of Key Features**

In [None]:
for col in ['Glucose', 'BMI']:
    fig = px.histogram(df, x=col, nbins=30, marginal="box", title=f'Distribution of {col}')
    fig.show()


Glucose Distribution
You’ll likely see a right-skewed distribution, meaning most people have moderate glucose levels, but a few have very high values.

The box plot may show outliers—individuals with extremely high glucose, which could indicate diabetes or prediabetes.

BMI Distribution
BMI may show a bell-shaped or slightly skewed distribution, depending on the population.

The box plot helps you spot individuals with very high BMI, which is a known risk factor for diabetes.

**Box Plots by Diabetes Outcome**

In [None]:
for col in ['Glucose', 'BMI']:
    fig = px.box(df, x='Outcome', y=col, color='Outcome', title=f'{col} by Diabetes Outcome',
                 color_discrete_map={0: 'blue', 1: 'red'})
    fig.show()


Each plot reveals the distribution of a health feature across the two groups:

Glucose by Outcome
Diabetic individuals tend to have higher glucose levels.

The red box (diabetic group) is shifted upward compared to the blue box.

This suggests that glucose is a strong risk factor for diabetes.

BMI by Outcome
Diabetic individuals often have higher BMI values.

The red box may show a higher median and more upper outliers.

This indicates that body weight and fat composition play a role in diabetes risk.

**Compare Group Means**

In [None]:
diabetic = df[df['Outcome'] == 1]
non_diabetic = df[df['Outcome'] == 0]

mean_diff = pd.DataFrame({
    'Diabetic': diabetic.mean(),
    'Non-Diabetic': non_diabetic.mean(),
    'Difference': diabetic.mean() - non_diabetic.mean()
})
mean_diff[['Diabetic', 'Non-Diabetic']]


Unnamed: 0,Diabetic,Non-Diabetic
Pregnancies,4.865672,3.298
Glucose,141.257463,109.98
BloodPressure,70.824627,68.184
SkinThickness,22.164179,19.664
Insulin,100.335821,68.792
BMI,35.142537,30.3042
DiabetesPedigreeFunction,0.5505,0.429734
Age,37.067164,31.19
Outcome,1.0,0.0
RiskScore,108.459005,86.356033


The resulting table has three columns:

**Diabetic**: The average value of each feature for individuals diagnosed with diabetes (Outcome = 1)

**Non-Diabetic**: The average value of each feature for individuals without diabetes (Outcome = 0)

**Difference** (not shown in the final output but calculated): The difference between the two group means

**Visualize Composite Risk Score**

In [None]:
fig = px.histogram(df, x='Age', color='Outcome', nbins=30, title='Age Distribution by Diabetes Outcome',
                   color_discrete_map={0: 'blue', 1: 'red'}, marginal='box')
fig.show()


**Age spread**: You’ll see how people of different ages are distributed across the dataset.

**Group comparison:**

If the red bars (diabetic group) are more concentrated in older age ranges, it suggests that older individuals are more likely to be diabetic.

If the blue bars dominate younger age ranges, it shows that younger individuals are less affected.

**Box plot insights:**

The box plot shows the median age, interquartile range, and outliers for each group.

You can quickly spot whether diabetics tend to be older on average.

**Insulin Levels by Outcome**

In [None]:
fig = px.box(df, x='Outcome', y='Insulin', color='Outcome', title='Insulin Levels by Diabetes Outcome',
             color_discrete_map={0: 'blue', 1: 'red'})
fig.show()


**Median insulin levels**: The red box (diabetic group) typically has a higher median, suggesting that diabetics tend to have elevated insulin levels.

**Spread and variability**: The diabetic group often shows a wider range of insulin values, indicating more variability in insulin response or treatment.

**Outliers**: You’ll likely see dots above the whiskers, especially in the diabetic group—these are individuals with extremely high insulin levels, which may reflect insulin resistance or medical interventions.

**Overlap:** Some non-diabetic individuals may also have high insulin levels, showing that insulin alone isn’t a perfect predictor but still a meaningful factor.

**Diabetes Pedigree Function Distribution**

In [None]:
fig = px.histogram(df, x='DiabetesPedigreeFunction', color='Outcome', nbins=30,
                   title='Diabetes Pedigree Function by Outcome',
                   color_discrete_map={0: 'blue', 1: 'red'}, marginal='rug')
fig.show()


Diabetes Pedigree Function is a measure of genetic risk—it reflects how likely someone is to develop diabetes based on family history.

The histogram shows how this value is spread across both groups:

If the red bars (diabetic group) tend to appear at higher values, it suggests that people with stronger genetic links to diabetes are more likely to be affected.

If both groups overlap heavily, it means this feature alone doesn’t strongly separate diabetic from non-diabetic individuals.

The rug plot adds granularity by showing where each individual falls along the x-axis.

**Feature Importance via Correlation with Outcome**

In [None]:
correlations = df.corr()['Outcome'].drop('Outcome').sort_values(ascending=False)
fig = px.bar(x=correlations.index, y=correlations.values,
             title='Feature Correlation with Diabetes Outcome',
             labels={'x': 'Feature', 'y': 'Correlation Coefficient'},
             color=correlations.values, color_continuous_scale='RdBu')
fig.show()


Each bar represents a feature (like Glucose, BMI, Age, etc.).

The height and direction of the bar show how strongly that feature is correlated with diabetes:

A positive value means that as the feature increases, the likelihood of diabetes also increases.

A negative value means that higher values of the feature are associated with a lower chance of diabetes.

The color gradient (from red to blue) visually emphasizes the strength and direction of each correlation.

**CONCLUSION**

In conclusion, this project provided a comprehensive and interactive exploration of the Pima Indians Diabetes dataset, allowing us to uncover meaningful patterns and relationships among key health indicators. Through data cleaning, visualization, and statistical analysis, we identified glucose, BMI, and age as strong predictors of diabetes, while also examining the role of insulin levels and genetic predisposition. By comparing diabetic and non-diabetic groups across multiple features, we gained insight into how these variables influence disease risk. The creation of a composite risk score further enhanced our understanding by combining multiple factors into a single metric that effectively distinguished high-risk individuals. Overall, this analysis not only deepened our knowledge of diabetes-related health trends but also laid a strong foundation for future predictive modeling, clinical decision support, or public health interventions.