<center>
<h1>Activity: Diabetes Factors Visualizations<br>using Plotly<br><b></b></h1>
<h2>Johann Sebastian Catalla, BSCS-III</h2>
Professor: Dean Rodrigo Belleza Jr. <br>
As partial requirement for the semifinal period of the course <br>CSDATA01: Data Science Fundamentals<br><br>
</center>

---

<b>Data Dictionary</b>
<ul>
    <li>Pregnancies: Number of times pregnant</li>
    <li>Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test</li>
    <li>BloodPressure: Diastolic blood pressure (mm Hg)</li>
    <li>SkinThickness: Triceps skin fold thickness (mm)</li>
    <li>Insulin: 2-Hour serum insulin (mu U/ml)</li>
    <li>BMI: Body mass index (weight in kg/(height in m)^2)</li>
    <li>DiabetesPedigreeFunction: Diabetes pedigree function</li>
    <li>Age: Age (years)</li>
    <li>Outcome: Class variable (0 or 1)</li>
</ul>

In [24]:
import pandas as pd
df = pd.read_csv('diabetes.csv')
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [25]:
import plotly.express as px
import plotly.graph_objects as go

In [26]:
# 1. Histogram: Distribution of Glucose Levels
fig1 = px.histogram(df, x='Glucose', color='Outcome', 
                    nbins=10, title="Distribution of Glucose Levels by Outcome",
                    labels={'Glucose': 'Glucose Levels', 'Outcome': 'Diabetes Outcome'})
fig1.update_layout(bargap=0.2)
fig1.show()

The histogram shows two distinct distributions for individuals with and without diabetes (Outcome = 1 and 0).
Individuals with diabetes tend to have higher glucose levels, with peaks around 150 and above.
Individuals without diabetes have a wider range of glucose levels but tend to cluster around lower levels, typically under 120.
This indicates that higher glucose levels might be a strong predictor of diabetes.

In [27]:
# 2. Scatter Plot: Relationship Between BMI and Diabetes Outcome
fig2 = px.scatter(df, x='BMI', y='Glucose', color='Outcome',
                  title="Relationship Between BMI and Glucose Levels by Outcome",
                  labels={'BMI': 'Body Mass Index', 'Glucose': 'Glucose Levels'},
                  hover_data=['Age', 'DiabetesPedigreeFunction'])
fig2.show()

It is observed that gigher glucose levels are more prevalent in individuals with diabetes (Outcome = 1), particularly for BMIs over 30.
Individuals with lower BMI and glucose levels are more likely to be non-diabetic (Outcome = 0). This relationship highlights that glucose levels and BMI together can be significant factors in determining diabetes.

In [28]:
# 3. Box Plot: Pregnancies and Diabetes Outcome
fig3 = px.box(df, x='Outcome', y='Pregnancies', color='Outcome',
              title="Distribution of Pregnancies by Diabetes Outcome",
              labels={'Pregnancies': 'Number of Pregnancies', 'Outcome': 'Diabetes Outcome'})
fig3.update_xaxes(ticktext=["No Diabetes", "Diabetes"], tickvals=[0, 1])
fig3.show()

Individuals with diabetes (Outcome = 1) tend to have a slightly higher median number of pregnancies compared to non-diabetic individuals.
The range of pregnancies for diabetics also appears more spread out, indicating more variability.
This suggests that a higher number of pregnancies may slightly increase the risk of developing diabetes, likely due to gestational diabetes or related factors.

In [29]:
# 4. Violin Plot: Age Distribution of Diabetic vs. Non-Diabetic Individuals
fig4 = px.violin(df, x='Outcome', y='Age', color='Outcome', box=True, points='all',
                 title="Age Distribution by Diabetes Outcome",
                 labels={'Age': 'Age (years)', 'Outcome': 'Diabetes Outcome'})
fig4.update_xaxes(ticktext=["No Diabetes", "Diabetes"], tickvals=[0, 1])

Diabetic individuals (Outcome = 1) tend to be older, with a more significant concentration in the 30–50 age range.
Non-diabetic individuals (Outcome = 0) have a broader age distribution, including more younger individuals.
This implies that age might play a role in the likelihood of developing diabetes, with older individuals at a higher risk.


In [30]:
# 5. Heatmap: Correlation Matrix
correlation = df.corr()
correlation_matrix = correlation.values.astype(float)
fig5 = px.imshow(correlation_matrix, x=correlation.columns, y=correlation.columns,
                 color_continuous_scale='Viridis', title="Correlation Matrix for Diabetes Dataset",
                 labels=dict(color="Correlation"))
fig5.update_traces(text=correlation.values.round(2), texttemplate="%{text}")

fig5.show()

There is a strong positive correlation between Glucose and Outcome (diabetes status), highlighting that higher glucose levels are associated with diabetes.
BMI and Age also show moderate positive correlations with Outcome, supporting the idea that these factors contribute to diabetes risk.
Some variables, such as BloodPressure and SkinThickness, have weaker correlations, suggesting they may have less direct influence on diabetes.

In [31]:
# 6. Scatter Plot: Relationship Between Age and Diabetes Outcome
fig6 = px.scatter(df, x='Age', y='Glucose', color='Outcome',
                  title="Relationship Between Age and Glucose Levels by Outcome",
                  labels={'Age': 'Age (years)', 'Glucose': 'Glucose Levels'},
                  hover_data=['BMI', 'DiabetesPedigreeFunction'])

fig6.show()

This scatter plot shows how glucose levels vary with age for diabetic (Outcome = 1) and non-diabetic (Outcome = 0) individuals.
Diabetic individuals tend to have higher glucose levels across a broad age range.
While younger individuals (under 30) tend to have lower glucose levels, the likelihood of diabetes increases with age, reinforcing that age is a risk factor.

In [32]:
# 7. Box Plot: Comparison of Insulin Levels by Outcome
fig7 = px.box(df, x='Outcome', y='Insulin', color='Outcome',
              title="Distribution of Insulin Levels by Diabetes Outcome",
              labels={'Insulin': 'Insulin Levels', 'Outcome': 'Diabetes Outcome'})
fig7.update_xaxes(ticktext=["No Diabetes", "Diabetes"], tickvals=[0, 1])

Diabetic individuals (Outcome = 1) appear to have more variability in insulin levels compared to non-diabetic individuals.
Some individuals with diabetes exhibit very high insulin levels (outliers), suggesting that irregular insulin production or usage is associated with diabetes.


In [33]:
# 8. Scatter Plot with Regression Line: Pregnancies vs. Glucose
fig8 = px.scatter(df, x='Pregnancies', y='Glucose', trendline='ols', color='Outcome',
                  title="Pregnancies vs. Glucose Levels with Regression Line",
                  labels={'Pregnancies': 'Number of Pregnancies', 'Glucose': 'Glucose Levels'})
fig8.show()

The regression line shows a slight positive relationship between the number of pregnancies and glucose levels.
Diabetic individuals (Outcome = 1) generally exhibit higher glucose levels irrespective of the number of pregnancies.
This plot supports the idea that while pregnancies may contribute to diabetes risk, glucose levels are more directly indicative of diabetes.

In [34]:
# 9. Pie Chart: Outcome Distribution
fig9 = px.pie(df, names='Outcome', title="Proportion of Diabetic vs. Non-Diabetic Individuals",
               color='Outcome', color_discrete_map={0: 'blue', 1: 'orange'},
               labels={'Outcome': 'Diabetes Outcome'})
fig9.update_traces(textinfo='percent+label')

The pie chart visually demonstrates the balance between diabetic and non-diabetic individuals in the dataset.
For example, if the dataset has more non-diabetic individuals (Outcome = 0), it could affect the generalizability of models trained on this data.
Understanding this balance is crucial for interpreting model predictions and ensuring fairness in analyses.
