In [None]:
import pandas as pd
pd.set_option("max.colwidth",0)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Data Information

<b>Context</b>
- This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

<b>Content</b>
- Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

<b>Data Attributes</b>
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1)

In [None]:
data = pd.read_csv("/content/drive/MyDrive/pandas/Copy of diabetes.csv")


In [None]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


# Step 1: Define Business Questions — Think Like a Data Analyst, Not Just a Coder
Before touching the data, `what questions` actually matter? Here are some examples to get them started, but they should come up with your own ideas:

- Which features (e.g., glucose level, BMI) are strongest predictors of diabetes?

- Are there notable differences in average values of health indicators between diabetic and non-diabetic groups?

- Does age or number of pregnancies influence diabetes likelihood?

- What health factors could a medical clinic focus on to identify at-risk patients early?

- Can we segment patients into meaningful risk groups for targeted interventions?

Make it clear: Without a business or health context, data analysis can be pointless or misleading.

# Step 2: Explore the Dataset with Pandas
You should:

- Load the data into a DataFrame.

- Check basic info: shape, data types, missing or zero values that may require cleaning or imputation.

- Compute summary stats (mean, median, std) grouped by Outcome.

- Correlation matrix to find relationships.

- Maybe simple pivot tables to cross-examine features like Age vs. Outcome.

# Step 3: Write the Analytical Report — Tell a Story, Not Just Numbers

- Objective: What questions are you answering?

- Data Description: Summarize the dataset, any cleaning steps.

- Analysis: Show key findings with code snippets or charts.

- Insights: What do these findings mean for the business or healthcare decision-making? For example:

    - "Higher glucose levels strongly associate with diabetes outcome, so glucose screening is critical."

    - "Older age groups show higher diabetes incidence—suggesting targeted age-based preventive programs."

- Limitations & Next Steps: What can’t this data tell us? What additional data or analysis is needed?



# Sample

Business Questions :

<b>Which factors in this dataset are most strongly associated with the presence of diabetes, and how could a health clinic use this information to identify high-risk patients?<b>



In [None]:
print(data.info())
print(data.describe())

zero_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for col in zero_cols:
    print(f'Zeros in {col}:', (data[col] == 0).sum())


grouped = data.groupby('Outcome').mean()
print(grouped)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None
       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count  768.000000   768.000000  768.000000     768.000000     768.000000   
mean   3.845052     120.894531  69.105469      20.536458      79.799479    
std    

In [None]:
corr = data.corr()
print(corr['Outcome'].sort_values(ascending=False))

Outcome                     1.000000
Glucose                     0.466581
BMI                         0.292695
Age                         0.238356
Pregnancies                 0.221898
DiabetesPedigreeFunction    0.173844
Insulin                     0.130548
SkinThickness               0.074752
BloodPressure               0.065068
Name: Outcome, dtype: float64


Key Findings :

- Glucose has the highest positive correlation with diabetes outcome (~0.47).

- BMI, Age, and DiabetesPedigreeFunction also show moderate positive correlations.

- Several features have zero values that likely indicate missing data rather than true zero (e.g., Insulin, SkinThickness).

- Pregnancies have a weaker correlation, but could still be a factor when combined with others.



# Note on Using LLMs (ChatGPT, Claude, etc.) for This Exercise
Feel free to use large language models (LLMs) to help generate ideas on what business questions to ask, what analyses to try, or how to interpret your results. They can be excellent for:

- Suggesting relevant features or hypotheses you might not think of.

- Providing example code snippets or pandas commands.

- Offering ways to structure your report or explain results clearly.

BUT, here’s the reality check:
- <font color="red">LLMs don’t “know” your specific dataset. They can’t see your data, they don’t check your numbers, and they don’t debug your code.</font> Their advice is often generic or based on patterns learned from many datasets, but:

- They may suggest complicated techniques or models that don’t fit your data or problem well.

- They often don’t spot data quality issues like missing values disguised as zeros or subtle biases.

- They can generate plausible-sounding but inaccurate interpretations if you don’t verify.

<font color="red">Trust yourself more than the LLM. </font>
- Double-check every suggestion, run your own exploratory analysis, and critically evaluate any output or recommendation.
- The power to deliver real insights lies in your understanding of the data and domain; not in blindly copying AI outputs.


Most importantly:

- Sometimes, you might want to uncover certain business insights, but the data simply doesn’t contain that information or signal. <b>That’s a completely valid outcome!</b>

# Now It’s Your Turn
explore the dataset and bring forward at least 5 key business insights related to diabetes risk or health indicators

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv("/content/drive/MyDrive/pandas/Copy of diabetes.csv")

In [None]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,155.548223,33.6,0.627,50,1
1,1,85.0,66.0,29.0,155.548223,26.6,0.351,31,0
2,8,183.0,64.0,29.15342,155.548223,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


In [None]:
data.shape

(768, 9)

In [None]:
data.dtypes

Unnamed: 0,0
Pregnancies,int64
Glucose,float64
BloodPressure,float64
SkinThickness,float64
Insulin,float64
BMI,float64
DiabetesPedigreeFunction,float64
Age,int64
Outcome,int64


In [None]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [None]:
(data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']]==0).sum()

Unnamed: 0,0
Glucose,5
BloodPressure,35
SkinThickness,227
Insulin,374
BMI,11


In [None]:
# Changing 0 value to NaN

In [None]:
import numpy as np

cols_with_missing = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

for col in cols_with_missing:
    data[col] = data[col].replace(0, np.nan)

data.isna().sum()

Unnamed: 0,0
Pregnancies,0
Glucose,5
BloodPressure,35
SkinThickness,227
Insulin,374
BMI,11
DiabetesPedigreeFunction,0
Age,0
Outcome,0


In [None]:
# filling NaN value with median

In [None]:
cols_with_missing = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

for col in cols_with_missing:
    data[col].fillna(data[col].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[col].fillna(data[col].mean(), inplace=True)


In [None]:
data.isnull().sum()

Unnamed: 0,0
Pregnancies,0
Glucose,0
BloodPressure,0
SkinThickness,0
Insulin,0
BMI,0
DiabetesPedigreeFunction,0
Age,0
Outcome,0


In [None]:
# Outcome

In [None]:
data['Outcome'].value_counts()

Unnamed: 0_level_0,count
Outcome,Unnamed: 1_level_1
0,500
1,268


In [None]:
data.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,110.710121,70.935397,27.768651,142.210761,30.888434,0.429734,31.19
1,4.865672,142.165573,75.147324,31.736944,180.431548,35.384757,0.5505,37.067164


I used descriptive statistics (mean, median, min, max, std) with the groupby method and a pivot table to compare average health indicators by age and diabetes outcome. Missing values were handled, and the analysis focused on trends between diabetic and non-diabetic individuals across different ages.


In [None]:
grouped_age = data.groupby('Outcome')['Age'].agg(['mean', 'median', 'min', 'max','std'])
grouped_age

Unnamed: 0_level_0,mean,median,min,max,std
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,31.19,27.0,21,81,11.667655
1,37.067164,36.0,21,70,10.968254


# Analysis of Age
The statistics for Age show that it is a relevant factor in diabetes prediction.

**Age Range**: Both groups start at the same minimum age of **21**. However, the non-diabetic group(Outcome=0) has a higher maximum age of **81**, compared to **70** for the diabetic group(Outcome=1).

**Average Age**: The average age for the diabetic group(Outcome=1) is **37.06**, which is noticeably older than the **31.19** average for the non-diabetic group. This suggests that older individuals are more likely to have diabetes.

**Median**: The median age of diabetic individuals is higher **(36)** than non-diabetics **(27)**, indicating that diabetes is more common in older age groups.

**Standard Devitation(std)**: It is slightly similar for both groups. This shows that the variation in ages is comparable, but it does not strongly affect the main insight. What matters more is that the mean and median ages are higher for diabetic individuals, indicating that older people are more likely to have diabetes.



In [None]:
grouped_insulin = data.groupby('Outcome')['Insulin'].agg(['mean', 'median', 'min', 'max','std'])
grouped_insulin

Unnamed: 0_level_0,mean,median,min,max,std
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,142.210761,155.548223,15.0,744.0,75.463785
1,180.431548,155.548223,14.0,846.0,95.747538


# Analysis of Insulin
The statistics for **Insulin** show that it is a relevant factor in diabetes prediction.



In [None]:
grouped_glucose = data.groupby('Outcome')['Glucose'].agg(['mean', 'median', 'min', 'max','std'])
grouped_glucose

Unnamed: 0_level_0,mean,median,min,max,std
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,110.710121,107.5,44.0,197.0,24.71706
1,142.165573,140.0,78.0,199.0,29.54175


In [None]:
grouped_pregnancy = data.groupby('Outcome')['Pregnancies'].agg(['mean', 'median', 'min', 'max','std'])
grouped_pregnancy

Unnamed: 0_level_0,mean,median,min,max,std
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,3.298,2.0,0,13,3.017185
1,4.865672,4.0,0,17,3.741239


In [None]:
grouped_BMI = data.groupby('Outcome')['BMI'].agg(['mean', 'median', 'min', 'max','std'])
grouped_BMI

Unnamed: 0_level_0,mean,median,min,max,std
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,30.888434,30.4,18.2,57.3,6.504779
1,35.384757,34.25,22.9,67.1,6.595065


In [None]:
grouped_bloodpressure = data.groupby('Outcome')['BloodPressure'].agg(['mean', 'median', 'min', 'max','std'])
grouped_bloodpressure

Unnamed: 0_level_0,mean,median,min,max,std
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,70.935397,72.0,24.0,122.0,11.931033
1,75.147324,74.0,30.0,114.0,11.945712


In [None]:
grouped_DPF = data.groupby('Outcome')['DiabetesPedigreeFunction'].agg(['mean', 'median', 'min', 'max','std'])
grouped_DPF

Unnamed: 0_level_0,mean,median,min,max,std
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.429734,0.336,0.078,2.329,0.299085
1,0.5505,0.449,0.088,2.42,0.372354


In [None]:
grouped_skinthickness = data.groupby('Outcome')['SkinThickness'].agg(['mean', 'median', 'min', 'max','std'])
grouped_skinthickness

Unnamed: 0_level_0,mean,median,min,max,std
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,27.768651,29.15342,7.0,60.0,8.559606
1,31.736944,29.15342,7.0,99.0,8.647599


Correlation with Outcome

In [None]:
corr = data.corr()
print(corr['Outcome'].sort_values(ascending=False))

Outcome                     1.000000
Glucose                     0.492928
BMI                         0.311924
Age                         0.238356
Pregnancies                 0.221898
SkinThickness               0.215299
Insulin                     0.214411
DiabetesPedigreeFunction    0.173844
BloodPressure               0.166074
Name: Outcome, dtype: float64


In [None]:
pivot_age = pd.pivot_table(data, index='Age', columns='Outcome', aggfunc='mean')
pivot_age

Unnamed: 0_level_0,BMI,BMI,BloodPressure,BloodPressure,DiabetesPedigreeFunction,DiabetesPedigreeFunction,Glucose,Glucose,Insulin,Insulin,Pregnancies,Pregnancies,SkinThickness,SkinThickness
Outcome,0,1,0,1,0,1,0,1,0,1,0,1,0,1
Age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
21,28.656421,37.56,66.765607,70.8,0.415828,0.6426,107.735979,139.4,135.908717,222.328934,1.12069,0.6,24.046338,34.830684
22,29.575655,35.045455,65.724845,72.363636,0.389574,0.658273,103.973336,153.818182,127.305068,191.662667,1.442623,2.181818,23.987334,32.873971
23,30.693548,35.085714,69.058399,74.343598,0.431581,0.469571,109.032258,122.857143,164.202882,135.741842,1.516129,1.857143,26.756783,27.922894
24,32.117302,38.775,67.958031,68.800648,0.389737,0.41175,113.131579,140.5,137.00334,147.012056,2.052632,1.125,28.757029,34.394177
25,30.269337,38.328571,64.741634,68.028942,0.483294,0.885143,95.382353,145.785714,123.650791,183.91008,1.735294,1.857143,27.38904,36.450489
26,32.682299,45.95,68.016207,70.300648,0.41364,0.412875,114.08,131.125,120.251574,174.080584,2.12,1.5,27.310684,37.788355
27,30.570833,36.0875,75.25,77.300648,0.430167,0.5965,110.708333,129.0,153.588621,141.71764,2.458333,2.875,28.932532,30.432532
28,32.788,35.78,73.416207,70.040518,0.43732,0.5154,109.8,145.2,137.881218,204.164467,3.44,2.0,28.784547,31.146026
29,33.5,33.592308,73.462824,72.954245,0.37375,0.452154,114.1875,143.615385,132.96764,213.903163,2.8125,3.923077,31.32671,32.958481
30,31.08,32.826244,73.6,67.135061,0.287267,0.567167,116.866667,135.833333,159.625719,175.440778,3.733333,3.333333,28.917807,27.884473
