# Step 0: Exploring the data

In [0]:
%sql
SELECT * FROM workspace.default.diabetes LIMIT 10;


Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1


In [0]:
%sql
SELECT COUNT(*) AS total_patients FROM workspace.default.diabetes;



total_patients
768


In [0]:
%sql
SELECT COUNT(*) AS diabetic_cases
FROM workspace.default.diabetes
WHERE Outcome = 1;


diabetic_cases
268


#  Step 1: Analyze and Visualize Patterns


## 1. Diabetic vs Non-Diabetic Counts

In [0]:
%sql
SELECT Outcome, COUNT(*) AS count
FROM workspace.default.diabetes
GROUP BY Outcome;


Outcome,count
0,500
1,268


## 2. Average Glucose by Outcome

In [0]:
%sql
SELECT Outcome, AVG(Glucose) AS avg_glucose
FROM workspace.default.diabetes
GROUP BY Outcome;


Outcome,avg_glucose
0,109.98
1,141.25746268656715


## 3. Average BMI by Outcome

In [0]:
%sql
SELECT Outcome, AVG(BMI) AS avg_bmi
FROM workspace.default.diabetes
GROUP BY Outcome;


Outcome,avg_bmi
0,30.30419999999996
1,35.14253731343278


## 4. Age Distribution (Diabetic Only)

In [0]:
%sql
SELECT Age, COUNT(*) AS count
FROM workspace.default.diabetes
WHERE Outcome = 1
GROUP BY Age
ORDER BY Age;


Age,count
21,5
22,11
23,7
24,8
25,14
26,8
27,8
28,10
29,13
30,6


# Step 2: Project Summary: Diabetes Dataset Analysis (SQL)
**Author**: Harika Gangu

**Platform**: Databricks Community Edition  
**Dataset**: Pima Indians Diabetes Dataset(from Kaggle)

## Objective:
To explore factors influencing diabetes using SQL analysis in Databricks.

## Dataset:
Pima Indians Diabetes dataset (768 records, 9 columns). Key features include glucose level, BMI, age, and outcome (0 = non-diabetic, 1 = diabetic).

## Key Insights:
About 35% of the patients in the dataset are diabetic.
Diabetic patients show higher average glucose and higher BMI.
Diabetes is more common in patients over age 30.
Visualizations show a clear difference in glucose and BMI between diabetic and non-diabetic groups.

## Tools Used:
Databricks Community Edition
SQL Warehouses
Pie and Bar Charts for visualization

## Limitations:
No predictive modeling done due to SQL-only environment.
Dataset may contain zero values where missing values should exist (e.g., insulin = 0).