# DSCD 611: Programming for Data Scientists I
# FINAL PROJECT: Predictive Analytics for Early Diabetes Detection

**Group:** Group B15  
**Group Leader:** Edward Tsatsu Akorlie (ID: 22424530)  
**Members:** Daniel K. Adotey, Kwame Ofori-Gyau, Francis A. Sarbeng, Caleb A. Mensah  
**Institution:** Department of Computer Science, University of Ghana – Legon  
**Instructors:** Clifford Broni-Bediako and Michael Soli  
**Date:** 14th November 2025

---

## 1. Introduction and Background
Diabetes mellitus is a chronic condition that has reached epidemic proportions globally. According to the WHO, early diagnosis is the most effective way to prevent debilitating complications. This project uses the **PIMA Indians Diabetes Dataset** to explore metabolic patterns and predict the likelihood of diabetes.

### Project Objectives
- Perform comprehensive Exploratory Data Analysis (EDA).
- Address 4 critical research questions.
- Build a robust predictive model using Random Forest.

### 1.1 Import Libraries
We start by importing the necessary data science libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

### 1.2 Load Dataset
The dataset is loaded from the localized `Data/` directory.

In [None]:
df = pd.read_csv('Data/PIMA_Diabetes_Source.csv')
df.head()

## 2. Exploratory Data Analysis (EDA)
Data discovery helps identify patterns and anomalies before modeling. We address four fundamental research questions.

### Q1: What is the prevalence of diabetes in this population?
Understanding the target distribution is critical. We analyze the distribution of the target variable ('Outcome').

In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(x='Outcome', data=df, hue='Outcome', palette='viridis', legend=False)
plt.title('Prevalence of Diabetes (0 = Healthy, 1 = Diabetic)')
plt.show()

counts = df['Outcome'].value_counts(normalize=True) * 100
print(f"Distribution:\nNon-diabetic: {counts[0]:.1f}%\nDiabetic: {counts[1]:.1f}%")

### Q2: How significantly do plasma glucose levels vary between cohorts?
Glucose is a primary clinical marker. We check the distribution shift between healthy and diabetic patients.

In [None]:
sns.histplot(data=df, x='Glucose', hue='Outcome', kde=True, bins=30, alpha=0.5)
plt.title('Glucose Concentration by Health Outcome')
plt.show()
print(df.groupby('Outcome')['Glucose'].mean())

### Q3: What is the relationship between BMI and Glucose, and how does it shift across outcomes?
Obesity and blood sugar are often linked. We use a scatter plot to observe clustering.

In [None]:
sns.scatterplot(x='BMI', y='Glucose', hue='Outcome', data=df, alpha=0.7)
plt.title('Glucose vs BMI Relationship Across Outcomes')
plt.show()

### Q4: Does age show a statistically observable correlation with diabetes risk?
We investigate if older age groups have a higher prevalence of diabetes in this demographic.

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Outcome', y='Age', data=df, hue='Outcome', palette='muted', legend=False)
plt.title('Age Distribution by Health Outcome')
plt.show()
print(df.groupby('Outcome')['Age'].median())

## 3. Data Cleaning and Preprocessing
Before modeling, we handle 'Logical Zeros'—placeholders for missing values in metabolic columns.

In [None]:
masked_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# Imputation logic using column medians
for col in masked_cols:
    df[col] = df[col].replace(0, np.nan)
    df[col] = df[col].fillna(df[col].median())

X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Dataset Partitioning (80/20 Stratified Split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Normalization to ensure feature scale consistency
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 4. Modeling and Evaluation
We deploy the **Random Forest Classifier** and evaluate its predictive accuracy and ROC-AUC score.

In [None]:
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)

y_pred = clf.predict(X_test_scaled)
print("--- Classification Performance ---")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC Score: {roc_auc_score(y_test, clf.predict_proba(X_test_scaled)[:, 1]):.4f}")

## 5. Summary and Conclusion
Based on the analysis of the **PIMA Indians Diabetes Dataset**, we conclude that metabolic features like **Glucose** and **high BMI**, alongside **Age**, are significant predictors of diabetes. The Random Forest model provides a robust baseline for early screening, demonstrating that machine learning can reliably identify risk patterns using non-invasive medical metrics recorded during the study.