# Lab - Diabetes Classification Using PIMA Indian Dataset
![Diabetes Classification](diabetes_classification.png)

Estimated time needed: **30** minutes


## Lab Objectives

In this lab you will:

1. **Load the PIMA Indian diabetes dataset** from a CSV file
2. **Display the top 5 rows** to understand the data structure
3. **Identify input features (X) and target variable (y)** for prediction
4. **Choose a machine learning algorithm** suitable for medical diagnosis
5. **Train the model** using the dataset
6. **Print the accuracy** and evaluate model performance
7. **Make predictions** on new patient data


## About the Dataset

The PIMA Indian Diabetes Dataset contains medical information about female patients of Pima Indian heritage, aged 21 and older. This dataset is widely used in medical machine learning research to predict diabetes onset.

**Features include:**
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function (genetic factor)
- Age: Age in years
- Outcome: 0 (no diabetes) or 1 (diabetes)

---

## Step 1: Import Required Libraries

First, you need to import the Python libraries that will help you work with data and machine learning. Think of these as specialized medical tools for data analysis.

In [None]:
# Import pandas for data manipulation (like Excel for Python)
import pandas as pd

In [None]:
# Import numpy for numerical operations
import numpy as np

In [None]:
# Import machine learning tools from scikit-learn
from sklearn.model_selection import train_test_split

In [None]:
# Import our machine learning algorithm
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Import evaluation metrics for measuring model performance
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# Import matplotlib for creating graphs and charts
import matplotlib.pyplot as plt

In [None]:
# Import seaborn for better-looking statistical plots
import seaborn as sns

In [None]:
# Set up the plotting style for better visualization
plt.style.use('default')
sns.set_palette("husl")

In [None]:
# Confirm all libraries are loaded successfully
print("All libraries imported successfully!")
print("Ready to begin diabetes classification analysis.")

## Step 2: Load Dataset from CSV File

You will load the PIMA Indian diabetes dataset directly from a csv file.

In [None]:
# Define the csv_file path where the dataset is stored on the disk
csv_file = "diabetes.csv"

In [None]:
# Load the dataset from csv file into a pandas DataFrame
diabetes_data = pd.read_csv(csv_file)
# Confirm dataset loaded successfully
print("Dataset loaded successfully")

In [None]:
# Display basic information about the dataset
print(f"Dataset contains {len(diabetes_data)} patient records.")
print(f"Dataset contains {len(diabetes_data.columns)} features per patient.")

## Step 3: Display Top 5 Rows

Let's examine the first few patient records to understand what our data looks like. This is similar to reviewing the first few patient files in a medical study.

In [None]:
# Display the first 5 rows of the dataset
print("=== FIRST 5 PATIENT RECORDS ===")
# Show the actual patient data
diabetes_data.head()

In [None]:
# Display summary information about the dataset
print("\n=== DATASET INFORMATION ===")
print(f"Total number of patients: {len(diabetes_data)}")

In [None]:
# Count features and diabetes cases
print(f"Number of features per patient: {len(diabetes_data.columns) - 1}")
print(f"Number of patients with diabetes: {sum(diabetes_data['Outcome'] == 1)}")

In [None]:
# Count patients without diabetes
print(f"Number of patients without diabetes: {sum(diabetes_data['Outcome'] == 0)}")

## Additional Data Exploration

Let's get a better understanding of our patient data by looking at basic statistics and checking for any data quality issues.

In [None]:
# Display basic statistical information about each feature
print("=== BASIC STATISTICS FOR ALL FEATURES ===")
diabetes_data.describe()

In [None]:
# Check for missing values (important in medical data)
print("\n=== MISSING VALUES CHECK ===")
missing_values = diabetes_data.isnull().sum()

In [None]:
# Report on missing values status
if missing_values.sum() == 0:
    print("Good news: No missing values found in the dataset.")
else:
    print("Missing values found:", missing_values[missing_values > 0])

## Step 4: Identify Input Features (X) and Target Variable (y)

In medical prediction, separate:
- **Input features (X)**: The medical measurements you use to make predictions (like symptoms and test results)
- **Target variable (y)**: What you want to predict (diabetes diagnosis: yes or no)

In [None]:
# X contains all the medical measurements (features) you will use for prediction
X = diabetes_data.drop(columns=['Outcome'])

In [None]:
# y contains the diagnosis results (0 = no diabetes, 1 = diabetes)
y = diabetes_data['Outcome']

In [None]:
# Display information about input features
print("=== INPUT FEATURES (X) ===")
print("These are the medical measurements you will use to predict diabetes:")

In [None]:
# List all feature names
print(list(X.columns))

In [None]:
# Display shape information for X
print(f"\nShape of X (features): {X.shape}")
print(f"This means you have {X.shape[0]} patients and {X.shape[1]} measurements per patient.")

In [None]:
# Display information about target variable
print("\n=== TARGET VARIABLE (y) ===")
print("This is what you want to predict (diabetes diagnosis):")

In [None]:
# Display shape and unique values for y
print(f"Shape of y (target): {y.shape}")
print(f"Unique values in y: {sorted(y.unique())} (0=No Diabetes, 1=Diabetes)")

In [None]:
# Show examples of features and targets
print("\n=== FIRST 5 EXAMPLES ===")
print("Features for first 5 patients:")

In [None]:
# Display first 5 feature rows
X.head()

In [None]:
# Display corresponding diagnoses
print("\nCorresponding diagnoses:")
print(y.head().values)

## Step 5: Choose Machine Learning Algorithm

For medical diagnosis, we'll use **Random Forest Classifier**. This algorithm is:
- **Reliable**: Combines multiple decision trees for better accuracy
- **Interpretable**: Can show which medical features are most important
- **Robust**: Works well with medical datasets
- **Handles complexity**: Can capture complex relationships between symptoms


In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
# Display data split information
print("=== DATA SPLIT INFORMATION ===")
print(f"Training set size: {len(X_train)} patients ({len(X_train)/len(diabetes_data)*100:.1f}%)")

In [None]:
# Display testing set size
print(f"Testing set size: {len(X_test)} patients ({len(X_test)/len(diabetes_data)*100:.1f}%)")

In [None]:
# Calculate training set diagnosis distribution
train_diabetes_count = sum(y_train == 1)
train_no_diabetes_count = sum(y_train == 0)

In [None]:
# Display training set distribution
print("\n=== TRAINING SET DIAGNOSIS DISTRIBUTION ===")
print(f"Patients with diabetes: {train_diabetes_count}")
print(f"Patients without diabetes: {train_no_diabetes_count}")

In [None]:
# Calculate testing set diagnosis distribution
test_diabetes_count = sum(y_test == 1)
test_no_diabetes_count = sum(y_test == 0)

In [None]:
# Display testing set distribution
print("\n=== TESTING SET DIAGNOSIS DISTRIBUTION ===")
print(f"Patients with diabetes: {test_diabetes_count}")
print(f"Patients without diabetes: {test_no_diabetes_count}")

In [None]:
# Initialize the Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)

In [None]:
# Display algorithm selection information
print("\n=== ALGORITHM SELECTED ===")
print("Random Forest Classifier initialized with:")

In [None]:
# Display model configuration details
print("- 100 decision trees (expert diagnosticians)")
print("- Maximum depth of 10 levels")
print("- Random state set for reproducible results")

## Step 6: Train the Model

Now you will train our machine learning model using the training data. This is like teaching a medical diagnostic system using historical patient records and their known outcomes.

In [None]:
# Display training start message
print("=== TRAINING THE MODEL ===")
print("Training in progress... (This may take a few seconds)")
# Train the model using the training data
model.fit(X_train, y_train)
# Confirm training completion
print("Model training completed successfully!")

In [None]:
print("\nThe algorithm has learned to recognize patterns in:")
# Display all features the model learned from
for i, feature in enumerate(X.columns, 1):
    print(f"{i}. {feature}")

In [None]:
# Display training summary
print("\n=== TRAINING SUMMARY ===")
print(f"Number of patients used for training: {len(X_train)}")
# Display additional training details
print(f"Number of medical features analyzed: {len(X.columns)}")
print(f"Number of decision trees created: {model.n_estimators}")
# Confirm model readiness
print("The model is now ready to make diabetes predictions!")

## Step 7: Print Accuracy and Evaluate Performance

Let's test our trained model on patients it has never seen before and measure how accurate its diagnoses are.

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

In [None]:
# Display model performance results
print("=== MODEL PERFORMANCE RESULTS ===")
print(f"Overall Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")

In [None]:
# Explain accuracy meaning
print(f"This means the model correctly diagnosed {accuracy*100:.1f}% of patients.")

In [None]:
# Calculate detailed prediction statistics
correct_predictions = sum(y_pred == y_test)
total_predictions = len(y_test)
incorrect_predictions = total_predictions - correct_predictions

In [None]:
# Display detailed results
print(f"\n=== DETAILED RESULTS ===")
print(f"Total patients tested: {total_predictions}")

In [None]:
# Display correct and incorrect predictions
print(f"Correct diagnoses: {correct_predictions}")
print(f"Incorrect diagnoses: {incorrect_predictions}")

In [None]:
# Display classification report header
print("\n=== DETAILED CLASSIFICATION REPORT ===")
print("This report shows precision, recall, and F1-score for each diagnosis:")

In [None]:
# Generate and display classification report
target_names = ['No Diabetes (0)', 'Diabetes (1)']
report = classification_report(y_test, y_pred, target_names=target_names)
print(report)

## Confusion Matrix Visualization

A confusion matrix helps medical professionals understand exactly where the diagnostic system makes correct and incorrect predictions.

In [None]:
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

In [None]:
# Create figure for confusion matrix visualization
plt.figure(figsize=(8, 6))
# Create heatmap for confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted: No Diabetes', 'Predicted: Diabetes'],
            yticklabels=['Actual: No Diabetes', 'Actual: Diabetes'])
# Add title and labels to confusion matrix
plt.title('Confusion Matrix: Diabetes Prediction Results', fontsize=14, fontweight='bold')
plt.ylabel('Actual Diagnosis', fontsize=12)
plt.xlabel('Predicted Diagnosis', fontsize=12)
# Display the confusion matrix plot
plt.tight_layout()
plt.show()

In [None]:
# Extract confusion matrix values
tn, fp, fn, tp = cm.ravel()

In [None]:
# Display confusion matrix explanation
print("=== CONFUSION MATRIX EXPLANATION ===")
print(f"True Negatives (TN): {tn} - Correctly identified patients WITHOUT diabetes")

In [None]:
# Continue confusion matrix explanation
print(f"True Positives (TP): {tp} - Correctly identified patients WITH diabetes")
print(f"False Positives (FP): {fp} - Incorrectly diagnosed diabetes (false alarm)")

In [None]:
# Complete confusion matrix explanation
print(f"False Negatives (FN): {fn} - Missed diabetes cases (dangerous!)")

In [None]:
# Calculate sensitivity and specificity
sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0

In [None]:
# Display clinical performance metrics
print(f"\n=== CLINICAL PERFORMANCE METRICS ===")
print(f"Sensitivity (True Positive Rate): {sensitivity:.3f} ({sensitivity*100:.1f}%)")

In [None]:
# Explain sensitivity
print(f"  - This is the percentage of diabetes cases correctly identified")
print(f"Specificity (True Negative Rate): {specificity:.3f} ({specificity*100:.1f}%)")

In [None]:
# Explain specificity
print(f"  - This is the percentage of healthy patients correctly identified")

## Feature Importance Analysis

Let's see which medical features are most important for diabetes prediction according to our model.

In [None]:
# Get feature importance scores
feature_importance = model.feature_importances_
feature_names = X.columns

In [None]:
# Create DataFrame for feature importance
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
})

In [None]:
# Sort features by importance
importance_df = importance_df.sort_values('Importance', ascending=False)

In [None]:
# Display feature importance ranking header
print("=== FEATURE IMPORTANCE RANKING ===")
print("Most important medical features for diabetes prediction:")

In [None]:
# Display each feature with its importance
for i, (_, row) in enumerate(importance_df.iterrows(), 1):
    print(f"{i}. {row['Feature']}: {row['Importance']:.3f} ({row['Importance']*100:.1f}%)")

In [None]:
# Create figure for feature importance visualization
plt.figure(figsize=(10, 6))
# Create bar plot for feature importance
sns.barplot(data=importance_df, x='Importance', y='Feature', palette='viridis', hue='Importance')
# Add title and labels to feature importance plot
plt.title('Medical Feature Importance for Diabetes Prediction', fontsize=14, fontweight='bold')
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Medical Features', fontsize=12)
# Display the feature importance plot
plt.tight_layout()
plt.show()

In [None]:
# Get top 3 most important features
top_3_features = importance_df.head(3)

# Display clinical insights header
print(f"\n=== CLINICAL INSIGHTS ===")
print(f"The top 3 most predictive features are:")

# Display top 3 features with their contributions
for i, (_, row) in enumerate(top_3_features.iterrows(), 1):
    print(f"{i}. {row['Feature']} (contributes {row['Importance']*100:.1f}% to the prediction)")

## Step 8: Make Predictions on New Patients

Now let's use our trained model to make predictions for new patients. This simulates how the system would work in a clinical setting.

In [None]:
# Create example patient data for demonstration
new_patients = pd.DataFrame({
    'Pregnancies': [1, 0, 3, 8],
    'Glucose': [85, 180, 110, 190],
    'BloodPressure': [66, 90, 70, 92]
})

In [None]:
# Add remaining patient features
new_patients['SkinThickness'] = [29, 35, 25, 40]
new_patients['Insulin'] = [0, 200, 80, 250]
new_patients['BMI'] = [26.6, 34.5, 28.2, 38.1]

In [None]:
# Complete patient features
new_patients['DiabetesPedigreeFunction'] = [0.351, 0.627, 0.314, 0.835]
new_patients['Age'] = [31, 45, 28, 55]

In [None]:
# Display new patient data header
print("=== NEW PATIENT DATA ===")
print("Medical information for 4 new patients:")

In [None]:
# Display the new patient data
new_patients.head()

In [None]:
# Make predictions for new patients
predictions = model.predict(new_patients)
prediction_probabilities = model.predict_proba(new_patients)

In [None]:
# Display predictions header
print("\n=== DIABETES PREDICTIONS ===")
print()

In [None]:
# Display prediction for first patient
patient_id = 1
prediction = predictions[0]
prob_no_diabetes = prediction_probabilities[0][0]
prob_diabetes = prediction_probabilities[0][1]

In [None]:
# Format and display first patient results
diagnosis = "DIABETES DETECTED" if prediction == 1 else "NO DIABETES"
confidence = max(prob_no_diabetes, prob_diabetes)
print(f"Patient {patient_id}: {diagnosis} (Confidence: {confidence:.1%})")

In [None]:
# Display all patient predictions in a loop
for i in range(len(new_patients)):
    patient_id = i + 1
    prediction = predictions[i]
    prob_no_diabetes = prediction_probabilities[i][0]

In [None]:
# Continue displaying predictions for all patients
for i in range(len(new_patients)):
    patient_id = i + 1
    prediction = predictions[i]
    prob_diabetes = prediction_probabilities[i][1]
    diagnosis = "DIABETES DETECTED" if prediction == 1 else "NO DIABETES"

In [None]:
# Display detailed results for each patient
for i in range(len(new_patients)):
    patient_id = i + 1
    prediction = predictions[i]
    prob_no_diabetes = prediction_probabilities[i][0]
    prob_diabetes = prediction_probabilities[i][1]
    
    diagnosis = "DIABETES DETECTED" if prediction == 1 else "NO DIABETES"
    confidence = max(prob_no_diabetes, prob_diabetes)
    
    print(f"Patient {patient_id}:")
    print(f"  Diagnosis: {diagnosis}")
    print(f"  Confidence: {confidence:.1%}")
    print(f"  Probability of No Diabetes: {prob_no_diabetes:.1%}")
    print(f"  Probability of Diabetes: {prob_diabetes:.1%}")
    print()

## Interactive Prediction Function

Here's a function that medical professionals can use to make predictions for individual patients by entering their medical measurements.

In [None]:
# Define prediction function for individual patients
def predict_diabetes_for_patient(pregnancies, glucose, blood_pressure, skin_thickness, 
                               insulin, bmi, diabetes_pedigree, age):
    # Create DataFrame with patient data
    patient_data = pd.DataFrame({
        'Pregnancies': [pregnancies], 'Glucose': [glucose], 'BloodPressure': [blood_pressure],
        'SkinThickness': [skin_thickness], 'Insulin': [insulin], 'BMI': [bmi],
        'DiabetesPedigreeFunction': [diabetes_pedigree], 'Age': [age]
    })
    
    # Make prediction and get probabilities
    prediction = model.predict(patient_data)[0]
    probabilities = model.predict_proba(patient_data)[0]
    
    # Format and display results
    diagnosis = "DIABETES DETECTED" if prediction == 1 else "NO DIABETES DETECTED"
    confidence = max(probabilities)
    
    print("=== PATIENT DIAGNOSIS REPORT ===")
    print(f"Diagnosis: {diagnosis}")
    print(f"Confidence Level: {confidence:.1%}")
    print(f"Probability of No Diabetes: {probabilities[0]:.1%}")
    print(f"Probability of Diabetes: {probabilities[1]:.1%}")
    
    if prediction == 1:
        print("\nCLINICAL RECOMMENDATION: Further medical evaluation recommended.")
    else:
        print("\nCLINICAL RECOMMENDATION: Continue routine monitoring.")
    
    return prediction, probabilities

In [None]:
# Display example usage header
print("=== EXAMPLE: PREDICTING FOR A SAMPLE PATIENT ===")
print("Patient Profile: 35-year-old female, 2 pregnancies, glucose=120, BMI=28.5")

In [None]:
# Use the prediction function with example patient data
result = predict_diabetes_for_patient(
    pregnancies=2, glucose=120, blood_pressure=75, skin_thickness=25,
    insulin=100, bmi=28.5, diabetes_pedigree=0.4, age=35
)

## Summary and Clinical Applications

Let's summarize what we've accomplished in this lab and discuss clinical applications.

In [None]:
# Display lab summary header
print("=== LAB SUMMARY ===")
print("\nSuccessfully completed all lab objectives:")

In [None]:
# Summarize dataset loading accomplishment
print("\n1. DATASET LOADING:")
print(f"   - Loaded {len(diabetes_data)} patient records from csv file")
print(f"   - Dataset contains {len(diabetes_data.columns)-1} medical features")

In [None]:
# Summarize data exploration
print("\n2. DATA EXPLORATION:")
print("   - Displayed and analyzed patient data structure")
print("   - Identified data quality and distribution")

In [None]:
# Summarize feature identification
print("\n3. FEATURE IDENTIFICATION:")
print(f"   - Input features (X): {len(X.columns)} medical measurements")
print("   - Target variable (y): Diabetes diagnosis (0/1)")

In [None]:
# Summarize algorithm selection and training
print("\n4. ALGORITHM SELECTION:")
print("   - Chose Random Forest Classifier")
print("   - Configured for medical diagnosis accuracy")

In [None]:
# Summarize model training
print("\n5. MODEL TRAINING:")
print(f"   - Trained on {len(X_train)} patient records")
print("   - Used 100 decision trees for robust predictions")

In [None]:
# Summarize performance evaluation
print("\n6. PERFORMANCE EVALUATION:")
print(f"   - Overall accuracy: {accuracy:.1%}")
print(f"   - Sensitivity: {sensitivity:.1%} (diabetes detection rate)")
print(f"   - Specificity: {specificity:.1%} (healthy patient identification)")

In [None]:
# Summarize prediction capability
print("\n7. PREDICTION CAPABILITY:")
print("   - Successfully demonstrated predictions on new patients")
print("   - Provided probability scores for clinical decision-making")

In [None]:
# Display clinical applications
print("\n=== CLINICAL APPLICATIONS ===")
print("\nThis diabetes prediction model can assist medical professionals by:")

In [None]:
# List clinical applications
print("\n• SCREENING: Identify high-risk patients during routine checkups")
print("• PRIORITIZATION: Focus resources on patients most likely to have diabetes")
print("• EARLY DETECTION: Catch diabetes cases before symptoms become severe")

In [None]:
# Continue listing clinical applications
print("• DECISION SUPPORT: Provide data-driven insights alongside clinical judgment")
print("• POPULATION HEALTH: Monitor diabetes prevalence in patient populations")

In [None]:
# Display important clinical notes
print("\n=== IMPORTANT CLINICAL NOTES ===")
print("\nThis model is a DECISION SUPPORT TOOL, not a replacement for clinical judgment")
print("Always combine AI predictions with comprehensive medical evaluation")

In [None]:
# Complete clinical notes
print("Consider patient history, symptoms, and additional tests for final diagnosis")
print("Regular model retraining with new data is recommended for optimal performance")

In [None]:
# Display final completion message
print("\n" + "="*60)
print("LAB COMPLETED SUCCESSFULLY!")
print("You now have a working diabetes prediction system.")
print("="*60)

## Authors


[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/)


Copyright © 2025 Skillup Corporation. All rights reserved.
