# Week 1, Class 1: Introduction to Machine Learning in Healthcare
## Hands-on Lab: Google Colab Setup and Medical Data Exploration

**Course:** AI/ML in Medicine and Healthcare  
**Module:** Week 1 - Foundations  
**Lab Type:** Individual Work

---

## Learning Objectives
By the end of this lab, you will be able to:
1. Navigate and use Google Colab effectively
2. Mount Google Drive for file persistence
3. Load and explore a medical dataset
4. Perform basic data analysis with NumPy and pandas
5. Create visualizations with matplotlib

---

## Part 1: Welcome to Google Colab! üöÄ

Google Colab is a free cloud-based Jupyter notebook environment. It provides:
- Free GPU access (with limits)
- Pre-installed ML libraries
- Easy sharing and collaboration
- Integration with Google Drive

### Colab Basics
- **Run a cell:** Shift+Enter or click the play button
- **Add cell:** Click "+Code" or "+Text" buttons
- **Save:** File ‚Üí Save or Ctrl+S
- **Share:** Click "Share" button (top right)


In [1]:
# Let's start with a simple test
print("Hello, AI/ML in Medicine!")
print("You're running Python in the cloud! ‚òÅÔ∏è")

# Check Python version
import sys
print(f"\nPython version: {sys.version}")

Hello, AI/ML in Medicine!
You're running Python in the cloud! ‚òÅÔ∏è

Python version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]


### Mounting Google Drive

‚ö†Ô∏è **IMPORTANT:** Mount your Google Drive to save your work persistently!

Without this, your files will disappear when the Colab session ends.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("‚úì Google Drive mounted successfully!")
print("Your files are now accessible at: /content/drive/MyDrive/")

---

## Part 2: Essential Libraries for Medical ML üìö

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

---

## Part 3: Loading the Diabetes Dataset üè•

We'll use the UCI Pima Indians Diabetes dataset:
- **768 patient records**
- **8 features:** pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree, age
- **Target:** diabetes diagnosis (0 = no, 1 = yes)

This dataset is perfect for learning ML fundamentals!

In [None]:
# Load diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
                'Insulin', 'BMI', 'DiabetesPedigree', 'Age', 'Outcome']

df = pd.read_csv(url, names=column_names)

print("‚úì Dataset loaded successfully!")
print(f"\nDataset shape: {df.shape}")
print(f"  - {df.shape[0]} patients")
print(f"  - {df.shape[1]} columns (8 features + 1 target)")

### First Look at the Data

In [None]:
# Display first few rows
print("First 5 patients in the dataset:\n")
df.head()

In [None]:
# Get basic information about the dataset
print("Dataset Information:")
print("="*50)
df.info()

In [None]:
# Statistical summary
print("Statistical Summary:\n")
df.describe()

### ü§î Observation Questions

Look at the statistical summary above and answer:
1. What's the average glucose level?
2. What's the age range of patients?
3. Do you notice any unusual values? (Hint: Can blood pressure be 0?)

**Write your observations here:**
-
-
-

---

## Part 4: Data Exploration with NumPy üî¢

In [None]:
# Convert to NumPy arrays for practice
glucose = df['Glucose'].values
bmi = df['BMI'].values

print("NumPy Array Operations:")
print("="*50)
print(f"Glucose - Mean: {np.mean(glucose):.2f}, Std: {np.std(glucose):.2f}")
print(f"BMI - Mean: {np.mean(bmi):.2f}, Std: {np.std(bmi):.2f}")
print(f"\nGlucose range: {np.min(glucose):.0f} to {np.max(glucose):.0f}")
print(f"BMI range: {np.min(bmi):.1f} to {np.max(bmi):.1f}")

In [None]:
# Array operations - vectorization is powerful!
# Let's categorize BMI

# BMI categories: <18.5 (underweight), 18.5-25 (normal), 25-30 (overweight), >30 (obese)
underweight = np.sum(bmi < 18.5)
normal = np.sum((bmi >= 18.5) & (bmi < 25))
overweight = np.sum((bmi >= 25) & (bmi < 30))
obese = np.sum(bmi >= 30)

print("BMI Distribution:")
print(f"  Underweight (<18.5): {underweight}")
print(f"  Normal (18.5-25): {normal}")
print(f"  Overweight (25-30): {overweight}")
print(f"  Obese (>30): {obese}")

---

## Part 5: Visualizing Medical Data üìä

Visualization is crucial in medical ML for:
- Understanding data distributions
- Identifying outliers and anomalies
- Communicating insights to clinicians
- Debugging models

In [None]:
# Distribution of diabetes outcomes
plt.figure(figsize=(8, 6))
outcome_counts = df['Outcome'].value_counts()
plt.bar(['No Diabetes', 'Diabetes'], outcome_counts.values, color=['green', 'red'], alpha=0.7)
plt.title('Distribution of Diabetes Diagnosis', fontsize=14, fontweight='bold')
plt.ylabel('Number of Patients')
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(outcome_counts.values):
    plt.text(i, v + 10, str(v), ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print(f"Class distribution:")
print(f"  No diabetes: {outcome_counts[0]} ({outcome_counts[0]/len(df)*100:.1f}%)")
print(f"  Diabetes: {outcome_counts[1]} ({outcome_counts[1]/len(df)*100:.1f}%)")

In [None]:
# Glucose distribution by diabetes status
plt.figure(figsize=(12, 5))

# Subplot 1: Histogram
plt.subplot(1, 2, 1)
plt.hist(df[df['Outcome'] == 0]['Glucose'], bins=20, alpha=0.7, label='No Diabetes', color='green')
plt.hist(df[df['Outcome'] == 1]['Glucose'], bins=20, alpha=0.7, label='Diabetes', color='red')
plt.xlabel('Glucose Level (mg/dL)')
plt.ylabel('Frequency')
plt.title('Glucose Distribution by Diabetes Status')
plt.legend()
plt.grid(alpha=0.3)

# Subplot 2: Box plot
plt.subplot(1, 2, 2)
df.boxplot(column='Glucose', by='Outcome', ax=plt.gca())
plt.xlabel('Diabetes Status (0=No, 1=Yes)')
plt.ylabel('Glucose Level (mg/dL)')
plt.title('Glucose Levels: Box Plot Comparison')
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

In [None]:
# Scatter plot: BMI vs Age, colored by diabetes status
plt.figure(figsize=(10, 6))

# Plot non-diabetic patients
no_diabetes = df[df['Outcome'] == 0]
plt.scatter(no_diabetes['Age'], no_diabetes['BMI'],
            alpha=0.6, s=50, c='green', label='No Diabetes', edgecolors='black', linewidth=0.5)

# Plot diabetic patients
diabetes = df[df['Outcome'] == 1]
plt.scatter(diabetes['Age'], diabetes['BMI'],
            alpha=0.6, s=50, c='red', label='Diabetes', edgecolors='black', linewidth=0.5)

plt.xlabel('Age (years)', fontsize=12)
plt.ylabel('BMI (kg/m¬≤)', fontsize=12)
plt.title('Age vs BMI: Diabetes Status', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüí° Observation: Look for patterns! Do diabetic patients tend to cluster in certain regions?")

---

## Part 6: Correlation Analysis üîó

Understanding feature correlations is crucial for:
- Feature selection
- Understanding relationships
- Avoiding multicollinearity

In [None]:
# Compute correlation matrix
correlation_matrix = df.corr()

# Visualize with heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Show correlations with outcome
print("\nCorrelations with Diabetes Outcome:")
print("="*50)
outcome_corr = correlation_matrix['Outcome'].sort_values(ascending=False)
for feature, corr in outcome_corr.items():
    if feature != 'Outcome':
        print(f"{feature:20s}: {corr:+.3f}")

### ü§î Analysis Questions

Based on the correlation heatmap:
1. Which feature has the strongest correlation with diabetes outcome?
2. Are there any highly correlated features (excluding the diagonal)?
3. Would you consider removing any features? Why?

**Your answers:**
1.
2.
3.

---

## Part 7: Your First ML Insight! üéØ

Let's make a simple observation about glucose levels and diabetes risk.

In [None]:
# Compare average glucose between groups
avg_glucose_no_diabetes = df[df['Outcome'] == 0]['Glucose'].mean()
avg_glucose_diabetes = df[df['Outcome'] == 1]['Glucose'].mean()

print("Average Glucose Levels:")
print("="*50)
print(f"No Diabetes: {avg_glucose_no_diabetes:.1f} mg/dL")
print(f"Diabetes:    {avg_glucose_diabetes:.1f} mg/dL")
print(f"\nDifference:  {avg_glucose_diabetes - avg_glucose_no_diabetes:.1f} mg/dL")
print(f"({(avg_glucose_diabetes/avg_glucose_no_diabetes - 1)*100:.1f}% higher in diabetic patients)")

# Simple rule-based prediction
threshold = 120  # mg/dL
rule_based_predictions = (df['Glucose'] > threshold).astype(int)
accuracy = (rule_based_predictions == df['Outcome']).mean()

print(f"\nüéØ Simple Rule: 'Predict diabetes if glucose > {threshold}'")
print(f"   Accuracy: {accuracy*100:.1f}%")
print(f"\nüí° This is machine learning at its simplest!")
print(f"   We'll learn to do much better with sophisticated models!")

---

## Part 8: Save Your Work üíæ

In [None]:
# Create a summary report
summary = {
    'Total Patients': len(df),
    'Diabetic': outcome_counts[1],
    'Non-diabetic': outcome_counts[0],
    'Avg Age': df['Age'].mean(),
    'Avg Glucose': df['Glucose'].mean(),
    'Avg BMI': df['BMI'].mean(),
}

summary_df = pd.DataFrame([summary])

# Save to Google Drive (update path with your folder)
# output_path = '/content/drive/MyDrive/AI_ML_Healthcare/Week1_Summary.csv'
# summary_df.to_csv(output_path, index=False)
# print(f"‚úì Summary saved to: {output_path}")

print("\nData Summary:")
print(summary_df.T)

---

## üéì Wrap-Up and Reflection

### What You've Learned:
‚úì Set up and navigate Google Colab  
‚úì Loaded a real medical dataset  
‚úì Performed exploratory data analysis  
‚úì Created meaningful visualizations  
‚úì Identified relationships in data  
‚úì Made your first (simple) prediction!  

### Next Steps:
1. Save this notebook to your Google Drive
2. Try modifying the visualizations
3. Explore other features in the dataset
4. Read Chapter 1 of the textbook

### üìù Reflection Questions:
1. What surprised you most about the data?
2. What challenges do you anticipate in building ML models for healthcare?
3. What questions do you have for the next class?

**Your reflections:**
-
-
-

---

## üè† Optional Homework Challenge

Try these extensions:
1. Explore the relationship between Age and Pregnancies
2. Create a histogram showing the distribution of all features
3. Identify potential data quality issues (zeros in blood pressure, etc.)
4. Compare BMI distributions between diabetic and non-diabetic patients

---

**Great job completing your first lab! üéâ**  
See you in Class 2!