# üß† Exercise 2: Machine Learning Fundamentals

**Week 2 | AI in Healthcare Curriculum**

---

## Learning Objectives

By completing this exercise, you will:

- üéØ Understand how ML models learn from data
- üéØ Train your own classification model
- üéØ Understand train/test splitting and why it matters
- üéØ See how changing parameters affects model behaviour

---

## ‚è±Ô∏è Estimated Time: 2 hours

---

## Context

Last week, you used pre-trained AI models. This week, you'll train one yourself from scratch. This demystifies the "learning" in machine learning.

**Our scenario:** We'll build a model to predict which patients might deteriorate within 24 hours, using vital signs data.

## Part 1: Setup and Data Loading

In [None]:
import pandas as pd
# Load the synthetic dataset
url = "https://raw.githubusercontent.com/harl00/AIinHealthcare/main/data/AI_in_HealthCare_Dataset.csv"
ed_data = pd.read_csv(url)
ed_data.head()

### Creating Our Dataset

In real healthcare AI, training data comes from electronic health records. For this exercise, we'll create realistic synthetic data.

**What we're simulating:**
- 1000 patient encounters
- Vital signs at a point in time
- Whether the patient deteriorated in the next 24 hours

In [None]:
# Generate synthetic patient data
# This function creates realistic-looking vital signs data

def generate_patient_data(n_patients=1000):
    """
    Generate synthetic patient vital signs data.

    In real life, this would come from your EHR system.
    We're creating synthetic data that has realistic patterns.
    """

    np.random.seed(42)

    # Generate vital signs with realistic distributions
    data = {
        'patient_id': [f'P{i:04d}' for i in range(n_patients)],
        'age': np.random.normal(60, 18, n_patients).clip(18, 95).astype(int),
        'heart_rate': np.random.normal(82, 18, n_patients).clip(40, 180).astype(int),
        'respiratory_rate': np.random.normal(17, 5, n_patients).clip(8, 40).astype(int),
        'systolic_bp': np.random.normal(125, 22, n_patients).clip(70, 200).astype(int),
        'diastolic_bp': np.random.normal(75, 12, n_patients).clip(40, 120).astype(int),
        'temperature': np.round(np.random.normal(37.0, 0.7, n_patients).clip(35, 40), 1),
        'oxygen_saturation': np.random.normal(96, 3, n_patients).clip(80, 100).astype(int),
    }

    df = pd.DataFrame(data)

    # Create outcome based on vital sign patterns
    # This simulates a realistic relationship between vitals and deterioration
    risk_score = (
        0.03 * (df['heart_rate'] - 80) +
        0.08 * (df['respiratory_rate'] - 16) +
        -0.02 * (df['systolic_bp'] - 120) +
        0.5 * (df['temperature'] - 37) +
        -0.1 * (df['oxygen_saturation'] - 96) +
        0.02 * (df['age'] - 60) +
        np.random.normal(0, 0.5, n_patients)  # Random noise
    )

    # Convert to binary outcome (deteriorated or not)
    df['deteriorated'] = (risk_score > 0.8).astype(int)

    return df

# Generate the data
patient_data = generate_patient_data(1000)

print("Patient Dataset Generated")
print("="*50)
print(f"Total patients: {len(patient_data)}")
print(f"\nOutcome distribution:")
print(f"  Did NOT deteriorate: {(patient_data['deteriorated']==0).sum()} ({(patient_data['deteriorated']==0).mean()*100:.1f}%)")
print(f"  DID deteriorate: {(patient_data['deteriorated']==1).sum()} ({(patient_data['deteriorated']==1).mean()*100:.1f}%)")
print("\nFirst 10 patients:")
patient_data.head(10)

### Exploring the Data

Before training a model, data scientists always explore the data first. Let's see what we're working with:

In [None]:
# Summary statistics
print("Summary Statistics:")
print("="*60)
patient_data.describe()

In [None]:
# Compare vital signs between groups
print("Vital Signs by Outcome Group:")
print("="*60)

comparison = patient_data.groupby('deteriorated').agg({
    'age': 'mean',
    'heart_rate': 'mean',
    'respiratory_rate': 'mean',
    'systolic_bp': 'mean',
    'temperature': 'mean',
    'oxygen_saturation': 'mean'
}).round(1)

comparison.index = ['Did NOT deteriorate', 'DID deteriorate']
comparison

### üí° Observation

Look at the differences between the groups:
- Patients who deteriorated had **higher** heart rates, respiratory rates, and temperatures
- Patients who deteriorated had **lower** blood pressure and oxygen saturation

**This is exactly what a ML model will learn to detect!**

## Part 2: The Critical Concept - Train/Test Split

### Why Can't We Test on the Same Data We Train On?

Imagine studying for an exam by memorising the exact questions and answers. You'd score 100%... on those specific questions. But could you answer *new* questions?

ML models can "memorise" training data too. This is called **overfitting**.

**Solution:** Split the data:
- **Training set** - The model learns from this
- **Test set** - We evaluate on this (model has never seen it)

In [None]:
# Prepare features (X) and labels (y)

# Features: the vital signs we'll use to make predictions
feature_columns = ['age', 'heart_rate', 'respiratory_rate',
                   'systolic_bp', 'temperature', 'oxygen_saturation']

X = patient_data[feature_columns]  # Features
y = patient_data['deteriorated']    # Labels (what we're predicting)

print("Feature matrix (X):")
print(f"  Shape: {X.shape} (rows=patients, columns=features)")
print(f"  Features: {list(X.columns)}")
print(f"\nLabel vector (y):")
print(f"  Shape: {y.shape}")
print(f"  Values: 0 (no deterioration) or 1 (deterioration)")

In [None]:
# Split into training and test sets
# We'll use 80% for training, 20% for testing

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,      # 20% for testing
    random_state=42,     # For reproducibility
    stratify=y           # Keep same proportion of outcomes in both sets
)

print("Data Split Complete")
print("="*50)
print(f"\nTraining set: {len(X_train)} patients ({len(X_train)/len(X)*100:.0f}%)")
print(f"  - Did not deteriorate: {(y_train==0).sum()}")
print(f"  - Did deteriorate: {(y_train==1).sum()}")

print(f"\nTest set: {len(X_test)} patients ({len(X_test)/len(X)*100:.0f}%)")
print(f"  - Did not deteriorate: {(y_test==0).sum()}")
print(f"  - Did deteriorate: {(y_test==1).sum()}")

print("\n‚úÖ The model will ONLY learn from training data.")
print("‚úÖ We'll evaluate on test data the model has NEVER seen.")

## Part 3: Training Your First Model - Decision Tree

We'll start with a **Decision Tree** - one of the most interpretable ML models.

A decision tree asks a series of yes/no questions to classify patients:
- "Is heart rate > 100?" ‚Üí If yes, go left; if no, go right
- Continue until reaching a conclusion

This is similar to how clinicians think through differential diagnoses!

In [None]:
# Train a Decision Tree classifier

# Create the model
# max_depth=4 limits how many questions deep the tree can go
tree_model = DecisionTreeClassifier(max_depth=4, random_state=42)

# Train (fit) the model on training data
# This is where the "learning" happens!
tree_model.fit(X_train, y_train)

print("‚úÖ Decision Tree trained!")
print(f"\nThe model learned to ask {tree_model.get_n_leaves()} different questions")
print(f"to classify patients into {tree_model.n_classes_} categories.")

### Visualising the Decision Tree

One huge advantage of decision trees is we can SEE what they learned:

In [None]:
# Visualise the decision tree
plt.figure(figsize=(20, 10))
plot_tree(tree_model,
          feature_names=feature_columns,
          class_names=['No Deterioration', 'Deterioration'],
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('Decision Tree for Deterioration Prediction', fontsize=16)
plt.tight_layout()
plt.show()

print("\nüìä How to read this tree:")
print("  - Start at the top (root node)")
print("  - Each box shows a question and the split")
print("  - Follow left branch if condition is TRUE, right if FALSE")
print("  - Colour: blue = predicts no deterioration, orange = predicts deterioration")
print("  - Darker colour = more confident prediction")

### üí° What Did the Model Learn?

Look at the tree and answer:
1. What's the first (most important) question the model asks?
2. Does this make clinical sense?
3. Can you trace through what happens to a patient with:
   - Respiratory rate = 25, Temperature = 38.0, Oxygen sat = 90?

## Part 4: Evaluating Model Performance

Now the crucial question: **How well does our model actually work?**

Remember: We must test on data the model has never seen!

In [None]:
# Make predictions on test set
y_pred = tree_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Model Performance on TEST DATA")
print("="*50)
print(f"\nAccuracy: {accuracy*100:.1f}%")
print(f"\nThis means the model correctly classified {accuracy*100:.1f}% of patients")
print(f"it had never seen before.")

### The Confusion Matrix

Accuracy alone doesn't tell the whole story. In healthcare, we care about:
- **False Negatives:** Missed deteriorating patients (dangerous!)
- **False Positives:** Unnecessary alerts (alert fatigue)

A **confusion matrix** shows all four outcomes:

In [None]:
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualise it
fig, ax = plt.subplots(figsize=(8, 6))

# Create heatmap
im = ax.imshow(cm, cmap='Blues')

# Labels
ax.set_xticks([0, 1])
ax.set_yticks([0, 1])
ax.set_xticklabels(['Predicted: No', 'Predicted: Yes'], fontsize=12)
ax.set_yticklabels(['Actual: No', 'Actual: Yes'], fontsize=12)
ax.set_xlabel('Model Prediction', fontsize=14)
ax.set_ylabel('Actual Outcome', fontsize=14)
ax.set_title('Confusion Matrix: Deterioration Prediction', fontsize=16)

# Add numbers
for i in range(2):
    for j in range(2):
        color = 'white' if cm[i, j] > cm.max()/2 else 'black'
        ax.text(j, i, str(cm[i, j]), ha='center', va='center',
                fontsize=24, fontweight='bold', color=color)

# Add labels for each quadrant
labels = [['True Negative\n(Correct: No deterioration)', 'False Positive\n(Alert fatigue)'],
          ['False Negative\n(MISSED deterioration!)', 'True Positive\n(Correct: Caught deterioration)']]

for i in range(2):
    for j in range(2):
        ax.text(j, i+0.35, labels[i][j], ha='center', va='center',
                fontsize=9, color='darkgray')

plt.tight_layout()
plt.show()

# Calculate metrics
tn, fp, fn, tp = cm.ravel()
sensitivity = tp / (tp + fn)  # Also called "recall"
specificity = tn / (tn + fp)

print("\nDetailed Metrics:")
print(f"  True Negatives: {tn} (correctly said NO deterioration)")
print(f"  True Positives: {tp} (correctly caught deterioration)")
print(f"  False Negatives: {fn} (MISSED deterioration - dangerous!)")
print(f"  False Positives: {fp} (false alarms)")
print(f"\n  Sensitivity: {sensitivity*100:.1f}% (catches {sensitivity*100:.1f}% of deteriorating patients)")
print(f"  Specificity: {specificity*100:.1f}% (correctly identifies {specificity*100:.1f}% of stable patients)")

### üí° Clinical Interpretation

**Question:** In a clinical early warning system, would you prefer:
- A) Higher sensitivity (catch more deteriorating patients, but more false alarms)
- B) Higher specificity (fewer false alarms, but might miss some deteriorations)

There's no universally "right" answer - it depends on the clinical context!

## Part 5: Experimenting with Model Parameters

ML models have settings called **hyperparameters** that control how they learn.

For decision trees, one important parameter is `max_depth` - how many questions deep the tree can go.

Let's see what happens when we change it:

In [None]:
# Compare different tree depths

depths = [1, 2, 3, 4, 5, 10, 20, None]  # None = no limit
results = []

for depth in depths:
    # Train model
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)

    # Evaluate on BOTH training and test data
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)

    depth_str = str(depth) if depth else 'Unlimited'
    results.append({
        'depth': depth_str,
        'train_accuracy': train_acc,
        'test_accuracy': test_acc
    })

results_df = pd.DataFrame(results)

# Plot
fig, ax = plt.subplots(figsize=(10, 5))
x_pos = range(len(depths))
width = 0.35

bars1 = ax.bar([x - width/2 for x in x_pos], results_df['train_accuracy']*100,
               width, label='Training Accuracy', color='steelblue')
bars2 = ax.bar([x + width/2 for x in x_pos], results_df['test_accuracy']*100,
               width, label='Test Accuracy', color='coral')

ax.set_xlabel('Maximum Tree Depth', fontsize=12)
ax.set_ylabel('Accuracy (%)', fontsize=12)
ax.set_title('Effect of Tree Depth on Model Performance', fontsize=14)
ax.set_xticks(x_pos)
ax.set_xticklabels(results_df['depth'])
ax.legend()
ax.set_ylim(50, 105)
ax.axhline(y=100, color='gray', linestyle=':', alpha=0.5)

plt.tight_layout()
plt.show()

print("\nResults:")
print(results_df.to_string(index=False))

### üí° Understanding Overfitting

Look at the graph above. Notice:

1. **Training accuracy** keeps increasing as trees get deeper
2. **Test accuracy** increases at first, then may plateau or decrease

When training accuracy is MUCH higher than test accuracy, the model is **overfitting** - it memorised the training data instead of learning generalisable patterns.

**The goal is to find the "sweet spot"** where the model is complex enough to capture real patterns, but not so complex that it memorises noise.

## Part 6: The Effect of Training Data Size

ML models learn from data. What happens if we have less data?

In [None]:
# Train with different amounts of data

training_sizes = [50, 100, 200, 400, 600, 800]
size_results = []

for size in training_sizes:
    # Take a subset of training data
    X_subset = X_train[:size]
    y_subset = y_train[:size]

    # Train model
    model = DecisionTreeClassifier(max_depth=4, random_state=42)
    model.fit(X_subset, y_subset)

    # Evaluate on full test set
    test_acc = model.score(X_test, y_test)

    size_results.append({
        'training_size': size,
        'test_accuracy': test_acc
    })

size_df = pd.DataFrame(size_results)

# Plot
plt.figure(figsize=(10, 5))
plt.plot(size_df['training_size'], size_df['test_accuracy']*100, 'b-o', linewidth=2, markersize=8)
plt.xlabel('Number of Training Patients', fontsize=12)
plt.ylabel('Test Accuracy (%)', fontsize=12)
plt.title('How Training Data Size Affects Model Performance', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()

print("\nüìà Observations:")
print("  - More training data generally = better performance")
print("  - But returns diminish as you add more data")
print("  - This is why healthcare AI needs large datasets!")

## Part 7: Feature Importance

Which vital signs does the model think are most important for predicting deterioration?

In [None]:
# Train final model with good parameters
final_model = DecisionTreeClassifier(max_depth=4, random_state=42)
final_model.fit(X_train, y_train)

# Get feature importances
importances = final_model.feature_importances_
importance_df = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': importances
}).sort_values('Importance', ascending=True)

# Plot
plt.figure(figsize=(10, 5))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='steelblue')
plt.xlabel('Importance Score', fontsize=12)
plt.title('Which Vital Signs Matter Most for Predicting Deterioration?', fontsize=14)
plt.tight_layout()
plt.show()

print("\nFeature Importances:")
for _, row in importance_df.sort_values('Importance', ascending=False).iterrows():
    print(f"  {row['Feature']}: {row['Importance']:.3f}")

print("\nüí° Higher importance = the model relies more on this feature")
print("   Does this match clinical intuition about deterioration?")

## Part 8: üîß Hands-On Experimentation

Now it's your turn! Use the cell below to experiment with different settings:

In [None]:
# ===== YOUR EXPERIMENT =====
# Modify these parameters and see what happens!

MY_TREE_DEPTH = 4           # Try: 1, 2, 3, 5, 10, None
MY_MIN_SAMPLES = 5          # Minimum patients needed to make a split (try: 2, 5, 10, 20)

# ============================

# Train YOUR model
my_model = DecisionTreeClassifier(
    max_depth=MY_TREE_DEPTH,
    min_samples_split=MY_MIN_SAMPLES,
    random_state=42
)
my_model.fit(X_train, y_train)

# Evaluate
train_acc = my_model.score(X_train, y_train)
test_acc = my_model.score(X_test, y_test)

# Predictions for confusion matrix
my_predictions = my_model.predict(X_test)
cm = confusion_matrix(y_test, my_predictions)
tn, fp, fn, tp = cm.ravel()

print("YOUR MODEL RESULTS")
print("="*50)
print(f"Settings: max_depth={MY_TREE_DEPTH}, min_samples={MY_MIN_SAMPLES}")
print(f"\nTraining Accuracy: {train_acc*100:.1f}%")
print(f"Test Accuracy: {test_acc*100:.1f}%")
print(f"\nConfusion Matrix:")
print(f"  True Negatives: {tn}")
print(f"  True Positives: {tp}")
print(f"  False Negatives: {fn} (missed deteriorations)")
print(f"  False Positives: {fp} (false alarms)")
print(f"\nSensitivity: {tp/(tp+fn)*100:.1f}%")
print(f"Specificity: {tn/(tn+fp)*100:.1f}%")

# Check for overfitting
if train_acc - test_acc > 0.1:
    print(f"\n‚ö†Ô∏è  Warning: Model may be overfitting!")
    print(f"    Training acc is {(train_acc-test_acc)*100:.1f}% higher than test acc")

## Part 9: Reflection Questions

Consider these questions and write your thoughts:

In [None]:
# ===== YOUR REFLECTIONS =====

reflections = """
1. Why is it important to test on data the model hasn't seen?
   Your answer:


2. What happens when a model is too complex (overfitting)?
   Your answer:


3. Why might a model perform differently on data from another hospital?
   Your answer:


4. The decision tree shows its "reasoning". Why might this be important
   for healthcare AI?
   Your answer:


5. If you were deploying this model, what test accuracy would you require?
   What sensitivity/specificity trade-off would you choose?
   Your answer:


"""

print(reflections)
print("\n‚úÖ Reflection saved!")

## üìù Deliverable

Complete the guided notebook, including:
1. Running all experiments
2. Your own experimentation with parameters
3. Written reflections

Submit via LMS by the Week 2 deadline.

## üèÅ Summary

In this exercise, you learned:

‚úÖ **Training data** is used to teach the model patterns

‚úÖ **Test data** (never seen by model) is used to evaluate real performance

‚úÖ **Decision trees** make predictions through a series of questions

‚úÖ **Overfitting** occurs when models memorise rather than generalise

‚úÖ **More data** generally improves performance

‚úÖ **Feature importance** shows which inputs the model relies on most

**Key insight:** ML models learn patterns from data - they can only be as good as the data they're trained on!

---

**Next week:** We'll explore the data itself - where it comes from, what biases it might contain, and why that matters for healthcare AI.