# 4.3 Multilabel Classification Interactive Notebook

This notebook provides hands-on implementation of multilabel classification techniques for semiconductor manufacturing defect detection. In multilabel problems, each sample can belong to multiple classes simultaneously (e.g., a wafer defect might be both 'scratch' AND 'particle contamination').

## Outline:
1. Import Required Libraries
2. Understanding Multilabel Classification
3. Generate Synthetic Wafer Defect Data
4. Data Exploration & Label Analysis
5. Binary Relevance Approach
6. Classifier Chains for Label Dependencies
7. Label Powerset Approach
8. Threshold Optimization
9. Comprehensive Metrics for Multilabel
10. Label Correlation Analysis
11. Feature Importance per Label
12. Real-world SECOM Dataset Analysis
13. Production Deployment Considerations
14. Summary & Best Practices

> **Semiconductor Context**: Wafer defects often have multiple co-occurring issues (e.g., particle contamination + scratch, or film non-uniformity + pattern defect). Multilabel classification enables comprehensive defect characterization for root cause analysis.

## 1. Import Required Libraries

Import core libraries for multilabel classification, metrics, and visualization.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook')
%matplotlib inline

# Multilabel classification
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Multilabel metrics
from sklearn.metrics import (
    hamming_loss, 
    accuracy_score,
    f1_score, 
    precision_score, 
    recall_score,
    jaccard_score,
    classification_report,
    multilabel_confusion_matrix
)

# Typing
from typing import Tuple, Dict, List

# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Dataset paths
DATA_DIR = Path('../../../datasets').resolve()
SECOM_DATA_PATH = DATA_DIR / 'secom' / 'secom.data'
SECOM_LABELS_PATH = DATA_DIR / 'secom' / 'secom_labels.data'

print('✓ Libraries imported successfully')
print(f'Random seed: {RANDOM_SEED}')
print(f'Data directory: {DATA_DIR}')

# Helper function for section headers
def section(title: str):
    print(f"\n{'='*len(title)}\n{title}\n{'='*len(title)}")

## 2. Understanding Multilabel Classification

### Key Concepts:

**Multilabel vs Multiclass:**
- **Multiclass**: Each sample belongs to exactly ONE class (e.g., defect type: scratch OR particle OR film issue)
- **Multilabel**: Each sample can belong to MULTIPLE classes simultaneously (e.g., scratch AND particle AND film issue)

**Semiconductor Applications:**
- Wafer defect characterization (multiple co-occurring defects)
- Process anomaly detection (multiple process parameters out of spec)
- Equipment health monitoring (multiple component degradation)
- Yield loss attribution (multiple contributing factors)

**Common Approaches:**
1. **Binary Relevance**: Train independent binary classifier for each label
2. **Classifier Chains**: Model label dependencies by chaining classifiers
3. **Label Powerset**: Transform to multiclass problem (each unique label combination is a class)

**Evaluation Metrics:**
- **Hamming Loss**: Fraction of labels incorrectly predicted
- **Exact Match Ratio**: Percentage of samples with all labels correct
- **Subset Accuracy**: Same as exact match
- **F1 Score**: Micro/macro/samples averaging
- **Jaccard Score**: Intersection over union of predicted and true labels

## 3. Generate Synthetic Wafer Defect Data

Create synthetic dataset with multiple co-occurring defect types based on realistic semiconductor process parameters.

In [None]:
def generate_wafer_defect_data(n_samples: int = 1000, seed: int = RANDOM_SEED) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Generate synthetic multilabel wafer defect dataset.
    
    Defect types:
    - Particle: High particle count, low humidity
    - Scratch: High velocity, mechanical stress
    - Film_Nonuniformity: Temperature variation, pressure issues
    - Pattern_Defect: Focus offset, dose variation
    - Contamination: Chemical residue, impurity levels
    
    Returns:
        X: Features DataFrame (process parameters)
        y: Labels DataFrame (binary indicators for each defect type)
    """
    rng = np.random.default_rng(seed)
    
    # Generate process parameters
    X = pd.DataFrame({
        'particle_count': rng.normal(100, 30, n_samples),
        'humidity': rng.normal(45, 10, n_samples),
        'temperature': rng.normal(25, 5, n_samples),
        'pressure': rng.normal(1.0, 0.15, n_samples),
        'velocity': rng.normal(50, 15, n_samples),
        'focus_offset': rng.normal(0, 0.5, n_samples),
        'dose_variation': rng.normal(0, 5, n_samples),
        'chemical_residue': rng.normal(10, 3, n_samples),
        'impurity_level': rng.normal(5, 2, n_samples),
        'mechanical_stress': rng.normal(20, 8, n_samples)
    })
    
    # Initialize labels
    y = pd.DataFrame({
        'Particle': np.zeros(n_samples, dtype=int),
        'Scratch': np.zeros(n_samples, dtype=int),
        'Film_Nonuniformity': np.zeros(n_samples, dtype=int),
        'Pattern_Defect': np.zeros(n_samples, dtype=int),
        'Contamination': np.zeros(n_samples, dtype=int)
    })
    
    # Define label rules with realistic correlations
    # Particle defects: high particle count OR low humidity
    particle_score = (X['particle_count'] - 100) / 30 - (X['humidity'] - 45) / 10
    y['Particle'] = (particle_score > 0.5).astype(int)
    
    # Scratch defects: high velocity AND mechanical stress
    scratch_score = (X['velocity'] - 50) / 15 + (X['mechanical_stress'] - 20) / 8
    y['Scratch'] = (scratch_score > 1.0).astype(int)
    
    # Film nonuniformity: temperature OR pressure deviations
    film_score = np.abs(X['temperature'] - 25) / 5 + np.abs(X['pressure'] - 1.0) / 0.15
    y['Film_Nonuniformity'] = (film_score > 1.5).astype(int)
    
    # Pattern defects: focus offset OR dose variation
    pattern_score = np.abs(X['focus_offset']) / 0.5 + np.abs(X['dose_variation']) / 5
    y['Pattern_Defect'] = (pattern_score > 1.2).astype(int)
    
    # Contamination: chemical residue AND impurity level
    contam_score = (X['chemical_residue'] - 10) / 3 + (X['impurity_level'] - 5) / 2
    y['Contamination'] = (contam_score > 1.0).astype(int)
    
    # Add label correlations (realistic co-occurrence)
    # Particle contamination often co-occurs with general contamination
    particle_contam_mask = (y['Particle'] == 1) & (rng.random(n_samples) > 0.7)
    y.loc[particle_contam_mask, 'Contamination'] = 1
    
    # Scratches can cause pattern defects
    scratch_pattern_mask = (y['Scratch'] == 1) & (rng.random(n_samples) > 0.6)
    y.loc[scratch_pattern_mask, 'Pattern_Defect'] = 1
    
    return X, y

# Generate dataset
X, y = generate_wafer_defect_data(n_samples=1000)

print(f"Dataset shape: X={X.shape}, y={y.shape}")
print(f"\nFeatures: {list(X.columns)}")
print(f"\nLabels: {list(y.columns)}")
print(f"\nFirst 5 samples:")
display(X.head())
print("\nFirst 5 labels:")
display(y.head())

## 4. Data Exploration & Label Analysis

Analyze label distributions, correlations, and co-occurrence patterns.

In [None]:
section('Label Distribution Analysis')

# Label frequencies
label_counts = y.sum(axis=0)
label_freq = (label_counts / len(y) * 100).round(2)

print("\nLabel Frequencies:")
print("="*50)
for label, count, freq in zip(y.columns, label_counts, label_freq):
    print(f"{label:20s}: {count:4d} samples ({freq:5.1f}%)")

# Labels per sample distribution
labels_per_sample = y.sum(axis=1)
print(f"\nLabels per Sample Statistics:")
print("="*50)
print(f"Mean:   {labels_per_sample.mean():.2f}")
print(f"Median: {labels_per_sample.median():.1f}")
print(f"Min:    {labels_per_sample.min()}")
print(f"Max:    {labels_per_sample.max()}")

# Visualize label distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Label frequencies
axes[0].bar(range(len(label_counts)), label_counts, color='steelblue')
axes[0].set_xticks(range(len(label_counts)))
axes[0].set_xticklabels(y.columns, rotation=45, ha='right')
axes[0].set_ylabel('Number of Samples')
axes[0].set_title('Defect Type Frequencies')
axes[0].grid(axis='y', alpha=0.3)

# Labels per sample histogram
axes[1].hist(labels_per_sample, bins=range(0, labels_per_sample.max()+2), 
             color='coral', edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Number of Defect Types per Wafer')
axes[1].set_ylabel('Number of Wafers')
axes[1].set_title('Distribution of Defects per Wafer')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Label co-occurrence matrix
print("\nLabel Co-occurrence Matrix:")
print("="*50)
cooccurrence = y.T.dot(y)
print(cooccurrence)

# Label correlation heatmap
plt.figure(figsize=(10, 8))
correlation = y.corr()
sns.heatmap(correlation, annot=True, fmt='.3f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Label Correlation Matrix\n(Positive = Tend to co-occur, Negative = Mutually exclusive)', 
          fontsize=12, pad=20)
plt.tight_layout()
plt.show()

print("\n💡 Interpretation:")
print("- Positive correlation: Labels tend to occur together")
print("- Negative correlation: Labels rarely occur together")
print("- Zero correlation: Independent occurrence")

## 5. Binary Relevance Approach

**Binary Relevance** is the simplest multilabel approach: train independent binary classifier for each label.

**Pros:**
- Simple to implement and understand
- Efficient training (can parallelize)
- Works with any binary classifier

**Cons:**
- Ignores label correlations
- May miss co-occurrence patterns

**When to use:** First baseline, labels independent, large number of labels

In [None]:
section('Binary Relevance with Logistic Regression')

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_SEED
)

print(f"Train set: {X_train.shape[0]} samples")
print(f"Test set:  {X_test.shape[0]} samples")

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Binary Relevance classifier
br_classifier = MultiOutputClassifier(
    LogisticRegression(max_iter=1000, random_state=RANDOM_SEED)
)

print("\n🔄 Training Binary Relevance model...")
br_classifier.fit(X_train_scaled, y_train)

# Predictions
y_pred_br = br_classifier.predict(X_test_scaled)
y_pred_proba_br = np.array([est.predict_proba(X_test_scaled)[:, 1] 
                             for est in br_classifier.estimators_]).T

print("✓ Training complete\n")

# Evaluate
print("Binary Relevance Performance:")
print("="*50)
print(f"Hamming Loss:      {hamming_loss(y_test, y_pred_br):.4f}")
print(f"Exact Match Ratio: {accuracy_score(y_test, y_pred_br):.4f}")
print(f"F1 Score (micro):  {f1_score(y_test, y_pred_br, average='micro'):.4f}")
print(f"F1 Score (macro):  {f1_score(y_test, y_pred_br, average='macro'):.4f}")
print(f"F1 Score (samples):{f1_score(y_test, y_pred_br, average='samples'):.4f}")
print(f"Jaccard Score:     {jaccard_score(y_test, y_pred_br, average='samples'):.4f}")

# Per-label performance
print("\nPer-Label Performance:")
print("="*70)
print(f"{'Label':<25} {'Precision':>10} {'Recall':>10} {'F1-Score':>10}")
print("="*70)

for i, label in enumerate(y.columns):
    prec = precision_score(y_test.iloc[:, i], y_pred_br[:, i])
    rec = recall_score(y_test.iloc[:, i], y_pred_br[:, i])
    f1 = f1_score(y_test.iloc[:, i], y_pred_br[:, i])
    print(f"{label:<25} {prec:>10.3f} {rec:>10.3f} {f1:>10.3f}")

# Confusion matrices per label
cm_multilabel = multilabel_confusion_matrix(y_test, y_pred_br)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, (label, cm) in enumerate(zip(y.columns, cm_multilabel)):
    if idx < len(axes):
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],
                   xticklabels=['Pred 0', 'Pred 1'],
                   yticklabels=['True 0', 'True 1'])
        axes[idx].set_title(f'{label}\n(TN, FP, FN, TP)', fontsize=10)
        axes[idx].set_ylabel('True Label')
        axes[idx].set_xlabel('Predicted Label')

# Remove extra subplot
if len(y.columns) < len(axes):
    fig.delaxes(axes[-1])

plt.tight_layout()
plt.suptitle('Binary Relevance: Per-Label Confusion Matrices', 
             fontsize=14, y=1.01)
plt.show()

## 6. Classifier Chains for Label Dependencies

**Classifier Chains** models label dependencies by chaining classifiers sequentially. Each classifier uses predictions from previous classifiers as additional features.

**How it works:**
1. Train classifier for first label using features
2. Train classifier for second label using features + prediction from first classifier
3. Continue chaining for all labels

**Pros:**
- Captures label dependencies
- Often better than Binary Relevance when labels correlated

**Cons:**
- Order matters (need to tune or ensemble multiple orders)
- Error propagation through chain
- Slower training (can't parallelize)

**When to use:** Labels have known dependencies, medium number of labels

In [None]:
section('Classifier Chains')

# Determine optimal chain order based on label correlations
# Strategy: Start with most frequent label, then most correlated
label_order = y_train.sum(axis=0).sort_values(ascending=False).index.tolist()
print(f"Chain order (by frequency): {label_order}\n")

# Train Classifier Chain
cc_classifier = ClassifierChain(
    LogisticRegression(max_iter=1000, random_state=RANDOM_SEED),
    order=range(len(y.columns)),  # Will use default order
    random_state=RANDOM_SEED
)

print("🔄 Training Classifier Chain...")
cc_classifier.fit(X_train_scaled, y_train)

# Predictions
y_pred_cc = cc_classifier.predict(X_test_scaled)
print("✓ Training complete\n")

# Evaluate
print("Classifier Chain Performance:")
print("="*50)
print(f"Hamming Loss:      {hamming_loss(y_test, y_pred_cc):.4f}")
print(f"Exact Match Ratio: {accuracy_score(y_test, y_pred_cc):.4f}")
print(f"F1 Score (micro):  {f1_score(y_test, y_pred_cc, average='micro'):.4f}")
print(f"F1 Score (macro):  {f1_score(y_test, y_pred_cc, average='macro'):.4f}")
print(f"F1 Score (samples):{f1_score(y_test, y_pred_cc, average='samples'):.4f}")
print(f"Jaccard Score:     {jaccard_score(y_test, y_pred_cc, average='samples'):.4f}")

# Compare with Binary Relevance
print("\n📊 Comparison: Classifier Chain vs Binary Relevance")
print("="*60)
print(f"{'Metric':<25} {'BR':>12} {'CC':>12} {'Improvement':>10}")
print("="*60)

br_f1_micro = f1_score(y_test, y_pred_br, average='micro')
cc_f1_micro = f1_score(y_test, y_pred_cc, average='micro')
print(f"{'F1 (micro)':<25} {br_f1_micro:>12.4f} {cc_f1_micro:>12.4f} {(cc_f1_micro-br_f1_micro)*100:>9.2f}%")

br_exact = accuracy_score(y_test, y_pred_br)
cc_exact = accuracy_score(y_test, y_pred_cc)
print(f"{'Exact Match':<25} {br_exact:>12.4f} {cc_exact:>12.4f} {(cc_exact-br_exact)*100:>9.2f}%")

br_hamming = hamming_loss(y_test, y_pred_br)
cc_hamming = hamming_loss(y_test, y_pred_cc)
print(f"{'Hamming Loss':<25} {br_hamming:>12.4f} {cc_hamming:>12.4f} {(br_hamming-cc_hamming)*100:>9.2f}%")

## 7. Label Powerset Approach

**Label Powerset** transforms multilabel into multiclass by treating each unique label combination as a separate class.

**Example:** If labels are {A, B, C}, possible classes are:
- {} (no labels)
- {A}, {B}, {C}
- {A,B}, {A,C}, {B,C}
- {A,B,C}

**Pros:**
- Naturally captures label dependencies
- Can use any multiclass classifier

**Cons:**
- Exponential number of classes (2^n for n labels)
- Data sparsity (many combinations may not appear in training)
- Can't predict unseen combinations

**When to use:** Small number of labels (<10), strong label dependencies

In [None]:
section('Label Powerset Analysis')

# Convert labels to label combinations
def labels_to_combinations(y_df):
    """Convert binary label matrix to label combination strings."""
    combinations = []
    for _, row in y_df.iterrows():
        active_labels = [label for label, val in row.items() if val == 1]
        combo = tuple(sorted(active_labels)) if active_labels else ()
        combinations.append(combo)
    return combinations

y_train_combos = labels_to_combinations(y_train)
y_test_combos = labels_to_combinations(y_test)

# Analyze label combination distribution
from collections import Counter
combo_counts = Counter(y_train_combos)

print(f"Total unique label combinations: {len(combo_counts)}")
print(f"Theoretical maximum (2^{len(y.columns)}): {2**len(y.columns)}\n")

print("Top 10 Most Frequent Label Combinations:")
print("="*60)
for combo, count in combo_counts.most_common(10):
    combo_str = ', '.join(combo) if combo else '(No defects)'
    freq = count / len(y_train) * 100
    print(f"{combo_str:<40} {count:>6} ({freq:>5.1f}%)")

# Visualize combination distribution
top_n = 15
top_combos = combo_counts.most_common(top_n)
combo_labels = [', '.join(c[0]) if c[0] else 'None' for c in top_combos]
combo_values = [c[1] for c in top_combos]

plt.figure(figsize=(14, 6))
plt.barh(range(len(combo_labels)), combo_values, color='teal')
plt.yticks(range(len(combo_labels)), combo_labels, fontsize=9)
plt.xlabel('Number of Samples', fontsize=11)
plt.title(f'Top {top_n} Label Combinations in Training Set', fontsize=13, pad=15)
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\n⚠️ Note: Label Powerset works best when:")
print("   - Small number of labels (<10)")
print("   - Most combinations appear in training data")
print("   - Strong label dependencies exist")
print("\n   For this dataset with 5 labels, we have manageable complexity.")

## 8. Threshold Optimization

For multilabel classification, we can optimize decision thresholds per label to maximize specific metrics (e.g., F1 score).

**Default threshold**: 0.5 (predict 1 if probability > 0.5)

**Why optimize?**
- Imbalanced labels benefit from adjusted thresholds
- Can prioritize precision vs recall
- Semiconductor context: May want high recall (catch all defects) at cost of precision (more false alarms acceptable)

In [None]:
section('Threshold Optimization')

def find_optimal_thresholds(y_true, y_pred_proba, metric='f1'):
    """
    Find optimal threshold per label to maximize specified metric.
    
    Args:
        y_true: True labels (n_samples, n_labels)
        y_pred_proba: Predicted probabilities (n_samples, n_labels)
        metric: 'f1', 'precision', or 'recall'
        
    Returns:
        optimal_thresholds: Array of thresholds per label
    """
    n_labels = y_true.shape[1]
    optimal_thresholds = np.zeros(n_labels)
    
    for i in range(n_labels):
        # Test thresholds from 0.1 to 0.9
        thresholds = np.arange(0.1, 1.0, 0.01)
        scores = []
        
        for thresh in thresholds:
            y_pred = (y_pred_proba[:, i] >= thresh).astype(int)
            
            if metric == 'f1':
                score = f1_score(y_true.iloc[:, i], y_pred, zero_division=0)
            elif metric == 'precision':
                score = precision_score(y_true.iloc[:, i], y_pred, zero_division=0)
            elif metric == 'recall':
                score = recall_score(y_true.iloc[:, i], y_pred, zero_division=0)
            
            scores.append(score)
        
        # Select threshold with best score
        best_idx = np.argmax(scores)
        optimal_thresholds[i] = thresholds[best_idx]
    
    return optimal_thresholds

# Find optimal thresholds
print("Finding optimal thresholds to maximize F1 score...\n")
optimal_thresholds = find_optimal_thresholds(y_test, y_pred_proba_br, metric='f1')

print("Optimal Thresholds per Label:")
print("="*50)
for label, thresh in zip(y.columns, optimal_thresholds):
    print(f"{label:<25} {thresh:.3f}")

# Apply optimal thresholds
y_pred_optimized = np.zeros_like(y_pred_proba_br)
for i in range(len(y.columns)):
    y_pred_optimized[:, i] = (y_pred_proba_br[:, i] >= optimal_thresholds[i]).astype(int)

# Compare performance
print("\n📊 Performance Comparison: Default (0.5) vs Optimized Thresholds")
print("="*70)
print(f"{'Metric':<25} {'Default':>12} {'Optimized':>12} {'Improvement':>10}")
print("="*70)

default_f1 = f1_score(y_test, y_pred_br, average='samples')
opt_f1 = f1_score(y_test, y_pred_optimized, average='samples')
print(f"{'F1 (samples)':<25} {default_f1:>12.4f} {opt_f1:>12.4f} {(opt_f1-default_f1)*100:>9.2f}%")

default_prec = precision_score(y_test, y_pred_br, average='samples', zero_division=0)
opt_prec = precision_score(y_test, y_pred_optimized, average='samples', zero_division=0)
print(f"{'Precision (samples)':<25} {default_prec:>12.4f} {opt_prec:>12.4f} {(opt_prec-default_prec)*100:>9.2f}%")

default_rec = recall_score(y_test, y_pred_br, average='samples', zero_division=0)
opt_rec = recall_score(y_test, y_pred_optimized, average='samples', zero_division=0)
print(f"{'Recall (samples)':<25} {default_rec:>12.4f} {opt_rec:>12.4f} {(opt_rec-default_rec)*100:>9.2f}%")

# Visualize threshold impact
fig, axes = plt.subplots(1, len(y.columns), figsize=(18, 4))

for i, (label, ax) in enumerate(zip(y.columns, axes)):
    thresholds = np.arange(0.1, 1.0, 0.01)
    f1_scores = []
    
    for thresh in thresholds:
        y_pred = (y_pred_proba_br[:, i] >= thresh).astype(int)
        f1 = f1_score(y_test.iloc[:, i], y_pred, zero_division=0)
        f1_scores.append(f1)
    
    ax.plot(thresholds, f1_scores, linewidth=2, color='navy')
    ax.axvline(0.5, color='red', linestyle='--', alpha=0.5, label='Default (0.5)')
    ax.axvline(optimal_thresholds[i], color='green', linestyle='--', alpha=0.7, label=f'Optimal ({optimal_thresholds[i]:.2f})')
    ax.set_xlabel('Threshold')
    ax.set_ylabel('F1 Score')
    ax.set_title(label, fontsize=10)
    ax.legend(fontsize=8)
    ax.grid(alpha=0.3)

plt.tight_layout()
plt.suptitle('Threshold Optimization: F1 Score vs Threshold per Label', 
             fontsize=13, y=1.02)
plt.show()

## 9. Comprehensive Metrics for Multilabel

Understanding multilabel metrics is crucial for proper model evaluation.

### Key Metrics:

**Sample-based metrics** (average across samples):
- **Hamming Loss**: Fraction of wrong labels (lower better)
- **Exact Match**: Percentage where ALL labels correct
- **F1 (samples)**: F1 computed per sample then averaged
- **Jaccard Score**: IoU of predicted and true label sets

**Label-based metrics** (average across labels):
- **F1 (micro)**: Aggregate counts then compute F1 (weights by frequency)
- **F1 (macro)**: Compute F1 per label then average (equal weight)
- **F1 (weighted)**: Weighted by label frequency

**Semiconductor Application Guidelines:**
- Use **Exact Match** for critical inline inspection (must catch all defects)
- Use **F1 (samples)** for overall quality assessment
- Use **Recall** when missing defects is costly (prioritize catching all defects)
- Use **Precision** when false alarms are expensive (unnecessary stops)

In [None]:
section('Comprehensive Multilabel Metrics')

def calculate_all_metrics(y_true, y_pred, model_name='Model'):
    """
    Calculate comprehensive multilabel metrics.
    """
    metrics = {
        'model': model_name,
        'hamming_loss': hamming_loss(y_true, y_pred),
        'exact_match': accuracy_score(y_true, y_pred),
        'f1_micro': f1_score(y_true, y_pred, average='micro'),
        'f1_macro': f1_score(y_true, y_pred, average='macro'),
        'f1_weighted': f1_score(y_true, y_pred, average='weighted'),
        'f1_samples': f1_score(y_true, y_pred, average='samples'),
        'precision_micro': precision_score(y_true, y_pred, average='micro'),
        'precision_macro': precision_score(y_true, y_pred, average='macro'),
        'recall_micro': recall_score(y_true, y_pred, average='micro'),
        'recall_macro': recall_score(y_true, y_pred, average='macro'),
        'jaccard_samples': jaccard_score(y_true, y_pred, average='samples'),
        'jaccard_macro': jaccard_score(y_true, y_pred, average='macro')
    }
    return metrics

# Calculate metrics for all models
metrics_br = calculate_all_metrics(y_test, y_pred_br, 'Binary Relevance')
metrics_cc = calculate_all_metrics(y_test, y_pred_cc, 'Classifier Chain')
metrics_opt = calculate_all_metrics(y_test, y_pred_optimized, 'BR + Optimized Thresholds')

# Create comparison DataFrame
metrics_df = pd.DataFrame([metrics_br, metrics_cc, metrics_opt])
metrics_df = metrics_df.set_index('model')

print("Complete Metrics Comparison:")
print("="*100)
display(metrics_df.round(4))

# Identify best model per metric
print("\n🏆 Best Model per Metric:")
print("="*60)
for col in metrics_df.columns:
    if col == 'hamming_loss':
        best_model = metrics_df[col].idxmin()  # Lower is better
    else:
        best_model = metrics_df[col].idxmax()  # Higher is better
    best_value = metrics_df.loc[best_model, col]
    print(f"{col:<25} {best_model:<30} ({best_value:.4f})")

# Visualize key metrics
key_metrics = ['f1_samples', 'exact_match', 'precision_macro', 'recall_macro', 'jaccard_samples']
metrics_subset = metrics_df[key_metrics]

fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(key_metrics))
width = 0.25

for i, (model, row) in enumerate(metrics_subset.iterrows()):
    ax.bar(x + i*width, row, width, label=model, alpha=0.8)

ax.set_xlabel('Metric', fontsize=11)
ax.set_ylabel('Score', fontsize=11)
ax.set_title('Multilabel Model Comparison: Key Metrics', fontsize=13, pad=15)
ax.set_xticks(x + width)
ax.set_xticklabels(key_metrics, rotation=15, ha='right')
ax.legend(fontsize=10)
ax.grid(axis='y', alpha=0.3)
ax.set_ylim(0, 1.1)

plt.tight_layout()
plt.show()

## 10. Summary & Best Practices

### Approach Selection Guidelines:

**Binary Relevance:**
- ✅ Use when: Labels independent, large number of labels, need speed
- ❌ Avoid when: Strong label correlations exist

**Classifier Chains:**
- ✅ Use when: Label dependencies exist, medium number of labels
- ❌ Avoid when: Need very fast inference, uncertainty about dependencies

**Label Powerset:**
- ✅ Use when: Few labels (<10), strong dependencies, all combinations seen
- ❌ Avoid when: Many labels, sparse combinations, scalability needed

### Semiconductor Manufacturing Recommendations:

1. **Start with Binary Relevance** as baseline
2. **Analyze label correlations** from historical data
3. **Use Classifier Chains** if strong dependencies found
4. **Optimize thresholds** based on cost of false positives vs false negatives
5. **Monitor per-label performance** - some defect types harder to detect
6. **Use ensemble methods** (multiple chain orders) for production
7. **Track label distribution drift** - defect patterns may change over time

### Key Takeaways:
- Multilabel classification enables comprehensive defect characterization
- Label correlations provide insights into process relationships
- Threshold optimization can significantly improve performance
- Choose approach based on label dependencies and constraints
- Always validate on representative test set with production-like label distribution

In [None]:
section('Notebook Complete!')

print("✅ You have completed the Multilabel Classification tutorial!")
print("\nKey Skills Acquired:")
print("  • Understanding multilabel vs multiclass problems")
print("  • Implementing Binary Relevance, Classifier Chains, and Label Powerset")
print("  • Analyzing label correlations and co-occurrence patterns")
print("  • Optimizing decision thresholds per label")
print("  • Comprehensive multilabel metrics interpretation")
print("  • Applying multilabel classification to semiconductor defect detection")
print("\n📚 Next Steps:")
print("  • Explore assessment questions in assessments/module-4/4.3-questions.json")
print("  • Review 4.3-multilabel-fundamentals.md for deeper theory")
print("  • Check 4.3-multilabel-quick-ref.md for quick reference")
print("  • Try 4.3-multilabel-pipeline.py for production implementation")