# Wafer Defect Classification Tutorial

This tutorial demonstrates how to build and deploy a production-ready wafer defect classification system using classical machine learning approaches.

## Business Context

In semiconductor manufacturing, wafer defect detection is critical for:
- **Quality Control**: Early detection prevents defective dies from reaching customers
- **Cost Reduction**: Identifying process issues before they impact entire lots
- **Process Optimization**: Understanding defect patterns to improve manufacturing

## Learning Objectives

By the end of this tutorial, you will:
1. Understand semiconductor defect classification challenges
2. Build and compare multiple ML models for defect detection
3. Apply manufacturing-specific metrics (PWS, Estimated Loss)
4. Deploy models using standardized CLI interface
5. Optimize model thresholds for precision/recall constraints

## Setup and Imports

In [None]:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Import our wafer defect pipeline
from wafer_defect_pipeline import (
    WaferDefectPipeline, 
    generate_synthetic_wafer_defects,
    load_dataset
)

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)

## 1. Data Generation and Exploration

Let's start by generating synthetic wafer defect data to understand the problem.

In [None]:
# Generate synthetic wafer defect data
print("Generating synthetic wafer defect data...")
df = generate_synthetic_wafer_defects(
    n_samples=2000,
    n_features=20,
    defect_rate=0.15,  # 15% defect rate
    noise_level=0.1,
    seed=42
)

print(f"Dataset shape: {df.shape}")
print(f"\nFeatures: {list(df.columns[:-1])}")
print(f"Target: {df.columns[-1]}")

# Display basic statistics
print("\n=== Dataset Summary ===")
print(df.describe())

In [None]:
# Analyze defect distribution
defect_counts = df['defect'].value_counts()
print("\n=== Defect Distribution ===")
print(f"Good wafers: {defect_counts[0]} ({defect_counts[0]/len(df)*100:.1f}%)")
print(f"Defective wafers: {defect_counts[1]} ({defect_counts[1]/len(df)*100:.1f}%)")

# Visualize class distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Class distribution
defect_counts.plot(kind='bar', ax=ax1, color=['skyblue', 'salmon'])
ax1.set_title('Wafer Defect Distribution')
ax1.set_xlabel('Class (0=Good, 1=Defective)')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=0)

# Feature correlation with target
feature_cols = [col for col in df.columns if col != 'defect']
correlations = df[feature_cols].corrwith(df['defect']).abs().sort_values(ascending=False)
correlations.head(10).plot(kind='bar', ax=ax2, color='lightgreen')
ax2.set_title('Top 10 Features Correlated with Defects')
ax2.set_xlabel('Features')
ax2.set_ylabel('Absolute Correlation')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 2. Model Training and Comparison

Now let's train and compare different ML models for wafer defect classification.

In [None]:
# Prepare data
X = df.drop('defect', axis=1)
y = df['defect'].values

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Models to compare
models_to_test = [
    ('logistic', 'Logistic Regression'),
    ('rf', 'Random Forest'),
    ('gb', 'Gradient Boosting'),
    ('linear_svm', 'Linear SVM'),
    ('tree', 'Decision Tree')
]

results = []

print("\n=== Training Models ===")
for model_name, model_desc in models_to_test:
    print(f"\nTraining {model_desc}...")
    
    # Create and train pipeline
    pipeline = WaferDefectPipeline(
        model_name=model_name,
        handle_imbalance='class_weight'
    )
    
    # Fit the model
    pipeline.fit(X, y)
    
    # Evaluate
    metrics = pipeline.evaluate(X, y)
    
    results.append({
        'Model': model_desc,
        'ROC-AUC': metrics['ROC_AUC'],
        'PR-AUC': metrics['PR_AUC'],
        'Precision': metrics['Precision'],
        'Recall': metrics['Recall'],
        'F1': metrics['F1'],
        'PWS': metrics['PWS'],
        'Estimated_Loss': metrics['Estimated_Loss']
    })
    
    print(f"  ROC-AUC: {metrics['ROC_AUC']:.3f}")
    print(f"  PR-AUC: {metrics['PR_AUC']:.3f}")
    print(f"  PWS: {metrics['PWS']:.1%}")

# Create results dataframe
results_df = pd.DataFrame(results)
print("\n=== Model Comparison ===")
print(results_df.round(3))

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# ROC-AUC comparison
ax1 = axes[0, 0]
results_df.set_index('Model')['ROC-AUC'].plot(kind='bar', ax=ax1, color='skyblue')
ax1.set_title('ROC-AUC Comparison')
ax1.set_ylabel('ROC-AUC Score')
ax1.tick_params(axis='x', rotation=45)
ax1.grid(True, alpha=0.3)

# PR-AUC comparison
ax2 = axes[0, 1]
results_df.set_index('Model')['PR-AUC'].plot(kind='bar', ax=ax2, color='lightgreen')
ax2.set_title('PR-AUC Comparison')
ax2.set_ylabel('PR-AUC Score')
ax2.tick_params(axis='x', rotation=45)
ax2.grid(True, alpha=0.3)

# PWS comparison  
ax3 = axes[1, 0]
results_df.set_index('Model')['PWS'].plot(kind='bar', ax=ax3, color='salmon')
ax3.set_title('PWS (Prediction Within Spec) Comparison')
ax3.set_ylabel('PWS Score')
ax3.tick_params(axis='x', rotation=45)
ax3.grid(True, alpha=0.3)

# F1 Score comparison
ax4 = axes[1, 1]
results_df.set_index('Model')['F1'].plot(kind='bar', ax=ax4, color='gold')
ax4.set_title('F1 Score Comparison')
ax4.set_ylabel('F1 Score')
ax4.tick_params(axis='x', rotation=45)
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find best model
best_model_idx = results_df['ROC-AUC'].idxmax()
best_model = results_df.loc[best_model_idx]
print(f"\n=== Best Performing Model ===")
print(f"Model: {best_model['Model']}")
print(f"ROC-AUC: {best_model['ROC-AUC']:.3f}")
print(f"PWS: {best_model['PWS']:.1%}")

## 3. Manufacturing-Specific Metrics Deep Dive

Let's explore the semiconductor manufacturing metrics in detail.

In [None]:
# Train the best performing model for detailed analysis
best_pipeline = WaferDefectPipeline(
    model_name='rf',  # Usually performs well
    handle_imbalance='class_weight'
)
best_pipeline.fit(X, y)

# Get predictions and probabilities
y_pred = best_pipeline.predict(X)
y_proba = best_pipeline.predict_proba(X)[:, 1]  # Probability of defect

# Analyze manufacturing costs at different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
threshold_analysis = []

print("=== Threshold Analysis for Manufacturing Optimization ===")
for threshold in thresholds:
    # Apply threshold
    y_pred_thresh = (y_proba >= threshold).astype(int)
    
    # Calculate metrics with manufacturing parameters
    metrics = WaferDefectPipeline.compute_metrics(
        y, y_pred_thresh, y_proba,
        cost_false_positive=10.0,  # Cost of scrapping good wafer
        cost_false_negative=100.0,  # Cost of shipping defective wafer
        tolerance=0.05
    )
    
    threshold_analysis.append({
        'Threshold': threshold,
        'Precision': metrics['Precision'],
        'Recall': metrics['Recall'],
        'F1': metrics['F1'],
        'PWS': metrics['PWS'],
        'Estimated_Loss': metrics['Estimated_Loss']
    })

threshold_df = pd.DataFrame(threshold_analysis)
print(threshold_df.head(10).round(3))

In [None]:
# Visualize threshold optimization
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Precision vs Recall vs Threshold
ax1 = axes[0, 0]
ax1.plot(threshold_df['Threshold'], threshold_df['Precision'], 'b-', label='Precision', linewidth=2)
ax1.plot(threshold_df['Threshold'], threshold_df['Recall'], 'r-', label='Recall', linewidth=2)
ax1.plot(threshold_df['Threshold'], threshold_df['F1'], 'g--', label='F1', linewidth=2)
ax1.set_xlabel('Decision Threshold')
ax1.set_ylabel('Score')
ax1.set_title('Precision-Recall vs Threshold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# PWS vs Threshold
ax2 = axes[0, 1]
ax2.plot(threshold_df['Threshold'], threshold_df['PWS'], 'purple', linewidth=2)
ax2.set_xlabel('Decision Threshold')
ax2.set_ylabel('PWS (Prediction Within Spec)')
ax2.set_title('Manufacturing PWS vs Threshold')
ax2.grid(True, alpha=0.3)

# Estimated Loss vs Threshold
ax3 = axes[1, 0]
ax3.plot(threshold_df['Threshold'], threshold_df['Estimated_Loss'], 'orange', linewidth=2)
ax3.set_xlabel('Decision Threshold')
ax3.set_ylabel('Estimated Loss ($)')
ax3.set_title('Manufacturing Cost vs Threshold')
ax3.grid(True, alpha=0.3)

# Find optimal threshold (minimum loss)
optimal_idx = threshold_df['Estimated_Loss'].idxmin()
optimal_threshold = threshold_df.loc[optimal_idx, 'Threshold']
optimal_loss = threshold_df.loc[optimal_idx, 'Estimated_Loss']

ax3.axvline(x=optimal_threshold, color='red', linestyle='--', alpha=0.7)
ax3.text(optimal_threshold + 0.05, optimal_loss, 
         f'Optimal: {optimal_threshold:.2f}', 
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# ROC Curve
from sklearn.metrics import roc_curve
fpr, tpr, _ = roc_curve(y, y_proba)
ax4 = axes[1, 1]
ax4.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve')
ax4.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random Classifier')
ax4.set_xlabel('False Positive Rate')
ax4.set_ylabel('True Positive Rate')
ax4.set_title('ROC Curve')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n=== Optimal Operating Point ===")
print(f"Optimal Threshold: {optimal_threshold:.3f}")
print(f"At this threshold:")
print(f"  Precision: {threshold_df.loc[optimal_idx, 'Precision']:.3f}")
print(f"  Recall: {threshold_df.loc[optimal_idx, 'Recall']:.3f}")
print(f"  PWS: {threshold_df.loc[optimal_idx, 'PWS']:.1%}")
print(f"  Estimated Loss: ${threshold_df.loc[optimal_idx, 'Estimated_Loss']:.2f}")

## 4. Model Deployment and CLI Usage

Now let's demonstrate how to use the production CLI interface.

In [None]:
# Save the optimized model
model_path = Path('best_wafer_defect_model.joblib')

# Create pipeline with optimal threshold
production_pipeline = WaferDefectPipeline(
    model_name='rf',
    handle_imbalance='class_weight'
)
production_pipeline.fit(X, y)

# Optimize threshold for minimum cost
production_pipeline.optimize_threshold(
    X, y, 
    min_precision=0.8,  # Require at least 80% precision
    cost_false_positive=10.0,
    cost_false_negative=100.0
)

# Save the model
production_pipeline.save(model_path)
print(f"Model saved to: {model_path}")
print(f"Optimized threshold: {production_pipeline.threshold:.3f}")

In [None]:
# Demonstrate CLI usage (these would be run from command line)
print("=== CLI Usage Examples ===")
print("\nTo train a new model:")
print("python wafer_defect_pipeline.py train --dataset synthetic_wafer --model rf --min-precision 0.8 --save model.joblib")

print("\nTo evaluate an existing model:")
print("python wafer_defect_pipeline.py evaluate --model-path model.joblib --dataset synthetic_wafer")

print("\nTo make predictions:")
prediction_example = {
    "center_density": 0.12,
    "edge_density": 0.05,
    "pattern_uniformity": 0.85,
    "thickness_variation": 0.03
}
print(f'python wafer_defect_pipeline.py predict --model-path model.joblib --input-json \'{prediction_example}\'')

# Simulate a prediction
print("\n=== Live Prediction Example ===")
sample_wafer = X.iloc[0:1]  # Take first wafer
prediction = production_pipeline.predict(sample_wafer)
probability = production_pipeline.predict_proba(sample_wafer)[0, 1]

print(f"Sample wafer features: {sample_wafer.iloc[0].to_dict()}")
print(f"Prediction: {'DEFECTIVE' if prediction[0] == 1 else 'GOOD'}")
print(f"Defect probability: {probability:.3f}")
print(f"Actual label: {'DEFECTIVE' if y[0] == 1 else 'GOOD'}")

## 5. Key Takeaways

### Manufacturing Insights
1. **Threshold Optimization**: The optimal decision threshold balances false positive costs (scrapping good wafers) vs false negative costs (shipping defective wafers)
2. **PWS Metric**: Prediction Within Spec measures how well predictions align with manufacturing tolerance requirements
3. **Cost-Aware ML**: Manufacturing decisions should consider economic impact, not just accuracy

### Technical Insights
1. **Model Selection**: Random Forest and Gradient Boosting typically perform well for semiconductor defect detection
2. **Imbalance Handling**: Class weights help address the natural imbalance in defect rates
3. **Feature Engineering**: Process parameters with high correlation to defects are most valuable

### Production Deployment
1. **Standardized CLI**: Consistent interface across all semiconductor ML projects
2. **Model Persistence**: Save/load functionality preserves optimal thresholds and preprocessing
3. **JSON Output**: Machine-readable results for integration with manufacturing systems

## Next Steps

To extend this baseline classifier:
1. **Real Data Integration**: Connect to actual wafer map datasets (WM-811K)
2. **Deep Learning**: Implement CNN models for spatial pattern recognition
3. **Feature Engineering**: Add domain-specific features (spatial statistics, pattern descriptors)
4. **Model Ensemble**: Combine multiple model predictions for improved performance
5. **Real-time Deployment**: Create API wrapper for production manufacturing lines
6. **Drift Monitoring**: Track model performance degradation over time

In [None]:
# Clean up
if model_path.exists():
    model_path.unlink()
    print("Cleaned up temporary model file")

print("\n🎉 Tutorial completed successfully!")
print("You now have the knowledge to build production-ready wafer defect classifiers.")