# Semiconductor Yield Regression - Solution Notebook

**Complete Reference Implementations for All Exercises**

This notebook provides complete solutions for all exercises in the yield regression tutorial. Use this as a reference for understanding the expected implementations and best practices.

## About This Solution Notebook

This notebook contains:
- ✅ **Complete implementations** for all 4 exercises
- ✅ **Detailed explanations** of regression concepts and manufacturing context
- ✅ **Best practices** for production-ready ML pipelines
- ✅ **Debugging tips** and common pitfalls to avoid
- ✅ **Key takeaways** summarizing important concepts

## Business Context

In semiconductor manufacturing, **yield** (the percentage of wafers that meet specifications) is a critical metric. Even small improvements in yield prediction can:
- Save millions in manufacturing costs
- Optimize process parameters faster
- Reduce scrap and rework
- Improve production planning

This project demonstrates how to build production-ready regression models for yield prediction.

## Learning Objectives

By working through these solutions, you will:
1. Generate and explore synthetic semiconductor yield data
2. Train and compare multiple regression models
3. Calculate manufacturing-specific metrics (PWS, Estimated Loss)
4. Deploy models using production-ready CLI patterns
5. Understand residual analysis and model validation

## Exercise Overview

| Exercise | Topic | Difficulty | Time |
|----------|-------|------------|------|
| 1 | Data Generation & Exploration | ★★ | 20 min |
| 2 | Model Training & Comparison | ★★ | 30 min |
| 3 | Manufacturing Metrics & Residuals | ★★★ | 30 min |
| 4 | Model Deployment & CLI | ★★ | 20 min |

**Total Estimated Time**: 100 minutes

## Setup and Imports

In [None]:
# Standard library imports
import json
import sys
from pathlib import Path

# Data manipulation and numerical computing
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Import our yield regression pipeline
from yield_regression_pipeline import (
    YieldRegressionPipeline,
    generate_yield_process,
    TARGET_COLUMN,
    RANDOM_SEED
)

# Set random seed for reproducibility
np.random.seed(RANDOM_SEED)

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print("✅ Setup complete!")
print(f"Python version: {sys.version.split()[0]}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

---

## Exercise 1: Data Generation and Exploration

**Objective**: Generate synthetic yield data and explore relationships between process parameters and yield.

**Skills**: Data generation, exploratory data analysis, visualization

**Difficulty**: ★★ Intermediate

### What You'll Learn
- How to generate realistic semiconductor manufacturing data
- Feature engineering for process parameters
- Identifying correlations and distributions
- Visualization best practices for regression analysis

### Step 1.1: Generate Synthetic Yield Data

In [None]:
# Generate synthetic yield process data
df = generate_yield_process(n=800, seed=RANDOM_SEED)

print(f"Generated {len(df)} samples")
print(f"\nColumns: {list(df.columns)}")
print(f"\nShape: {df.shape}")

# Display first few rows
df.head()

**💡 Key Insight**: The synthetic data generator creates realistic semiconductor process data with:
- **4 base parameters**: temperature, pressure, flow, time
- **4 engineered features**: temp_centered, pressure_sq, flow_time_inter, temp_flow_inter
- **1 target**: yield_pct (0-100%)

Feature engineering is crucial in manufacturing - interaction terms and non-linear transformations capture real process physics.

### Step 1.2: Explore Data Distributions

In [None]:
# Summary statistics
print("=" * 60)
print("PROCESS PARAMETER STATISTICS")
print("=" * 60)
print(df.describe().T)
print("\n" + "=" * 60)

# Check for missing values
print(f"\nMissing values: {df.isnull().sum().sum()}")

# Yield statistics
print(f"\nYield Statistics:")
print(f"  Mean: {df['yield_pct'].mean():.2f}%")
print(f"  Std:  {df['yield_pct'].std():.2f}%")
print(f"  Min:  {df['yield_pct'].min():.2f}%")
print(f"  Max:  {df['yield_pct'].max():.2f}%")

### Step 1.3: Visualize Yield Distribution

In [None]:
# Yield distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(df['yield_pct'], bins=30, edgecolor='black', alpha=0.7)
axes[0].axvline(df['yield_pct'].mean(), color='red', linestyle='--', 
                label=f'Mean: {df["yield_pct"].mean():.1f}%', linewidth=2)
axes[0].set_xlabel('Yield (%)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Yield Distribution', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Box plot
axes[1].boxplot(df['yield_pct'], vert=True)
axes[1].set_ylabel('Yield (%)', fontsize=12)
axes[1].set_title('Yield Box Plot', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("✅ Yield appears normally distributed around 70%")
print("📊 This is typical for semiconductor manufacturing processes")

### Step 1.4: Analyze Process Parameter Correlations

In [None]:
# Calculate correlations with yield
correlations = df.corr()['yield_pct'].sort_values(ascending=False)

print("=" * 60)
print("FEATURE CORRELATIONS WITH YIELD")
print("=" * 60)
print(correlations)
print("\n" + "=" * 60)

# Visualize correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1)
plt.title('Process Parameter Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("  - Strong negative correlation: pressure_sq (~-0.85)")
print("  - Moderate positive correlation: time (~0.45)")
print("  - Interaction terms capture non-linear effects")

### Step 1.5: Process Parameter vs Yield Scatter Plots

In [None]:
# Scatter plots for base parameters
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
base_params = ['temperature', 'pressure', 'flow', 'time']

for idx, param in enumerate(base_params):
    row = idx // 2
    col = idx % 2
    ax = axes[row, col]
    
    # Scatter plot
    ax.scatter(df[param], df['yield_pct'], alpha=0.5, s=30)
    
    # Trend line (polynomial fit)
    z = np.polyfit(df[param], df['yield_pct'], 2)
    p = np.poly1d(z)
    x_trend = np.linspace(df[param].min(), df[param].max(), 100)
    ax.plot(x_trend, p(x_trend), 'r--', linewidth=2, label='Trend')
    
    # Correlation annotation
    corr = df[[param, 'yield_pct']].corr().iloc[0, 1]
    ax.text(0.05, 0.95, f'r = {corr:.3f}', 
            transform=ax.transAxes, fontsize=11, 
            verticalalignment='top', 
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    ax.set_xlabel(param.replace('_', ' ').title(), fontsize=11)
    ax.set_ylabel('Yield (%)', fontsize=11)
    ax.set_title(f'{param.replace("_", " ").title()} vs Yield', 
                 fontsize=12, fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("✅ Pressure shows quadratic relationship (optimal window exists)")
print("✅ Temperature and flow show weak linear relationships")
print("✅ Time shows moderate positive correlation")

### Exercise 1 Key Takeaways

**✅ Data Characteristics**:
- 800 samples with 8 features + 1 target
- Yield ranges from ~60% to ~77%
- All features normally distributed (realistic for manufacturing)

**✅ Important Correlations**:
- **pressure_sq**: Strongest predictor (r ≈ -0.85) - quadratic relationship
- **time**: Moderate predictor (r ≈ 0.45) - linear relationship
- **Interaction terms**: Capture non-linear process physics

**✅ Manufacturing Insights**:
- Pressure has optimal window (too high/low both reduce yield)
- Longer process time generally improves yield
- Temperature effects are subtle (requires interaction analysis)

**✅ Next Steps**:
- Ready for model training with clean, well-understood data
- Expect non-linear models (RF, GB) to outperform linear models
- Feature engineering already captures key physics

---