# Module 10.2: Testing & Quality Assurance for Semiconductor ML Pipelines

This notebook demonstrates comprehensive testing strategies, coverage measurement, and quality gates for production ML systems in semiconductor manufacturing.

## Learning Objectives
- Understand testing patterns for CLI-based ML pipelines
- Implement coverage measurement with pytest-cov
- Apply code quality checks with flake8 and black
- Validate dataset paths and system health
- Create quality gates for production deployments

## Setup

In [None]:
import sys
import json
import subprocess
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set paths
MODULE_DIR = Path('.').resolve()
PROJECT_ROOT = MODULE_DIR.parent.parent.parent
PIPELINE_SCRIPT = MODULE_DIR / '10.2-testing-qa-pipeline.py'

print(f"Module directory: {MODULE_DIR}")
print(f"Project root: {PROJECT_ROOT}")
print(f"Pipeline script: {PIPELINE_SCRIPT}")

## 1. CLI Testing Fundamentals

### 1.1 Basic CLI Interaction

First, let's explore the QA pipeline CLI interface:

In [None]:
def run_pipeline_cmd(args, capture_output=True):
    """Execute the QA pipeline CLI and return results."""
    result = subprocess.run(
        [sys.executable, str(PIPELINE_SCRIPT)] + args,
        capture_output=capture_output,
        text=True,
        cwd=PROJECT_ROOT
    )
    if capture_output:
        try:
            return json.loads(result.stdout)
        except json.JSONDecodeError:
            return {"stdout": result.stdout, "stderr": result.stderr, "returncode": result.returncode}
    return result

# Test help command
help_result = run_pipeline_cmd(['--help'], capture_output=False)
print(f"Help command exit code: {help_result.returncode}")

### 1.2 System Health Check

Let's start with a basic system health check using the `predict` command:

In [None]:
# Run system health check
health_check = run_pipeline_cmd(['predict'])

print("=" * 60)
print("SYSTEM HEALTH CHECK RESULTS")
print("=" * 60)
print(f"Status: {health_check['status']}")
print(f"Overall Health: {health_check['recommendation'].upper()}")
print(f"Target: {health_check['target']}")

# Display import test results
print("\n📦 Package Import Tests:")
import_tests = health_check['health_check']['import_tests']
for package, result in import_tests.items():
    status_icon = "✅" if result['status'] == 'pass' else "❌"
    print(f"  {status_icon} {package}: {result['status']}")
    if result['error']:
        print(f"    Error: {result['error']}")

# Display functionality tests
print("\n🔧 Basic Functionality Tests:")
func_tests = health_check['health_check']['basic_functionality']
for test, result in func_tests.items():
    status_icon = "✅" if result['status'] == 'pass' else "❌"
    print(f"  {status_icon} {test}: {result['status']}")
    if 'shape' in result:
        print(f"    Data shape: {result['shape']}")

## 2. Dataset Path Validation

### 2.1 Validate Dataset Accessibility

One of the most common issues in ML pipelines is incorrect dataset paths. Let's validate our dataset structure:

In [None]:
# Run dataset path validation
path_validation = run_pipeline_cmd(['evaluate', '--check-type', 'paths'])

print("=" * 60)
print("DATASET PATH VALIDATION")
print("=" * 60)

dataset_info = path_validation['checks']['dataset_paths']
print(f"Dataset root: {dataset_info['dataset_root']}")
print(f"All datasets valid: {dataset_info['all_valid']}\n")

# Create summary DataFrame
datasets_data = []
for name, info in dataset_info['datasets'].items():
    datasets_data.append({
        'Dataset': name,
        'Type': info['type'],
        'Exists': info['exists'],
        'Path': info['path']
    })

df_datasets = pd.DataFrame(datasets_data)
print("Dataset Status Summary:")
print(df_datasets.to_string(index=False))

# Visualize dataset availability
fig, ax = plt.subplots(1, 1, figsize=(10, 6))

# Count dataset types and availability
availability_counts = df_datasets['Exists'].value_counts()
type_counts = df_datasets['Type'].value_counts()

# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Dataset availability pie chart
ax1.pie(availability_counts.values, labels=['Available', 'Missing'], 
        autopct='%1.1f%%', startangle=90, colors=['lightgreen', 'lightcoral'])
ax1.set_title('Dataset Availability')

# Dataset types bar chart
ax2.bar(type_counts.index, type_counts.values, color=['skyblue', 'lightblue'])
ax2.set_title('Dataset Types')
ax2.set_ylabel('Count')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

### 2.2 Cross-Platform Path Compatibility Test

Let's demonstrate how to handle different path formats across platforms:

In [None]:
import os
from pathlib import Path

def test_path_normalization():
    """Test path handling across different platforms."""
    test_cases = [
        "~/datasets/secom/secom.data",          # Home directory
        "../../../datasets/secom/secom.data",   # Relative paths  
        "/tmp/test_dataset.csv",                 # Absolute paths
        "datasets\\secom\\secom.data",            # Windows-style paths
    ]
    
    results = []
    for path_str in test_cases:
        try:
            normalized = Path(path_str).expanduser().resolve()
            results.append({
                'Original': path_str,
                'Normalized': str(normalized),
                'Is_Absolute': normalized.is_absolute(),
                'Platform': os.name
            })
        except Exception as e:
            results.append({
                'Original': path_str,
                'Normalized': f"Error: {e}",
                'Is_Absolute': False,
                'Platform': os.name
            })
    
    return pd.DataFrame(results)

path_test_df = test_path_normalization()
print("Path Normalization Test Results:")
print(path_test_df.to_string(index=False))

## 3. Code Quality Assessment

### 3.1 Linting and Formatting Checks

Let's run code quality checks on our module:

In [None]:
# Run code quality assessment
quality_check = run_pipeline_cmd(['evaluate', '--check-type', 'lint', '--target-path', str(MODULE_DIR)])

print("=" * 60)
print("CODE QUALITY ASSESSMENT")
print("=" * 60)

code_quality = quality_check['checks']['code_quality']

# Display flake8 results
print("🔍 Flake8 Linting Results:")
flake8_results = code_quality['flake8']
print(f"  Violations: {flake8_results['violations']}")
print(f"  Compliant: {flake8_results['compliant']}")
if flake8_results['violations'] > 0:
    print(f"  Output: {flake8_results['output']}")

# Display black results
print("\n🎨 Black Formatting Results:")
black_results = code_quality['black']
print(f"  Compliant: {black_results['compliant']}")
print(f"  Action: {black_results['action']}")

# Display overall score
print(f"\n📊 Overall Quality Score: {code_quality['overall_score']:.1f}/100")

# Visualize quality metrics
fig, ax = plt.subplots(1, 1, figsize=(10, 6))

# Create quality metrics
metrics = {
    'Flake8 Score': max(0, 100 - flake8_results['violations'] * 5),
    'Black Score': 100.0 if black_results['compliant'] else 70.0,
    'Overall Score': code_quality['overall_score']
}

# Create bar chart
bars = ax.bar(metrics.keys(), metrics.values(), 
              color=['lightblue', 'lightgreen', 'orange'])
ax.set_ylabel('Score')
ax.set_title('Code Quality Metrics')
ax.set_ylim(0, 100)

# Add value labels on bars
for bar, value in zip(bars, metrics.values()):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
            f'{value:.1f}', ha='center', va='bottom')

# Add quality threshold line
ax.axhline(y=80, color='red', linestyle='--', alpha=0.7, label='Quality Threshold')
ax.legend()

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### 3.2 Automatic Code Formatting

If there are formatting issues, we can automatically fix them:

In [None]:
# Check if we need to fix formatting issues
if not black_results['compliant']:
    print("🔧 Fixing formatting issues with Black...")
    fix_result = run_pipeline_cmd([
        'evaluate', '--check-type', 'lint', 
        '--target-path', str(MODULE_DIR),
        '--fix-issues'
    ])
    
    fixed_black = fix_result['checks']['code_quality']['black']
    print(f"Formatting fixed: {fixed_black['action']}")
    print(f"Now compliant: {fixed_black['compliant']}")
else:
    print("✅ Code is already properly formatted!")

## 4. Performance Benchmarking

### 4.1 Pipeline Performance Assessment

Let's benchmark key operations in our ML pipeline:

In [None]:
# Run performance benchmarks
perf_check = run_pipeline_cmd(['evaluate', '--check-type', 'performance'])

print("=" * 60)
print("PERFORMANCE BENCHMARKING")
print("=" * 60)

performance_data = perf_check['checks']['performance']
benchmarks = performance_data['benchmarks']

print(f"Overall Performance Score: {performance_data['overall_score']:.1f}/100\n")

# Create performance summary
perf_summary = []
for operation, metrics in benchmarks.items():
    perf_summary.append({
        'Operation': operation.replace('_', ' ').title(),
        'Execution Time (s)': f"{metrics['execution_time']:.4f}",
        'Status': metrics['status'],
        'Details': metrics.get('data_shape', metrics.get('r2_score', metrics.get('prediction', 'N/A')))
    })

df_performance = pd.DataFrame(perf_summary)
print("Performance Benchmark Results:")
print(df_performance.to_string(index=False))

# Visualize performance metrics
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Execution times bar chart
operations = [op.replace('_', ' ').title() for op in benchmarks.keys()]
times = [metrics['execution_time'] for metrics in benchmarks.values()]

bars1 = ax1.bar(operations, times, color=['lightblue', 'lightgreen', 'lightcoral'])
ax1.set_ylabel('Execution Time (seconds)')
ax1.set_title('Operation Execution Times')
ax1.tick_params(axis='x', rotation=45)

# Add value labels
for bar, time in zip(bars1, times):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
             f'{time:.3f}s', ha='center', va='bottom')

# Performance score gauge
score = performance_data['overall_score']
ax2.pie([score, 100-score], labels=['Performance Score', ''], 
        colors=['lightgreen' if score > 70 else 'orange' if score > 50 else 'lightcoral', 'lightgray'],
        startangle=90, counterclock=False)
ax2.set_title(f'Performance Score: {score:.1f}/100')

plt.tight_layout()
plt.show()

### 4.2 Performance Trends Analysis

Let's run multiple benchmarks to analyze performance consistency:

In [None]:
# Run multiple performance benchmarks to check consistency
print("Running multiple performance benchmarks...")

benchmark_results = []
for i in range(5):
    perf_result = run_pipeline_cmd(['evaluate', '--check-type', 'performance'])
    benchmarks = perf_result['checks']['performance']['benchmarks']
    
    run_data = {'run': i + 1}
    for operation, metrics in benchmarks.items():
        run_data[f'{operation}_time'] = metrics['execution_time']
    benchmark_results.append(run_data)

df_trends = pd.DataFrame(benchmark_results)

# Plot performance trends
fig, ax = plt.subplots(1, 1, figsize=(12, 6))

operations = ['data_generation_time', 'model_training_time', 'prediction_time']
labels = ['Data Generation', 'Model Training', 'Prediction']

for op, label in zip(operations, labels):
    ax.plot(df_trends['run'], df_trends[op], marker='o', label=label)

ax.set_xlabel('Benchmark Run')
ax.set_ylabel('Execution Time (seconds)')
ax.set_title('Performance Consistency Analysis')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate performance statistics
print("\nPerformance Statistics:")
for op, label in zip(operations, labels):
    times = df_trends[op]
    print(f"{label}:")
    print(f"  Mean: {times.mean():.4f}s")
    print(f"  Std:  {times.std():.4f}s")
    print(f"  Min:  {times.min():.4f}s")
    print(f"  Max:  {times.max():.4f}s")

## 5. Comprehensive Quality Assessment

### 5.1 Full QA Test Suite

Let's run a comprehensive quality assessment using the train command:

In [None]:
# Run comprehensive QA test suite
print("🚀 Running comprehensive QA test suite...\n")

qa_result = run_pipeline_cmd([
    'train', '--test-suite', 'smoke',
    '--coverage-threshold', '75',
    '--lint-threshold', '85',
    '--performance-threshold', '30'
])

print("=" * 60)
print("COMPREHENSIVE QA ASSESSMENT")
print("=" * 60)

print(f"Status: {qa_result['status']}")

# Display metrics
metrics = qa_result['metrics']
print(f"\n📊 Test Metrics:")
print(f"  Test Pass Rate: {metrics['test_pass_rate']:.1f}%")
print(f"  Coverage: {metrics['coverage_percentage']:.1f}%")
print(f"  Lint Score: {metrics['lint_score']:.1f}")
print(f"  Performance Score: {metrics['performance_score']:.1f}")
print(f"  Total Tests: {metrics['total_tests']}")
print(f"  Failed Tests: {metrics['failed_tests']}")
print(f"  Execution Time: {metrics['execution_time']:.2f}s")

# Display quality gates
gates = qa_result['quality_gates']
print(f"\n🚪 Quality Gates:")
for gate, passed in gates.items():
    status_icon = "✅" if passed else "❌"
    print(f"  {status_icon} {gate.replace('_', ' ').title()}: {passed}")

# Visualize QA metrics
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

# Test results pie chart
passed_tests = metrics['total_tests'] - metrics['failed_tests']
ax1.pie([passed_tests, metrics['failed_tests']], 
        labels=['Passed', 'Failed'],
        colors=['lightgreen', 'lightcoral'],
        autopct='%1.1f%%')
ax1.set_title('Test Results')

# Quality metrics radar chart simulation (bar chart)
quality_metrics = {
    'Test Pass Rate': metrics['test_pass_rate'],
    'Coverage': metrics['coverage_percentage'], 
    'Lint Score': metrics['lint_score'],
    'Performance': metrics['performance_score']
}

bars2 = ax2.bar(quality_metrics.keys(), quality_metrics.values(),
                color=['skyblue', 'lightgreen', 'orange', 'lightcoral'])
ax2.set_ylabel('Score')
ax2.set_title('Quality Metrics')
ax2.set_ylim(0, 100)
ax2.tick_params(axis='x', rotation=45)

# Add threshold line
ax2.axhline(y=80, color='red', linestyle='--', alpha=0.7, label='Target Threshold')
ax2.legend()

# Quality gates status
gate_names = list(gates.keys())
gate_status = [1 if gates[gate] else 0 for gate in gate_names]
colors = ['lightgreen' if status else 'lightcoral' for status in gate_status]

ax3.bar([name.replace('_', '\n') for name in gate_names], gate_status, color=colors)
ax3.set_ylabel('Passed (1) / Failed (0)')
ax3.set_title('Quality Gates Status')
ax3.set_ylim(0, 1.2)

# Execution time trend (simulated)
time_data = [metrics['execution_time']] * 5  # Simulate 5 runs
ax4.plot(range(1, 6), time_data, marker='o', color='purple')
ax4.set_xlabel('Run Number')
ax4.set_ylabel('Execution Time (s)')
ax4.set_title('Execution Time Trend')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 5.2 Quality Trends Dashboard

Let's create a comprehensive quality dashboard:

In [None]:
# Create comprehensive quality dashboard
def create_quality_dashboard(qa_metrics, path_validation, performance_data):
    """Create a comprehensive quality dashboard."""
    
    fig = plt.figure(figsize=(20, 12))
    gs = fig.add_gridspec(3, 4, height_ratios=[1, 1, 1], width_ratios=[1, 1, 1, 1])
    
    # 1. Overall Quality Score (large gauge)
    ax1 = fig.add_subplot(gs[0, :2])
    overall_score = (qa_metrics['test_pass_rate'] + qa_metrics['coverage_percentage'] + 
                    qa_metrics['lint_score'] + qa_metrics['performance_score']) / 4
    
    # Create gauge chart
    theta = np.linspace(0, np.pi, 100)
    r = np.ones_like(theta)
    
    ax1.plot(theta, r, 'k-', linewidth=3)
    
    # Color segments
    colors = ['red', 'orange', 'yellow', 'lightgreen', 'green']
    segments = np.linspace(0, np.pi, 6)
    
    for i in range(5):
        mask = (theta >= segments[i]) & (theta < segments[i+1])
        ax1.fill_between(theta[mask], 0, r[mask], color=colors[i], alpha=0.3)
    
    # Add needle
    needle_angle = np.pi * (1 - overall_score / 100)
    ax1.plot([needle_angle, needle_angle], [0, 0.8], 'r-', linewidth=4)
    ax1.plot(needle_angle, 0, 'ro', markersize=10)
    
    ax1.set_ylim(0, 1.2)
    ax1.set_xlim(0, np.pi)
    ax1.set_title(f'Overall Quality Score: {overall_score:.1f}/100', fontsize=16, fontweight='bold')
    ax1.set_xticks([])
    ax1.set_yticks([])
    ax1.axis('off')
    
    # 2. Test Results Summary
    ax2 = fig.add_subplot(gs[0, 2])
    passed = qa_metrics['total_tests'] - qa_metrics['failed_tests']
    ax2.pie([passed, qa_metrics['failed_tests']], 
            labels=['Passed', 'Failed'],
            colors=['lightgreen', 'lightcoral'],
            autopct='%1.0f')
    ax2.set_title('Test Results')
    
    # 3. Dataset Status
    ax3 = fig.add_subplot(gs[0, 3])
    datasets = path_validation['datasets']
    available = sum(1 for d in datasets.values() if d['exists'])
    total = len(datasets)
    ax3.pie([available, total - available],
            labels=['Available', 'Missing'],
            colors=['lightblue', 'lightcoral'],
            autopct='%1.0f')
    ax3.set_title('Dataset Status')
    
    # 4. Quality Metrics Comparison
    ax4 = fig.add_subplot(gs[1, :])
    metrics_data = {
        'Test Pass Rate': qa_metrics['test_pass_rate'],
        'Coverage': qa_metrics['coverage_percentage'],
        'Lint Score': qa_metrics['lint_score'],
        'Performance': qa_metrics['performance_score']
    }
    
    x_pos = np.arange(len(metrics_data))
    bars = ax4.bar(x_pos, metrics_data.values(), 
                   color=['skyblue', 'lightgreen', 'orange', 'lightcoral'])
    
    # Add threshold lines
    thresholds = [95, 75, 85, 70]  # Different thresholds for different metrics
    for i, threshold in enumerate(thresholds):
        ax4.axhline(y=threshold, xmin=i/len(metrics_data), xmax=(i+1)/len(metrics_data),
                   color='red', linestyle='--', alpha=0.7)
    
    ax4.set_ylabel('Score')
    ax4.set_title('Quality Metrics vs Thresholds')
    ax4.set_xticks(x_pos)
    ax4.set_xticklabels(metrics_data.keys(), rotation=45)
    ax4.set_ylim(0, 100)
    
    # Add value labels on bars
    for bar, value in zip(bars, metrics_data.values()):
        ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                f'{value:.1f}', ha='center', va='bottom')
    
    # 5. Performance Breakdown
    ax5 = fig.add_subplot(gs[2, :2])
    perf_ops = list(performance_data['benchmarks'].keys())
    perf_times = [performance_data['benchmarks'][op]['execution_time'] for op in perf_ops]
    
    ax5.barh(perf_ops, perf_times, color=['lightblue', 'lightgreen', 'lightcoral'])
    ax5.set_xlabel('Execution Time (s)')
    ax5.set_title('Performance Breakdown')
    
    # 6. Quality Trend (simulated)
    ax6 = fig.add_subplot(gs[2, 2:])
    days = list(range(1, 8))
    quality_trend = [overall_score + np.random.normal(0, 2) for _ in days]
    
    ax6.plot(days, quality_trend, marker='o', linewidth=2, color='purple')
    ax6.axhline(y=80, color='red', linestyle='--', alpha=0.7, label='Target (80%)')
    ax6.set_xlabel('Day')
    ax6.set_ylabel('Quality Score')
    ax6.set_title('Quality Trend (7 days)')
    ax6.legend()
    ax6.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig

# Create dashboard with current data
dashboard_fig = create_quality_dashboard(
    qa_result['metrics'],
    path_validation['checks']['dataset_paths'],
    perf_check['checks']['performance']
)

plt.show()

## 6. Test Automation and CI Integration

### 6.1 Simulated CI Pipeline

Let's simulate a continuous integration pipeline:

In [None]:
def simulate_ci_pipeline():
    """Simulate a complete CI pipeline with quality gates."""
    
    print("🚀 Starting CI Pipeline Simulation...\n")
    
    pipeline_steps = [
        ("Code Checkout", lambda: {"status": "success", "message": "Code checked out successfully"}),
        ("Dependency Installation", lambda: {"status": "success", "message": "Dependencies installed"}),
        ("Linting (Flake8)", lambda: run_pipeline_cmd(['evaluate', '--check-type', 'lint'])),
        ("Code Quality (Black)", lambda: {"status": "success", "message": "Code formatting verified"}),
        ("Dataset Validation", lambda: run_pipeline_cmd(['evaluate', '--check-type', 'paths'])),
        ("Smoke Tests", lambda: run_pipeline_cmd(['evaluate', '--check-type', 'smoke'])),
        ("Performance Tests", lambda: run_pipeline_cmd(['evaluate', '--check-type', 'performance'])),
        ("Full Test Suite", lambda: run_pipeline_cmd(['train', '--test-suite', 'smoke'])),
    ]
    
    results = []
    
    for step_name, step_func in pipeline_steps:
        print(f"📋 Running: {step_name}...")
        
        try:
            start_time = time.time()
            result = step_func()
            execution_time = time.time() - start_time
            
            # Determine success based on result
            if isinstance(result, dict):
                if 'status' in result:
                    success = result['status'] in ['success', 'trained', 'evaluated', 'predicted']
                else:
                    success = True  # Assume success if no status field
            else:
                success = True
            
            results.append({
                'Step': step_name,
                'Status': 'PASS' if success else 'FAIL',
                'Duration': f"{execution_time:.2f}s",
                'Details': result.get('message', 'Completed successfully')
            })
            
            status_icon = "✅" if success else "❌"
            print(f"  {status_icon} {step_name}: {'PASS' if success else 'FAIL'} ({execution_time:.2f}s)")
            
        except Exception as e:
            results.append({
                'Step': step_name,
                'Status': 'ERROR',
                'Duration': 'N/A',
                'Details': str(e)
            })
            print(f"  ❌ {step_name}: ERROR - {str(e)}")
    
    return results

import time
ci_results = simulate_ci_pipeline()

# Display CI results
print("\n" + "="*60)
print("CI PIPELINE RESULTS")
print("="*60)

df_ci = pd.DataFrame(ci_results)
print(df_ci.to_string(index=False))

# Summary statistics
total_steps = len(ci_results)
passed_steps = sum(1 for r in ci_results if r['Status'] == 'PASS')
failed_steps = sum(1 for r in ci_results if r['Status'] == 'FAIL')
error_steps = sum(1 for r in ci_results if r['Status'] == 'ERROR')

print(f"\n📊 CI Pipeline Summary:")
print(f"  Total Steps: {total_steps}")
print(f"  Passed: {passed_steps}")
print(f"  Failed: {failed_steps}")
print(f"  Errors: {error_steps}")
print(f"  Success Rate: {(passed_steps/total_steps)*100:.1f}%")

# Visualize CI results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# CI step status
status_counts = df_ci['Status'].value_counts()
colors = {'PASS': 'lightgreen', 'FAIL': 'lightcoral', 'ERROR': 'orange'}
status_colors = [colors.get(status, 'gray') for status in status_counts.index]

ax1.pie(status_counts.values, labels=status_counts.index, colors=status_colors, autopct='%1.0f')
ax1.set_title('CI Pipeline Status Distribution')

# Step execution timeline
step_names = [step[:15] + '...' if len(step) > 15 else step for step in df_ci['Step']]
y_pos = np.arange(len(step_names))
bar_colors = [colors.get(status, 'gray') for status in df_ci['Status']]

ax2.barh(y_pos, range(len(step_names)), color=bar_colors)
ax2.set_yticks(y_pos)
ax2.set_yticklabels(step_names)
ax2.set_xlabel('Execution Order')
ax2.set_title('CI Pipeline Execution Timeline')

plt.tight_layout()
plt.show()

## 7. Testing Best Practices Summary

### 7.1 Key Takeaways

Based on our comprehensive testing demonstration, here are the key testing best practices for semiconductor ML pipelines:

In [None]:
# Summarize testing best practices
best_practices = {
    "CLI Testing": [
        "Test all subcommands (train, evaluate, predict)",
        "Validate JSON output schemas",
        "Test error handling and exit codes",
        "Use subprocess.run with proper timeout handling"
    ],
    "Path Validation": [
        "Test relative path resolution from different module levels",
        "Ensure cross-platform compatibility",
        "Validate dataset accessibility before training",
        "Handle missing datasets gracefully"
    ],
    "Code Quality": [
        "Use flake8 for linting with complexity limits",
        "Apply black for consistent formatting",
        "Set quality thresholds as CI gates",
        "Automate formatting fixes where possible"
    ],
    "Performance Testing": [
        "Benchmark critical operations (train, predict)",
        "Set realistic performance thresholds",
        "Monitor performance trends over time",
        "Test prediction latency for real-time applications"
    ],
    "Coverage & Metrics": [
        "Target 80%+ overall coverage",
        "Require 95%+ coverage for critical business logic",
        "Include manufacturing-specific metrics (PWS, loss)",
        "Track quality metrics trends"
    ]
}

print("=" * 60)
print("SEMICONDUCTOR ML TESTING BEST PRACTICES")
print("=" * 60)

for category, practices in best_practices.items():
    print(f"\n🔧 {category}:")
    for practice in practices:
        print(f"  • {practice}")

# Create best practices summary visualization
fig, ax = plt.subplots(1, 1, figsize=(12, 8))

categories = list(best_practices.keys())
practice_counts = [len(practices) for practices in best_practices.values()]

bars = ax.bar(categories, practice_counts, 
              color=['lightblue', 'lightgreen', 'orange', 'lightcoral', 'lightpink'])

ax.set_ylabel('Number of Best Practices')
ax.set_title('Testing Best Practices by Category')
ax.tick_params(axis='x', rotation=45)

# Add value labels on bars
for bar, count in zip(bars, practice_counts):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
            str(count), ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

### 7.2 Quality Gates Checklist

Here's a comprehensive checklist for implementing quality gates in production:

In [None]:
# Quality gates checklist
quality_gates_checklist = {
    "Pre-Deployment Gates": {
        "Unit Tests": "95% pass rate",
        "Integration Tests": "100% pass rate", 
        "Code Coverage": "≥80% overall, ≥95% critical paths",
        "Lint Score": "≥90 (flake8 compliance)",
        "Code Formatting": "100% black compliance",
        "Dataset Validation": "All required datasets accessible",
        "Performance Benchmarks": "Within acceptable thresholds"
    },
    "Manufacturing Specific": {
        "PWS Accuracy": "Prediction Within Spec ≥95%",
        "False Negative Rate": "<1% for critical defects",
        "Prediction Latency": "<100ms for real-time applications",
        "Model Stability": "Consistent results across runs",
        "Feature Validation": "All process parameters within expected ranges"
    },
    "Post-Deployment Monitoring": {
        "Data Drift Detection": "Monitor input distribution shifts",
        "Performance Degradation": "Alert on accuracy decline",
        "System Health": "Continuous availability monitoring",
        "Model Freshness": "Regular retraining schedules",
        "Audit Trail": "Complete prediction logging"
    }
}

print("=" * 60)
print("QUALITY GATES CHECKLIST")
print("=" * 60)

for category, gates in quality_gates_checklist.items():
    print(f"\n🚪 {category}:")
    for gate, threshold in gates.items():
        print(f"  ☐ {gate}: {threshold}")

print("\n" + "="*60)
print("Remember: Quality gates should be tailored to your specific")
print("manufacturing process and business requirements!")
print("="*60)

## Conclusion

This notebook demonstrated comprehensive testing and quality assurance strategies for semiconductor ML pipelines:

1. **CLI Testing Patterns**: How to test command-line interfaces with JSON output validation
2. **Dataset Path Validation**: Ensuring data accessibility across different environments
3. **Code Quality Assessment**: Automated linting and formatting checks
4. **Performance Benchmarking**: Measuring and monitoring pipeline performance
5. **Comprehensive QA Suites**: Running full test suites with quality gates
6. **CI/CD Integration**: Simulating continuous integration pipelines

### Key Benefits for Semiconductor Manufacturing:

- **Reliability**: Robust testing ensures consistent model performance
- **Maintainability**: Quality gates prevent code degradation
- **Scalability**: Automated testing supports rapid development cycles
- **Compliance**: Comprehensive testing supports regulatory requirements
- **Cost Reduction**: Early defect detection reduces production failures

The testing framework and quality gates demonstrated here provide a solid foundation for deploying production-ready ML systems in semiconductor manufacturing environments.