# Module 1.2: Statistical Foundations for Semiconductor Analysis

## Interactive Learning Notebook

Welcome to the hands-on statistical analysis module! This notebook will guide you through essential statistical concepts specifically applied to semiconductor manufacturing and testing.

### What You'll Learn
- Descriptive statistics with real semiconductor process data
- Process capability analysis (Cp, Cpk calculations)
- Distribution analysis and yield prediction
- Hypothesis testing for process validation
- Control chart implementation for process monitoring

### Prerequisites
- Python fundamentals from Module 1.1
- Basic understanding of statistics
- NumPy and Pandas familiarity

Let's begin our statistical journey in semiconductor analysis!

In [None]:
# Import Essential Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, t, chi2, shapiro, anderson
import warnings
warnings.filterwarnings('ignore')

# Set styling for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("✅ All libraries imported successfully!")
print("📊 Ready for statistical analysis of semiconductor data")

## 1. Generating Realistic Semiconductor Data

Before diving into statistics, let's create realistic semiconductor process data that mimics real-world manufacturing scenarios.

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

def generate_semiconductor_data():
    """Generate realistic semiconductor process data"""
    n_wafers = 100
    n_die_per_wafer = 200
    
    # Generate wafer-level data
    wafer_data = {
        'wafer_id': range(1, n_wafers + 1),
        'lot_id': np.repeat(['LOT_' + str(i) for i in range(1, 21)], 5),
        'process_tool': np.random.choice(['TOOL_A', 'TOOL_B', 'TOOL_C'], n_wafers),
        'operator': np.random.choice(['OP1', 'OP2', 'OP3', 'OP4'], n_wafers),
        'timestamp': pd.date_range('2024-01-01', periods=n_wafers, freq='6H')
    }
    
    # Generate electrical parameters with realistic distributions
    # Threshold voltage (mV) - normally distributed
    vth_mean = 650  # Target 650mV
    vth_std = 25    # Process variation
    vth_data = np.random.normal(vth_mean, vth_std, n_wafers)
    
    # Leakage current (nA) - log-normal distribution
    leakage_median = 10
    leakage_std = 0.3
    leakage_data = np.random.lognormal(np.log(leakage_median), leakage_std, n_wafers)
    
    # Critical dimension (nm) - normally distributed with tool bias
    cd_target = 100
    cd_std = 2
    tool_bias = {'TOOL_A': 0, 'TOOL_B': 1.5, 'TOOL_C': -1.2}
    cd_data = []
    for tool in wafer_data['process_tool']:
        cd_data.append(np.random.normal(cd_target + tool_bias[tool], cd_std))
    
    # Add all electrical parameters to wafer data
    wafer_data.update({
        'threshold_voltage_mv': vth_data,
        'leakage_current_na': leakage_data,
        'critical_dimension_nm': cd_data,
        'yield_percent': np.random.beta(20, 2, n_wafers) * 100  # High yield with some variation
    })
    
    return pd.DataFrame(wafer_data)

# Generate our dataset
df = generate_semiconductor_data()

print(f"📈 Generated semiconductor dataset with {len(df)} wafers")
print(f"🔬 Parameters: {list(df.select_dtypes(include=[np.number]).columns)}")
print("\n📋 First 5 rows:")
df.head()

## 2. Descriptive Statistics for Process Data

Let's start with fundamental descriptive statistics to understand our process parameters.

In [None]:
def calculate_descriptive_stats(data, parameter_name):
    """Calculate comprehensive descriptive statistics"""
    stats_dict = {
        'Count': len(data),
        'Mean': np.mean(data),
        'Median': np.median(data),
        'Mode': stats.mode(data, keepdims=True)[0][0] if len(stats.mode(data, keepdims=True)[0]) > 0 else np.nan,
        'Std Dev': np.std(data, ddof=1),
        'Variance': np.var(data, ddof=1),
        'Min': np.min(data),
        'Max': np.max(data),
        'Range': np.max(data) - np.min(data),
        'Q1 (25%)': np.percentile(data, 25),
        'Q3 (75%)': np.percentile(data, 75),
        'IQR': np.percentile(data, 75) - np.percentile(data, 25),
        'Skewness': stats.skew(data),
        'Kurtosis': stats.kurtosis(data),
        'CV (%)': (np.std(data, ddof=1) / np.mean(data)) * 100
    }
    
    return pd.DataFrame(list(stats_dict.items()), columns=['Statistic', parameter_name])

# Calculate descriptive statistics for key parameters
parameters = ['threshold_voltage_mv', 'leakage_current_na', 'critical_dimension_nm', 'yield_percent']

print("📊 DESCRIPTIVE STATISTICS FOR SEMICONDUCTOR PARAMETERS")
print("=" * 60)

for param in parameters:
    print(f"\n🔍 {param.replace('_', ' ').title()}")
    stats_df = calculate_descriptive_stats(df[param], param)
    print(stats_df.to_string(index=False))

In [None]:
# Create comprehensive visualization of parameter distributions
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Semiconductor Parameter Distributions', fontsize=16, fontweight='bold')

parameters_info = {
    'threshold_voltage_mv': {'title': 'Threshold Voltage (mV)', 'color': 'skyblue'},
    'leakage_current_na': {'title': 'Leakage Current (nA)', 'color': 'lightcoral'},
    'critical_dimension_nm': {'title': 'Critical Dimension (nm)', 'color': 'lightgreen'},
    'yield_percent': {'title': 'Yield (%)', 'color': 'gold'}
}

for i, (param, info) in enumerate(parameters_info.items()):
    row, col = i // 2, i % 2
    ax = axes[row, col]
    
    # Histogram with KDE
    ax.hist(df[param], bins=20, alpha=0.7, color=info['color'], density=True, edgecolor='black')
    
    # Add normal distribution overlay
    mu, sigma = df[param].mean(), df[param].std()
    x = np.linspace(df[param].min(), df[param].max(), 100)
    normal_curve = stats.norm.pdf(x, mu, sigma)
    ax.plot(x, normal_curve, 'r-', linewidth=2, label=f'Normal(μ={mu:.2f}, σ={sigma:.2f})')
    
    # Add statistics annotations
    ax.axvline(mu, color='red', linestyle='--', alpha=0.8, label=f'Mean: {mu:.2f}')
    ax.axvline(np.median(df[param]), color='orange', linestyle='--', alpha=0.8, label=f'Median: {np.median(df[param]):.2f}')
    
    ax.set_title(info['title'], fontweight='bold')
    ax.set_xlabel('Value')
    ax.set_ylabel('Density')
    ax.legend(fontsize=8)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Process Capability Analysis

Process capability indices are crucial for assessing whether a process can consistently produce parts within specification limits.

### Key Capability Indices:
- **Cp**: Process Capability (spread only)
- **Cpk**: Process Capability Index (accounts for centering)
- **Pp**: Process Performance (long-term)
- **Ppk**: Process Performance Index (long-term with centering)

In [None]:
def calculate_process_capability(data, lsl, usl, target=None):
    """
    Calculate process capability indices
    
    Parameters:
    data: array-like, process data
    lsl: float, lower specification limit
    usl: float, upper specification limit
    target: float, target value (optional, defaults to midpoint)
    
    Returns:
    dict: capability indices and related statistics
    """
    if target is None:
        target = (lsl + usl) / 2
    
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    
    # Basic capability indices
    cp = (usl - lsl) / (6 * std)
    cpu = (usl - mean) / (3 * std)
    cpl = (mean - lsl) / (3 * std)
    cpk = min(cpu, cpl)
    
    # Process performance (assumes long-term data)
    pp = cp  # Same calculation for this example
    ppk = cpk  # Same calculation for this example
    
    # Additional metrics
    percent_defective = (stats.norm.cdf(lsl, mean, std) + 
                        (1 - stats.norm.cdf(usl, mean, std))) * 100
    
    sigma_level = min(cpu, cpl) * 3
    
    results = {
        'Mean': mean,
        'Std Dev': std,
        'LSL': lsl,
        'USL': usl,
        'Target': target,
        'Cp': cp,
        'Cpu': cpu,
        'Cpl': cpl,
        'Cpk': cpk,
        'Pp': pp,
        'Ppk': ppk,
        'Percent Defective': percent_defective,
        'Sigma Level': sigma_level,
        'Process Centered': abs(mean - target) < (usl - lsl) * 0.05  # Within 5% of spec width
    }
    
    return results

# Define specification limits for our parameters
specifications = {
    'threshold_voltage_mv': {'lsl': 600, 'usl': 700, 'target': 650},
    'critical_dimension_nm': {'lsl': 95, 'usl': 105, 'target': 100},
    'leakage_current_na': {'lsl': 0, 'usl': 50, 'target': 10},
    'yield_percent': {'lsl': 90, 'usl': 100, 'target': 95}
}

print("🎯 PROCESS CAPABILITY ANALYSIS")
print("=" * 50)

capability_results = {}
for param, specs in specifications.items():
    print(f"\n📊 {param.replace('_', ' ').title()}")
    print("-" * 40)
    
    results = calculate_process_capability(df[param], **specs)
    capability_results[param] = results
    
    # Display key results
    print(f"Cp:  {results['Cp']:.3f} | Cpk: {results['Cpk']:.3f}")
    print(f"Mean: {results['Mean']:.2f} | Std: {results['Std Dev']:.2f}")
    print(f"Defective: {results['Percent Defective']:.4f}%")
    print(f"Sigma Level: {results['Sigma Level']:.2f}")
    print(f"Centered: {'✅' if results['Process Centered'] else '❌'}")
    
    # Capability interpretation
    if results['Cpk'] >= 1.33:
        capability_status = "🟢 Excellent (Cpk ≥ 1.33)"
    elif results['Cpk'] >= 1.0:
        capability_status = "🟡 Adequate (1.0 ≤ Cpk < 1.33)"
    else:
        capability_status = "🔴 Poor (Cpk < 1.0)"
    
    print(f"Status: {capability_status}")

In [None]:
# Visualize process capability for threshold voltage
def plot_process_capability(data, specs, title):
    """Create a comprehensive process capability plot"""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Left plot: Histogram with specification limits
    ax1.hist(data, bins=20, alpha=0.7, color='skyblue', density=True, edgecolor='black')
    
    # Add specification limits
    ax1.axvline(specs['lsl'], color='red', linestyle='-', linewidth=2, label=f'LSL: {specs["lsl"]}')
    ax1.axvline(specs['usl'], color='red', linestyle='-', linewidth=2, label=f'USL: {specs["usl"]}')
    ax1.axvline(specs['target'], color='green', linestyle='--', linewidth=2, label=f'Target: {specs["target"]}')
    ax1.axvline(np.mean(data), color='blue', linestyle='--', linewidth=2, label=f'Mean: {np.mean(data):.2f}')
    
    # Add normal distribution overlay
    mu, sigma = np.mean(data), np.std(data, ddof=1)
    x = np.linspace(data.min(), data.max(), 100)
    normal_curve = stats.norm.pdf(x, mu, sigma)
    ax1.plot(x, normal_curve, 'black', linewidth=2, alpha=0.8, label='Normal Fit')
    
    ax1.set_title(f'{title} - Process Capability', fontweight='bold')
    ax1.set_xlabel('Value')
    ax1.set_ylabel('Density')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Right plot: Capability metrics visualization
    results = calculate_process_capability(data, **specs)
    
    metrics = ['Cp', 'Cpk', 'Cpu', 'Cpl']
    values = [results[metric] for metric in metrics]
    colors = ['blue' if val >= 1.33 else 'orange' if val >= 1.0 else 'red' for val in values]
    
    bars = ax2.bar(metrics, values, color=colors, alpha=0.7, edgecolor='black')
    
    # Add benchmark lines
    ax2.axhline(y=1.0, color='orange', linestyle='--', alpha=0.8, label='Minimum (1.0)')
    ax2.axhline(y=1.33, color='green', linestyle='--', alpha=0.8, label='Target (1.33)')
    
    # Add value labels on bars
    for bar, value in zip(bars, values):
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{value:.3f}', ha='center', va='bottom', fontweight='bold')
    
    ax2.set_title('Capability Indices', fontweight='bold')
    ax2.set_ylabel('Index Value')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(0, max(values) * 1.2)
    
    plt.tight_layout()
    plt.show()

# Plot capability for threshold voltage
plot_process_capability(df['threshold_voltage_mv'], 
                       specifications['threshold_voltage_mv'], 
                       'Threshold Voltage (mV)')

## 4. Distribution Analysis and Normality Testing

Understanding the underlying distribution of process parameters is crucial for proper statistical analysis and yield prediction.

In [None]:
def comprehensive_normality_tests(data, parameter_name):
    """Perform multiple normality tests"""
    print(f"🔍 NORMALITY TESTING: {parameter_name}")
    print("-" * 50)
    
    # Shapiro-Wilk Test
    shapiro_stat, shapiro_p = shapiro(data)
    print(f"Shapiro-Wilk: W = {shapiro_stat:.4f}, p = {shapiro_p:.4f}")
    
    # Anderson-Darling Test
    anderson_result = anderson(data, dist='norm')
    print(f"Anderson-Darling: A² = {anderson_result.statistic:.4f}")
    
    # Kolmogorov-Smirnov Test
    # First fit normal distribution
    mu, sigma = stats.norm.fit(data)
    ks_stat, ks_p = stats.kstest(data, lambda x: stats.norm.cdf(x, mu, sigma))
    print(f"Kolmogorov-Smirnov: D = {ks_stat:.4f}, p = {ks_p:.4f}")
    
    # D'Agostino's Test
    dagostino_stat, dagostino_p = stats.normaltest(data)
    print(f"D'Agostino: χ² = {dagostino_stat:.4f}, p = {dagostino_p:.4f}")
    
    # Interpretation
    alpha = 0.05
    tests_passed = sum([
        shapiro_p > alpha,
        ks_p > alpha,
        dagostino_p > alpha
    ])
    
    print(f"\nResult: {tests_passed}/3 tests support normality (α = {alpha})")
    if tests_passed >= 2:
        print("✅ Data appears to be normally distributed")
    else:
        print("❌ Data may not be normally distributed")
    
    return {
        'shapiro_w': shapiro_stat,
        'shapiro_p': shapiro_p,
        'anderson_a2': anderson_result.statistic,
        'ks_d': ks_stat,
        'ks_p': ks_p,
        'dagostino_chi2': dagostino_stat,
        'dagostino_p': dagostino_p,
        'normal_likely': tests_passed >= 2
    }

# Test normality for all parameters
normality_results = {}
for param in parameters:
    normality_results[param] = comprehensive_normality_tests(df[param], param.replace('_', ' ').title())
    print("\n" + "="*60 + "\n")

In [None]:
# Create Q-Q plots for visual normality assessment
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Q-Q Plots for Normality Assessment', fontsize=16, fontweight='bold')

for i, param in enumerate(parameters):
    row, col = i // 2, i % 2
    ax = axes[row, col]
    
    # Create Q-Q plot
    stats.probplot(df[param], dist="norm", plot=ax)
    ax.set_title(f'{param.replace("_", " ").title()} Q-Q Plot', fontweight='bold')
    ax.grid(True, alpha=0.3)
    
    # Add R² value for linearity assessment
    # Fit line to Q-Q plot data
    theoretical_quantiles, sample_quantiles = stats.probplot(df[param], dist="norm")[:2]
    slope, intercept, r_value, _, _ = stats.linregress(theoretical_quantiles, sample_quantiles)
    ax.text(0.05, 0.95, f'R² = {r_value**2:.4f}', transform=ax.transAxes, 
            bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8),
            verticalalignment='top', fontweight='bold')

plt.tight_layout()
plt.show()

## 5. Hypothesis Testing for Process Validation

Hypothesis testing helps us make data-driven decisions about process changes, tool comparisons, and quality improvements.

In [None]:
def perform_t_test_analysis():
    """Compare process parameters between different tools"""
    print("🧪 HYPOTHESIS TESTING: Tool Comparison")
    print("=" * 50)
    
    # Compare threshold voltage between Tool A and Tool B
    tool_a_vth = df[df['process_tool'] == 'TOOL_A']['threshold_voltage_mv']
    tool_b_vth = df[df['process_tool'] == 'TOOL_B']['threshold_voltage_mv']
    
    print(f"Tool A samples: {len(tool_a_vth)}")
    print(f"Tool B samples: {len(tool_b_vth)}")
    print(f"Tool A mean: {tool_a_vth.mean():.2f} ± {tool_a_vth.std():.2f}")
    print(f"Tool B mean: {tool_b_vth.mean():.2f} ± {tool_b_vth.std():.2f}")
    
    # Test for equal variances (Levene's test)
    levene_stat, levene_p = stats.levene(tool_a_vth, tool_b_vth)
    equal_var = levene_p > 0.05
    
    print(f"\nLevene's test for equal variances: p = {levene_p:.4f}")
    print(f"Equal variances assumed: {'Yes' if equal_var else 'No'}")
    
    # Perform two-sample t-test
    t_stat, t_p = stats.ttest_ind(tool_a_vth, tool_b_vth, equal_var=equal_var)
    
    print(f"\nTwo-sample t-test:")
    print(f"t-statistic: {t_stat:.4f}")
    print(f"p-value: {t_p:.4f}")
    
    # Interpretation
    alpha = 0.05
    if t_p < alpha:
        print(f"✅ Significant difference detected (p < {alpha})")
        print("🔧 Recommendation: Investigate tool calibration")
    else:
        print(f"❌ No significant difference (p ≥ {alpha})")
        print("✅ Tools appear to perform similarly")
    
    # Effect size (Cohen's d)
    pooled_std = np.sqrt(((len(tool_a_vth) - 1) * tool_a_vth.var() + 
                         (len(tool_b_vth) - 1) * tool_b_vth.var()) / 
                        (len(tool_a_vth) + len(tool_b_vth) - 2))
    cohens_d = (tool_a_vth.mean() - tool_b_vth.mean()) / pooled_std
    
    print(f"Effect size (Cohen's d): {cohens_d:.4f}")
    
    if abs(cohens_d) < 0.2:
        effect_interpretation = "Small effect"
    elif abs(cohens_d) < 0.5:
        effect_interpretation = "Medium effect"
    else:
        effect_interpretation = "Large effect"
    
    print(f"Effect interpretation: {effect_interpretation}")
    
    return {
        'tool_a_mean': tool_a_vth.mean(),
        'tool_b_mean': tool_b_vth.mean(),
        't_statistic': t_stat,
        'p_value': t_p,
        'cohens_d': cohens_d,
        'significant': t_p < alpha
    }

# Perform t-test analysis
t_test_results = perform_t_test_analysis()

In [None]:
def perform_anova_analysis():
    """Compare process parameters across all tools using ANOVA"""
    print("\n🧪 ANOVA: Multi-Tool Comparison")
    print("=" * 40)
    
    # Prepare data for ANOVA
    tool_groups = []
    for tool in df['process_tool'].unique():
        tool_data = df[df['process_tool'] == tool]['critical_dimension_nm']
        tool_groups.append(tool_data)
        print(f"{tool}: n={len(tool_data)}, mean={tool_data.mean():.2f}, std={tool_data.std():.2f}")
    
    # Perform one-way ANOVA
    f_stat, f_p = stats.f_oneway(*tool_groups)
    
    print(f"\nOne-way ANOVA Results:")
    print(f"F-statistic: {f_stat:.4f}")
    print(f"p-value: {f_p:.4f}")
    
    # Interpretation
    alpha = 0.05
    if f_p < alpha:
        print(f"✅ Significant differences between tools detected (p < {alpha})")
        print("🔧 Recommendation: Perform post-hoc analysis to identify which tools differ")
        
        # Perform pairwise t-tests with Bonferroni correction
        from itertools import combinations
        tools = df['process_tool'].unique()
        n_comparisons = len(list(combinations(tools, 2)))
        bonferroni_alpha = alpha / n_comparisons
        
        print(f"\nPost-hoc pairwise comparisons (Bonferroni α = {bonferroni_alpha:.4f}):")
        
        for tool1, tool2 in combinations(tools, 2):
            group1 = df[df['process_tool'] == tool1]['critical_dimension_nm']
            group2 = df[df['process_tool'] == tool2]['critical_dimension_nm']
            
            t_stat, t_p = stats.ttest_ind(group1, group2)
            significant = t_p < bonferroni_alpha
            
            print(f"{tool1} vs {tool2}: p = {t_p:.4f} {'*' if significant else ''}")
    else:
        print(f"❌ No significant differences between tools (p ≥ {alpha})")
        print("✅ All tools perform similarly for critical dimension")
    
    return {
        'f_statistic': f_stat,
        'p_value': f_p,
        'significant': f_p < alpha
    }

# Perform ANOVA analysis
anova_results = perform_anova_analysis()

## 6. Statistical Process Control (SPC) and Control Charts

Control charts are essential tools for monitoring process stability and detecting special cause variation.

In [None]:
def create_control_chart_data():
    """Create time-series data for control chart demonstration"""
    # Sort data by timestamp for time series analysis
    df_sorted = df.sort_values('timestamp').reset_index(drop=True)
    
    # Create subgroups (every 5 consecutive wafers)
    subgroup_size = 5
    n_subgroups = len(df_sorted) // subgroup_size
    
    subgroup_data = []
    for i in range(n_subgroups):
        start_idx = i * subgroup_size
        end_idx = start_idx + subgroup_size
        subgroup = df_sorted.iloc[start_idx:end_idx]
        
        subgroup_stats = {
            'subgroup': i + 1,
            'timestamp': subgroup['timestamp'].iloc[0],
            'mean': subgroup['threshold_voltage_mv'].mean(),
            'range': subgroup['threshold_voltage_mv'].max() - subgroup['threshold_voltage_mv'].min(),
            'std': subgroup['threshold_voltage_mv'].std(),
            'median': subgroup['threshold_voltage_mv'].median()
        }
        subgroup_data.append(subgroup_stats)
    
    return pd.DataFrame(subgroup_data)

def calculate_control_limits(data, chart_type='xbar'):
    """Calculate control limits for different chart types"""
    
    # Control chart constants for subgroup size = 5
    A2 = 0.577  # For X-bar chart
    D3 = 0.0    # For R chart (lower limit)
    D4 = 2.114  # For R chart (upper limit)
    
    if chart_type == 'xbar':
        center_line = data['mean'].mean()
        r_bar = data['range'].mean()
        
        ucl = center_line + A2 * r_bar
        lcl = center_line - A2 * r_bar
        
        return center_line, ucl, lcl
    
    elif chart_type == 'range':
        center_line = data['range'].mean()
        
        ucl = D4 * center_line
        lcl = D3 * center_line
        
        return center_line, ucl, lcl

# Create control chart data
control_data = create_control_chart_data()

print(f"📊 CONTROL CHART ANALYSIS")
print(f"Subgroups created: {len(control_data)}")
print(f"Subgroup size: 5 wafers")
print(f"Parameter: Threshold Voltage (mV)")

# Calculate control limits
xbar_cl, xbar_ucl, xbar_lcl = calculate_control_limits(control_data, 'xbar')
r_cl, r_ucl, r_lcl = calculate_control_limits(control_data, 'range')

print(f"\nX-bar Chart Limits:")
print(f"UCL: {xbar_ucl:.2f} mV")
print(f"CL:  {xbar_cl:.2f} mV")
print(f"LCL: {xbar_lcl:.2f} mV")

print(f"\nR Chart Limits:")
print(f"UCL: {r_ucl:.2f} mV")
print(f"CL:  {r_cl:.2f} mV")
print(f"LCL: {r_lcl:.2f} mV")

In [None]:
# Create X-bar and R control charts
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))
fig.suptitle('Statistical Process Control Charts - Threshold Voltage', fontsize=16, fontweight='bold')

# X-bar Chart
ax1.plot(control_data['subgroup'], control_data['mean'], 'bo-', linewidth=2, markersize=6, label='Subgroup Means')
ax1.axhline(y=xbar_cl, color='green', linestyle='-', linewidth=2, label=f'Center Line ({xbar_cl:.2f})')
ax1.axhline(y=xbar_ucl, color='red', linestyle='--', linewidth=2, label=f'UCL ({xbar_ucl:.2f})')
ax1.axhline(y=xbar_lcl, color='red', linestyle='--', linewidth=2, label=f'LCL ({xbar_lcl:.2f})')

# Check for out-of-control points
ooc_points_xbar = control_data[(control_data['mean'] > xbar_ucl) | (control_data['mean'] < xbar_lcl)]
if not ooc_points_xbar.empty:
    ax1.scatter(ooc_points_xbar['subgroup'], ooc_points_xbar['mean'], 
               color='red', s=100, marker='x', linewidth=3, label='Out of Control')

ax1.set_title('X-bar Chart (Process Centering)', fontweight='bold')
ax1.set_xlabel('Subgroup Number')
ax1.set_ylabel('Mean Threshold Voltage (mV)')
ax1.legend()
ax1.grid(True, alpha=0.3)

# R Chart
ax2.plot(control_data['subgroup'], control_data['range'], 'go-', linewidth=2, markersize=6, label='Subgroup Ranges')
ax2.axhline(y=r_cl, color='green', linestyle='-', linewidth=2, label=f'Center Line ({r_cl:.2f})')
ax2.axhline(y=r_ucl, color='red', linestyle='--', linewidth=2, label=f'UCL ({r_ucl:.2f})')
ax2.axhline(y=r_lcl, color='red', linestyle='--', linewidth=2, label=f'LCL ({r_lcl:.2f})')

# Check for out-of-control points
ooc_points_r = control_data[(control_data['range'] > r_ucl) | (control_data['range'] < r_lcl)]
if not ooc_points_r.empty:
    ax2.scatter(ooc_points_r['subgroup'], ooc_points_r['range'], 
               color='red', s=100, marker='x', linewidth=3, label='Out of Control')

ax2.set_title('R Chart (Process Variation)', fontweight='bold')
ax2.set_xlabel('Subgroup Number')
ax2.set_ylabel('Range (mV)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Analyze control chart results
print("\n🔍 CONTROL CHART ANALYSIS RESULTS")
print("=" * 40)

xbar_ooc = len(ooc_points_xbar)
r_ooc = len(ooc_points_r)

print(f"X-bar Chart: {xbar_ooc}/{len(control_data)} points out of control")
print(f"R Chart: {r_ooc}/{len(control_data)} points out of control")

if xbar_ooc == 0 and r_ooc == 0:
    print("✅ Process appears to be in statistical control")
    print("✅ No special cause variation detected")
else:
    print("⚠️ Special cause variation detected")
    if xbar_ooc > 0:
        print(f"  📍 Process centering issues in subgroups: {ooc_points_xbar['subgroup'].tolist()}")
    if r_ooc > 0:
        print(f"  📍 Process variation issues in subgroups: {ooc_points_r['subgroup'].tolist()}")

## 7. Yield Prediction and Modeling

Using statistical models to predict yield based on process parameters and specifications.

In [None]:
def predict_yield_from_distribution(data, lsl, usl):
    """Predict yield based on normal distribution assumption"""
    mu = np.mean(data)
    sigma = np.std(data, ddof=1)
    
    # Calculate probability of being within specifications
    prob_within_spec = stats.norm.cdf(usl, mu, sigma) - stats.norm.cdf(lsl, mu, sigma)
    yield_percent = prob_within_spec * 100
    
    # Calculate defect rates
    lower_defects = stats.norm.cdf(lsl, mu, sigma) * 100
    upper_defects = (1 - stats.norm.cdf(usl, mu, sigma)) * 100
    
    return {
        'predicted_yield': yield_percent,
        'lower_defects': lower_defects,
        'upper_defects': upper_defects,
        'total_defects': lower_defects + upper_defects
    }

def monte_carlo_yield_simulation(n_simulations=10000):
    """Monte Carlo simulation for yield prediction"""
    np.random.seed(42)
    
    # Simulate process parameters
    vth_sim = np.random.normal(650, 25, n_simulations)
    cd_sim = np.random.normal(100, 2, n_simulations)
    leakage_sim = np.random.lognormal(np.log(10), 0.3, n_simulations)
    
    # Apply specifications
    vth_pass = (vth_sim >= 600) & (vth_sim <= 700)
    cd_pass = (cd_sim >= 95) & (cd_sim <= 105)
    leakage_pass = leakage_sim <= 50
    
    # Overall yield (all parameters must pass)
    overall_pass = vth_pass & cd_pass & leakage_pass
    simulated_yield = np.mean(overall_pass) * 100
    
    # Individual parameter yields
    vth_yield = np.mean(vth_pass) * 100
    cd_yield = np.mean(cd_pass) * 100
    leakage_yield = np.mean(leakage_pass) * 100
    
    return {
        'overall_yield': simulated_yield,
        'vth_yield': vth_yield,
        'cd_yield': cd_yield,
        'leakage_yield': leakage_yield,
        'n_simulations': n_simulations
    }

print("🎯 YIELD PREDICTION ANALYSIS")
print("=" * 40)

# Predict yield for each parameter
yield_predictions = {}
for param, specs in specifications.items():
    if param != 'yield_percent':  # Skip actual yield column
        prediction = predict_yield_from_distribution(df[param], specs['lsl'], specs['usl'])
        yield_predictions[param] = prediction
        
        print(f"\n📊 {param.replace('_', ' ').title()}")
        print(f"Predicted Yield: {prediction['predicted_yield']:.2f}%")
        print(f"Lower Defects: {prediction['lower_defects']:.4f}%")
        print(f"Upper Defects: {prediction['upper_defects']:.4f}%")

# Monte Carlo simulation
print(f"\n🎲 MONTE CARLO SIMULATION")
print("-" * 30)
mc_results = monte_carlo_yield_simulation()

print(f"Simulations: {mc_results['n_simulations']:,}")
print(f"Overall Yield: {mc_results['overall_yield']:.2f}%")
print(f"VTH Yield: {mc_results['vth_yield']:.2f}%")
print(f"CD Yield: {mc_results['cd_yield']:.2f}%")
print(f"Leakage Yield: {mc_results['leakage_yield']:.2f}%")

# Compare with actual yield
actual_yield = df['yield_percent'].mean()
print(f"\nActual Average Yield: {actual_yield:.2f}%")
print(f"Prediction Accuracy: {abs(mc_results['overall_yield'] - actual_yield):.2f}% difference")

In [None]:
# Visualize yield prediction results
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Yield Prediction Analysis', fontsize=16, fontweight='bold')

# Plot 1: Individual parameter yield predictions
params = list(yield_predictions.keys())
yields = [yield_predictions[p]['predicted_yield'] for p in params]
param_names = [p.replace('_', ' ').title() for p in params]

bars1 = ax1.bar(param_names, yields, color=['skyblue', 'lightcoral', 'lightgreen'], alpha=0.7, edgecolor='black')
ax1.set_title('Predicted Yield by Parameter', fontweight='bold')
ax1.set_ylabel('Yield (%)')
ax1.set_ylim(90, 101)

# Add value labels on bars
for bar, yield_val in zip(bars1, yields):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.1,
             f'{yield_val:.2f}%', ha='center', va='bottom', fontweight='bold')

ax1.grid(True, alpha=0.3)

# Plot 2: Monte Carlo vs Actual Yield Comparison
methods = ['Monte Carlo', 'Actual Data']
yield_values = [mc_results['overall_yield'], actual_yield]
colors = ['gold', 'coral']

bars2 = ax2.bar(methods, yield_values, color=colors, alpha=0.7, edgecolor='black')
ax2.set_title('Yield Prediction Validation', fontweight='bold')
ax2.set_ylabel('Overall Yield (%)')

# Add value labels
for bar, yield_val in zip(bars2, yield_values):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + 0.5,
             f'{yield_val:.2f}%', ha='center', va='bottom', fontweight='bold')

ax2.grid(True, alpha=0.3)

# Plot 3: Defect breakdown for threshold voltage
vth_prediction = yield_predictions['threshold_voltage_mv']
defect_types = ['Lower Defects', 'Upper Defects', 'Good Parts']
defect_values = [vth_prediction['lower_defects'], 
                vth_prediction['upper_defects'],
                vth_prediction['predicted_yield']]
colors = ['red', 'orange', 'green']

ax3.pie(defect_values, labels=defect_types, colors=colors, autopct='%1.2f%%', startangle=90)
ax3.set_title('Threshold Voltage Defect Breakdown', fontweight='bold')

# Plot 4: Process capability vs yield correlation
cp_values = [capability_results[p]['Cp'] for p in params]
yield_values_cp = [yield_predictions[p]['predicted_yield'] for p in params]

scatter = ax4.scatter(cp_values, yield_values_cp, c=['blue', 'red', 'green'], 
                     s=100, alpha=0.7, edgecolors='black')
ax4.set_xlabel('Process Capability (Cp)')
ax4.set_ylabel('Predicted Yield (%)')
ax4.set_title('Capability vs Yield Relationship', fontweight='bold')

# Add parameter labels
for i, param in enumerate(param_names):
    ax4.annotate(param, (cp_values[i], yield_values_cp[i]), 
                xytext=(5, 5), textcoords='offset points', fontsize=9)

ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Regression Analysis for Process Relationships

Understanding relationships between process parameters helps optimize manufacturing conditions and predict outcomes.

In [None]:
# Correlation analysis between parameters
correlation_matrix = df[parameters].corr()

print("🔗 CORRELATION ANALYSIS")
print("=" * 30)
print(correlation_matrix.round(4))

# Create correlation heatmap
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, 
            mask=mask,
            annot=True, 
            cmap='RdBu_r', 
            center=0,
            square=True,
            fmt='.3f',
            cbar_kws={"shrink": .8})
plt.title('Parameter Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Linear regression: Predict yield from threshold voltage
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

X = df[['threshold_voltage_mv']].values
y = df['yield_percent'].values

# Fit linear regression
reg_model = LinearRegression()
reg_model.fit(X, y)
y_pred = reg_model.predict(X)

# Calculate metrics
r2 = r2_score(y, y_pred)
rmse = np.sqrt(mean_squared_error(y, y_pred))

print(f"\n📈 LINEAR REGRESSION: Yield vs Threshold Voltage")
print("-" * 50)
print(f"Slope: {reg_model.coef_[0]:.4f} %/mV")
print(f"Intercept: {reg_model.intercept_:.4f} %")
print(f"R² Score: {r2:.4f}")
print(f"RMSE: {rmse:.4f} %")

# Statistical significance test
n = len(X)
t_stat = reg_model.coef_[0] * np.sqrt((n-2) * np.sum((X - X.mean())**2)) / (rmse * np.sqrt(n))
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), n-2))

print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("✅ Relationship is statistically significant")
else:
    print("❌ Relationship is not statistically significant")

In [None]:
# Visualize regression analysis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Regression plot
ax1.scatter(df['threshold_voltage_mv'], df['yield_percent'], alpha=0.6, color='blue', label='Data Points')
ax1.plot(X, y_pred, 'r-', linewidth=2, label=f'Regression Line (R² = {r2:.3f})')

# Add confidence interval
from scipy.stats import t
confidence = 0.95
alpha = 1 - confidence
t_val = t.ppf(1 - alpha/2, n-2)
se = rmse * np.sqrt(1/n + (X - X.mean())**2 / np.sum((X - X.mean())**2))
ci_lower = y_pred - t_val * se.flatten()
ci_upper = y_pred + t_val * se.flatten()

# Sort for plotting
sort_idx = np.argsort(X.flatten())
ax1.fill_between(X[sort_idx].flatten(), ci_lower[sort_idx], ci_upper[sort_idx], 
                alpha=0.2, color='red', label=f'{confidence*100}% Confidence Interval')

ax1.set_xlabel('Threshold Voltage (mV)')
ax1.set_ylabel('Yield (%)')
ax1.set_title('Yield vs Threshold Voltage Regression', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Residual plot
residuals = y - y_pred
ax2.scatter(y_pred, residuals, alpha=0.6, color='green')
ax2.axhline(y=0, color='red', linestyle='--', linewidth=2)
ax2.set_xlabel('Predicted Yield (%)')
ax2.set_ylabel('Residuals (%)')
ax2.set_title('Residual Plot', fontweight='bold')
ax2.grid(True, alpha=0.3)

# Add residual statistics
residual_std = np.std(residuals)
ax2.text(0.05, 0.95, f'Residual Std: {residual_std:.3f}%\nMean: {np.mean(residuals):.3f}%', 
         transform=ax2.transAxes, verticalalignment='top',
         bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8))

plt.tight_layout()
plt.show()

## 9. Summary and Key Takeaways

Let's summarize our statistical analysis findings and provide actionable recommendations.

In [None]:
def generate_statistical_summary():
    """Generate comprehensive summary of statistical analysis"""
    
    print("📋 STATISTICAL ANALYSIS SUMMARY")
    print("=" * 50)
    
    print("\n🎯 PROCESS CAPABILITY ASSESSMENT")
    print("-" * 35)
    for param, results in capability_results.items():
        status = "🟢 Excellent" if results['Cpk'] >= 1.33 else "🟡 Adequate" if results['Cpk'] >= 1.0 else "🔴 Poor"
        print(f"{param.replace('_', ' ').title()}: Cpk = {results['Cpk']:.3f} {status}")
    
    print("\n📊 NORMALITY ASSESSMENT")
    print("-" * 25)
    for param, results in normality_results.items():
        status = "✅ Normal" if results['normal_likely'] else "❌ Non-normal"
        print(f"{param.replace('_', ' ').title()}: {status}")
    
    print("\n🧪 HYPOTHESIS TESTING RESULTS")
    print("-" * 30)
    print(f"Tool Comparison (t-test): {'Significant' if t_test_results['significant'] else 'Not significant'}")
    print(f"Multi-tool ANOVA: {'Significant' if anova_results['significant'] else 'Not significant'}")
    
    print("\n📈 PROCESS CONTROL STATUS")
    print("-" * 25)
    total_points = len(control_data)
    ooc_xbar = len(control_data[(control_data['mean'] > xbar_ucl) | (control_data['mean'] < xbar_lcl)])
    ooc_r = len(control_data[(control_data['range'] > r_ucl) | (control_data['range'] < r_lcl)])
    
    if ooc_xbar == 0 and ooc_r == 0:
        print("✅ Process in statistical control")
    else:
        print(f"⚠️ {ooc_xbar + ooc_r}/{total_points} points out of control")
    
    print("\n🎯 YIELD PREDICTIONS")
    print("-" * 20)
    print(f"Monte Carlo Simulation: {mc_results['overall_yield']:.2f}%")
    print(f"Actual Average Yield: {actual_yield:.2f}%")
    print(f"Prediction Error: {abs(mc_results['overall_yield'] - actual_yield):.2f}%")
    
    print("\n🔗 KEY RELATIONSHIPS")
    print("-" * 20)
    print(f"Yield vs Threshold Voltage: R² = {r2:.3f}")
    print(f"Statistical Significance: {'Yes' if p_value < 0.05 else 'No'} (p = {p_value:.4f})")

def provide_recommendations():
    """Provide data-driven recommendations"""
    
    print("\n\n💡 RECOMMENDATIONS FOR PROCESS IMPROVEMENT")
    print("=" * 55)
    
    # Capability-based recommendations
    print("\n🎯 Process Capability Improvements:")
    for param, results in capability_results.items():
        if results['Cpk'] < 1.33:
            if not results['Process Centered']:
                print(f"  📍 {param.replace('_', ' ').title()}: Center process (Cpk = {results['Cpk']:.3f})")
            else:
                print(f"  📍 {param.replace('_', ' ').title()}: Reduce variation (Cpk = {results['Cpk']:.3f})")
    
    # Tool-specific recommendations
    print("\n🔧 Tool Management:")
    if t_test_results['significant']:
        print("  📍 Investigate tool calibration differences")
        print("  📍 Consider tool matching procedures")
    else:
        print("  ✅ Tools performing consistently")
    
    # Control chart recommendations
    print("\n📊 Process Monitoring:")
    if ooc_xbar > 0 or ooc_r > 0:
        print("  📍 Investigate out-of-control signals")
        print("  📍 Review process parameters during OOC periods")
    else:
        print("  ✅ Continue current SPC monitoring")
    
    # Yield optimization
    print("\n📈 Yield Optimization:")
    worst_yield_param = min(yield_predictions.keys(), 
                          key=lambda x: yield_predictions[x]['predicted_yield'])
    print(f"  📍 Focus on {worst_yield_param.replace('_', ' ').title()} optimization")
    print(f"  📍 Current limiting yield: {yield_predictions[worst_yield_param]['predicted_yield']:.2f}%")
    
    print("\n🔬 Further Analysis Recommendations:")
    print("  📍 Implement real-time SPC monitoring")
    print("  📍 Conduct designed experiments for optimization")
    print("  📍 Establish automated capability studies")
    print("  📍 Develop predictive yield models")

# Generate summary and recommendations
generate_statistical_summary()
provide_recommendations()

## 10. Practice Exercises

Now it's your turn! Complete these exercises to reinforce your learning:

### Exercise 1: Process Capability Study
1. Create your own semiconductor dataset with different specification limits
2. Calculate Cp and Cpk for your parameters
3. Determine if the process is capable and centered

### Exercise 2: Hypothesis Testing
1. Compare performance between two different process conditions
2. Use appropriate statistical tests
3. Interpret results and make recommendations

### Exercise 3: Control Chart Implementation
1. Create control charts for a new parameter
2. Simulate some out-of-control conditions
3. Detect and analyze the special causes

### Exercise 4: Yield Optimization
1. Use regression analysis to identify yield drivers
2. Predict yield improvement from process changes
3. Recommend optimization strategies

Try these exercises with your own data or modify the existing dataset!

## Conclusion

Congratulations! You've completed Module 1.2 on Statistical Foundations for Semiconductor Analysis. 

### What You've Learned:
- ✅ Descriptive statistics for process characterization
- ✅ Process capability analysis (Cp, Cpk calculations)
- ✅ Distribution analysis and normality testing
- ✅ Hypothesis testing for process validation
- ✅ Statistical process control and control charts
- ✅ Yield prediction using statistical models
- ✅ Regression analysis for process relationships

### Key Skills Developed:
- Statistical analysis with Python
- Process capability assessment
- Quality control implementation
- Data-driven decision making
- Yield optimization strategies

### Next Steps:
- Module 1.3: Data Manipulation and Preprocessing
- Advanced statistical modeling
- Machine learning foundations
- Process optimization techniques

Keep practicing these statistical concepts - they form the foundation for all advanced semiconductor data analysis!