# 2.2 Outlier Detection in Semiconductor Manufacturing

## 📚 Learning Objectives

By the end of this notebook, you will be able to:
- **Differentiate** types of outliers in semiconductor manufacturing (process, measurement, contextual, collective)
- **Implement** statistical outlier detection methods (Z-score, Modified Z-score, IQR)
- **Apply** multivariate techniques (Mahalanobis Distance)
- **Leverage** machine learning models (Isolation Forest, Local Outlier Factor)
- **Monitor** process stability using time-series control (EWMA)
- **Design** contextual (recipe-aware) and physics-guided detection strategies
- **Evaluate** detection performance with precision/recall metrics

## 🗂 Dataset: SECOM (UCI ML Repository)
We'll use the real-world **SECOM** dataset (1567 wafers × 591 features) capturing inline process measurements with a binary pass/fail quality label (−1 = pass, 1 = fail). The data contains:
- High dimensional, noisy sensor signals
- Missing values (NaN)
- Imbalanced target (few fails)

> If the dataset files (`secom.data`, `secom_labels.data`) are not present locally, the notebook will prompt how to download them or optionally generate a smaller synthetic proxy.

## 🎯 Workflow Overview
1. Load & inspect dataset
2. Clean & preprocess (missing value handling, variance filtering, scaling)
3. Implement univariate detectors
4. Implement multivariate & ML-based detectors
5. Time-series style EWMA example (using an engineered aggregate signal)
6. Contextual (recipe/surrogate grouping) + physics-inspired validation
7. Compare methods vs. injected / inferred anomaly labels
8. Provide extension exercises

Let's start by importing the necessary libraries.

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.spatial.distance import mahalanobis
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score, f1_score
import os
import warnings
from pathlib import Path

# Configure plotting and environment
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (14, 8)
sns.set_palette("viridis")
warnings.filterwarnings('ignore')

DATA_DIR = Path('../../datasets').resolve()
SECOM_DATA_FILE = DATA_DIR / 'secom.data'
SECOM_LABEL_FILE = DATA_DIR / 'secom_labels.data'

print("✅ Libraries imported successfully!")
print(f"Looking for dataset in: {DATA_DIR}")
print(f"secom.data exists: {SECOM_DATA_FILE.exists()} | secom_labels.data exists: {SECOM_LABEL_FILE.exists()}")

## 🏭 Loading the SECOM Dataset (or Fallback Synthetic Data)

The **SECOM** dataset (UCI ID: 179) consists of 1567 samples and 591 continuous features measuring in-process signals. A separate labels file contains:
- Quality label (column 0): −1 = pass, 1 = fail
- Timestamp (column 1): DateTime string

If the raw files are not available in `datasets/`, you may:
1. Download manually from the UCI repository (https://archive.ics.uci.edu/dataset/179/secom) and place them in `datasets/`
2. Use `pip install ucimlrepo` and fetch programmatically
3. Proceed with an automatically generated smaller synthetic proxy (this notebook will inform you)

We'll implement a loader that prefers local files but falls back gracefully.

In [None]:
def load_secom(local_data_path=SECOM_DATA_FILE, local_label_path=SECOM_LABEL_FILE, synthetic_if_missing=True):
    """Load SECOM dataset or optionally generate synthetic fallback.

    Returns
    -------
    X : pd.DataFrame
        Feature matrix (after basic cleaning)
    y : pd.Series | None
        Quality label (if available)
    meta : dict
        Metadata about loading mode
    """
    meta = {}
    if local_data_path.exists() and local_label_path.exists():
        # Load raw feature matrix (space-separated, NaNs present)
        X = pd.read_csv(local_data_path, sep=" ", header=None, na_values='NaN')
        labels = pd.read_csv(local_label_path, sep=" ", header=None, na_values='NaN')
        y = labels.iloc[:, 0].replace({-1:0, 1:1})  # Convert to 0=pass,1=fail for convenience
        # Basic cleaning: drop all-NaN columns
        all_nan_cols = X.columns[X.isna().all()]
        if len(all_nan_cols) > 0:
            X = X.drop(columns=all_nan_cols)
        meta['mode'] = 'secom_real'
        meta['dropped_all_nan_columns'] = len(all_nan_cols)
        return X, y, meta
    elif synthetic_if_missing:
        np.random.seed(42)
        n_samples, n_features = 600, 40
        base = np.random.normal(0, 1, size=(n_samples, n_features))
        # Inject correlation structure
        for i in range(1, n_features, 5):
            base[:, i] = base[:, 0] * 0.6 + np.random.normal(0, 0.5, size=n_samples)
        # Inject sparse anomalies
        anomaly_idx = np.random.choice(n_samples, size=25, replace=False)
        base[anomaly_idx] += np.random.normal(5, 1, size=(25, n_features))
        X = pd.DataFrame(base, columns=[f'f{i}' for i in range(n_features)])
        y = pd.Series((np.isin(range(n_samples), anomaly_idx)).astype(int), name='is_fail')
        meta['mode'] = 'synthetic'
        meta['note'] = 'SECOM files missing; using synthetic proxy.'
        return X, y, meta
    else:
        raise FileNotFoundError("SECOM dataset not found and synthetic fallback disabled.")

X_raw, y_raw, load_meta = load_secom()
print(f"📦 Data load mode: {load_meta['mode']}")
print(f"Shape: {X_raw.shape} | Label available: {y_raw is not None}")
X_raw.head().iloc[:3, :8]

## 🧹 Preprocessing & Feature Reduction

Given high dimensionality and missingness, we apply:
1. Drop columns with > 40% missing values
2. Impute remaining NaNs (median)
3. Remove near-zero variance features
4. (Optional) Standard scaling
5. Keep a manageable subset (e.g., top variance features) for demonstration

We'll retain both the processed matrix and a scaled version for algorithms that need it.

In [None]:
def preprocess_features(X, max_missing_ratio=0.4, variance_threshold=1e-5, top_variance=50):
    Xc = X.copy()
    # 1. Drop high-missing columns
    missing_ratio = Xc.isna().mean()
    keep_cols = missing_ratio[missing_ratio <= max_missing_ratio].index
    Xc = Xc[keep_cols]
    # 2. Median impute
    medians = Xc.median()
    Xc = Xc.fillna(medians)
    # 3. Remove near-zero variance
    variances = Xc.var()
    keep_var_cols = variances[variances > variance_threshold].index
    Xc = Xc[keep_var_cols]
    # 4. Select top variance features for demo clarity
    top_cols = variances[keep_var_cols].sort_values(ascending=False).head(top_variance).index
    Xc = Xc[top_cols]
    # 5. Scale
    scaler = StandardScaler()
    X_scaled = pd.DataFrame(scaler.fit_transform(Xc), columns=Xc.columns)
    meta = {
        'dropped_missing_cols': int((missing_ratio > max_missing_ratio).sum()),
        'remaining_features': Xc.shape[1]
    }
    return Xc, X_scaled, meta

X_proc, X_scaled, prep_meta = preprocess_features(X_raw)
print(f"🧪 Preprocessing complete. Features kept: {X_proc.shape[1]} | Dropped (missing): {prep_meta['dropped_missing_cols']}")
X_proc.head().iloc[:3, :8]

## 📊 1. Univariate Statistical Outlier Detection

We implement three foundational methods:
- **Z-Score**: Assumes approximate normality
- **Modified Z-Score**: Robust (median + MAD)
- **IQR**: Non-parametric range-based

We'll apply each to one representative high-variance feature and compare results.

In [None]:
def zscore_flags(series, threshold=3.0):
    z = np.abs(stats.zscore(series, nan_policy='omit'))
    return (z > threshold)

def modified_zscore_flags(series, threshold=3.5):
    med = np.median(series)
    mad = np.median(np.abs(series - med))
    if mad == 0:
        return pd.Series([False]*len(series), index=series.index)
    mz = 0.6745 * (series - med) / mad
    return (np.abs(mz) > threshold)

def iqr_flags(series, k=1.5):
    q1, q3 = series.quantile(0.25), series.quantile(0.75)
    iqr = q3 - q1
    lower, upper = q1 - k*iqr, q3 + k*iqr
    return (series < lower) | (series > upper)

feature_example = X_proc.columns[0]
uni_df = pd.DataFrame({
    feature_example: X_proc[feature_example],
    'zscore': zscore_flags(X_proc[feature_example]),
    'modified_z': modified_zscore_flags(X_proc[feature_example]),
    'iqr': iqr_flags(X_proc[feature_example])
})
uni_df.head()

### Visualization: Comparing Univariate Methods
We'll overlay flagged points for the selected feature.

In [None]:
fig, ax = plt.subplots(figsize=(14,5))
ax.plot(uni_df[feature_example].values, label=feature_example, alpha=0.6)
for method, color in [('zscore','orange'),('modified_z','red'),('iqr','purple')]:
    idx = uni_df[uni_df[method]].index
    ax.scatter(idx, uni_df.loc[idx, feature_example], label=method, color=color, s=60)
ax.set_title(f'Univariate Outlier Flags - {feature_example}')
ax.legend()
plt.show()

print(uni_df[['zscore','modified_z','iqr']].sum().rename('flag_counts'))

## 🔬 2. Multivariate Statistical Detection (Mahalanobis)
We compute Mahalanobis distance over a reduced subset of features (top variance) to avoid numerical instability and overfitting in high dimensions.

In [None]:
def mahalanobis_flags(X, feature_subset=None, threshold_percentile=99):
    if feature_subset is None:
        feature_subset = X.columns[:10]
    Xs = X[feature_subset]
    mean = Xs.mean().values
    cov = np.cov(Xs.values, rowvar=False)
    inv_cov = np.linalg.pinv(cov)
    dists = []
    for row in Xs.values:
        dists.append(mahalanobis(row, mean, inv_cov))
    dists = np.array(dists)
    th = np.percentile(dists, threshold_percentile)
    return pd.Series(dists>th, index=Xs.index), dists, th

maha_flags, maha_dists, maha_th = mahalanobis_flags(X_scaled)
print(f"Threshold: {maha_th:.3f} | Flagged: {maha_flags.sum()}")
plt.figure(figsize=(12,5))
plt.plot(maha_dists, label='Mahalanobis Distance')
plt.axhline(maha_th, color='red', linestyle='--', label='Threshold')
plt.title('Mahalanobis Distances (Top Features)')
plt.legend();

## 🤖 3. Machine Learning Based Detection
Implement unsupervised models trained on feature matrix:
- Isolation Forest (tree ensemble isolating anomalies)
- Local Outlier Factor (density-based, local neighborhood)
We use scaled features for stability.

In [None]:
iso = IsolationForest(contamination=0.05, random_state=42)
iso_preds = iso.fit_predict(X_scaled)
iso_flags = pd.Series(iso_preds==-1, index=X_scaled.index, name='iso_forest')

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
lof_preds = lof.fit_predict(X_scaled)
lof_flags = pd.Series(lof_preds==-1, index=X_scaled.index, name='lof')

ml_flags = pd.concat([iso_flags, lof_flags], axis=1)
print(ml_flags.sum())

# Simple 2D projection for visualization (first two scaled features)
plt.figure(figsize=(10,6))
plt.scatter(X_scaled.iloc[:,0], X_scaled.iloc[:,1], c=iso_flags.map({True:'red', False:'gray'}), alpha=0.6)
plt.title('Isolation Forest Flags (Red = Outlier)')
plt.xlabel(X_scaled.columns[0]); plt.ylabel(X_scaled.columns[1]);
plt.show()

## ⏱️ 4. EWMA Monitoring (Synthetic Time Axis)
The SECOM dataset lacks raw temporal order context; we simulate an order and create an aggregate signal (mean of selected features) to illustrate EWMA small-shift detection.

In [None]:
def ewma_series(x, lambda_param=0.2):
    ew = []
    prev = x[0]
    for val in x:
        prev = lambda_param * val + (1-lambda_param) * prev
        ew.append(prev)
    return np.array(ew)

agg_signal = X_scaled.iloc[:, :5].mean(axis=1).values
ew = ewma_series(agg_signal)
mu, sigma = np.mean(agg_signal), np.std(agg_signal)
L=3
n=len(agg_signal)
t = np.arange(1,n+1)
var_factor = (lambda_param:=0.2)/(2-lambda_param)*(1-(1-lambda_param)**(2*t))
ucl = mu + L*sigma*np.sqrt(var_factor)
lcl = mu - L*sigma*np.sqrt(var_factor)
flags_ewma = (ew>ucl)|(ew<lcl)

plt.figure(figsize=(14,5))
plt.plot(agg_signal, label='Aggregate Signal', alpha=0.4)
plt.plot(ew, label='EWMA', color='orange')
plt.plot(ucl, '--', color='red', label='UCL/LCL')
plt.plot(lcl, '--', color='red')
plt.scatter(np.where(flags_ewma)[0], ew[flags_ewma], color='red', s=50, label='EWMA Flag')
plt.title('EWMA Monitoring Example')
plt.legend();
print(f"EWMA flags: {flags_ewma.sum()}")

## 🧠 5. Contextual & Ensemble Strategy
Without true recipe metadata, we simulate grouping via k-means on top features to create pseudo-contexts, then perform within-group Z-score detection. Finally, we build an ensemble consensus score across methods.

In [None]:
from sklearn.cluster import KMeans

k = 4
kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
cluster_labels = kmeans.fit_predict(X_scaled.iloc[:, :10])
X_proc['cluster'] = cluster_labels

context_flags = pd.Series(False, index=X_proc.index)
feat_for_context = X_proc.columns[0]
for c in range(k):
    mask = X_proc['cluster']==c
    context_series = X_proc.loc[mask, feat_for_context]
    context_flags.loc[mask] = zscore_flags(context_series, threshold=2.5)

# Ensemble: combine methods
methods_df = pd.DataFrame({
    'uni_z': uni_df['zscore'],
    'uni_modz': uni_df['modified_z'],
    'uni_iqr': uni_df['iqr'],
    'maha': maha_flags,
    'iso': iso_flags,
    'lof': lof_flags,
    'ewma': pd.Series(flags_ewma, index=X_proc.index),
    'context': context_flags
}).fillna(False)

methods_df['consensus_score'] = methods_df.mean(axis=1)
# Flag consensus if >= 0.4 (arbitrary demo threshold)
methods_df['consensus_flag'] = methods_df['consensus_score'] >= 0.4

print(methods_df['consensus_flag'].value_counts())
methods_df.head()

if y_raw is not None:
    # Align labels to processed frame length (in synthetic fallback lengths match)
    y = y_raw.loc[methods_df.index] if len(y_raw)==len(methods_df) else y_raw.iloc[:len(methods_df)]
    def safe_metric(y_true, y_pred, name):
        if y_true.sum()==0 and y_pred.sum()==0:
            return 0.0
        return name(y_true, y_pred, zero_division=0)
    prec = safe_metric(y, methods_df['consensus_flag'], precision_score)
    rec = safe_metric(y, methods_df['consensus_flag'], recall_score)
    f1 = safe_metric(y, methods_df['consensus_flag'], f1_score)
    print(f"Consensus Performance -> Precision: {prec:.3f} | Recall: {rec:.3f} | F1: {f1:.3f}")

## ✅ Summary & Next Steps
We implemented a layered outlier detection workflow:
- Robust univariate screening (Z, Modified Z, IQR)
- Multivariate correlation-aware distance (Mahalanobis)
- Model-based isolation (Isolation Forest, LOF)
- Temporal shift detection (EWMA example)
- Contextual grouping + ensemble consensus

### Key Principles
1. Combine complementary methods to reduce false positives
2. Use robust & context-aware approaches for manufacturing variability
3. Evaluate with precision/recall when labeled data exists
4. Start simple (univariate) → escalate to multivariate/ML → ensemble

### 🚀 Exercises
1. Adjust contamination in Isolation Forest & LOF (0.01–0.15) and observe trade-offs.
2. Replace percentile threshold in Mahalanobis with chi-square cutoff (df = features).
3. Modify consensus threshold (0.3–0.6). How does F1 change?
4. Implement One-Class SVM and add to the ensemble.
5. Engineer a physics-based rule: flag rows where first two features diverge beyond 3× their historical correlation residual.
6. Add a rolling window variant of Modified Z to simulate streaming detection.

### 🔜 Coming Up (2.3)
We will extend these foundations into **advanced statistical analysis** (ANOVA / multivariate process exploration) to attribute variation sources.

---
If running with synthetic fallback, consider downloading the real SECOM dataset for richer structure.
