# 2.1 Data Quality Analysis for Semiconductor Manufacturing

## 📚 Learning Objectives

By the end of this section, you will:
- Understand the fundamentals of data quality assessment
- Master data profiling techniques for semiconductor datasets
- Implement data validation frameworks
- Create comprehensive data quality reports
- Build automated data quality monitoring systems

## 🎯 What You'll Build

We'll work with the **SECOM dataset** - a real semiconductor manufacturing dataset with 1567 records and 590 process parameters. You'll learn to:

1. **Profile** the dataset to understand its characteristics
2. **Assess** data quality across multiple dimensions
3. **Identify** data quality issues and their impact
4. **Implement** automated validation frameworks
5. **Generate** actionable data quality reports

Let's start by importing the essential libraries and loading our data!

In [None]:
# Import essential libraries for data quality analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from typing import Dict, List, Tuple, Optional
import os
from pathlib import Path

# Configure plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
sns.set_palette("husl")
warnings.filterwarnings('ignore')

# Dataset path resolution (per repo standard)
DATA_DIR = Path('../../../datasets').resolve()

print("✅ Libraries imported successfully!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🔢 NumPy version: {np.__version__}")

## 📊 Loading the SECOM Dataset

The **SECOM** (SEmiCOnductor Manufacturing) dataset contains:
- **1567 observations** of semiconductor manufacturing processes
- **590 process parameters** (sensors, measurements, etc.)
- **Binary target** indicating pass/fail status
- **Real-world** data with typical manufacturing data quality challenges

This dataset is perfect for learning data quality assessment because it contains:
- Missing values (common in manufacturing)
- Outliers (equipment malfunctions, measurement errors)
- Multicollinearity (related process parameters)
- Scale differences (various measurement units)

In [None]:
# Load the SECOM dataset
def load_secom_data():
    """
    Load the SECOM semiconductor manufacturing dataset.
    
    Returns:
        Tuple of (features_df, target_series, metadata_dict)
    """
    try:
        # Try to load from the standard datasets directory
        secom_dir = DATA_DIR / 'secom'
        features_path = secom_dir / 'secom.data'
        labels_path = secom_dir / 'secom_labels.data'
        
        # Load features and targets
        features_df = pd.read_csv(features_path, sep=" ", header=None)
        target_df = pd.read_csv(labels_path, sep=" ", header=None)
        
        # Clean column names
        features_df.columns = [f"sensor_{i:03d}" for i in range(len(features_df.columns))]
        target_series = target_df.iloc[:, 0]
        
        # Create metadata
        metadata = {
            "dataset_name": "SECOM",
            "description": "Semiconductor Manufacturing Process Data",
            "n_samples": len(features_df),
            "n_features": len(features_df.columns),
            "target_classes": target_series.value_counts().to_dict(),
            "source": "UCI Machine Learning Repository"
        }
        
        print(f"✅ SECOM dataset loaded successfully!")
        print(f"📏 Shape: {features_df.shape}")
        print(f"🎯 Target distribution: {dict(target_series.value_counts())}")
        
        return features_df, target_series, metadata
        
    except FileNotFoundError:
        print("❌ SECOM dataset not found. Creating synthetic data for demonstration...")
        return create_synthetic_secom_data()

def create_synthetic_secom_data():
    """Create synthetic semiconductor data that mimics SECOM characteristics."""
    np.random.seed(42)
    
    # Create synthetic features with manufacturing characteristics
    n_samples, n_features = 1567, 590
    
    # Base features with different distributions
    features = np.random.normal(0, 1, (n_samples, n_features))
    
    # Add manufacturing-specific patterns
    # Some sensors are highly correlated (process coupling)
    for i in range(0, min(50, n_features), 5):
        base_signal = np.random.normal(0, 1, n_samples)
        noise_level = 0.3
        for j in range(5):
            if i + j < n_features:
                features[:, i + j] = base_signal + np.random.normal(0, noise_level, n_samples)
    
    # Add missing values (realistic pattern)
    missing_rate = 0.1
    for col in range(n_features):
        n_missing = int(np.random.poisson(missing_rate * n_samples))
        if n_missing > 0:
            missing_indices = np.random.choice(n_samples, n_missing, replace=False)
            features[missing_indices, col] = np.nan
    
    # Add outliers (equipment malfunctions)
    outlier_rate = 0.02
    for col in range(n_features):
        n_outliers = int(np.random.poisson(outlier_rate * n_samples))
        if n_outliers > 0:
            outlier_indices = np.random.choice(n_samples, n_outliers, replace=False)
            outlier_values = np.random.choice([-1, 1], n_outliers) * np.random.exponential(5, n_outliers)
            features[outlier_indices, col] = outlier_values
    
    # Create DataFrame
    features_df = pd.DataFrame(features, columns=[f"sensor_{i:03d}" for i in range(n_features)])
    
    # Create realistic target (some correlation with features)
    feature_importance = np.random.exponential(0.1, n_features)
    target_scores = np.dot(features, feature_importance)
    target_probs = 1 / (1 + np.exp(-target_scores / np.std(target_scores)))
    target_series = pd.Series(np.random.binomial(1, target_probs), name="target")
    
    metadata = {
        "dataset_name": "Synthetic SECOM",
        "description": "Synthetic Semiconductor Manufacturing Process Data",
        "n_samples": n_samples,
        "n_features": n_features,
        "target_classes": target_series.value_counts().to_dict(),
        "source": "Synthetic (for demonstration)"
    }
    
    print(f"✅ Synthetic SECOM dataset created!")
    print(f"📏 Shape: {features_df.shape}")
    print(f"🎯 Target distribution: {dict(target_series.value_counts())}")
    
    return features_df, target_series, metadata

# Load the data
features_df, target_series, metadata = load_secom_data()

## 🔍 Basic Data Quality Assessment

Let's start with a fundamental data quality assessment. We'll examine:

1. **Data Completeness** - How much data is missing?
2. **Data Types** - Are the data types appropriate?
3. **Data Ranges** - What are the min/max values?
4. **Data Distribution** - How are values distributed?
5. **Data Consistency** - Are there any obvious inconsistencies?

In [None]:
def basic_data_quality_assessment(df: pd.DataFrame) -> Dict:
    """
    Perform basic data quality assessment on a DataFrame.
    
    Args:
        df: Input DataFrame
        
    Returns:
        Dictionary containing quality metrics
    """
    assessment = {}
    
    # Basic info
    assessment['shape'] = df.shape
    assessment['memory_usage'] = df.memory_usage(deep=True).sum()
    
    # Missing data analysis
    missing_count = df.isnull().sum()
    assessment['missing_data'] = {
        'total_missing': missing_count.sum(),
        'missing_percentage': (missing_count.sum() / (df.shape[0] * df.shape[1])) * 100,
        'columns_with_missing': (missing_count > 0).sum(),
        'columns_missing_percentage': (missing_count > 0).sum() / df.shape[1] * 100
    }
    
    # Data types
    assessment['data_types'] = df.dtypes.value_counts().to_dict()
    
    # Numeric columns analysis
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 0:
        assessment['numeric_summary'] = {
            'count': len(numeric_cols),
            'infinite_values': np.isinf(df[numeric_cols]).sum().sum(),
            'negative_values': (df[numeric_cols] < 0).sum().sum(),
            'zero_values': (df[numeric_cols] == 0).sum().sum()
        }
    
    # Duplicated rows
    assessment['duplicates'] = {
        'duplicate_rows': df.duplicated().sum(),
        'duplicate_percentage': df.duplicated().sum() / len(df) * 100
    }
    
    return assessment

# Perform basic assessment
print("🔍 Performing Basic Data Quality Assessment...")
quality_assessment = basic_data_quality_assessment(features_df)

# Display results
print("\n📊 BASIC DATA QUALITY REPORT")
print("=" * 50)
print(f"Dataset Shape: {quality_assessment['shape']}")
print(f"Memory Usage: {quality_assessment['memory_usage'] / 1024**2:.2f} MB")
print(f"\n📉 Missing Data:")
print(f"  Total Missing Values: {quality_assessment['missing_data']['total_missing']:,}")
print(f"  Missing Percentage: {quality_assessment['missing_data']['missing_percentage']:.2f}%")
print(f"  Columns with Missing: {quality_assessment['missing_data']['columns_with_missing']}")
print(f"\n📊 Data Types:")
for dtype, count in quality_assessment['data_types'].items():
    print(f"  {dtype}: {count} columns")
print(f"\n🔢 Numeric Data:")
if 'numeric_summary' in quality_assessment:
    print(f"  Numeric Columns: {quality_assessment['numeric_summary']['count']}")
    print(f"  Infinite Values: {quality_assessment['numeric_summary']['infinite_values']}")
    print(f"  Negative Values: {quality_assessment['numeric_summary']['negative_values']}")
    print(f"  Zero Values: {quality_assessment['numeric_summary']['zero_values']}")
print(f"\n🔄 Duplicates:")
print(f"  Duplicate Rows: {quality_assessment['duplicates']['duplicate_rows']}")
print(f"  Duplicate Percentage: {quality_assessment['duplicates']['duplicate_percentage']:.2f}%")

## 📈 Missing Data Analysis

Missing data is one of the most critical data quality issues in manufacturing. Let's dive deeper into understanding patterns of missingness in our semiconductor dataset.

### Types of Missing Data:
1. **MCAR (Missing Completely at Random)** - Missing values are random
2. **MAR (Missing at Random)** - Missing depends on observed data
3. **MNAR (Missing Not at Random)** - Missing depends on unobserved data

In semiconductor manufacturing, missing data often occurs due to:
- **Sensor failures** (equipment malfunction)
- **Maintenance windows** (scheduled downtime)
- **Process variations** (different measurement protocols)
- **Data transmission errors** (network issues)

In [None]:
def analyze_missing_patterns(df: pd.DataFrame, max_cols_to_show: int = 20) -> Dict:
    """
    Analyze missing data patterns in detail.
    
    Args:
        df: Input DataFrame
        max_cols_to_show: Maximum columns to show in detailed analysis
        
    Returns:
        Dictionary with missing pattern analysis
    """
    missing_analysis = {}
    
    # Calculate missing statistics per column
    missing_stats = pd.DataFrame({
        'missing_count': df.isnull().sum(),
        'missing_percentage': (df.isnull().sum() / len(df)) * 100,
        'data_type': df.dtypes
    })
    
    # Sort by missing percentage
    missing_stats = missing_stats.sort_values('missing_percentage', ascending=False)
    missing_analysis['column_stats'] = missing_stats
    
    # Missing patterns by row
    missing_per_row = df.isnull().sum(axis=1)
    missing_analysis['row_stats'] = {
        'min_missing_per_row': missing_per_row.min(),
        'max_missing_per_row': missing_per_row.max(),
        'avg_missing_per_row': missing_per_row.mean(),
        'rows_with_no_missing': (missing_per_row == 0).sum(),
        'rows_with_all_missing': (missing_per_row == len(df.columns)).sum()
    }
    
    # Create missing pattern matrix (for visualization)
    if len(df.columns) <= max_cols_to_show:
        cols_to_analyze = df.columns[:max_cols_to_show]
    else:
        # Select top missing columns
        cols_to_analyze = missing_stats.head(max_cols_to_show).index
    
    missing_analysis['pattern_matrix'] = df[cols_to_analyze].isnull()
    
    return missing_analysis

# Analyze missing patterns
print("🔍 Analyzing Missing Data Patterns...")
missing_analysis = analyze_missing_patterns(features_df)

# Display top columns with missing data
print("\n📊 TOP 10 COLUMNS WITH MISSING DATA")
print("=" * 50)
top_missing = missing_analysis['column_stats'].head(10)
for idx, (col, row) in enumerate(top_missing.iterrows(), 1):
    print(f"{idx:2d}. {col}: {row['missing_count']:4d} missing ({row['missing_percentage']:5.1f}%)")

# Display row statistics
print(f"\n📊 MISSING DATA BY ROW")
print("=" * 50)
row_stats = missing_analysis['row_stats']
print(f"Rows with no missing data: {row_stats['rows_with_no_missing']:,}")
print(f"Average missing per row: {row_stats['avg_missing_per_row']:.1f}")
print(f"Max missing in a single row: {row_stats['max_missing_per_row']}")

# Visualize missing patterns
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Missing data heatmap (top 20 columns)
top_20_missing_cols = missing_analysis['column_stats'].head(20).index
missing_matrix = features_df[top_20_missing_cols].isnull()
sns.heatmap(missing_matrix.iloc[:200], 
            xticklabels=True, yticklabels=False, 
            cbar=True, ax=axes[0,0])
axes[0,0].set_title('Missing Data Pattern (Top 20 Columns, First 200 Rows)')
axes[0,0].set_xlabel('Sensor Columns')

# 2. Missing percentage per column histogram
missing_pcts = missing_analysis['column_stats']['missing_percentage']
axes[0,1].hist(missing_pcts, bins=30, alpha=0.7, edgecolor='black')
axes[0,1].set_title('Distribution of Missing Percentages Across Columns')
axes[0,1].set_xlabel('Missing Percentage (%)')
axes[0,1].set_ylabel('Number of Columns')

# 3. Missing count per row histogram
missing_per_row = features_df.isnull().sum(axis=1)
axes[1,0].hist(missing_per_row, bins=30, alpha=0.7, edgecolor='black')
axes[1,0].set_title('Distribution of Missing Values per Row')
axes[1,0].set_xlabel('Number of Missing Values')
axes[1,0].set_ylabel('Number of Rows')

# 4. Missing vs Target correlation (if applicable)
if target_series is not None:
    missing_by_target = pd.DataFrame({
        'missing_count': missing_per_row,
        'target': target_series
    })
    missing_by_target.boxplot(column='missing_count', by='target', ax=axes[1,1])
    axes[1,1].set_title('Missing Values by Target Class')
    axes[1,1].set_xlabel('Target Class')
    axes[1,1].set_ylabel('Missing Values per Row')

plt.tight_layout()
plt.show()

## 🎯 Data Quality Dimensions Framework

In manufacturing, we assess data quality across **6 critical dimensions**:

1. **Completeness** - Is all required data present?
2. **Accuracy** - Is the data correct and precise?
3. **Consistency** - Is the data uniform across sources?
4. **Validity** - Does the data conform to defined formats/rules?
5. **Uniqueness** - Are there inappropriate duplicates?
6. **Timeliness** - Is the data current and up-to-date?

Let's implement a comprehensive framework to assess all dimensions:

In [None]:
class DataQualityFramework:
    """
    Comprehensive data quality assessment framework for semiconductor manufacturing.
    """
    
    def __init__(self, df: pd.DataFrame, target: Optional[pd.Series] = None):
        self.df = df
        self.target = target
        self.quality_report = {}
        
    def assess_completeness(self) -> Dict:
        """Assess data completeness."""
        missing_data = self.df.isnull()
        
        completeness = {
            'overall_completeness': (1 - missing_data.sum().sum() / (self.df.shape[0] * self.df.shape[1])) * 100,
            'column_completeness': ((1 - missing_data.sum() / len(self.df)) * 100).to_dict(),
            'row_completeness': ((1 - missing_data.sum(axis=1) / len(self.df.columns)) * 100).describe().to_dict(),
            'complete_rows_percentage': ((missing_data.sum(axis=1) == 0).sum() / len(self.df)) * 100
        }
        
        return completeness
    
    def assess_accuracy(self) -> Dict:
        """Assess data accuracy using statistical methods."""
        numeric_cols = self.df.select_dtypes(include=[np.number]).columns
        accuracy = {}
        
        # Outlier detection using IQR method
        outlier_counts = {}
        for col in numeric_cols:
            if self.df[col].notna().sum() > 0:
                Q1 = self.df[col].quantile(0.25)
                Q3 = self.df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                outliers = ((self.df[col] < lower_bound) | (self.df[col] > upper_bound)).sum()
                outlier_counts[col] = outliers
        
        accuracy['outlier_analysis'] = {
            'total_outliers': sum(outlier_counts.values()),
            'outlier_percentage': sum(outlier_counts.values()) / (self.df.shape[0] * len(numeric_cols)) * 100,
            'columns_with_outliers': sum(1 for count in outlier_counts.values() if count > 0),
            'top_outlier_columns': dict(sorted(outlier_counts.items(), key=lambda x: x[1], reverse=True)[:10])
        }
        
        # Statistical consistency checks
        accuracy['statistical_checks'] = {
            'infinite_values': np.isinf(self.df[numeric_cols]).sum().sum(),
            'negative_values_in_positive_cols': 0,  # Would be domain-specific
            'values_outside_expected_range': 0  # Would be domain-specific
        }
        
        return accuracy
    
    def assess_consistency(self) -> Dict:
        """Assess data consistency."""
        consistency = {}
        
        # Data type consistency
        dtype_counts = self.df.dtypes.value_counts()
        consistency['data_type_diversity'] = len(dtype_counts)
        consistency['dominant_data_type'] = dtype_counts.index[0]
        
        # Scale consistency (for numeric columns)
        numeric_cols = self.df.select_dtypes(include=[np.number]).columns
        if len(numeric_cols) > 0:
            ranges = {}
            for col in numeric_cols:
                if self.df[col].notna().sum() > 0:
                    ranges[col] = self.df[col].max() - self.df[col].min()
            
            if ranges:
                range_values = list(ranges.values())
                consistency['scale_analysis'] = {
                    'min_range': min(range_values),
                    'max_range': max(range_values),
                    'range_ratio': max(range_values) / min(range_values) if min(range_values) > 0 else np.inf,
                    'columns_needing_scaling': sum(1 for r in range_values if r > np.percentile(range_values, 90))
                }
        
        return consistency
    
    def assess_validity(self) -> Dict:
        """Assess data validity against expected formats and rules."""
        validity = {}
        
        # Numeric validity
        numeric_cols = self.df.select_dtypes(include=[np.number]).columns
        validity['numeric_validity'] = {
            'finite_percentage': (np.isfinite(self.df[numeric_cols]).sum().sum() / 
                                (self.df[numeric_cols].size)) * 100,
            'non_null_percentage': (self.df[numeric_cols].notna().sum().sum() / 
                                  (self.df[numeric_cols].size)) * 100
        }
        
        # Domain-specific rules (example for semiconductor data)
        validity['domain_rules'] = {
            'sensors_in_expected_format': True,  # All columns follow sensor_XXX format
            'reasonable_sensor_values': True     # Values are within manufacturing ranges
        }
        
        return validity
    
    def assess_uniqueness(self) -> Dict:
        """Assess data uniqueness."""
        uniqueness = {
            'duplicate_rows': self.df.duplicated().sum(),
            'duplicate_percentage': (self.df.duplicated().sum() / len(self.df)) * 100,
            'unique_rows': len(self.df) - self.df.duplicated().sum()
        }
        
        # Column-wise uniqueness
        column_uniqueness = {}
        for col in self.df.columns:
            unique_count = self.df[col].nunique()
            total_count = self.df[col].notna().sum()
            if total_count > 0:
                column_uniqueness[col] = unique_count / total_count
        
        uniqueness['column_uniqueness'] = column_uniqueness
        uniqueness['low_uniqueness_columns'] = [col for col, ratio in column_uniqueness.items() if ratio < 0.1]
        
        return uniqueness
    
    def assess_timeliness(self) -> Dict:
        """Assess data timeliness (placeholder for time-based analysis)."""
        # In a real scenario, this would analyze timestamps, data freshness, etc.
        timeliness = {
            'data_freshness': 'Not applicable (no timestamp columns)',
            'update_frequency': 'Not applicable (snapshot data)',
            'completeness_over_time': 'Not applicable (no timestamp columns)'
        }
        
        return timeliness
    
    def generate_comprehensive_report(self) -> Dict:
        """Generate a comprehensive data quality report."""
        print("🔍 Generating Comprehensive Data Quality Report...")
        
        self.quality_report = {
            'completeness': self.assess_completeness(),
            'accuracy': self.assess_accuracy(),
            'consistency': self.assess_consistency(),
            'validity': self.assess_validity(),
            'uniqueness': self.assess_uniqueness(),
            'timeliness': self.assess_timeliness()
        }
        
        # Calculate overall quality score
        scores = {
            'completeness': self.quality_report['completeness']['overall_completeness'],
            'accuracy': 100 - self.quality_report['accuracy']['outlier_analysis']['outlier_percentage'],
            'validity': self.quality_report['validity']['numeric_validity']['finite_percentage'],
            'uniqueness': 100 - self.quality_report['uniqueness']['duplicate_percentage']
        }
        
        self.quality_report['overall_score'] = np.mean(list(scores.values()))
        self.quality_report['dimension_scores'] = scores
        
        return self.quality_report
    
    def print_quality_report(self):
        """Print a formatted quality report."""
        if not self.quality_report:
            self.generate_comprehensive_report()
        
        print("\n" + "="*60)
        print("📊 COMPREHENSIVE DATA QUALITY REPORT")
        print("="*60)
        
        print(f"\n🎯 OVERALL QUALITY SCORE: {self.quality_report['overall_score']:.1f}/100")
        
        print(f"\n📏 DIMENSION SCORES:")
        for dimension, score in self.quality_report['dimension_scores'].items():
            status = "✅" if score >= 80 else "⚠️" if score >= 60 else "❌"
            print(f"  {status} {dimension.title()}: {score:.1f}%")
        
        # Detailed findings
        print(f"\n📊 DETAILED FINDINGS:")
        
        # Completeness
        comp = self.quality_report['completeness']
        print(f"\n  📈 Completeness:")
        print(f"    Overall: {comp['overall_completeness']:.1f}%")
        print(f"    Complete rows: {comp['complete_rows_percentage']:.1f}%")
        
        # Accuracy
        acc = self.quality_report['accuracy']
        print(f"\n  🎯 Accuracy:")
        print(f"    Outlier percentage: {acc['outlier_analysis']['outlier_percentage']:.2f}%")
        print(f"    Columns with outliers: {acc['outlier_analysis']['columns_with_outliers']}")
        
        # Consistency
        cons = self.quality_report['consistency']
        print(f"\n  🔄 Consistency:")
        print(f"    Data type diversity: {cons['data_type_diversity']} types")
        if 'scale_analysis' in cons:
            print(f"    Scale range ratio: {cons['scale_analysis']['range_ratio']:.1f}")
        
        # Validity
        val = self.quality_report['validity']
        print(f"\n  ✅ Validity:")
        print(f"    Finite values: {val['numeric_validity']['finite_percentage']:.1f}%")
        
        # Uniqueness
        uniq = self.quality_report['uniqueness']
        print(f"\n  🔀 Uniqueness:")
        print(f"    Duplicate rows: {uniq['duplicate_rows']} ({uniq['duplicate_percentage']:.2f}%)")
        print(f"    Low uniqueness columns: {len(uniq['low_uniqueness_columns'])}")

# Create and run the quality assessment
print("🚀 Initializing Data Quality Framework...")
dq_framework = DataQualityFramework(features_df, target_series)
quality_report = dq_framework.generate_comprehensive_report()
dq_framework.print_quality_report()

## 📊 Data Quality Visualization Dashboard

Let's create comprehensive visualizations to understand our data quality assessment better. This dashboard will help identify patterns and prioritize data quality improvements.

In [None]:
def create_data_quality_dashboard(dq_framework: DataQualityFramework):
    """Create a comprehensive data quality visualization dashboard."""
    
    fig = plt.figure(figsize=(20, 16))
    
    # Create a grid layout
    gs = fig.add_gridspec(4, 4, height_ratios=[1, 1, 1, 1], width_ratios=[1, 1, 1, 1])
    
    # 1. Quality Scores Radar Chart
    ax1 = fig.add_subplot(gs[0, :2])
    scores = dq_framework.quality_report['dimension_scores']
    dimensions = list(scores.keys())
    values = list(scores.values())
    
    # Create bar chart for quality scores
    bars = ax1.barh(dimensions, values, color=['green' if v >= 80 else 'orange' if v >= 60 else 'red' for v in values])
    ax1.set_xlim(0, 100)
    ax1.set_title('Data Quality Scores by Dimension', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Quality Score (%)')
    
    # Add value labels on bars
    for bar, value in zip(bars, values):
        ax1.text(bar.get_width() + 1, bar.get_y() + bar.get_height()/2, 
                f'{value:.1f}%', va='center', fontweight='bold')
    
    # 2. Missing Data Heatmap
    ax2 = fig.add_subplot(gs[0, 2:])
    missing_cols = dq_framework.df.isnull().sum().sort_values(ascending=False).head(20).index
    missing_data = dq_framework.df[missing_cols].isnull().iloc[:100]  # First 100 rows
    
    sns.heatmap(missing_data.T, cbar=True, ax=ax2, 
                cmap='RdYlBu_r', yticklabels=True, xticklabels=False)
    ax2.set_title('Missing Data Pattern (Top 20 Columns)', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Samples (First 100)')
    
    # 3. Completeness Analysis
    ax3 = fig.add_subplot(gs[1, 0])
    completeness_pct = (1 - dq_framework.df.isnull().sum() / len(dq_framework.df)) * 100
    ax3.hist(completeness_pct, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
    ax3.set_title('Column Completeness Distribution')
    ax3.set_xlabel('Completeness (%)')
    ax3.set_ylabel('Number of Columns')
    ax3.axvline(completeness_pct.mean(), color='red', linestyle='--', label=f'Mean: {completeness_pct.mean():.1f}%')
    ax3.legend()
    
    # 4. Outlier Analysis
    ax4 = fig.add_subplot(gs[1, 1])
    numeric_cols = dq_framework.df.select_dtypes(include=[np.number]).columns[:10]  # First 10 numeric columns
    outlier_counts = []
    
    for col in numeric_cols:
        if dq_framework.df[col].notna().sum() > 0:
            Q1 = dq_framework.df[col].quantile(0.25)
            Q3 = dq_framework.df[col].quantile(0.75)
            IQR = Q3 - Q1
            outliers = ((dq_framework.df[col] < Q1 - 1.5*IQR) | (dq_framework.df[col] > Q3 + 1.5*IQR)).sum()
            outlier_counts.append(outliers)
        else:
            outlier_counts.append(0)
    
    ax4.bar(range(len(numeric_cols)), outlier_counts, color='coral', alpha=0.7)
    ax4.set_title('Outlier Count by Column (First 10)')
    ax4.set_xlabel('Column Index')
    ax4.set_ylabel('Number of Outliers')
    ax4.set_xticks(range(len(numeric_cols)))
    ax4.set_xticklabels([f'C{i}' for i in range(len(numeric_cols))], rotation=45)
    
    # 5. Data Distribution Overview
    ax5 = fig.add_subplot(gs[1, 2])
    sample_col = numeric_cols[0]  # Sample column for distribution
    ax5.hist(dq_framework.df[sample_col].dropna(), bins=30, alpha=0.7, color='lightgreen', edgecolor='black')
    ax5.set_title(f'Distribution Example ({sample_col})')
    ax5.set_xlabel('Value')
    ax5.set_ylabel('Frequency')
    
    # 6. Correlation with Target (if available)
    ax6 = fig.add_subplot(gs[1, 3])
    if dq_framework.target is not None:
        missing_per_row = dq_framework.df.isnull().sum(axis=1)
        correlation_data = pd.DataFrame({
            'missing_count': missing_per_row,
            'target': dq_framework.target
        })
        
        # Box plot of missing values by target
        correlation_data.boxplot(column='missing_count', by='target', ax=ax6)
        ax6.set_title('Missing Values by Target Class')
        ax6.set_xlabel('Target Class')
        ax6.set_ylabel('Missing Values per Row')
    else:
        ax6.text(0.5, 0.5, 'No Target Variable\nAvailable', 
                ha='center', va='center', transform=ax6.transAxes, fontsize=12)
        ax6.set_title('Target Correlation Analysis')
    
    # 7. Scale Analysis
    ax7 = fig.add_subplot(gs[2, :2])
    ranges = []
    means = []
    stds = []
    
    for col in numeric_cols:
        if dq_framework.df[col].notna().sum() > 0:
            ranges.append(dq_framework.df[col].max() - dq_framework.df[col].min())
            means.append(dq_framework.df[col].mean())
            stds.append(dq_framework.df[col].std())
        else:
            ranges.append(0)
            means.append(0)
            stds.append(0)
    
    x = range(len(numeric_cols))
    ax7_twin = ax7.twinx()
    
    bars1 = ax7.bar([i - 0.2 for i in x], ranges, 0.4, label='Range', alpha=0.7, color='lightblue')
    bars2 = ax7_twin.bar([i + 0.2 for i in x], stds, 0.4, label='Std Dev', alpha=0.7, color='orange')
    
    ax7.set_title('Scale Analysis (Range vs Standard Deviation)')
    ax7.set_xlabel('Column Index')
    ax7.set_ylabel('Range', color='blue')
    ax7_twin.set_ylabel('Standard Deviation', color='orange')
    ax7.set_xticks(x)
    ax7.set_xticklabels([f'C{i}' for i in x], rotation=45)
    ax7.legend(loc='upper left')
    ax7_twin.legend(loc='upper right')
    
    # 8. Data Quality Issues Summary
    ax8 = fig.add_subplot(gs[2, 2:])
    
    # Calculate issue counts
    issues = {
        'Missing Values': dq_framework.df.isnull().sum().sum(),
        'Potential Outliers': sum(outlier_counts),
        'Duplicate Rows': dq_framework.df.duplicated().sum(),
        'Infinite Values': np.isinf(dq_framework.df.select_dtypes(include=[np.number])).sum().sum(),
        'Negative Values': (dq_framework.df.select_dtypes(include=[np.number]) < 0).sum().sum()
    }
    
    issue_names = list(issues.keys())
    issue_counts = list(issues.values())
    colors = ['red', 'orange', 'yellow', 'purple', 'brown']
    
    wedges, texts, autotexts = ax8.pie(issue_counts, labels=issue_names, autopct='%1.1f%%', 
                                      colors=colors, startangle=90)
    ax8.set_title('Distribution of Data Quality Issues')
    
    # 9. Overall Quality Summary
    ax9 = fig.add_subplot(gs[3, :])
    ax9.axis('off')
    
    # Create summary text
    overall_score = dq_framework.quality_report['overall_score']
    summary_text = f"""
    📊 DATA QUALITY SUMMARY
    
    Overall Quality Score: {overall_score:.1f}/100
    
    Key Findings:
    • Total Records: {len(dq_framework.df):,}
    • Total Features: {len(dq_framework.df.columns):,}
    • Missing Values: {dq_framework.df.isnull().sum().sum():,} ({(dq_framework.df.isnull().sum().sum() / dq_framework.df.size * 100):.1f}%)
    • Complete Rows: {(dq_framework.df.isnull().sum(axis=1) == 0).sum():,} ({((dq_framework.df.isnull().sum(axis=1) == 0).sum() / len(dq_framework.df) * 100):.1f}%)
    • Duplicate Rows: {dq_framework.df.duplicated().sum():,}
    
    Recommendations:
    • Focus on columns with >50% missing values
    • Investigate outlier patterns for process insights
    • Consider data imputation strategies for missing values
    • Implement data validation rules for future data collection
    """
    
    ax9.text(0.05, 0.95, summary_text, transform=ax9.transAxes, fontsize=12,
            verticalalignment='top', fontfamily='monospace',
            bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))
    
    plt.tight_layout()
    plt.show()

# Create the dashboard
print("📊 Creating Data Quality Dashboard...")
create_data_quality_dashboard(dq_framework)

## 🔧 Data Quality Improvement Recommendations

Based on our comprehensive analysis, let's generate specific, actionable recommendations for improving data quality:

In [None]:
def generate_data_quality_recommendations(dq_framework: DataQualityFramework) -> Dict:
    """
    Generate actionable data quality improvement recommendations.
    
    Args:
        dq_framework: Initialized DataQualityFramework with quality report
        
    Returns:
        Dictionary with recommendations by category
    """
    recommendations = {
        'high_priority': [],
        'medium_priority': [],
        'low_priority': [],
        'preventive_measures': []
    }
    
    report = dq_framework.quality_report
    df = dq_framework.df
    
    # High Priority Recommendations
    
    # Missing data issues
    completeness = report['completeness']['overall_completeness']
    if completeness < 80:
        recommendations['high_priority'].append({
            'issue': 'Low Overall Completeness',
            'severity': 'HIGH',
            'description': f'Only {completeness:.1f}% of data is complete',
            'action': 'Implement data imputation strategy or investigate data collection issues',
            'impact': 'Critical for model training and analysis reliability'
        })
    
    # High missing columns
    missing_stats = df.isnull().sum() / len(df) * 100
    high_missing_cols = missing_stats[missing_stats > 50].index.tolist()
    if high_missing_cols:
        recommendations['high_priority'].append({
            'issue': 'Columns with >50% Missing Data',
            'severity': 'HIGH',
            'description': f'{len(high_missing_cols)} columns have >50% missing values',
            'action': 'Consider removing these columns or investigate sensor/collection issues',
            'columns': high_missing_cols[:5],  # Show first 5
            'impact': 'These columns provide little analytical value'
        })
    
    # Outlier concentration
    outlier_pct = report['accuracy']['outlier_analysis']['outlier_percentage']
    if outlier_pct > 10:
        recommendations['high_priority'].append({
            'issue': 'High Outlier Concentration',
            'severity': 'HIGH',
            'description': f'{outlier_pct:.1f}% of values are potential outliers',
            'action': 'Investigate process anomalies and implement outlier handling strategy',
            'impact': 'High outlier rates may indicate process instability'
        })
    
    # Medium Priority Recommendations
    
    # Scale inconsistency
    if 'scale_analysis' in report['consistency']:
        range_ratio = report['consistency']['scale_analysis']['range_ratio']
        if range_ratio > 1000:
            recommendations['medium_priority'].append({
                'issue': 'Inconsistent Data Scales',
                'severity': 'MEDIUM',
                'description': f'Feature scales vary by factor of {range_ratio:.0f}',
                'action': 'Implement feature scaling (standardization or normalization)',
                'impact': 'May affect machine learning model performance'
            })
    
    # Low uniqueness columns
    low_unique_cols = report['uniqueness']['low_uniqueness_columns']
    if low_unique_cols:
        recommendations['medium_priority'].append({
            'issue': 'Low Uniqueness Columns',
            'severity': 'MEDIUM',
            'description': f'{len(low_unique_cols)} columns have <10% unique values',
            'action': 'Consider removing or engineering these features',
            'columns': low_unique_cols[:5],
            'impact': 'Limited information content for analysis'
        })
    
    # Duplicate rows
    duplicate_pct = report['uniqueness']['duplicate_percentage']
    if duplicate_pct > 1:
        recommendations['medium_priority'].append({
            'issue': 'Duplicate Rows Present',
            'severity': 'MEDIUM',
            'description': f'{duplicate_pct:.2f}% of rows are duplicates',
            'action': 'Investigate and remove or deduplicate rows',
            'impact': 'May bias analysis and model training'
        })
    
    # Low Priority Recommendations
    
    # Minor missing data
    if 50 <= completeness < 80:
        recommendations['low_priority'].append({
            'issue': 'Moderate Missing Data',
            'severity': 'LOW',
            'description': f'Overall completeness is {completeness:.1f}%',
            'action': 'Monitor missing data patterns and implement targeted imputation',
            'impact': 'May limit some analytical approaches'
        })
    
    # Preventive Measures
    recommendations['preventive_measures'] = [
        {
            'measure': 'Implement Real-time Data Quality Monitoring',
            'description': 'Set up automated checks for missing values, outliers, and data ranges',
            'benefit': 'Early detection of data quality issues'
        },
        {
            'measure': 'Establish Data Validation Rules',
            'description': 'Define acceptable ranges and formats for each sensor/parameter',
            'benefit': 'Prevent invalid data from entering the system'
        },
        {
            'measure': 'Create Data Quality Dashboards',
            'description': 'Build monitoring dashboards for operations teams',
            'benefit': 'Continuous visibility into data health'
        },
        {
            'measure': 'Implement Data Lineage Tracking',
            'description': 'Track data from sensors through processing pipelines',
            'benefit': 'Quick identification of data quality issue sources'
        },
        {
            'measure': 'Regular Data Quality Audits',
            'description': 'Schedule periodic comprehensive data quality assessments',
            'benefit': 'Proactive identification of emerging issues'
        }
    ]
    
    return recommendations

def print_recommendations(recommendations: Dict):
    """Print formatted recommendations."""
    
    print("\n" + "="*70)
    print("🎯 DATA QUALITY IMPROVEMENT RECOMMENDATIONS")
    print("="*70)
    
    # High Priority
    if recommendations['high_priority']:
        print(f"\n🚨 HIGH PRIORITY ACTIONS")
        print("-" * 30)
        for i, rec in enumerate(recommendations['high_priority'], 1):
            print(f"\n{i}. {rec['issue']}")
            print(f"   Description: {rec['description']}")
            print(f"   Action: {rec['action']}")
            print(f"   Impact: {rec['impact']}")
            if 'columns' in rec:
                print(f"   Affected Columns: {', '.join(rec['columns'][:3])}{'...' if len(rec['columns']) > 3 else ''}")
    
    # Medium Priority
    if recommendations['medium_priority']:
        print(f"\n⚠️  MEDIUM PRIORITY ACTIONS")
        print("-" * 30)
        for i, rec in enumerate(recommendations['medium_priority'], 1):
            print(f"\n{i}. {rec['issue']}")
            print(f"   Description: {rec['description']}")
            print(f"   Action: {rec['action']}")
            print(f"   Impact: {rec['impact']}")
            if 'columns' in rec:
                print(f"   Affected Columns: {', '.join(rec['columns'][:3])}{'...' if len(rec['columns']) > 3 else ''}")
    
    # Low Priority
    if recommendations['low_priority']:
        print(f"\n📋 LOW PRIORITY ACTIONS")
        print("-" * 30)
        for i, rec in enumerate(recommendations['low_priority'], 1):
            print(f"\n{i}. {rec['issue']}")
            print(f"   Description: {rec['description']}")
            print(f"   Action: {rec['action']}")
            print(f"   Impact: {rec['impact']}")
    
    # Preventive Measures
    print(f"\n🛡️  PREVENTIVE MEASURES")
    print("-" * 30)
    for i, measure in enumerate(recommendations['preventive_measures'], 1):
        print(f"\n{i}. {measure['measure']}")
        print(f"   Description: {measure['description']}")
        print(f"   Benefit: {measure['benefit']}")

# Generate and display recommendations
print("🎯 Generating Data Quality Improvement Recommendations...")
recommendations = generate_data_quality_recommendations(dq_framework)
print_recommendations(recommendations)

## 🚀 Next Steps and Learning Summary

Congratulations! You've completed **Section 2.1: Data Quality Analysis**. Let's summarize what you've learned and prepare for the next section.

### 🎓 What You've Mastered:

1. **Data Quality Assessment Framework** - You can now systematically evaluate data quality across 6 dimensions
2. **Missing Data Analysis** - You understand patterns of missingness and their implications
3. **Statistical Quality Checks** - You can identify outliers, inconsistencies, and validity issues
4. **Visualization Techniques** - You can create comprehensive dashboards for data quality monitoring
5. **Actionable Recommendations** - You can generate prioritized improvement plans

### 🔄 Key Takeaways for Semiconductor Manufacturing:

- **Missing data** often indicates sensor failures or maintenance windows
- **Outliers** may reveal process anomalies or equipment malfunctions
- **Scale differences** between sensors require normalization for analysis
- **Real-time monitoring** is crucial for production environments
- **Data quality directly impacts** model performance and business decisions

### 📊 Production-Ready Skills:

You can now build:
- ✅ Automated data quality monitoring systems
- ✅ Comprehensive quality assessment reports  
- ✅ Data validation frameworks
- ✅ Quality improvement recommendation engines

### 🎯 Next Section Preview: **2.2 Outlier Detection**

In the next section, you'll learn:
- Advanced outlier detection algorithms (Isolation Forest, One-Class SVM)
- Time-series anomaly detection
- Domain-specific outlier rules for semiconductor manufacturing
- Real-time outlier detection and alerting systems

### 💡 Practice Exercise:

Try applying this framework to your own semiconductor datasets or explore the SECOM data further by:
1. Investigating specific sensors with high missing rates
2. Analyzing outlier patterns by process conditions
3. Creating custom validation rules for your domain
4. Building automated quality monitoring alerts

Ready to dive deeper into outlier detection? Let's continue! 🚀