# XAI Data Quality Checks in TemporalScope

This tutorial demonstrates TemporalScope's data quality validation tools, designed for end-users who need to implement auditing and monitoring within automated model pipelines (e.g., Apache Airflow). While we provide the validation utilities, we expect end-users to integrate them into their specific Time Series XAI workflows.

Key Use Cases:
- Automated quality gates in production pipelines
- Model training data validation
- Continuous monitoring of data quality
- Audit trails for XAI workflows

## Engineering Design Overview

The data quality tools follow a clear separation between validation and transformation phases:

### Core Components

1. **Validation Phase (fit)**:
   - Validates TimeFrame or supported DataFrame type
   - Ensures data meets quality requirements
   - No Narwhals operations at this stage

2. **Transformation Phase (transform)**:
   - Uses pure Narwhals operations
   - Performs configured validation checks
   - Returns detailed validation results

### Engineering Design Assumptions

1. **Research-Backed Validation**:
   - Thresholds from key research papers
   - Customizable for domain requirements
   - Clear documentation of sources

2. **Backend Agnostic**:
   - Validation in fit() before operations
   - Pure Narwhals operations in transform()
   - Clean separation of concerns

3. **Pipeline Integration**:
   - Monitoring and alerting capabilities
   - Structured results for pipelines
   - Support for workflow systems

## Example 1: Basic Validation with TimeFrame

In [None]:
import pandas as pd
import numpy as np
from temporalscope.core.temporal_data_loader import TimeFrame
from temporalscope.datasets.dataset_validator import DatasetValidator

# Create sample dataset
np.random.seed(42)
n_samples = 1000

# Create data with numeric features (no special prefixes needed)
data = {
    'time': pd.date_range('2023-01-01', periods=n_samples),
    'price': np.random.normal(0, 1, n_samples),
    'volume': np.random.uniform(0, 10, n_samples),
    'sentiment': np.random.choice([0, 1], n_samples),
    'target': np.random.choice([0, 1, 2], n_samples)
}

# Create TimeFrame - it handles data validation internally
tf = TimeFrame(pd.DataFrame(data), time_col='time', target_col='target')

# Initialize validator
validator = DatasetValidator()

# Run validation checks
results = validator.fit_transform(tf)

# Print detailed report
validator.print_report(results)

## Example 2: Working with Different Backends

The validator works with any DataFrame backend supported by Narwhals:

In [None]:
import polars as pl
import modin.pandas as mpd

# Test with different backends
for backend in ["pandas", "polars", "modin"]:
    print(f"\nValidating with {backend} backend:")
    
    # Create TimeFrame with appropriate backend
    if backend == "pandas":
        df = pd.DataFrame(data)
    elif backend == "polars":
        df = pl.DataFrame(data)
    else:
        df = mpd.DataFrame(data)
    
    # TimeFrame handles backend conversion
    tf = TimeFrame(df, time_col='time', target_col='target')
    
    results = validator.fit_transform(tf)
    print(f"Validation Results ({backend}):")
    validator.print_report(results)

## Example 3: Pipeline Integration

Here's how to integrate validation into your automated pipelines (e.g., Apache Airflow):

In [None]:
def validate_for_pipeline(df, time_col, target_col):
    """Example Airflow task for data validation.
    
    This shows how to integrate validation into automated pipelines:
    1. Create TimeFrame with proper configuration
    2. Run validation checks
    3. Get structured results for monitoring
    4. Handle validation failures
    """
    # Create TimeFrame
    tf = TimeFrame(df, time_col=time_col, target_col=target_col)
    
    # Initialize validator with custom thresholds
    validator = DatasetValidator(
        min_samples=1000,
        max_feature_ratio=0.2,
        class_imbalance_threshold=3.0,
        enable_warnings=True
    )
    
    # Run validation
    results = validator.fit_transform(tf)
    
    # Get structured results for monitoring
    summary = ValidationResult.get_validation_summary(results)
    failed = ValidationResult.get_failed_checks(results)
    
    if failed:
        # Example: Push metrics to monitoring system
        for check_name, result in failed.items():
            log_entry = result.to_log_entry()
            print(f"Failed Check: {check_name}")
            print(f"Details: {log_entry}")
            
            # Example: Push to monitoring
            # monitoring.push_metrics("data_validation", log_entry)
            
            # Example: Fail pipeline on critical errors
            if result.severity == "ERROR":
                raise ValueError(f"Critical validation failure: {check_name}")
    
    return results, summary

# Example usage
results, summary = validate_for_pipeline(pd.DataFrame(data), time_col='time', target_col='target')

## Implementation Details

The validator handles different DataFrame backends through Narwhals operations:

1. **LazyFrame (Dask/Polars)**:
   - Uses collect() for scalar access
   - Avoids direct indexing
   - Handles lazy evaluation properly

2. **PyArrow**:
   - Uses nw.Int64 for numeric operations
   - Handles comparisons through Narwhals
   - Converts types before arithmetic operations

3. **All Backends**:
   - Uses @nw.narwhalify for backend conversions
   - Pure Narwhals operations throughout
   - Consistent behavior across supported types

## Research-Backed Thresholds

The validation thresholds are derived from key research:

1. Grinsztajn et al. (2022):
   - Minimum samples (≥ 3,000)
   - Feature-to-sample ratio (d/n < 1/10)
   - Categorical cardinality (≤ 20)

2. Shwartz-Ziv and Armon (2021):
   - Maximum samples (≤ 50,000)
   - Minimum features (≥ 4)

3. Gorishniy et al. (2021):
   - Maximum features (< 500)
   - Numerical uniqueness (≥ 10)

These thresholds are recommendations that can be customized for your specific use case.