# TemporalScope Tutorial: XAI Data Quality Validation

## Purpose

This tutorial demonstrates how to validate time series data quality using research-backed thresholds. Data quality is critical for XAI (eXplainable AI) because poor quality data can lead to misleading explanations and unreliable models.

## What You'll Learn

1. How to validate time series data using research-backed thresholds
2. Why each validation check matters for XAI
3. How to integrate validation into production pipelines

## Research-Backed Validation

Our validation thresholds come from key research papers:

1. **Grinsztajn et al. (2022)**:
   - Minimum 3,000 samples needed for reliable model training
   - Feature-to-sample ratio should be < 1/10 to prevent overfitting
   - Categorical features should have ≤ 20 unique values

2. **Shwartz-Ziv et al. (2021)**:
   - Maximum 50,000 samples for medium-sized datasets
   - At least 4 features needed for meaningful complexity

3. **Gorishniy et al. (2021)**:
   - Keep features under 500 to avoid dimensionality issues
   - Numerical features should have ≥ 10 unique values

These thresholds help ensure your data is suitable for XAI analysis.

In [None]:
import pandas as pd
import numpy as np
import narwhals as nw

from temporalscope.datasets.dataset_validator import DatasetValidator, ValidationResult

# Create sample time series data
data = pd.DataFrame({
    'time': pd.date_range('2023-01-01', periods=5000),
    'price': np.random.normal(100, 10, 5000),
    'target': np.random.choice([0, 1], 5000)
})

# Initialize validator with research-backed thresholds
validator = DatasetValidator(
    time_col='time',
    target_col='target',
    min_samples=3000,
    max_feature_ratio=0.1,
    enable_warnings=True
)

# Run validation
results = validator.fit_transform(data)

print("Data Quality Validation Report:")
validator.print_report(results)

## Understanding Validation Checks

Each validation check has a specific purpose:

1. **Sample Size** (min_samples=3000):
   - WHY: Too few samples → unstable models
   - WHY: Too many samples → computational issues

2. **Feature Count**:
   - WHY: Too few features → oversimplified model
   - WHY: Too many features → curse of dimensionality

3. **Feature Ratio** (max_feature_ratio=0.1):
   - WHY: High ratio → risk of overfitting
   - WHY: Based on statistical learning theory

4. **Feature Variability**:
   - WHY: Low variability → uninformative features
   - WHY: Affects model's learning capacity

In [None]:
# Create data with quality issues
problematic_data = pd.DataFrame({
    'time': pd.date_range('2023-01-01', periods=1000),
    'feature1': np.random.choice([1, 2], 1000),
    'feature2': np.random.normal(0, 1, 1000),
    'feature3': [None] * 100 + list(range(900)),
    'target': np.random.choice([0, 1], 1000)
})

try:
    # Initialize validator
    validator = DatasetValidator(
        time_col='time',
        target_col='target',
        min_samples=3000,
        min_unique_values=10,
        enable_warnings=True
    )
    
    # Attempt validation
    results = validator.fit_transform(problematic_data)
    
except ValueError as e:
    print("Validation Failed:")
    print(f"Error: {str(e)}")
    print("\nWhy This Matters:")
    print("1. Too few samples (1000 < 3000) → unstable models")
    print("2. Missing values → unreliable predictions")
    print("3. Low feature variability → poor model learning")

## Production Pipeline Integration

The DatasetValidator uses Narwhals for backend-agnostic operations, making it suitable for production environments:

1. **Production Environment**:
   - Uses pandas + narwhals for efficient operations
   - Lightweight deployment with minimal dependencies
   - Consistent behavior in production

2. **Test Environment**:
   - Supports multiple backends via hatch
   - Validates across different DataFrame implementations
   - Ensures reliability across environments

3. **Core Operations**:
   - Uses @nw.narwhalify for backend conversions
   - Pure Narwhals operations throughout
   - Consistent behavior across supported types

In [None]:
@nw.narwhalify
def validate_production_data(df, time_col, target_col):
    """Production-ready validation function using Narwhals.
    
    Key Features:
    1. Backend-agnostic operations
    2. Proper error handling
    3. Detailed logging
    
    Args:
        df: Input DataFrame to validate
        time_col: Name of time column
        target_col: Name of target column
    
    Returns:
        Tuple of (results, summary)
    """
    validator = DatasetValidator(
        time_col=time_col,
        target_col=target_col,
        min_samples=3000,
        max_feature_ratio=0.1
    )
    
    try:
        results = validator.fit_transform(df)
        summary = ValidationResult.get_validation_summary(results)
        failed = ValidationResult.get_failed_checks(results)
        
        if failed:
            for check_name, result in failed.items():
                log_entry = result.to_log_entry()
                print(f"Failed Check: {check_name}")
                print(f"Details: {log_entry}")
                
                if result.severity == "ERROR":
                    raise ValueError(f"Critical validation failure: {check_name}")
        
        return results, summary
        
    except Exception as e:
        print(f"Validation failed: {str(e)}")
        raise

# Test the production validation
sample_data = pd.DataFrame({
    'time': pd.date_range('2023-01-01', periods=5000),
    'value': np.random.normal(0, 1, 5000),
    'target': np.random.choice([0, 1], 5000)
})

results, summary = validate_production_data(sample_data, 'time', 'target')
print("\nValidation Summary:")
print(f"Total Checks: {summary['total_checks']}")
print(f"Failed Checks: {summary['failed_checks']}")

## Best Practices

1. **Pipeline Integration**:
   ```python
   # Airflow DAG Example
   with DAG('data_validation_pipeline') as dag:
       validate_task = PythonOperator(
           task_id='validate_dataframe',
           python_callable=validate_production_data,
           op_kwargs={
               'df': '{{ task_instance.xcom_pull(task_ids="load_data") }}',
               'time_col': 'timestamp',
               'target_col': 'target'
           }
       )
   ```

2. **Monitoring Setup**:
   - Track validation metrics over time
   - Set up alerts for critical failures
   - Monitor feature drift

3. **Threshold Configuration**:
   - Start with research-backed defaults
   - Adjust based on domain requirements
   - Document threshold decisions

4. **Error Handling**:
   - Define clear failure policies
   - Set up fallback procedures
   - Maintain audit trails

## Further Reading

1. Grinsztajn et al. (2022) - Data quality thresholds
2. Shwartz-Ziv et al. (2021) - Dataset size guidelines
3. Gorishniy et al. (2021) - Feature complexity analysis