# TemporalScope Tutorial: XAI Data Quality Validation

## Purpose

This tutorial demonstrates how to validate time series data quality using research-backed thresholds. Data quality is critical for XAI (eXplainable AI) because poor quality data can lead to misleading explanations and unreliable models.

## What You'll Learn

1. How to validate time series data using research-backed thresholds
2. Why each validation check matters for XAI
3. How to integrate validation into production pipelines

## Research-Backed Validation

Our validation thresholds come from key research papers:

1. **Grinsztajn et al. (2022)**:
   - Minimum 3,000 samples needed for reliable model training
   - Feature-to-sample ratio should be < 1/10 to prevent overfitting
   - Categorical features should have ≤ 20 unique values

2. **Shwartz-Ziv et al. (2021)**:
   - Maximum 50,000 samples for medium-sized datasets
   - At least 4 features needed for meaningful complexity

3. **Gorishniy et al. (2021)**:
   - Keep features under 500 to avoid dimensionality issues
   - Numerical features should have ≥ 10 unique values

These thresholds help ensure your data is suitable for XAI analysis.

In [1]:
# Import required libraries
import pandas as pd  # For DataFrame operations
import numpy as np   # For numerical operations

# Import TemporalScope components
from temporalscope.datasets.dataset_validator import DatasetValidator, ValidationResult

# Note: We'll import other backends (polars, modin) when needed
# This keeps the initial setup simple and focused

## Example 1: Basic Data Validation

Let's start with a simple example to understand the core validation concepts.

### What We're Doing
1. Create a sample time series dataset
2. Set up the validator with research-backed thresholds
3. Run validation checks
4. Understand the validation results

### Why It Matters
- Poor data quality leads to unreliable XAI results
- Research-backed thresholds ensure meaningful analysis
- Early validation prevents downstream issues

In [2]:
# Step 1: Create a sample time series dataset
# - Using 5000 samples (meets minimum requirement of 3000)
# - One feature plus target (meets minimum feature requirement)
# - Daily timestamps for time series structure
data = pd.DataFrame({
    'time': pd.date_range('2023-01-01', periods=5000),  # Daily timestamps
    'price': np.random.normal(100, 10, 5000),           # Continuous feature
    'target': np.random.choice([0, 1], 5000)            # Binary target
})

# Step 2: Initialize validator with research-backed thresholds
validator = DatasetValidator(
    time_col='time',      # Column containing timestamps
    target_col='target',  # Column to predict
    
    # Research-backed thresholds from Grinsztajn et al. (2022)
    min_samples=3000,         # Minimum samples for reliable training
    max_feature_ratio=0.1,    # Prevent overfitting
    
    # Enable warnings to see detailed messages
    enable_warnings=True
)

# Step 3: Run validation
results = validator.fit_transform(data)

# Step 4: Print detailed report
print("Data Quality Validation Report:")
validator.print_report(results)

# Understanding the results:
# - ✓ means the check passed
# - ✗ means the check failed
# - Details show specific metrics and thresholds

### Understanding Validation Checks

Each validation check has a specific purpose:

1. **Sample Size** (min_samples=3000):
   - WHY: Too few samples → unstable models
   - WHY: Too many samples → computational issues
   
2. **Feature Count**:
   - WHY: Too few features → oversimplified model
   - WHY: Too many features → curse of dimensionality
   
3. **Feature Ratio** (max_feature_ratio=0.1):
   - WHY: High ratio → risk of overfitting
   - WHY: Based on statistical learning theory
   
4. **Feature Variability**:
   - WHY: Low variability → uninformative features
   - WHY: Affects model's learning capacity

## Example 2: Handling Data Quality Issues

Now let's see what happens with problematic data and how to interpret the warnings.

### Common Issues
1. Too few samples
2. Missing values
3. Low feature variability
4. Improper feature-to-sample ratio

In [3]:
# Create data with quality issues
problematic_data = pd.DataFrame({
    'time': pd.date_range('2023-01-01', periods=1000),     # Too few samples
    'feature1': np.random.choice([1, 2], 1000),            # Low variability
    'feature2': np.random.normal(0, 1, 1000),              # Good variability
    'feature3': [None] * 100 + list(range(900)),           # Missing values
    'target': np.random.choice([0, 1], 1000)               # Binary target
})

try:
    # Initialize validator
    validator = DatasetValidator(
        time_col='time',
        target_col='target',
        min_samples=3000,           # Will fail (only 1000 samples)
        min_unique_values=10,       # Will fail for feature1
        enable_warnings=True
    )
    
    # Attempt validation
    results = validator.fit_transform(problematic_data)
    
except ValueError as e:
    print("Validation Failed:")
    print(f"Error: {str(e)}")
    print("\nWhy This Matters:")
    print("1. Too few samples (1000 < 3000) → unstable models")
    print("2. Missing values → unreliable predictions")
    print("3. Low feature variability → poor model learning")

## Example 3: Production Pipeline Integration

Learn how to integrate validation into production workflows.

### Key Concepts
1. Quality gates in pipelines
2. Monitoring and alerting
3. Error handling and logging

In [4]:
def validate_production_data(df, time_col, target_col):
    """Production-ready validation function.
    
    Key Features:
    1. Proper error handling
    2. Detailed logging
    3. Monitoring integration
    
    Args:
        df: Input DataFrame to validate
        time_col: Name of time column
        target_col: Name of target column
    
    Returns:
        Tuple of (results, summary)
    """
    # Initialize validator with production thresholds
    validator = DatasetValidator(
        time_col=time_col,
        target_col=target_col,
        min_samples=3000,     # Research-backed minimum
        max_feature_ratio=0.1  # Prevent overfitting
    )
    
    try:
        # Run validation
        results = validator.fit_transform(df)
        
        # Get monitoring metrics
        summary = ValidationResult.get_validation_summary(results)
        failed = ValidationResult.get_failed_checks(results)
        
        # Handle failures
        if failed:
            for check_name, result in failed.items():
                # Get structured log entry
                log_entry = result.to_log_entry()
                
                # Log failure details
                print(f"Failed Check: {check_name}")
                print(f"Details: {log_entry}")
                
                # Critical failures stop the pipeline
                if result.severity == "ERROR":
                    raise ValueError(f"Critical validation failure: {check_name}")
        
        return results, summary
        
    except Exception as e:
        print(f"Validation failed: {str(e)}")
        raise

# Test the production validation
sample_data = pd.DataFrame({
    'time': pd.date_range('2023-01-01', periods=5000),
    'value': np.random.normal(0, 1, 5000),
    'target': np.random.choice([0, 1], 5000)
})

results, summary = validate_production_data(sample_data, 'time', 'target')
print("\nValidation Summary:")
print(f"Total Checks: {summary['total_checks']}")
print(f"Failed Checks: {summary['failed_checks']}")

## Best Practices for Production

### 1. Pipeline Integration
```python
# Airflow DAG Example
with DAG('data_validation_pipeline') as dag:
    validate_task = PythonOperator(
        task_id='validate_dataframe',
        python_callable=validate_production_data,
        op_kwargs={
            'df': '{{ task_instance.xcom_pull(task_ids="load_data") }}',
            'time_col': 'timestamp',
            'target_col': 'target'
        }
    )
```

### 2. Monitoring Setup
- Track validation metrics over time
- Set up alerts for critical failures
- Monitor feature drift

### 3. Threshold Configuration
- Start with research-backed defaults
- Adjust based on domain requirements
- Document threshold decisions

### 4. Error Handling
- Define clear failure policies
- Set up fallback procedures
- Maintain audit trails

## Further Reading

1. Grinsztajn et al. (2022) - Data quality thresholds
2. Shwartz-Ziv et al. (2021) - Dataset size guidelines
3. Gorishniy et al. (2021) - Feature complexity analysis