# Dataset Validation in TemporalScope

This tutorial demonstrates how to use TemporalScope's dataset validation utilities to check your data against research-backed heuristics. The validator helps assess whether your dataset meets common requirements for machine learning tasks, based on findings from key research papers.

## Research Background

The validation thresholds are derived from research papers that studied what makes datasets work well with different types of models:

1. Grinsztajn et al. (2022):
   - Minimum samples (≥ 3,000) for complex models
   - Feature-to-sample ratio (d/n < 1/10)
   - Categorical feature cardinality (≤ 20)

2. Shwartz-Ziv and Armon (2021):
   - Maximum samples (≤ 50,000) for medium-sized datasets
   - Minimum features (≥ 4) for meaningful complexity

3. Gorishniy et al. (2021):
   - Maximum features (< 500) to avoid dimensionality issues
   - Numerical feature uniqueness (≥ 10)

These findings provide valuable guidelines, but remember they're recommendations, not strict rules. Your specific use case may require different thresholds.

In [None]:
import pandas as pd
import numpy as np
from temporalscope.datasets.dataset_validator import DatasetValidator

# Create sample dataset
np.random.seed(42)
n_samples = 1000

data = {
    'numeric_feature_1': np.random.normal(0, 1, n_samples),
    'numeric_feature_2': np.random.uniform(0, 10, n_samples),
    'categorical_feature': np.random.choice(['A', 'B', 'C'], n_samples),
    'binary_feature': np.random.choice([0, 1], n_samples),
    'target': np.random.choice([0, 1, 2], n_samples)
}

df = pd.DataFrame(data)

## Basic Validation

Let's start by running all validation checks with default thresholds:

In [None]:
# Create validator with default settings
validator = DatasetValidator()

# Run validation checks
results = validator.validate(df, target_col='target')

# Print detailed report
validator.print_report(results)

## Customizing Validation Thresholds

The default thresholds are based on research findings but may not apply to all use cases. Let's customize them for our needs:

In [None]:
# Create validator with custom thresholds
custom_validator = DatasetValidator(
    min_samples=500,       # Minimum samples required
    max_samples=5000,      # Maximum samples allowed
    min_features=3,        # Minimum features required
    max_features=10,       # Maximum features allowed
    max_feature_ratio=0.2, # Maximum feature-to-sample ratio
    min_unique_values=5,   # Minimum unique values for numerical features
    max_categorical_values=5, # Maximum unique values for categorical features
    class_imbalance_threshold=2.0  # Maximum class imbalance ratio
)

# Run validation with custom thresholds
custom_results = custom_validator.validate(df, target_col='target')
custom_validator.print_report(custom_results)

## Selective Validation

You can choose to run only specific validation checks:

In [None]:
# Create validator with selected checks
selective_validator = DatasetValidator(
    checks_to_run=['sample_size', 'feature_count', 'class_balance']
)

# Run selected validation checks
selective_results = selective_validator.validate(df, target_col='target')
selective_validator.print_report(selective_results)

## Working with Different Backends

The validator works with any DataFrame backend supported by Narwhals:

In [None]:
import polars as pl

# Convert to Polars DataFrame
pl_df = pl.DataFrame(df)

# Validate Polars DataFrame
polars_results = validator.validate(pl_df, target_col='target')
validator.print_report(polars_results)

## Integration Examples

Here are some common ways to integrate dataset validation into your workflow:

In [None]:
def validate_and_preprocess(df, target_col=None):
    """Validate dataset and handle common issues."""
    # Create validator with domain-specific thresholds
    validator = DatasetValidator(
        min_samples=1000,
        max_feature_ratio=0.2,
        class_imbalance_threshold=3.0
    )
    
    # Run validation
    results = validator.validate(df, target_col=target_col)
    
    # Handle validation results
    if not results['sample_size'].passed:
        print("Warning: Dataset size issues detected")
        print(f"Details: {results['sample_size'].message}")
        
    if target_col and not results['class_balance'].passed:
        print("Warning: Class imbalance detected")
        print(f"Class counts: {results['class_balance'].details['class_counts']}")
        
    return results

# Example usage
validation_results = validate_and_preprocess(df, target_col='target')

## Best Practices

1. **Customize Thresholds**: Adjust validation thresholds based on your domain knowledge and specific requirements.

2. **Handle Failures Gracefully**: Use validation results to inform preprocessing decisions:
   - Sample size issues → Consider data collection or small dataset techniques
   - Feature ratio issues → Consider feature selection or dimensionality reduction
   - Class imbalance → Consider resampling or class weights

3. **Regular Validation**: Run validation checks:
   - When receiving new data
   - After major preprocessing steps
   - Before training models

4. **Documentation**: Keep track of your validation thresholds and reasoning:
```python
# Example validation configuration
validation_config = {
    'min_samples': 1000,    # Based on model complexity
    'max_features': 50,     # Based on available compute resources
    'class_imbalance': 3.0  # Based on domain knowledge
}
```

## Remember

The validation thresholds are research-backed recommendations, not strict requirements. They should be:
- Adjusted based on your specific use case
- Used as guidelines, not hard rules
- Documented with clear reasoning

For more information, see the research papers referenced in the documentation:
1. Grinsztajn et al. (2022) - Why do tree-based models still outperform deep learning on typical tabular data?
2. Shwartz-Ziv and Armon (2021) - Tabular data: Deep learning is not all you need
3. Gorishniy et al. (2021) - Revisiting deep learning models for tabular data