# Test Data Management Strategy for ActualGameSearch V2

This notebook documents our approach to handling test data in the ActualGameSearch V2 pipeline, ensuring robust testing without committing sensitive or large datasets to version control.

## Background

The ActualGameSearch V2 project processes large Steam game datasets, price information, and user reviews. While this data is essential for our ETL pipeline, we cannot and should not commit it to version control due to:

- **Size constraints**: Data files can be hundreds of MB to GB
- **Privacy concerns**: User review data may contain sensitive information
- **API rate limits**: Steam API data should not be redistributed
- **Repository bloat**: Large binary files make git operations slow

This notebook outlines our solution: **fixture-based testing with synthetic data**.

## Section 1: Understanding Test Data Requirements

Let's analyze what types of data our tests need and categorize them by sensitivity, size, and generation complexity.

In [None]:
import pandas as pd
import json
from pathlib import Path
from typing import Dict, Any, List

# Define our test data categories
test_data_categories = {
    "steam_app_data": {
        "sensitivity": "Low",  # Public Steam data
        "size": "Large",  # 100k+ apps
        "generation_complexity": "Medium",  # Structured but varied
        "examples": ["steam_appid", "name", "type", "is_free", "detailed_description"]
    },
    "price_data": {
        "sensitivity": "Low",  # Historical pricing, publicly available
        "size": "Medium",  # Price points per app
        "generation_complexity": "Low",  # Simple numeric ranges
        "examples": ["min_price", "max_price", "sample_count"]
    },
    "review_data": {
        "sensitivity": "Medium",  # User-generated content
        "size": "Very Large",  # Millions of reviews
        "generation_complexity": "High",  # Natural language, sentiment
        "examples": ["review_text", "helpful_votes", "sentiment_score"]
    },
    "embeddings": {
        "sensitivity": "Low",  # Derived vectors
        "size": "Large",  # High-dimensional vectors
        "generation_complexity": "High",  # Requires ML models
        "examples": ["description_embedding", "review_embedding"]
    }
}

# Display our analysis
for category, details in test_data_categories.items():
    print(f"\n{category.upper()}:")
    for key, value in details.items():
        if key != "examples":
            print(f"  {key}: {value}")
        else:
            print(f"  {key}: {', '.join(value)}")

## Section 2: Setting Up Local Test Data Generation

Create scripts to generate test data locally using libraries like Faker and custom generators that match our data schema.

In [None]:
# Install faker if not already available
# pip install faker

from faker import Faker
import random
import numpy as np

fake = Faker()

def generate_steam_app_data(num_apps: int = 100) -> List[Dict[str, Any]]:
    """Generate synthetic Steam app data for testing."""
    
    game_types = ['game', 'dlc', 'demo', 'software']
    
    apps = []
    for i in range(num_apps):
        app_id = fake.random_int(min=1000, max=999999)
        app_type = random.choice(game_types)
        
        # Generate realistic game names
        if app_type == 'dlc':
            name = f"{fake.catch_phrase()} - {fake.word().title()} DLC"
        elif app_type == 'demo':
            name = f"{fake.catch_phrase()} Demo"
        else:
            name = fake.catch_phrase()
        
        apps.append({
            'steam_appid': app_id,
            'name': name,
            'type': app_type,
            'is_free': random.choice([True, False]) if app_type != 'demo' else True,
            'detailed_description': fake.text(max_nb_chars=500),
            'short_description': fake.sentence(nb_words=10),
            'header_image': f"https://steamcdn-a.akamaihd.net/steam/apps/{app_id}/header.jpg",
            'release_date': fake.date_between(start_date='-10y', end_date='today').isoformat()
        })
    
    return apps

# Generate sample data
sample_apps = generate_steam_app_data(5)
print("Sample generated app data:")
for app in sample_apps[:2]:  # Show first 2
    print(f"\nApp ID: {app['steam_appid']}")
    print(f"Name: {app['name']}")
    print(f"Type: {app['type']}")
    print(f"Free: {app['is_free']}")
    print(f"Description: {app['short_description'][:50]}...")

In [None]:
def generate_price_data(app_ids: List[int]) -> Dict[int, Dict[str, Any]]:
    """Generate synthetic price data for given app IDs."""
    
    price_data = {}
    
    for app_id in app_ids:
        # Generate realistic price ranges
        base_price = round(random.uniform(0.99, 59.99), 2)
        
        # Some games go on sale
        if random.random() < 0.7:  # 70% chance of sales
            min_price = round(base_price * random.uniform(0.1, 0.8), 2)
        else:
            min_price = base_price
        
        # Free games
        if random.random() < 0.15:  # 15% free games
            min_price = 0.0
            base_price = 0.0
        
        price_data[app_id] = {
            'min': min_price,
            'max': base_price,
            'count': random.randint(1, 100)  # Number of price samples
        }
    
    return price_data

# Generate price data for our sample apps
app_ids = [app['steam_appid'] for app in sample_apps]
sample_prices = generate_price_data(app_ids)

print("Sample generated price data:")
for app_id, price_info in list(sample_prices.items())[:3]:
    print(f"App {app_id}: ${price_info['min']:.2f} - ${price_info['max']:.2f} ({price_info['count']} samples)")

## Section 3: Creating Mock Data Factories

Build reusable data factories that can create consistent, reproducible test data on-demand for different test scenarios.

In [None]:
class TestDataFactory:
    """Factory for generating consistent test data across different test scenarios."""
    
    def __init__(self, seed: int = 42):
        """Initialize with a seed for reproducible data generation."""
        self.fake = Faker()
        Faker.seed(seed)
        random.seed(seed)
        np.random.seed(seed)
    
    def create_minimal_app_dataset(self) -> pd.DataFrame:
        """Create minimal app dataset for basic ETL testing."""
        apps = [
            {
                'steam_appid': 12345,
                'name': 'Test Game 1',
                'type': 'game',
                'is_free': False,
                'detailed_description': 'A detailed description of test game 1',
                'short_description': 'Short desc 1',
                'header_image': 'http://example.com/image1.jpg'
            },
            {
                'steam_appid': 67890,
                'name': 'Test Game 2',
                'type': 'game',
                'is_free': True,
                'detailed_description': 'A detailed description of test game 2',
                'short_description': 'Short desc 2',
                'header_image': 'http://example.com/image2.jpg'
            },
            {
                'steam_appid': 11111,
                'name': 'Test DLC',
                'type': 'dlc',
                'is_free': False,
                'detailed_description': 'A detailed description of test DLC',
                'short_description': 'Short desc DLC',
                'header_image': 'http://example.com/image3.jpg'
            }
        ]
        return pd.DataFrame(apps)
    
    def create_price_dataset(self, app_ids: List[int]) -> Dict[int, Dict[str, Any]]:
        """Create corresponding price data for given app IDs."""
        # Fixed price data for reproducible tests
        price_map = {
            12345: {'min': 9.99, 'max': 19.99, 'count': 25},
            67890: {'min': 0.0, 'max': 0.0, 'count': 1},
            11111: {'min': 4.99, 'max': 9.99, 'count': 10}
        }
        
        return {app_id: price_map.get(app_id, {'min': 0.0, 'max': 0.0, 'count': 0}) 
                for app_id in app_ids}
    
    def create_large_dataset(self, num_apps: int = 1000) -> pd.DataFrame:
        """Create larger dataset for performance testing."""
        apps = generate_steam_app_data(num_apps)
        return pd.DataFrame(apps)
    
    def save_fixtures(self, output_dir: Path):
        """Save test fixtures to files for use in tests."""
        output_dir.mkdir(exist_ok=True, parents=True)
        
        # Save minimal app data
        minimal_apps = self.create_minimal_app_dataset()
        minimal_apps.to_csv(output_dir / 'sample_expanded_apps.csv', index=False)
        
        # Save corresponding price data
        app_ids = minimal_apps['steam_appid'].tolist()
        price_data = self.create_price_dataset(app_ids)
        
        with open(output_dir / 'sample_price_minmax.json', 'w') as f:
            json.dump({str(k): v for k, v in price_data.items()}, f, indent=2)
        
        print(f"Fixtures saved to {output_dir}")

# Example usage
factory = TestDataFactory()
test_apps = factory.create_minimal_app_dataset()
print("\nMinimal test dataset:")
print(test_apps[['steam_appid', 'name', 'type', 'is_free']].to_string())

## Section 4: Environment-Based Data Configuration

Implement configuration systems that use different data sources for local development, CI/CD, and production testing environments.

In [None]:
import os
from enum import Enum
from dataclasses import dataclass
from typing import Optional

class Environment(Enum):
    LOCAL = "local"
    CI = "ci"
    STAGING = "staging"
    PRODUCTION = "production"

@dataclass
class TestDataConfig:
    """Configuration for test data based on environment."""
    environment: Environment
    use_real_data: bool
    data_dir: Path
    fixtures_dir: Path
    max_test_records: Optional[int] = None
    cache_enabled: bool = True
    
    @classmethod
    def from_environment(cls) -> 'TestDataConfig':
        """Create config based on current environment variables."""
        env_name = os.getenv('TEST_ENV', 'local').lower()
        environment = Environment(env_name)
        
        # Base paths
        base_dir = Path(os.getenv('PROJECT_ROOT', '.'))
        data_dir = base_dir / 'pipeline' / 'data'
        fixtures_dir = base_dir / 'pipeline' / 'tests' / 'fixtures'
        
        if environment == Environment.LOCAL:
            return cls(
                environment=environment,
                use_real_data=os.getenv('USE_REAL_DATA', 'false').lower() == 'true',
                data_dir=data_dir,
                fixtures_dir=fixtures_dir,
                max_test_records=None,
                cache_enabled=True
            )
        
        elif environment == Environment.CI:
            return cls(
                environment=environment,
                use_real_data=False,  # Never use real data in CI
                data_dir=data_dir,
                fixtures_dir=fixtures_dir,
                max_test_records=100,  # Limit records for speed
                cache_enabled=False
            )
        
        else:
            # Staging/Production - minimal fixtures only
            return cls(
                environment=environment,
                use_real_data=False,
                data_dir=data_dir,
                fixtures_dir=fixtures_dir,
                max_test_records=10,
                cache_enabled=True
            )

class TestDataProvider:
    """Provides test data based on environment configuration."""
    
    def __init__(self, config: Optional[TestDataConfig] = None):
        self.config = config or TestDataConfig.from_environment()
        self.factory = TestDataFactory()
    
    def get_app_data(self) -> pd.DataFrame:
        """Get app data appropriate for current environment."""
        if self.config.use_real_data and self._real_data_available():
            return self._load_real_app_data()
        else:
            return self._load_fixture_app_data()
    
    def get_price_data(self, app_ids: List[int]) -> Dict[int, Dict[str, Any]]:
        """Get price data appropriate for current environment."""
        if self.config.use_real_data and self._real_data_available():
            return self._load_real_price_data(app_ids)
        else:
            return self.factory.create_price_dataset(app_ids)
    
    def _real_data_available(self) -> bool:
        """Check if real data files exist."""
        required_files = [
            self.config.data_dir / 'expanded_sampled_apps.csv',
            self.config.data_dir / 'price_minmax.json'
        ]
        return all(f.exists() for f in required_files)
    
    def _load_real_app_data(self) -> pd.DataFrame:
        """Load real app data with optional record limiting."""
        df = pd.read_csv(self.config.data_dir / 'expanded_sampled_apps.csv')
        if self.config.max_test_records:
            df = df.head(self.config.max_test_records)
        return df
    
    def _load_fixture_app_data(self) -> pd.DataFrame:
        """Load fixture app data."""
        fixture_file = self.config.fixtures_dir / 'sample_expanded_apps.csv'
        if fixture_file.exists():
            return pd.read_csv(fixture_file)
        else:
            # Generate on-the-fly if fixture doesn't exist
            return self.factory.create_minimal_app_dataset()
    
    def _load_real_price_data(self, app_ids: List[int]) -> Dict[int, Dict[str, Any]]:
        """Load real price data."""
        with open(self.config.data_dir / 'price_minmax.json', 'r') as f:
            all_prices = json.load(f)
        
        # Filter to requested app_ids and convert keys to int
        return {app_id: all_prices.get(str(app_id), {'min': 0.0, 'max': 0.0, 'count': 0}) 
                for app_id in app_ids}

# Example usage
config = TestDataConfig.from_environment()
print(f"Current environment: {config.environment.value}")
print(f"Use real data: {config.use_real_data}")
print(f"Max test records: {config.max_test_records}")
print(f"Fixtures directory: {config.fixtures_dir}")

provider = TestDataProvider(config)
test_data = provider.get_app_data()
print(f"\nLoaded {len(test_data)} test records")

## Section 5: Test Data Fixtures and Snapshots

Create small, representative data fixtures that can be committed to version control and larger datasets that are generated or downloaded as needed.

In [None]:
# Let's examine our current fixture structure
workspace_root = Path.cwd()
fixtures_dir = workspace_root / 'pipeline' / 'tests' / 'fixtures'

print(f"Fixtures directory: {fixtures_dir}")
print(f"Exists: {fixtures_dir.exists()}")

if fixtures_dir.exists():
    print("\nCurrent fixture files:")
    for file in fixtures_dir.iterdir():
        if file.is_file():
            size_kb = file.stat().st_size / 1024
            print(f"  {file.name}: {size_kb:.1f} KB")
            
            # Show content preview for small files
            if size_kb < 5 and file.suffix in ['.csv', '.json']:
                print(f"    Preview: {file.read_text()[:100]}...")

# Guidelines for fixture design
fixture_guidelines = {
    "Size limits": {
        "Individual files": "< 10 KB each",
        "Total fixtures": "< 100 KB total",
        "Records per file": "< 50 records"
    },
    "Content guidelines": {
        "Representative data": "Cover edge cases and common scenarios",
        "Anonymized": "No real user data or sensitive information",
        "Deterministic": "Same data every time for reproducible tests",
        "Minimal but complete": "Just enough to test all code paths"
    },
    "File formats": {
        "CSV": "For tabular data (apps, reviews)",
        "JSON": "For structured data (price mappings, config)",
        "Text": "For samples of descriptions, reviews"
    }
}

print("\n=== FIXTURE DESIGN GUIDELINES ===")
for category, rules in fixture_guidelines.items():
    print(f"\n{category.upper()}:")
    for rule, description in rules.items():
        print(f"  • {rule}: {description}")

In [None]:
def create_comprehensive_fixtures():
    """Create a comprehensive set of test fixtures."""
    
    factory = TestDataFactory(seed=42)  # Deterministic
    
    # Ensure fixtures directory exists
    fixtures_dir.mkdir(exist_ok=True, parents=True)
    
    # 1. Minimal app dataset (already created)
    minimal_apps = factory.create_minimal_app_dataset()
    minimal_apps.to_csv(fixtures_dir / 'sample_expanded_apps.csv', index=False)
    
    # 2. Edge case apps (free games, DLC, different types)
    edge_cases = pd.DataFrame([
        {
            'steam_appid': 99999,
            'name': 'Free Game',
            'type': 'game',
            'is_free': True,
            'detailed_description': '',  # Empty description edge case
            'short_description': 'Free to play game',
            'header_image': ''
        },
        {
            'steam_appid': 88888,
            'name': 'Very Long Game Name That Might Cause Issues With Text Processing And Database Storage',
            'type': 'software',
            'is_free': False,
            'detailed_description': 'x' * 1000,  # Very long description
            'short_description': 'Software application',
            'header_image': 'https://example.com/very/long/url/that/might/cause/issues/image.jpg'
        }
    ])
    edge_cases.to_csv(fixtures_dir / 'edge_case_apps.csv', index=False)
    
    # 3. Price data with edge cases
    all_app_ids = minimal_apps['steam_appid'].tolist() + edge_cases['steam_appid'].tolist()
    price_data = factory.create_price_dataset(all_app_ids)
    
    # Add edge cases for prices
    price_data[99999] = {'min': 0.0, 'max': 0.0, 'count': 0}  # Free game
    price_data[88888] = {'min': 199.99, 'max': 299.99, 'count': 5}  # Expensive software
    
    with open(fixtures_dir / 'sample_price_minmax.json', 'w') as f:
        json.dump({str(k): v for k, v in price_data.items()}, f, indent=2)
    
    # 4. Sample review data (for review processing tests)
    sample_reviews = [
        {
            'appid': 12345,
            'review_text': 'Great game! Highly recommended.',
            'voted_up': True,
            'votes_helpful': 15,
            'votes_funny': 2
        },
        {
            'appid': 12345,
            'review_text': 'Not worth the money. Boring gameplay.',
            'voted_up': False,
            'votes_helpful': 8,
            'votes_funny': 0
        },
        {
            'appid': 67890,
            'review_text': '',  # Empty review edge case
            'voted_up': True,
            'votes_helpful': 0,
            'votes_funny': 0
        }
    ]
    
    pd.DataFrame(sample_reviews).to_csv(fixtures_dir / 'sample_reviews.csv', index=False)
    
    # 5. Configuration fixtures
    test_config = {
        'embedding_model': 'test-model',
        'batch_size': 10,
        'max_description_length': 500,
        'price_currency': 'USD'
    }
    
    with open(fixtures_dir / 'test_config.json', 'w') as f:
        json.dump(test_config, f, indent=2)
    
    print(f"Created comprehensive fixtures in {fixtures_dir}")
    
    # Show what we created
    total_size = 0
    for file in fixtures_dir.iterdir():
        if file.is_file():
            size = file.stat().st_size
            total_size += size
            print(f"  {file.name}: {size} bytes")
    
    print(f"\nTotal fixtures size: {total_size / 1024:.1f} KB")
    
    if total_size > 100 * 1024:  # 100 KB limit
        print("⚠️  WARNING: Fixtures exceed 100 KB size limit!")
    else:
        print("✅ Fixtures within size guidelines")

# Create the fixtures
create_comprehensive_fixtures()

## Section 6: Data Mocking Strategies

Implement mocking patterns for external data dependencies using tools like pytest fixtures, unittest.mock, or dedicated mocking libraries.

In [None]:
# Example pytest fixtures for our pipeline tests

pytest_fixtures_code = '''
# Contents of pipeline/tests/conftest.py
import pytest
import pandas as pd
import json
from pathlib import Path
from unittest.mock import Mock, patch
from typing import Dict, Any

@pytest.fixture
def sample_app_data():
    """Provide sample app data for tests."""
    return pd.DataFrame([
        {
            'steam_appid': 12345,
            'name': 'Test Game 1',
            'type': 'game',
            'is_free': False,
            'detailed_description': 'A detailed description of test game 1',
            'short_description': 'Short desc 1'
        },
        {
            'steam_appid': 67890,
            'name': 'Test Game 2',
            'type': 'game', 
            'is_free': True,
            'detailed_description': 'A detailed description of test game 2',
            'short_description': 'Short desc 2'
        }
    ])

@pytest.fixture
def sample_price_data():
    """Provide sample price data for tests."""
    return {
        12345: {'min': 9.99, 'max': 19.99, 'count': 25},
        67890: {'min': 0.0, 'max': 0.0, 'count': 1}
    }

@pytest.fixture
def mock_steam_api():
    """Mock Steam API responses."""
    with patch('ags_pipeline.extract.steam_client.SteamClient') as mock_client:
        mock_instance = Mock()
        
        # Mock app details response
        mock_instance.get_app_details.return_value = {
            'success': True,
            'data': {
                'name': 'Mocked Game',
                'type': 'game',
                'is_free': False,
                'detailed_description': 'Mocked description'
            }
        }
        
        # Mock reviews response
        mock_instance.get_app_reviews.return_value = {
            'success': 1,
            'reviews': [
                {
                    'review': 'Great game!',
                    'voted_up': True,
                    'votes_helpful': 10
                }
            ]
        }
        
        mock_client.return_value = mock_instance
        yield mock_instance

@pytest.fixture
def mock_embedder():
    """Mock embedding model."""
    with patch('ags_pipeline.embed.nomic_embedder.NomicEmbedder') as mock_embedder:
        mock_instance = Mock()
        
        # Return fixed embeddings for reproducible tests
        mock_instance.embed_text.return_value = [0.1, 0.2, 0.3, 0.4, 0.5]  # 5D vector
        mock_instance.embed_batch.return_value = [
            [0.1, 0.2, 0.3, 0.4, 0.5],
            [0.6, 0.7, 0.8, 0.9, 1.0]
        ]
        
        mock_embedder.return_value = mock_instance
        yield mock_instance

@pytest.fixture
def temp_data_dir(tmp_path):
    """Create temporary data directory with test files."""
    data_dir = tmp_path / "data"
    data_dir.mkdir()
    
    # Create sample CSV
    sample_df = pd.DataFrame([
        {'steam_appid': 12345, 'name': 'Test Game', 'type': 'game'}
    ])
    sample_df.to_csv(data_dir / 'expanded_sampled_apps.csv', index=False)
    
    # Create sample price data
    price_data = {'12345': {'min': 9.99, 'max': 19.99, 'count': 25}}
    with open(data_dir / 'price_minmax.json', 'w') as f:
        json.dump(price_data, f)
    
    return data_dir

@pytest.fixture(autouse=True)
def mock_external_apis():
    """Automatically mock all external API calls."""
    with patch('requests.get') as mock_get:
        # Default successful response
        mock_response = Mock()
        mock_response.status_code = 200
        mock_response.json.return_value = {'success': True, 'data': {}}
        mock_get.return_value = mock_response
        yield mock_get
'''

print("Example pytest fixtures for comprehensive mocking:")
print(pytest_fixtures_code)

In [None]:
# Example of how to use mocking in actual tests

test_example_code = '''
# Example test using our mocking strategy
def test_etl_with_mocked_data(sample_app_data, sample_price_data, temp_data_dir):
    """Test ETL pipeline with mocked data."""
    from ags_pipeline.io.price_minmax_loader import join_price_minmax
    
    # Use fixture data
    result_df = join_price_minmax(sample_app_data, sample_price_data, appid_col='steam_appid')
    
    # Verify expected columns exist
    assert 'price_min' in result_df.columns
    assert 'price_max' in result_df.columns
    assert 'price_samples_count' in result_df.columns
    
    # Verify data integrity
    assert len(result_df) == len(sample_app_data)
    
    # Test specific values
    test_row = result_df[result_df['steam_appid'] == 12345].iloc[0]
    assert test_row['price_min'] == 9.99
    assert test_row['price_max'] == 19.99

def test_steam_api_integration(mock_steam_api):
    """Test Steam API integration with mocked responses."""
    from ags_pipeline.extract.steam_client import SteamClient
    
    client = SteamClient()
    app_data = client.get_app_details(12345)
    
    # Verify mock was called
    mock_steam_api.get_app_details.assert_called_once_with(12345)
    
    # Verify mocked response
    assert app_data['success'] is True
    assert app_data['data']['name'] == 'Mocked Game'

@pytest.mark.parametrize("app_type,expected_free", [
    ("game", False),
    ("demo", True),
    ("dlc", False)
])
def test_app_type_logic(app_type, expected_free):
    """Test business logic with parameterized data."""
    # Test logic without requiring real data
    from ags_pipeline.transform.field_locator import determine_if_free
    
    result = determine_if_free(app_type, price=0.0)
    assert result == expected_free
'''

print("Example test patterns using mocked data:")
print(test_example_code)

## Summary: Test Data Management Best Practices

Our comprehensive test data strategy for ActualGameSearch V2 includes:

### ✅ What We've Implemented

1. **Fixture-based Testing**: Small, committed test fixtures (< 10 KB each) with representative data
2. **Environment-aware Configuration**: Different data sources for local, CI, and production environments
3. **Mock External Dependencies**: Complete mocking of Steam API, embedding models, and other external services
4. **Reproducible Data Generation**: Seeded random data generation for consistent test results
5. **Edge Case Coverage**: Fixtures include edge cases like empty descriptions, very long names, free games

### 🎯 Key Benefits

- **Fast CI/CD**: Tests run quickly without downloading large datasets
- **Reliable**: No dependency on external APIs or data availability
- **Secure**: No sensitive user data in version control
- **Maintainable**: Small fixtures are easy to update and understand
- **Comprehensive**: Tests cover both happy paths and edge cases

### 📁 File Organization

```
pipeline/
├── tests/
│   ├── fixtures/              # Small test data files (committed)
│   │   ├── sample_expanded_apps.csv
│   │   ├── sample_price_minmax.json
│   │   ├── edge_case_apps.csv
│   │   └── test_config.json
│   ├── conftest.py           # Pytest fixtures and mocks
│   └── test_*.py            # Test files using fixtures
├── data/                    # Real data files (not committed)
│   ├── expanded_sampled_apps.csv  # Large datasets
│   └── price_minmax.json          # Generated locally
└── src/ags_pipeline/        # Source code
```

### 🚀 Next Steps

1. **Expand fixture coverage** as new data types are added
2. **Add integration test tier** for testing with larger datasets locally
3. **Document data generation scripts** for reproducing test datasets
4. **Monitor fixture size** to keep within version control limits
5. **Add performance benchmarks** using generated large datasets

This approach ensures our tests are fast, reliable, and maintainable while still providing comprehensive coverage of our ETL pipeline functionality.