# Example Data Exploration

**Purpose**: Ad-hoc analysis and data exploration template

**Author**: {TEAM NAME OR INDIVIDUAL}

**Date**: {LAST UPDATE DATE}

## Overview
Use this notebook for:
- Exploring new data sources
- Prototyping transformations
- Analyzing data quality
- Creating visualizations
- Testing utility functions

## Performance Strategy for Exploration

**Key Principles:**
1. **Filter first, sample later** - Apply date filters and LIMIT before any operations
2. **Use recent data only** - Start with last 7-30 days, not full history
3. **Hard limits** - Cap at 10k-100k rows for exploration
4. **Handle corrupt files** - Configure Spark to skip bad data
5. **Approximate counts** - Use `approx_count_distinct()` for speed
6. **Avoid full scans** - Never call `.count()` on unfiltered large tables

---

**Note**: If you get import errors after updating utility functions, restart the Python kernel:
- In Databricks: Click **Run** → **Clear state and run all**
- Or use: `dbutils.library.restartPython()` (will restart kernel)

In [0]:
# Setup
import sys
import os
from pyspark.sql import functions as F

# Add projects to path
current_dir = os.path.dirname(os.path.abspath(__file__)) if '__file__' in dir() else os.getcwd()
projects_path = os.path.abspath(os.path.join(current_dir, "..", ".."))
if projects_path not in sys.path:
    sys.path.insert(0, projects_path)

from src.environment_config import EnvironmentConfig

# Import utility functions for exploration
from example_project.utilities import (
    check_data_quality,
    get_date_range,
    sample_by_date,
    categorize_sport_udf
)

print("✓ Setup complete - utilities imported")
print(f"✓ Available utilities: sample_by_date, check_data_quality, get_date_range, categorize_sport_udf")

In [0]:
# Initialize config with defaults for exploration
spark.conf.set("bundle.catalog", "sandbox")
spark.conf.set("bundle.schema", "analytics_engineering")
spark.conf.set("bundle.core_catalog", "core_views")

config = EnvironmentConfig()
print(f"Environment: {config.env}")
print(f"Source Catalog: {config.core_catalog}")
print(f"Output schema: {config.project_schema}")

In [0]:
## Read and Sample Data

# Configure Spark to handle corrupt files gracefully
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
spark.conf.set("spark.sql.files.ignoreMissingFiles", "true")

# Use utility function to sample by date for fast exploration
source_table = config.get_core_table_path("sportsbook", "bet_legs")

# Sample last 7 days with max 10k rows using utility function
df = sample_by_date(
    table_path=source_table,
    date_column="bet_placed_local_ts",
    days_back=7,
    max_rows=10000
)

print(f"✓ Sampled data using sample_by_date utility")
print(f"  - Last 7 days")
print(f"  - Max 10,000 rows")
print(f"  - Actual rows: {df.count():,}")
print(f"\nSchema (first 10 columns):")
df.select(df.columns[:10]).printSchema()

## Explore Source Data

In [0]:
# Data quality checks using utility function
print("=== DATA QUALITY ANALYSIS ===")

# Use utility function for comprehensive quality checks
quality_metrics = check_data_quality(
    df,
    columns_to_check=['bet_id', 'bet_placed_local_ts', 'leg_sport_name_reporting', 'bet_portion']
)

print(f"\n✓ Quality check complete using check_data_quality() utility")

# Get date range using utility function
date_range = get_date_range(df, 'bet_placed_local_ts')
print(f"\n=== DATE RANGE ===")
print(f"From: {date_range['min_date']}")
print(f"To: {date_range['max_date']}")

# Quick preview of data
print("\n=== SAMPLE DATA ===")
display(df.limit(5))

## Prototype Analysis

In [0]:
# Fast aggregations for exploration
print("=== TOP SPORTS BY BET COUNT ===")

# Direct Spark aggregations - clear and performant
sport_summary = df.groupBy("leg_sport_name_reporting") \
    .agg(
        F.count("*").alias("bet_count"),
        F.sum("bet_portion").alias("total_stake"),
        F.avg("bet_portion").alias("avg_stake")
    ) \
    .orderBy(F.desc("bet_count")) \
    .limit(10)  # Top 10 only

display(sport_summary)

# Test the sport categorization UDF
print("\n=== SPORT CATEGORIZATION (using UDF utility) ===")
categorized_df = df.select(
    'leg_sport_name_reporting',
    categorize_sport_udf(F.col('leg_sport_name_reporting')).alias('sport_category')
).distinct().orderBy('sport_category').limit(20)

display(categorized_df)

## Summary

**Optimized Utility Functions Demonstrated:**

✅ **`sample_by_date()`** - Fast data sampling with date filter and row limit
  - Uses SQL predicate pushdown for optimal performance
  - Applies LIMIT at query level
  - Perfect for exploration without loading full datasets

✅ **`check_data_quality()`** - Single-pass quality checks
  - Optimized: One aggregation query for all null counts
  - Checks first 10 columns by default (configurable)
  - Shows null percentages for context

✅ **`get_date_range()`** - Extract date ranges efficiently
  - Single aggregation for min/max dates

✅ **`categorize_sport_udf()`** - Custom UDF for transformations
  - Demonstrates domain-specific logic as reusable function

**Performance Best Practices:**

1. **Filter first with `sample_by_date()`** - Always start with recent, limited data
2. **Single-pass aggregations** - Combine multiple metrics in one query
3. **Direct Spark functions** - Use `.groupBy().agg()` for simple aggregations
4. **Limit results** - Always use `.limit()` for top-N queries
5. **Handle corrupt files** - Set Spark configs to skip bad data

**When to Use Utilities vs Direct Code:**

| Use Utility Function | Use Direct Spark Code |
|---------------------|----------------------|
| Common patterns (sampling, quality checks) | One-off aggregations |
| Complex logic (UDFs, validations) | Simple groupBy operations |
| Reusable across projects | Notebook-specific analysis |
| Needs optimization (single-pass) | Already optimal |

**Exploration Workflow:**
```python
# 1. Sample with utility
df = sample_by_date(table_path, "date_col", days_back=7, max_rows=10000)

# 2. Quality check with utility
metrics = check_data_quality(df, columns_to_check=['col1', 'col2'])

# 3. Direct aggregations for analysis
summary = df.groupBy("category").agg(F.count("*"), F.sum("amount")).limit(10)

# 4. Use UDFs for complex transformations
df_transformed = df.withColumn("category", my_udf(F.col("raw_value")))
```

**Document your findings here:**
- Key insights from the data
- Data quality issues discovered
- Transformation ideas to implement
- Questions for stakeholders