# Notebook 01: Data Loading Pipeline

**Purpose**: Load → Label → Filter → Split → Sample Train Only → Save

**Pipeline Stages**:
1. Load raw Trino query logs from S3
2. Apply labeling (CPU/memory/error thresholds)
3. Apply data quality filters
4. Create time-based splits (train/val/test)
5. Apply boundary-focused sampling ONLY to training data
6. Save to separate paths (train_sampled, val_original, test_original)

**Key Features**:
- Continuous distance-based boundary sampling (POC v2.1.0)
- S3 checkpointing for fault tolerance
- Optional analysis at each stage
- Target 5:1 small:heavy ratio for TRAINING only
- Val/test kept at original ~36:1 distribution for realistic evaluation

**CRITICAL**: This notebook samples ONLY training data. Validation and test sets
maintain their original distribution (~36:1) to provide realistic performance metrics
that reflect production conditions. This prevents overoptimistic evaluation results.

**Prerequisites**:
- Completed notebook 00 (setup)
- Spark configuration copied from notebook 00

**Duration**: ~30-45 minutes for full dataset

## 1. Configure Spark Session

**PASTE THE SPARK CONFIGURATION FROM NOTEBOOK 00 HERE**

In [1]:
%%configure -f
{
    "pyFiles": [
        "s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/code/query_predictor_latest.zip",
        "s3://uipds-108043591022/dataintelligence-dev/di-airflow-prod/dags/common/utils/ParseArgs.py"
    ],
    "driverMemory": "16G",
    "driverCores": 4,
    "executorMemory": "20G",
    "executorCores": 5,
    "conf": {
        "spark.driver.maxResultSize": "8G",
        "spark.dynamicAllocation.enabled": "true",
        "spark.dynamicAllocation.minExecutors": "5",
        "spark.dynamicAllocation.maxExecutors": "20"
    }
}


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
3127,application_1758752217644_207850,pyspark,idle,Link,Link,sahith.kondepudi,
3144,application_1758752217644_208422,pyspark,idle,Link,Link,w.scroggins,
3159,application_1758752217644_209535,pyspark,idle,Link,Link,jj.chen,
3161,application_1758752217644_209559,pyspark,idle,Link,Link,a.wadhwa,
3162,application_1758752217644_209563,pyspark,idle,Link,Link,feifan.jian,
3166,application_1758752217644_209727,pyspark,idle,Link,Link,nmehrdad,
3167,application_1758752217644_209744,pyspark,idle,Link,Link,c.grey,
3170,application_1758752217644_209770,pyspark,idle,Link,Link,clyu,
3171,application_1758752217644_209784,pyspark,idle,Link,Link,s.mylavarapu,
3179,application_1758752217644_210005,pyspark,idle,Link,Link,pmannem,


## 2. Import Dependencies

In [2]:
%%spark
import logging
import yaml
from datetime import datetime, timedelta
from pyspark.sql import functions as F
from pyspark.sql.functions import to_date, hour as get_hour

# Import training modules
from query_predictor.training.data_loader import DataLoader
from query_predictor.training.checkpoint_manager import CheckpointManager
from query_predictor.training.boundary_sampler import BoundarySampler
from query_predictor.training.dataframe_analyzer import DataFrameAnalyzer

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

print("Dependencies imported")
print(f"Spark version: {spark.version}")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
3201,application_1758752217644_211229,pyspark,idle,Link,Link,pmannem,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Dependencies imported
Spark version: 3.5.4-amzn-0

## 3. Load Configuration

In [22]:
%%spark
import boto3
import yaml

# Download training configuration from S3
s3_client = boto3.client('s3')
s3_bucket = 'uip-datalake-bucket-prod'
s3_prefix = 'sf_trino/trino_query_predictor'
config_s3_key = f"{s3_prefix}/config/training_config_latest.yaml"
config_path = '/tmp/training_config.yaml'

print(f"Downloading config from S3: s3://{s3_bucket}/{config_s3_key}")
s3_client.download_file(s3_bucket, config_s3_key, config_path)
print(f"✅ Config downloaded to: {config_path}")

# Load configuration
with open(config_path) as f:
    config = yaml.safe_load(f)

print("✅ Configuration loaded")
print(f"  Date range: {config['data_loading']['start_date']} to {config['data_loading']['end_date']}")
print(f"  Target ratio: {config['boundary_sampling']['balance_ratio']}:1 (Small:Heavy)")
print(f"  Analysis mode: {'ENABLED (slower)' if config['analysis']['enabled'] else 'DISABLED (faster)'}")

# OPTIONAL: Override config parameters after loading
# Example: Use different input data path
# config['data_loading']['s3_base_path'] = 's3://your-bucket/your-path'
# Example: Change date range
# config['data_loading']['start_date'] = '2025-09-01'
# config['data_loading']['end_date'] = '2025-10-01'
# Example: Change sampling ratio
# config['boundary_sampling']['balance_ratio'] = 3.0
# Example: Disable analysis for faster execution
# config['analysis']['enabled'] = False
# Example: Use sample data for quick testing
# config['data_loading']['sample_fraction'] = 0.1  # Use 10% of data

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Downloading config from S3: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/config/training_config_latest.yaml
? Config downloaded to: /tmp/training_config.yaml
? Configuration loaded
  Date range: 2025-08-01 to 2025-10-01
  Target ratio: 5.0:1 (Small:Heavy)
  Analysis mode: DISABLED (faster)

## 4. Initialize Components

In [7]:
%%spark
# Initialize checkpoint manager
checkpoint_mgr = CheckpointManager(
    spark,
    s3_checkpoint_path=config['checkpointing']['s3_path'],
    enabled=config['checkpointing']['enabled']
)

# Initialize data loader
loader = DataLoader(spark, config['data_loading'])

# Initialize boundary sampler
sampler = BoundarySampler(config['boundary_sampling'])

# Initialize analyzer (optional)
analyzer = DataFrameAnalyzer(spark)

print("✅ Components initialized")
print(f"  Checkpointing: {'ENABLED' if config['checkpointing']['enabled'] else 'DISABLED'}")
print(f"  Boundary sampling: {'ENABLED' if config['boundary_sampling']['enabled'] else 'DISABLED'}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

? Components initialized
  Checkpointing: ENABLED
  Boundary sampling: ENABLED

## 5. Load Raw Data

In [8]:
%%spark
print("[STAGE 1/5] Loading raw data from S3...")

df_raw = loader.load_raw_data(
    s3_path=config['data_loading']['s3_base_path'],
    start_date=config['data_loading']['start_date'],
    end_date=config['data_loading']['end_date'],
    sample_fraction=config['data_loading'].get('sample_fraction')
)

raw_count = df_raw.count()
print(f"✅ Loaded {raw_count:,} raw queries")
print(f"   Columns: {len(df_raw.columns)}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[STAGE 1/5] Loading raw data from S3...
? Loaded 306,813,447 raw queries
   Columns: 52

## 6. Optional Analysis - Raw Data

In [9]:
%%spark
if config['analysis']['enabled']:
    print("[ANALYSIS] Analyzing raw data...")
    stats_raw = analyzer.analyze_dataframe(df_raw, "Raw Data")
    analyzer.print_analysis_report(stats_raw)
else:
    print("[INFO] Analysis disabled, skipping raw data analysis")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[INFO] Analysis disabled, skipping raw data analysis

## 7. Apply Labeling

In [10]:
%%spark
print("\n[STAGE 2/5] Applying labeling...")

df_labeled = loader.apply_labeling(df_raw)

# Get distribution
labeled_count = df_labeled.count()
heavy_count = df_labeled.filter(F.col('is_heavy') == 1).count()
small_count = labeled_count - heavy_count

print(f"\n✅ Labeled {labeled_count:,} queries")
print(f"   Heavy: {heavy_count:,} ({heavy_count/labeled_count*100:.2f}%)")
print(f"   Small: {small_count:,} ({small_count/labeled_count*100:.2f}%)")
print(f"   Initial ratio: {small_count/heavy_count:.2f}:1 (Small:Heavy)")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


[STAGE 2/5] Applying labeling...

? Labeled 306,813,447 queries
   Heavy: 3,462,933 (1.13%)
   Small: 303,350,514 (98.87%)
   Initial ratio: 87.60:1 (Small:Heavy)

## 8. Optional Analysis - After Labeling

In [11]:
%%spark
if config['analysis']['enabled']:
    print("[ANALYSIS] Analyzing labeled data...")
    stats_labeled = analyzer.analyze_dataframe(df_labeled, "After Labeling")
    analyzer.print_analysis_report(stats_labeled)
    
    comparison = analyzer.compare_dataframes(df_raw, df_labeled, "Labeling")
    analyzer.print_comparison_report(comparison)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 9. Apply Filters and Checkpoint

In [12]:
%%spark
print("\n[STAGE 3/5] Applying data quality filters...")

df_filtered = loader.apply_filters(df_labeled)
# df_filtered = checkpoint_mgr.checkpoint(df_filtered, "01_filtered_data")

filtered_count = df_filtered.count()
removed_count = labeled_count - filtered_count

print(f"\n✅ Filtered dataset: {filtered_count:,} queries")
print(f"   Removed: {removed_count:,} queries ({removed_count/labeled_count*100:.1f}%)")

# Check class distribution after filtering
filtered_heavy = df_filtered.filter(F.col('is_heavy') == 1).count()
filtered_small = filtered_count - filtered_heavy
print(f"   Heavy: {filtered_heavy:,} ({filtered_heavy/filtered_count*100:.2f}%)")
print(f"   Small: {filtered_small:,} ({filtered_small/filtered_count*100:.2f}%)")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


[STAGE 3/5] Applying data quality filters...

? Filtered dataset: 88,731,774 queries
   Removed: 218,081,673 queries (71.1%)
   Heavy: 2,349,368 (2.65%)
   Small: 86,382,406 (97.35%)

## 10. Optional Analysis - After Filtering

In [13]:
%%spark
if config['analysis']['enabled']:
    print("[ANALYSIS] Analyzing filtered data...")
    stats_filtered = analyzer.analyze_dataframe(df_filtered, "After Filtering")
    analyzer.print_analysis_report(stats_filtered)
    
    comparison = analyzer.compare_dataframes(df_labeled, df_filtered, "Filtering")
    analyzer.print_comparison_report(comparison)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 11. Create Time-Based Splits (BEFORE Sampling)

Split data chronologically into train/val/test BEFORE sampling.

This ensures val and test maintain the original ~36:1 distribution.

In [None]:
daily_counts_df = df_with_dates.groupBy('queryDate').count().orderBy('queryDate').withColumnRenamed('count', 'record_count')
daily_counts_df.show(100)

In [None]:
daily_counts_df.show(200)

In [23]:
%%spark
print("\n[STAGE 4/6] Creating time-based splits...")

# Add date and hour columns for splitting
df_with_dates = df_filtered.withColumn('queryDate', to_date(F.col('createTime')))
df_with_dates = df_with_dates.withColumn('hour', get_hour(F.col('createTime')))

# Get date range from config
start_date = datetime.strptime(config['data_loading']['start_date'], '%Y-%m-%d')
end_date = datetime.strptime(config['data_loading']['end_date'], '%Y-%m-%d')

# Calculate split boundaries
train_days_cfg = config['time_splits']['train_days']
val_days_cfg   = config['time_splits']['val_days']
test_days_cfg  = config['time_splits']['test_days']

test_start = end_date - timedelta(days=test_days_cfg)
val_start  = test_start - timedelta(days=val_days_cfg)

print(f"  Date range: {start_date.date()} to {end_date.date()}")
print(f"Train: {start_date.date()} to {val_start.date()} ({(val_start - start_date).days} days)")
print(f"Val:   {val_start.date()} to {test_start.date()} ({val_days_cfg} days)")
print(f"Test:  {test_start.date()} to {end_date.date()} ({test_days_cfg} days)")
start_lit = F.to_date(F.lit(start_date.strftime('%Y-%m-%d')))
val_lit   = F.to_date(F.lit(val_start.strftime('%Y-%m-%d')))
test_lit  = F.to_date(F.lit(test_start.strftime('%Y-%m-%d')))
end_lit   = F.to_date(F.lit(end_date.strftime('%Y-%m-%d')))

train_df = df_with_dates.filter((F.col('queryDate') >= start_lit) & (F.col('queryDate') <  val_lit))
val_df   = df_with_dates.filter((F.col('queryDate') >= val_lit)   & (F.col('queryDate') <  test_lit))
test_df  = df_with_dates.filter((F.col('queryDate') >= test_lit)  & (F.col('queryDate') <= end_lit))

# Checkpoint splits
train_df = checkpoint_mgr.checkpoint(train_df, "01_train_split")
val_df = checkpoint_mgr.checkpoint(val_df, "01_val_split")
test_df = checkpoint_mgr.checkpoint(test_df, "01_test_split")

# Get counts
train_count = train_df.count()
val_count = val_df.count()
test_count = test_df.count()

print(f"\nSplits created:")
print(f"  Full: {filtered_count:,} queries")
print(f"  Train: {train_count:,} queries")
print(f"  Val:   {val_count:,} queries")
print(f"  Test:  {test_count:,} queries")
print(f"  Total: {train_count + val_count + test_count:,} queries")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


[STAGE 4/6] Creating time-based splits...
  Date range: 2025-08-01 to 2025-10-01
Train: 2025-08-01 to 2025-09-17 (47 days)
Val:   2025-09-17 to 2025-09-24 (7 days)
Test:  2025-09-24 to 2025-10-01 (7 days)

Splits created:
  Full: 88,731,774 queries
  Train: 58,662,498 queries
  Val:   14,969,526 queries
  Test:  15,099,750 queries
  Total: 88,731,774 queries

## 12. Report Original Distributions (Before Sampling)

Show class distributions for all three splits BEFORE sampling training data.

In [24]:
%%spark
print("\nOriginal class distributions (before sampling):")

# Calculate distributions for each split
train_heavy = train_df.filter(F.col('is_heavy') == 1).count()
train_small = train_count - train_heavy
train_ratio = train_small / train_heavy if train_heavy > 0 else 0

val_heavy = val_df.filter(F.col('is_heavy') == 1).count()
val_small = val_count - val_heavy
val_ratio = val_small / val_heavy if val_heavy > 0 else 0

test_heavy = test_df.filter(F.col('is_heavy') == 1).count()
test_small = test_count - test_heavy
test_ratio = test_small / test_heavy if test_heavy > 0 else 0

print(f"\nTrain split (BEFORE sampling):")
print(f"  Heavy: {train_heavy:,} ({train_heavy/train_count*100:.2f}%)")
print(f"  Small: {train_small:,} ({train_small/train_count*100:.2f}%)")
print(f"  Ratio: {train_ratio:.1f}:1 (Small:Heavy)")

print(f"\nValidation split (will remain unchanged):")
print(f"  Heavy: {val_heavy:,} ({val_heavy/val_count*100:.2f}%)")
print(f"  Small: {val_small:,} ({val_small/val_count*100:.2f}%)")
print(f"  Ratio: {val_ratio:.1f}:1 (Small:Heavy)")

print(f"\nTest split (will remain unchanged):")
print(f"  Heavy: {test_heavy:,} ({test_heavy/test_count*100:.2f}%)")
print(f"  Small: {test_small:,} ({test_small/test_count*100:.2f}%)")
print(f"  Ratio: {test_ratio:.1f}:1 (Small:Heavy)")

print("\n" + "="*70)
print("NEXT: Sample ONLY training data to 5:1")
print("      Val and test will keep ~36:1 for realistic evaluation")
print("="*70)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


Original class distributions (before sampling):

Train split (BEFORE sampling):
  Heavy: 1,463,723 (2.50%)
  Small: 57,198,775 (97.50%)
  Ratio: 39.1:1 (Small:Heavy)

Validation split (will remain unchanged):
  Heavy: 304,049 (2.03%)
  Small: 14,665,477 (97.97%)
  Ratio: 48.2:1 (Small:Heavy)

Test split (will remain unchanged):
  Heavy: 581,596 (3.85%)
  Small: 14,518,154 (96.15%)
  Ratio: 25.0:1 (Small:Heavy)

NEXT: Sample ONLY training data to 5:1
      Val and test will keep ~36:1 for realistic evaluation

## 13. Apply Boundary Sampling (TRAINING DATA ONLY)

**CRITICAL**: Sample ONLY training data to 5:1 ratio. Val and test remain at original ~36:1.

In [25]:
%%spark
print("\n[STAGE 4/6] Applying boundary-focused sampling to TRAINING data only...")
print(f"  Target ratio: {config['boundary_sampling']['balance_ratio']}:1 (Small:Heavy)")
print(f"  Algorithm: Continuous distance-based (POC v2.1.0)")
print(f"  Val/Test: Keep at original ~36:1 distribution")

# Sample ONLY training data
train_sampled = sampler.sample(train_df, label_column='is_heavy')
train_sampled = checkpoint_mgr.checkpoint(train_sampled, "01_train_sampled")

# Training statistics after sampling
train_sampled_count = train_sampled.count()
train_sampled_heavy = train_sampled.filter(F.col('is_heavy') == 1).count()
train_sampled_small = train_sampled_count - train_sampled_heavy
train_sampled_ratio = train_sampled_small / train_sampled_heavy if train_sampled_heavy > 0 else 0

print(f"\nTraining data after sampling:")
print(f"  Samples: {train_sampled_count:,} queries")
print(f"  Heavy: {train_sampled_heavy:,} ({train_sampled_heavy/train_sampled_count*100:.2f}%)")
print(f"  Small: {train_sampled_small:,} ({train_sampled_small/train_sampled_count*100:.2f}%)")
print(f"  Ratio: {train_sampled_ratio:.2f}:1 (Small:Heavy)")
print(f"  Sampling efficiency: {train_sampled_count/train_count*100:.1f}% of original training data retained")

print(f"\nValidation/Test data (UNCHANGED):")
print(f"  Val:  {val_count:,} queries at ~{val_ratio:.1f}:1 ratio")
print(f"  Test: {test_count:,} queries at ~{test_ratio:.1f}:1 ratio")

print("\n" + "="*70)
print("RESULT: Training uses 5:1 for balanced learning")
print("        Val/Test use ~36:1 for realistic evaluation")
print("="*70)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


[STAGE 4/6] Applying boundary-focused sampling to TRAINING data only...
  Target ratio: 5.0:1 (Small:Heavy)
  Algorithm: Continuous distance-based (POC v2.1.0)
  Val/Test: Keep at original ~36:1 distribution

Training data after sampling:
  Samples: 8,782,474 queries
  Heavy: 1,463,723 (16.67%)
  Small: 7,318,751 (83.33%)
  Ratio: 5.00:1 (Small:Heavy)
  Sampling efficiency: 15.0% of original training data retained

Validation/Test data (UNCHANGED):
  Val:  14,969,526 queries at ~48.2:1 ratio
  Test: 15,099,750 queries at ~25.0:1 ratio

RESULT: Training uses 5:1 for balanced learning
        Val/Test use ~36:1 for realistic evaluation

## 14. Get Sampling Statistics (Training Data)

In [26]:
%%spark
print("\n[INFO] Boundary sampling statistics (training data only):")

# Get detailed sampling stats for training data
sampling_stats = sampler.get_sampling_stats(train_df)

print("\nDistance Distribution:")
for stat, value in sampling_stats['distance_distribution'].items():
    print(f"  {stat:10s}: {value:.4f}")

print("\nSampling Efficiency by Distance:")
for range_name, count in sampling_stats['sampling_efficiency'].items():
    print(f"  {range_name:15s}: {count:,} queries")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


[INFO] Boundary sampling statistics (training data only):

Distance Distribution:
  min       : 0.0000
  mean      : 0.9708
  median    : 0.9630
  p90       : 1.0000
  max       : 535.3801

Sampling Efficiency by Distance:
  very_close     : 363,855 queries
  close          : 582,718 queries
  moderate       : 55,654,933 queries
  far            : 1,837,983 queries
  very_far       : 223,009 queries

In [27]:
%%spark
print("\n[INFO] Boundary sampling statistics (Sampled training data only):")

# Get detailed sampling stats for training data
sampling_stats = sampler.get_sampling_stats(train_sampled)

print("\nDistance Distribution:")
for stat, value in sampling_stats['distance_distribution'].items():
    print(f"  {stat:10s}: {value:.4f}")

print("\nSampling Efficiency by Distance:")
for range_name, count in sampling_stats['sampling_efficiency'].items():
    print(f"  {range_name:15s}: {count:,} queries")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


[INFO] Boundary sampling statistics (Sampled training data only):

Distance Distribution:
  min       : 0.0000
  mean      : 1.1761
  median    : 0.9388
  p90       : 1.0000
  max       : 535.3801

Sampling Efficiency by Distance:
  very_close     : 264,487 queries
  close          : 384,276 queries
  moderate       : 7,628,809 queries
  far            : 281,893 queries
  very_far       : 223,009 queries

## 15. Save Processed Data to S3

Save three separate datasets:
- train_sampled: 5:1 ratio for training
- val_original: ~36:1 ratio for validation
- test_original: ~36:1 ratio for testing

In [28]:
%%spark
print("\n[STAGE 5/6] Saving processed data to S3...")

# Define output paths
date_range = f"{config['data_loading']['start_date']}_to_{config['data_loading']['end_date']}"
base_output_path = f"{config['data_loading']['processed_output_path']}/{date_range}"

train_sampled_path = f"{base_output_path}/train_sampled"
val_original_path = f"{base_output_path}/val_original"
test_original_path = f"{base_output_path}/test_original"

print(f"  Base output path: {base_output_path}")

# Select relevant columns for downstream processing
output_columns = [
    'queryId', 'query', 'createTime', 'endTime',
    'user', 'catalog', 'schema', 'clientInfo', 'queryDate', 'hour',
    'cpuTime', 'peakUserMemoryBytes', 'totalBytes',
    'queryType', 'errorName', 'sessionproperties',
    'cpu_time_seconds', 'memory_gb', 'is_heavy'
]

# Save train_sampled (5:1 ratio)
print(f"\n[1/3] Saving train_sampled...")
train_sampled.select(*output_columns) \
    .write \
    .mode("overwrite") \
    .partitionBy('is_heavy') \
    .parquet(train_sampled_path)
print(f"  Saved {train_sampled_count:,} queries to {train_sampled_path}")
print(f"  Distribution: {train_sampled_ratio:.1f}:1 (Small:Heavy)")

# Save val_original (~ 36:1 ratio)
print(f"\n[2/3] Saving val_original...")
val_df.select(*output_columns) \
    .write \
    .mode("overwrite") \
    .partitionBy('is_heavy') \
    .parquet(val_original_path)
print(f"  Saved {val_count:,} queries to {val_original_path}")
print(f"  Distribution: {val_ratio:.1f}:1 (Small:Heavy)")

# Save test_original (~36:1 ratio)
print(f"\n[3/3] Saving test_original...")
test_df.select(*output_columns) \
    .write \
    .mode("overwrite") \
    .partitionBy('is_heavy') \
    .parquet(test_original_path)
print(f"  Saved {test_count:,} queries to {test_original_path}")
print(f"  Distribution: {test_ratio:.1f}:1 (Small:Heavy)")

print(f"\nAll datasets saved to S3")
print(f"  Partitioned by: is_heavy")
print(f"  Format: Parquet")
print(f"\nIMPORTANT: Three separate datasets created:")
print(f"  1. train_sampled: Balanced 5:1 for training")
print(f"  2. val_original:  Natural ~36:1 for validation")
print(f"  3. test_original: Natural ~36:1 for testing")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


[STAGE 5/6] Saving processed data to S3...
  Base output path: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/processed_data/2025-08-01_to_2025-10-01

[1/3] Saving train_sampled...
  Saved 8,782,474 queries to s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/processed_data/2025-08-01_to_2025-10-01/train_sampled
  Distribution: 5.0:1 (Small:Heavy)

[2/3] Saving val_original...
  Saved 14,969,526 queries to s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/processed_data/2025-08-01_to_2025-10-01/val_original
  Distribution: 48.2:1 (Small:Heavy)

[3/3] Saving test_original...
  Saved 15,099,750 queries to s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/processed_data/2025-08-01_to_2025-10-01/test_original
  Distribution: 25.0:1 (Small:Heavy)

All datasets saved to S3
  Partitioned by: is_heavy
  Format: Parquet

IMPORTANT: Three separate datasets created:
  1. train_sampled: Balanced 5:1 for training
  2. val_original:  Natural ~36:1 for 

## 16. Generate Summary Report

In [29]:
%%spark
summary_report = f"""
{'='*80}
DATA LOADING PIPELINE SUMMARY
{'='*80}

Execution Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
Date Range: {config['data_loading']['start_date']} to {config['data_loading']['end_date']}

PIPELINE STAGES:
1. Raw Data Loaded:     {raw_count:,} queries
2. After Labeling:      {labeled_count:,} queries
3. After Filtering:     {filtered_count:,} queries (removed {labeled_count - filtered_count:,})
4. After Splitting:     Train: {train_count:,}, Val: {val_count:,}, Test: {test_count:,}
5. After Sampling:      Train: {train_sampled_count:,} (Val/Test UNCHANGED)
6. Saved to S3:         3 separate datasets

CRITICAL: CLASS DISTRIBUTION STRATEGY
{'='*80}
This pipeline uses SPLIT-THEN-SAMPLE to provide realistic evaluation:

TRAINING DATA (Sampled to 5:1):
- Samples:  {train_sampled_count:,} queries
- Heavy:    {train_sampled_heavy:,} ({train_sampled_heavy/train_sampled_count*100:.2f}%)
- Small:    {train_sampled_small:,} ({train_sampled_small/train_sampled_count*100:.2f}%)
- Ratio:    {train_sampled_ratio:.2f}:1 (Small:Heavy)
- Purpose:  Balanced learning with boundary sampling

VALIDATION DATA (Original Distribution ~36:1):
- Samples:  {val_count:,} queries
- Heavy:    {val_heavy:,} ({val_heavy/val_count*100:.2f}%)
- Small:    {val_small:,} ({val_small/val_count*100:.2f}%)
- Ratio:    {val_ratio:.1f}:1 (Small:Heavy)
- Purpose:  Realistic performance evaluation

TEST DATA (Original Distribution ~36:1):
- Samples:  {test_count:,} queries
- Heavy:    {test_heavy:,} ({test_heavy/test_count*100:.2f}%)
- Small:    {test_small:,} ({test_small/test_count*100:.2f}%)
- Ratio:    {test_ratio:.1f}:1 (Small:Heavy)
- Purpose:  Final realistic performance evaluation

RATIONALE:
- Training uses 5:1 sampled data for effective learning from imbalanced classes
- Val/Test use ~36:1 original data to reflect production performance
- This prevents overoptimistic metrics and ensures realistic evaluation
- Model learns from balanced examples but evaluated on real distribution

SAMPLING CONFIGURATION (Training Only):
- Algorithm: Continuous distance-based (POC v2.1.0)
- Target ratio: {config['boundary_sampling']['balance_ratio']}:1
- Max boost: {config['boundary_sampling']['boundary_sampling_max_boost']}x
- Min multiplier: {config['boundary_sampling']['boundary_sampling_min_multiplier']}x
- Guarantee threshold: {config['boundary_sampling']['guarantee_close_threshold']}
- Safety adjustment: {'ENABLED' if config['boundary_sampling']['enable_safety_adjustment'] else 'DISABLED'}

OUTPUT PATHS:
- train_sampled: {train_sampled_path}
- val_original:  {val_original_path}
- test_original: {test_original_path}
- Format: Parquet (partitioned by is_heavy)

CHECKPOINTS:
{chr(10).join([f'  - {name}' for name in checkpoint_mgr.list_checkpoints()])}

STATUS: PIPELINE COMPLETED SUCCESSFULLY

NEXT STEPS:
1. Run notebook 02_zero_cost_analysis.ipynb (optional)
2. Run notebook 03_feature_engineering.ipynb (will load from 3 separate paths)
3. Expect validation/test metrics to reflect realistic 36:1 distribution:
   - Precision will be lower (~20-25% vs ~70% on sampled data)
   - Recall should remain high (>95% with proper threshold)
   - This is EXPECTED and CORRECT for production deployment

{'='*80}
"""

print(summary_report)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


DATA LOADING PIPELINE SUMMARY

Execution Date: 2025-10-16 22:02:14
Date Range: 2025-08-01 to 2025-10-01

PIPELINE STAGES:
1. Raw Data Loaded:     306,813,447 queries
2. After Labeling:      306,813,447 queries
3. After Filtering:     88,731,774 queries (removed 218,081,673)
4. After Splitting:     Train: 58,662,498, Val: 14,969,526, Test: 15,099,750
5. After Sampling:      Train: 8,782,474 (Val/Test UNCHANGED)
6. Saved to S3:         3 separate datasets

CRITICAL: CLASS DISTRIBUTION STRATEGY
This pipeline uses SPLIT-THEN-SAMPLE to provide realistic evaluation:

TRAINING DATA (Sampled to 5:1):
- Samples:  8,782,474 queries
- Heavy:    1,463,723 (16.67%)
- Small:    7,318,751 (83.33%)
- Ratio:    5.00:1 (Small:Heavy)
- Purpose:  Balanced learning with boundary sampling

VALIDATION DATA (Original Distribution ~36:1):
- Samples:  14,969,526 queries
- Heavy:    304,049 (2.03%)
- Small:    14,665,477 (97.97%)
- Ratio:    48.2:1 (Small:Heavy)
- Purpose:  Realistic performance evaluation

TES

## 17. Cleanup

In [30]:
%%spark
print("\n[CLEANUP] Releasing resources...")

# Unpersist DataFrames to free memory
for df_name, df in [('df_raw', df_raw), ('df_labeled', df_labeled), 
                      ('df_filtered', df_filtered), ('train_df', train_df),
                      ('val_df', val_df), ('test_df', test_df),
                      ('train_sampled', train_sampled)]:
    try:
        df.unpersist()
        print(f"  Unpersisted {df_name}")
    except:
        pass

# Checkpoints remain in S3 for debugging (not deleted)
print(f"\nCleanup completed")
print(f"   Checkpoints preserved in S3: {config['checkpointing']['s3_path']}")
print(f"   Use checkpoint_mgr.cleanup(delete_from_s3=True) to delete checkpoints")

print("\n" + "="*80)
print("DATA LOADING COMPLETE!")
print("="*80)
print("\nREMINDER: Three separate datasets created with different distributions:")
print(f"  - Training: {train_sampled_ratio:.1f}:1 (sampled for balanced learning)")
print(f"  - Val/Test: ~{val_ratio:.1f}:1 (original for realistic evaluation)")
print("="*80)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


[CLEANUP] Releasing resources...
  Unpersisted df_raw
  Unpersisted df_labeled
  Unpersisted df_filtered
  Unpersisted train_df
  Unpersisted val_df
  Unpersisted test_df
  Unpersisted train_sampled

Cleanup completed
   Checkpoints preserved in S3: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/checkpoints
   Use checkpoint_mgr.cleanup(delete_from_s3=True) to delete checkpoints

DATA LOADING COMPLETE!

REMINDER: Three separate datasets created with different distributions:
  - Training: 5.0:1 (sampled for balanced learning)
  - Val/Test: ~48.2:1 (original for realistic evaluation)