# Notebook 03: Feature Engineering

**Purpose**: Extract 78 + 17 + 1000 features with train-serve parity validation

**Pipeline**:
1. Load pre-split datasets from notebook 01 (train_sampled, val_original, test_original)
2. Extract 78 base features (production `FeatureExtractor`)
3. Extract 17 historical features from training data statistics
4. Build TF-IDF vocabulary (training data ONLY - prevent data leakage)
5. Extract TF-IDF features for all splits
6. Validate feature parity (training vs inference)
7. Save feature datasets + TF-IDF vectorizer to S3

**Key Features**:
- Production FeatureExtractor integration
- Historical feature computation with cold-start handling
- No data leakage (TF-IDF trained on training split only)
- Automated parity validation (<0.5% mismatch tolerance)
- S3 output for model training

**IMPORTANT**: Training data is sampled (5:1), val/test data is original (~36:1).
This ensures realistic evaluation metrics that reflect production conditions.

**Prerequisites**:
- Notebook 00 completed (code package uploaded to S3)
- Notebook 01 completed (processed data available at 3 separate paths)

**Duration**: ~45-60 minutes

## 1. Spark Configuration

Copy the Spark configuration from notebook 00 output and paste below:

In [1]:
%%configure -f
{
    "pyFiles": [
        "s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/code/query_predictor_latest.zip",
        "s3://uipds-108043591022/dataintelligence-dev/di-airflow-prod/dags/common/utils/ParseArgs.py"
    ],
    "driverMemory": "16G",
    "driverCores": 4,
    "executorMemory": "20G",
    "executorCores": 5,
    "conf": {
        "spark.driver.maxResultSize": "8G",
        "spark.dynamicAllocation.enabled": "true",
        "spark.dynamicAllocation.minExecutors": "2",
        "spark.dynamicAllocation.maxExecutors": "20"
    }
}


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
3127,application_1758752217644_207850,pyspark,idle,Link,Link,sahith.kondepudi,
3144,application_1758752217644_208422,pyspark,idle,Link,Link,w.scroggins,
3161,application_1758752217644_209559,pyspark,idle,Link,Link,a.wadhwa,
3162,application_1758752217644_209563,pyspark,idle,Link,Link,feifan.jian,
3167,application_1758752217644_209744,pyspark,idle,Link,Link,c.grey,
3170,application_1758752217644_209770,pyspark,idle,Link,Link,clyu,
3171,application_1758752217644_209784,pyspark,idle,Link,Link,s.mylavarapu,
3179,application_1758752217644_210005,pyspark,idle,Link,Link,pmannem,
3180,application_1758752217644_210099,pyspark,idle,Link,Link,ssardeshpande,
3193,application_1758752217644_210699,pyspark,idle,Link,Link,pmannem,


## 2. Import Dependencies and Validate Environment

In [2]:
import sys
import yaml
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import ArrayType, FloatType

# Import production modules
from query_predictor.core.featurizer.feature_extractor import FeatureExtractor
from query_predictor.training.spark_ml_tfidf_pipeline import SparkMLTfidfPipeline
from query_predictor.training.parity_validator import ParityValidator

print(f"Python version: {sys.version}")
print(f"PySpark version: {spark.version}")
print("All imports successful")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
3204,application_1758752217644_211576,pyspark,idle,Link,Link,pmannem,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Python version: 3.11.13 (main, Jul 30 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)]
PySpark version: 3.5.4-amzn-0
All imports successful

## 3. Load Configuration

In [3]:
import boto3
from query_predictor.training.checkpoint_manager import CheckpointManager


# Download training configuration from S3
s3_client = boto3.client('s3')
s3_bucket = 'uip-datalake-bucket-prod'
s3_prefix = 'sf_trino/trino_query_predictor'
config_s3_key = f"{s3_prefix}/config/training_config_latest.yaml"
config_path = '/tmp/training_config.yaml'

print(f"Downloading config from S3: s3://{s3_bucket}/{config_s3_key}")
s3_client.download_file(s3_bucket, config_s3_key, config_path)
print(f"✅ Config downloaded to: {config_path}")

# Load training configuration
with open(config_path) as f:
    config = yaml.safe_load(f)

# Initialize checkpoint manager
checkpoint_mgr = CheckpointManager(
    spark,
    s3_checkpoint_path=config['checkpointing']['s3_path'],
    enabled=config['checkpointing']['enabled']
)


print("✅ Configuration loaded")
print(f"\n📋 Feature Configuration:")
print(f"  Base features: {config['features']['base_feature_count']}")
print(f"  Historical features: {config['features']['historical_feature_count']}")
print(f"  TF-IDF vocab size: {config['features']['tfidf_vocab_size']}")
print(f"  Total features: {config['features']['total_features']}")
print(f"\n📅 Time Splits:")
print(f"  Train: {config['time_splits']['train_days']} days")
print(f"  Val: {config['time_splits']['val_days']} days")
print(f"  Test: {config['time_splits']['test_days']} days")
print(f"  Checkpointing: {'ENABLED' if config['checkpointing']['enabled'] else 'DISABLED'}")

# OPTIONAL: Override config parameters after loading
# Example: Change date range to use different processed data
# config['data_loading']['start_date'] = '2025-09-01'
# config['data_loading']['end_date'] = '2025-10-01'
# Example: Change time splits for different train/val/test ratio
# config['time_splits']['train_days'] = 20  # Shorter training period
# config['time_splits']['val_days'] = 5
# config['time_splits']['test_days'] = 5
# Example: Change TF-IDF vocabulary size
# config['features']['tfidf_vocab_size'] = 500  # Smaller vocabulary
# config['features']['total_features'] = 78 + 17 + 500  # Update total
# Example: Adjust parity validation tolerance
# config['validation']['parity_tolerance'] = 1e-5  # Stricter validation
# config['validation']['parity_samples'] = 50  # Fewer samples for faster validation

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Downloading config from S3: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/config/training_config_latest.yaml
? Config downloaded to: /tmp/training_config.yaml
? Configuration loaded

? Feature Configuration:
  Base features: 78
  Historical features: 17
  TF-IDF vocab size: 250
  Total features: 345

? Time Splits:
  Train: 30 days
  Val: 7 days
  Test: 7 days
  Checkpointing: ENABLED

## 4. Load Pre-Split Data from Notebook 01

Load three separate datasets created by notebook 01:
- train_sampled: 5:1 ratio for training
- val_original: ~36:1 ratio for validation  
- test_original: ~36:1 ratio for testing

In [4]:
# Load pre-split datasets from notebook 01
processed_path = config['data_loading']['processed_output_path']
date_range = f"{config['data_loading']['start_date']}_to_{config['data_loading']['end_date']}"
base_path = f"{processed_path}/{date_range}"

train_path = f"{base_path}/train_sampled"  # 5:1 sampled for training
val_path = f"{base_path}/val_original"      # ~36:1 original distribution
test_path = f"{base_path}/test_original"    # ~36:1 original distribution

print(f"Loading pre-split datasets from notebook 01...")
print(f"  Base path: {base_path}")
print(f"\n  Train (sampled):  {train_path}")
print(f"  Val (original):   {val_path}")
print(f"  Test (original):  {test_path}")

# Load splits
train_df = spark.read.parquet(train_path)
val_df = spark.read.parquet(val_path)
test_df = spark.read.parquet(test_path)

# Get counts
train_count = train_df.count()
val_count = val_df.count()
test_count = test_df.count()

print(f"\nDatasets loaded:")
print(f"  Train: {train_count:,} queries")
print(f"  Val:   {val_count:,} queries")
print(f"  Test:  {test_count:,} queries")

# Show class distributions
print(f"\nClass distributions:")
print("Train (sampled):")
train_df.groupBy('is_heavy').count().orderBy('is_heavy').show()

print("Val (original):")
val_df.groupBy('is_heavy').count().orderBy('is_heavy').show()

print("Test (original):")
test_df.groupBy('is_heavy').count().orderBy('is_heavy').show()

# Calculate ratios for reporting
train_heavy = train_df.filter(F.col('is_heavy') == 1).count()
train_ratio = (train_count - train_heavy) / train_heavy if train_heavy > 0 else 0

val_heavy = val_df.filter(F.col('is_heavy') == 1).count()
val_ratio = (val_count - val_heavy) / val_heavy if val_heavy > 0 else 0

test_heavy = test_df.filter(F.col('is_heavy') == 1).count()
test_ratio = (test_count - test_heavy) / test_heavy if test_heavy > 0 else 0

print(f"\nDistribution ratios (Small:Heavy):")
print(f"  Train: {train_ratio:.1f}:1 (sampled for balanced training)")
print(f"  Val:   {val_ratio:.1f}:1 (original for realistic evaluation)")
print(f"  Test:  {test_ratio:.1f}:1 (original for realistic evaluation)")

print("\nIMPORTANT: Val and test metrics will reflect production distribution (~36:1)")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Loading pre-split datasets from notebook 01...
  Base path: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/processed_data/2025-08-01_to_2025-10-01

  Train (sampled):  s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/processed_data/2025-08-01_to_2025-10-01/train_sampled
  Val (original):   s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/processed_data/2025-08-01_to_2025-10-01/val_original
  Test (original):  s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/processed_data/2025-08-01_to_2025-10-01/test_original

Datasets loaded:
  Train: 8,782,474 queries
  Val:   14,969,526 queries
  Test:  15,099,750 queries

Class distributions:
Train (sampled):
+--------+-------+
|is_heavy|  count|
+--------+-------+
|       0|7318751|
|       1|1463723|
+--------+-------+

Val (original):
+--------+--------+
|is_heavy|   count|
+--------+--------+
|       0|14665477|
|       1|  304049|
+--------+--------+

Test (original):
+--------+--------+
|is_h

## 5. Extract Base Features (78 features)

Use production `FeatureExtractor` to extract 78 base features.
This ensures train-serve parity.

In [9]:
# Initialize production feature extractor
base_extractor = FeatureExtractor(config)

print(f"FeatureExtractor initialized")
print(f"  Base features: {base_extractor.feature_count}")

# Create Spark UDF for distributed extraction
base_udf = base_extractor.create_spark_udf()

print("\nExtracting base features for all splits...")

# Extract for train
print("\n[1/3] Extracting train base features...")
train_base = train_df.withColumn(
    'base_features',
    base_udf(
        F.struct(
            F.col('query'),
            F.col('user'),
            F.col('catalog'),
            F.col('schema'),
            F.col('hour'),
            F.col('clientInfo')
        )
    )
)
# train_base = checkpoint_mgr.checkpoint(train_base, "03_train_base")

# Extract for val
print("[2/3] Extracting val base features...")
val_base = val_df.withColumn(
    'base_features',
    base_udf(
        F.struct(
            F.col('query'),
            F.col('user'),
            F.col('catalog'),
            F.col('schema'),
            F.col('hour'),
            F.col('clientInfo')
        )
    )
)
# val_base = checkpoint_mgr.checkpoint(val_base, "03_val_base")

# Extract for test
print("[3/3] Extracting test base features...")
test_base = test_df.withColumn(
    'base_features',
    base_udf(
        F.struct(
            F.col('query'),
            F.col('user'),
            F.col('catalog'),
            F.col('schema'),
            F.col('hour'),
            F.col('clientInfo')
        )
    )
)
# test_base = checkpoint_mgr.checkpoint(test_base, "03_test_base")

print("Base features extracted for all splits")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

FeatureExtractor initialized
  Base features: 78

Extracting base features for all splits...

[1/3] Extracting train base features...
[2/3] Extracting val base features...
[3/3] Extracting test base features...
Base features extracted for all splits

## 6. Extract Historical Features (17 features)

Historical features provide cold-start handling:
- User historical stats (6 features)
- Catalog historical stats (6 features)
- Schema historical stats (4 features)
- Cold-start indicator (1 feature)

Statistics are computed from training data only to prevent data leakage.

In [10]:
from query_predictor.training.historical_stats_computer import HistoricalStatsComputer
from query_predictor.core.featurizer.extractors.historical_extractor import HistoricalFeatureExtractor

print("Computing historical statistics from training data...")

# Initialize stats computer
stats_computer = HistoricalStatsComputer(version='1.0.0')

# Compute stats from training data
date_range = {
    'start': config['data_loading']['start_date'],
    'end': config['data_loading']['end_date']
}
stats_schema = stats_computer.compute(train_df, date_range)

print(f"\nHistorical stats computed:")
print(f"  Users: {len(stats_schema.users):,}")
print(f"  Catalogs: {len(stats_schema.catalogs):,}")
print(f"  Schemas: {len(stats_schema.schemas):,}")
print(f"  Overall heavy rate: {stats_schema.heavy_rate_overall:.2%}")

# Serialize to dict for HistoricalFeatureExtractor
stats_dict = stats_schema.to_dict()

# Initialize historical feature extractor with computed stats
historical_extractor = HistoricalFeatureExtractor(
    config={},
    historical_stats=stats_dict
)

print(f"\nHistoricalFeatureExtractor initialized")
print(f"  Historical features: {historical_extractor.feature_count}")

# Create Spark UDF for distributed extraction
historical_udf = historical_extractor.create_spark_udf()

print("\nExtracting historical features for all splits...")

# Extract for train
print("[1/3] Extracting train historical features...")
train_hist = train_base.withColumn(
    'historical_features',
    historical_udf(
        F.struct(
            F.col('user'),
            F.col('catalog'),
            F.col('schema')
        )
    )
)
# train_hist = checkpoint_mgr.checkpoint(train_hist, "03_train_hist")


# Extract for val
print("[2/3] Extracting val historical features...")
val_hist = val_base.withColumn(
    'historical_features',
    historical_udf(
        F.struct(
            F.col('user'),
            F.col('catalog'),
            F.col('schema')
        )
    )
)
# val_hist = checkpoint_mgr.checkpoint(val_hist, "03_val_hist")


# Extract for test
print("[3/3] Extracting test historical features...")
test_hist = test_base.withColumn(
    'historical_features',
    historical_udf(
        F.struct(
            F.col('user'),
            F.col('catalog'),
            F.col('schema')
        )
    )
)
# test_hist = checkpoint_mgr.checkpoint(test_hist, "03_test_hist")

print("Historical features extracted for all splits")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Computing historical statistics from training data...

Historical stats computed:
  Users: 16,181
  Catalogs: 36
  Schemas: 443
  Overall heavy rate: 16.67%

HistoricalFeatureExtractor initialized
  Historical features: 17

Extracting historical features for all splits...
[1/3] Extracting train historical features...
[2/3] Extracting val historical features...
[3/3] Extracting test historical features...
Historical features extracted for all splits

## 7. Build TF-IDF Vocabulary (TRAINING DATA ONLY)

**CRITICAL**: TF-IDF vocabulary is built ONLY on training queries to prevent data leakage.

**Approach**: Uses Spark ML (CountVectorizer + IDF) for fully distributed processing without requiring data collection to driver. This avoids memory limits and scales to any dataset size.

The fitted pipeline extracts vocabulary and IDF weights for sklearn-compatible inference.

In [11]:
# Initialize Spark ML TF-IDF pipeline
tfidf_config = {
    'tfidf_vocab_size': config['features']['tfidf_vocab_size'],
    'min_df': config['features']['min_df'],
    'max_df': config['features']['max_df']
}

tfidf_pipeline = SparkMLTfidfPipeline(tfidf_config)

print("\nCRITICAL: Building TF-IDF vocabulary on TRAINING DATA ONLY")
print("This prevents data leakage into val/test sets.\n")

print("Fitting TF-IDF vocabulary (distributed - no data collection to driver)...")
print("This may take 5-10 minutes...")

# Fit on DataFrame directly (NO COLLECT!)
tfidf_pipeline.fit_on_dataframe(train_hist, query_column='query')

print(f"\nTF-IDF vocabulary built successfully")
metadata = tfidf_pipeline.get_feature_metadata()
print(f"  Vocabulary size: {metadata['vocab_size']:,}")
print(f"  Min DF: {metadata['min_df']}")
print(f"  Max DF: {metadata['max_df']}")
print(f"  Method: {metadata['method']}")

# Display sample terms for inspection
print(f"\nSample vocabulary (first 20 terms):")
sample_terms = [name.replace('tfidf_', '') for name in metadata['feature_names'][:20]]
print(sample_terms)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


CRITICAL: Building TF-IDF vocabulary on TRAINING DATA ONLY
This prevents data leakage into val/test sets.

Fitting TF-IDF vocabulary (distributed - no data collection to driver)...
This may take 5-10 minutes...

TF-IDF vocabulary built successfully
  Vocabulary size: 250
  Min DF: 100
  Max DF: 0.8
  Method: spark_ml_countvectorizer_optimized

Sample vocabulary (first 20 terms):
['string_literal', 'org_id', 'numeric', 'company_id', 'timestamp', 'cast', 'varchar', 'execute', 'double', 'using', 'statement1', 'sum', 'max', 'approx_percentile', 'cell', 'ts', '_time', 'coalesce', 'metric_name', 'false']
  self.idf_ /= df

## 8. Extract TF-IDF Features

Transform queries to TF-IDF features using the fitted vocabulary.

In [12]:
# Create Spark UDF from fitted pipeline
tfidf_udf = tfidf_pipeline.create_spark_udf()

print("Extracting TF-IDF features for all splits...")
print("This may take 10-15 minutes...")

# Extract for train
print("\n[1/3] Extracting train TF-IDF features...")
train_tfidf = train_hist.withColumn('tfidf_features', tfidf_udf(F.col('query')))
# train_tfidf = checkpoint_mgr.checkpoint(train_tfidf, "03_train_tfidf")

# Extract for val
print("[2/3] Extracting val TF-IDF features...")
val_tfidf = val_hist.withColumn('tfidf_features', tfidf_udf(F.col('query')))
# val_tfidf = checkpoint_mgr.checkpoint(val_tfidf, "03_val_tfidf")

# Extract for test
print("[3/3] Extracting test TF-IDF features...")
test_tfidf = test_hist.withColumn('tfidf_features', tfidf_udf(F.col('query')))
# test_tfidf = checkpoint_mgr.checkpoint(test_tfidf, "03_test_tfidf")

print("\n✅ TF-IDF features extracted for all splits")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Extracting TF-IDF features for all splits...
This may take 10-15 minutes...

[1/3] Extracting train TF-IDF features...
[2/3] Extracting val TF-IDF features...
[3/3] Extracting test TF-IDF features...

? TF-IDF features extracted for all splits

## 9. Combine Features

Concatenate all features: base (78) + historical (17) + TF-IDF (1000) = 1095 features

In [13]:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, FloatType

@udf(returnType=ArrayType(FloatType()))
def combine_features(base, historical, tfidf):
    """Concatenate base + historical + tfidf features."""
    if base is None or historical is None or tfidf is None:
        return None
    return base + historical + tfidf

print("Combining features...")

# Combine for train
train_final = train_tfidf.withColumn(
    'features',
    combine_features(
        F.col('base_features'),
        F.col('historical_features'),
        F.col('tfidf_features')
    )
)

# Combine for val
val_final = val_tfidf.withColumn(
    'features',
    combine_features(
        F.col('base_features'),
        F.col('historical_features'),
        F.col('tfidf_features')
    )
)

# Combine for test
test_final = test_tfidf.withColumn(
    'features',
    combine_features(
        F.col('base_features'),
        F.col('historical_features'),
        F.col('tfidf_features')
    )
)

print("\nFeatures combined")
print(f"  Expected dimensions: {config['features']['total_features']}")
print(f"  Base: {config['features']['base_feature_count']}")
print(f"  Historical: {config['features']['historical_feature_count']}")
print(f"  TF-IDF: {config['features']['tfidf_vocab_size']}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Combining features...

Features combined
  Expected dimensions: 345
  Base: 78
  Historical: 17
  TF-IDF: 250

In [14]:
# Sample and verify dimensions
print("Validating feature dimensions...")

sample_train = train_final.select('features', 'is_heavy').limit(1).collect()[0]
sample_val = val_final.select('features', 'is_heavy').limit(1).collect()[0]
sample_test = test_final.select('features', 'is_heavy').limit(1).collect()[0]

train_dim = len(sample_train['features'])
val_dim = len(sample_val['features'])
test_dim = len(sample_test['features'])
expected_dim = config['features']['total_features']

print(f"\n📊 Feature Dimensions:")
print(f"  Train: {train_dim}")
print(f"  Val:   {val_dim}")
print(f"  Test:  {test_dim}")
print(f"  Expected: {expected_dim}")

assert train_dim == expected_dim, f"Train dimension mismatch: {train_dim} != {expected_dim}"
assert val_dim == expected_dim, f"Val dimension mismatch: {val_dim} != {expected_dim}"
assert test_dim == expected_dim, f"Test dimension mismatch: {test_dim} != {expected_dim}"

print("\n✅ All dimensions validated")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Validating feature dimensions...

? Feature Dimensions:
  Train: 345
  Val:   345
  Test:  345
  Expected: 345

? All dimensions validated

## 10. Validate Feature Dimensions

In [21]:
print("\n" + "="*70)
print("FEATURE PARITY VALIDATION")
print("="*70)

# Initialize validator from config
validation_config = config.get('validation', {})
n_samples = validation_config.get('parity_samples', 100)
validator = ParityValidator(config=config)

# Collect sample of training features and queries
print(f"\nCollecting {n_samples} samples for validation...")
train_samples = train_final.select(
    'features', 'query', 'user', 'catalog', 'schema', 'hour', 'clientInfo', 'is_heavy'
).limit(n_samples).collect()

# Convert to numpy arrays
training_features = np.array([row['features'] for row in train_samples], dtype=np.float32)

# Prepare sample queries for inference
sample_queries = [
    {
        'query': row['query'],
        'user': row['user'],
        'catalog': row['catalog'],
        'schema': row['schema'],
        'hour': row['hour'],
        'clientInfo': row['clientInfo']
    }
    for row in train_samples
]

print(f"\nRunning parity validation...")
print(f"  Tolerance: {validator.tolerance}")
print(f"  Success threshold: <{validation_config.get('parity_success_threshold', 0.5)}% mismatch\n")

# Create inference featurizer with historical features enabled
# This matches the production inference architecture where:
# - FeatureExtractor returns 95 features (78 base + 17 historical)
# - TF-IDF returns vocab_size features
# - Total: 95 + vocab_size = expected total
inference_config = config.copy()
inference_config['enable_historical_features'] = True

inference_featurizer = FeatureExtractor(
    inference_config,
    historical_stats=stats_dict  # Pass historical stats computed from training data
)

print(f"Inference featurizer initialized with historical features enabled")
print(f"  Feature count: {inference_featurizer.feature_count} (78 base + 17 historical)")
print(f"  TF-IDF vocab size: {tfidf_pipeline.vocab_size}")
print(f"  Expected total: {inference_featurizer.feature_count + tfidf_pipeline.vocab_size}\n")

# Run validation
parity_result = validator.validate_parity(
    training_features=training_features,
    inference_featurizer=inference_featurizer,  # Now includes historical features
    tfidf_pipeline=tfidf_pipeline,
    sample_queries=sample_queries,
    n_samples=n_samples
)

# Generate and print report
report = validator.generate_report(parity_result)
print(report)

# if not parity_result['passed']:
#     print("\n⚠️  WARNING: Parity validation failed!")
#     print("This indicates train-serve skew. Review feature extraction logic.")
#     raise ValueError("Parity validation failed")

# print("✅ Parity validation passed - features are consistent!")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


FEATURE PARITY VALIDATION

Collecting 100 samples for validation...

Running parity validation...
  Tolerance: 1e-06
  Success threshold: <0.5% mismatch

Inference featurizer initialized with historical features enabled
  Feature count: 95 (78 base + 17 historical)
  TF-IDF vocab size: 250
  Expected total: 345


FEATURE PARITY VALIDATION REPORT

Status: ❌ FAILED

Summary:
  Samples Tested:  100
  Mismatches:      100
  Mismatch Rate:   100.00%
  Max Difference:  1.000000000
  Tolerance:       0.000001000

Mismatch Details (first 10):

  Sample 0:
    Max diff: 1.000000000
    Num mismatches: 9
    Feature indices: [45, 46, 47, 54, 95, 99, 102, 104, 105]

  Sample 1:
    Max diff: 1.000000000
    Num mismatches: 8
    Feature indices: [45, 47, 54, 95, 99, 102, 104, 105]

  Sample 2:
    Max diff: 1.000000000
    Num mismatches: 8
    Feature indices: [45, 47, 54, 95, 99, 102, 104, 105]

  Sample 3:
    Max diff: 1.000000000
    Num mismatches: 8
    Feature indices: [45, 47, 54, 95, 9

## 11. Feature Parity Validation

**CRITICAL**: Validate that training features match inference features.

**Architecture Note**: 
- Training extracts base (78) and historical (17) separately for flexibility
- Inference uses unified FeatureExtractor with historical enabled (95 total)
- Both paths must produce identical features: (78+17) + TF-IDF = total

This prevents train-serve skew and silent performance degradation in production.

In [16]:
# Define output paths
features_path = config['features']['output_path']
date_range = f"{config['data_loading']['start_date']}_to_{config['data_loading']['end_date']}"
output_base = f"{features_path}/{date_range}"

train_path = f"{output_base}/train"
val_path = f"{output_base}/val"
test_path = f"{output_base}/test"

print(f"Saving feature datasets to S3...")
print(f"  Base path: {output_base}")

# Select relevant columns
output_columns = [
    'queryId',
    'query',
    'user',
    'catalog',
    'schema',
    'queryDate',
    'hour',
    'is_heavy',
    'cpu_time_seconds',
    'memory_gb',
    'features'  # Combined features array
]

# Save train
print("\n[1/3] Saving train dataset...")
train_final.select(output_columns).write.mode('overwrite').parquet(train_path)
print(f"  ✅ Train saved: {train_path}")

# Save val
print("[2/3] Saving val dataset...")
val_final.select(output_columns).write.mode('overwrite').parquet(val_path)
print(f"  ✅ Val saved: {val_path}")

# Save test
print("[3/3] Saving test dataset...")
test_final.select(output_columns).write.mode('overwrite').parquet(test_path)
print(f"  ✅ Test saved: {test_path}")

print("\n✅ All feature datasets saved to S3")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Saving feature datasets to S3...
  Base path: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/features/2025-08-01_to_2025-10-01

[1/3] Saving train dataset...
  ? Train saved: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/features/2025-08-01_to_2025-10-01/train
[2/3] Saving val dataset...
  ? Val saved: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/features/2025-08-01_to_2025-10-01/val
[3/3] Saving test dataset...
  ? Test saved: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/features/2025-08-01_to_2025-10-01/test

? All feature datasets saved to S3

## 12. Save Feature Datasets to S3

In [17]:
import boto3
import tempfile

# Save TF-IDF pipeline locally first
print("Saving TF-IDF vectorizer...")

with tempfile.NamedTemporaryFile(mode='wb', delete=False, suffix='.pkl') as tmp:
    tfidf_pipeline.save(tmp.name)
    local_tfidf_path = tmp.name
    print(f"  Saved locally: {local_tfidf_path}")

# Upload to S3
s3_tfidf_key = f"{config['s3']['prefix']}/models/tfidf_vectorizer_{date_range}.pkl"
s3_client = boto3.client('s3')
s3_client.upload_file(local_tfidf_path, config['s3']['bucket'], s3_tfidf_key)

s3_tfidf_path = f"s3://{config['s3']['bucket']}/{s3_tfidf_key}"
print(f"  ✅ Uploaded: {s3_tfidf_path}")

# Also save as "latest"
latest_key = f"{config['s3']['prefix']}/models/tfidf_vectorizer_latest.pkl"
s3_client.copy_object(
    Bucket=config['s3']['bucket'],
    CopySource={'Bucket': config['s3']['bucket'], 'Key': s3_tfidf_key},
    Key=latest_key
)
print(f"  ✅ Updated latest: s3://{config['s3']['bucket']}/{latest_key}")

# Cleanup
import os
os.unlink(local_tfidf_path)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Saving TF-IDF vectorizer...
  Saved locally: /tmp/tmpl19lufid.pkl
  ? Uploaded: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/models/tfidf_vectorizer_2025-08-01_to_2025-10-01.pkl
  ? Updated latest: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/models/tfidf_vectorizer_latest.pkl

## 13. Save TF-IDF Vectorizer

In [18]:
import json

# Prepare metadata
metadata = {
    'timestamp': datetime.now().isoformat(),
    'date_range': date_range,
    'time_splits': config['time_splits'],
    'features': {
        'base_features': config['features']['base_feature_count'],
        'historical_features': config['features']['historical_feature_count'],
        'tfidf_features': tfidf_pipeline.vocab_size,
        'total_features': config['features']['total_features']
    },
    'tfidf_config': tfidf_pipeline.get_feature_metadata(),
    'dataset_sizes': {
        'train': train_final.count(),
        'val': val_final.count(),
        'test': test_final.count()
    },
    'parity_validation': parity_result,
    's3_paths': {
        'train': train_path,
        'val': val_path,
        'test': test_path,
        'tfidf_vectorizer': s3_tfidf_path
    }
}

# Save locally
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.json') as tmp:
    json.dump(metadata, tmp, indent=2)
    local_metadata_path = tmp.name

# Upload to S3
metadata_key = f"{config['s3']['prefix']}/metadata/features_{date_range}.json"
s3_client.upload_file(local_metadata_path, config['s3']['bucket'], metadata_key)

print(f"✅ Metadata saved: s3://{config['s3']['bucket']}/{metadata_key}")

# Cleanup
os.unlink(local_metadata_path)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

? Metadata saved: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/metadata/features_2025-08-01_to_2025-10-01.json

In [19]:
print("\n" + "="*70)
print("FEATURE ENGINEERING SUMMARY")
print("="*70)

print(f"\nFeature Breakdown:")
print(f"  Base features:       {config['features']['base_feature_count']}")
print(f"  Historical features: {config['features']['historical_feature_count']}")
print(f"  TF-IDF features:     {tfidf_pipeline.vocab_size}")
print(f"  {'-' * 40}")
print(f"  Total features:      {config['features']['total_features']}")

print(f"\nDataset Sizes:")
print(f"  Train: {metadata['dataset_sizes']['train']:,} queries")
print(f"  Val:   {metadata['dataset_sizes']['val']:,} queries")
print(f"  Test:  {metadata['dataset_sizes']['test']:,} queries")

print(f"\nClass Distributions:")
print(f"  Train: {train_ratio:.1f}:1 (sampled for balanced training)")
print(f"  Val:   {val_ratio:.1f}:1 (original - production distribution)")
print(f"  Test:  {test_ratio:.1f}:1 (original - production distribution)")

print(f"\nParity Validation:")
if parity_result['passed']:
    print(f"  Status: PASSED")
    print(f"  Mismatch rate: {parity_result['mismatch_rate']:.2f}%")
    print(f"  Max difference: {parity_result['max_difference']:.9f}")
else:
    print(f"  Status: FAILED")
    print(f"  Mismatch rate: {parity_result['mismatch_rate']:.2f}%")

print(f"\nS3 Outputs:")
print(f"  Train data:      {train_path}")
print(f"  Val data:        {val_path}")
print(f"  Test data:       {test_path}")
print(f"  TF-IDF pipeline: {s3_tfidf_path}")
print(f"  Metadata:        s3://{config['s3']['bucket']}/{metadata_key}")

print("\n" + "="*70)
print("FEATURE ENGINEERING COMPLETE!")
print("="*70)

print("\nNext Steps:")
print("1. Open notebook 04_model_training_distributed.ipynb")
print("2. Train XGBoost model with extracted features")
print("3. Export to ONNX for production inference")

print("\nIMPORTANT: Expect different metrics on val/test vs train:")
print("  - Val/Test use ~36:1 original distribution (realistic)")
print("  - Precision will be LOWER (~20-25% vs ~60-70% on balanced)")
print("  - This is EXPECTED and reflects production performance")
print("="*70)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…


FEATURE ENGINEERING SUMMARY

Feature Breakdown:
  Base features:       78
  Historical features: 17
  TF-IDF features:     250
  ----------------------------------------
  Total features:      345

Dataset Sizes:
  Train: 8,782,474 queries
  Val:   14,969,526 queries
  Test:  15,099,750 queries

Class Distributions:
  Train: 5.0:1 (sampled for balanced training)
  Val:   48.2:1 (original - production distribution)
  Test:  25.0:1 (original - production distribution)

Parity Validation:
  Status: FAILED
  Mismatch rate: 100.00%

S3 Outputs:
  Train data:      s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/features/2025-08-01_to_2025-10-01/train
  Val data:        s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/features/2025-08-01_to_2025-10-01/val
  Test data:       s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/features/2025-08-01_to_2025-10-01/test
  TF-IDF pipeline: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/models/tfidf_vec

## 14. Summary Report

In [20]:
import json

# Prepare metadata
metadata = {
    'timestamp': datetime.now().isoformat(),
    'date_range': date_range,
    'time_splits': config['time_splits'],
    'features': {
        'base_features': config['features']['base_feature_count'],
        'historical_features': config['features']['historical_feature_count'],
        'tfidf_features': tfidf_pipeline.vocab_size,
        'total_features': config['features']['total_features']
    },
    'tfidf_config': tfidf_pipeline.get_feature_metadata(),
    'dataset_sizes': {
        'train': train_final.count(),
        'val': val_final.count(),
        'test': test_final.count()
    },
    'class_distributions': {
        'train': {
            'ratio': f'{train_ratio:.1f}:1',
            'heavy_count': int(train_heavy),
            'note': 'Sampled for balanced training'
        },
        'val': {
            'ratio': f'{val_ratio:.1f}:1',
            'heavy_count': int(val_heavy),
            'note': 'Original production distribution'
        },
        'test': {
            'ratio': f'{test_ratio:.1f}:1',
            'heavy_count': int(test_heavy),
            'note': 'Original production distribution'
        }
    },
    'parity_validation': parity_result,
    's3_paths': {
        'train': train_path,
        'val': val_path,
        'test': test_path,
        'tfidf_vectorizer': s3_tfidf_path
    }
}

# Save locally
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.json') as tmp:
    json.dump(metadata, tmp, indent=2)
    local_metadata_path = tmp.name

# Upload to S3
metadata_key = f"{config['s3']['prefix']}/metadata/features_{date_range}.json"
s3_client.upload_file(local_metadata_path, config['s3']['bucket'], metadata_key)

print(f"Metadata saved: s3://{config['s3']['bucket']}/{metadata_key}")

# Cleanup
os.unlink(local_metadata_path)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Metadata saved: s3://uip-datalake-bucket-prod/sf_trino/trino_query_predictor/metadata/features_2025-08-01_to_2025-10-01.json