## üîí Data Leakage Prevention

**Critical Fix:** Test data has NO cancellation events (you're predicting future churn!).

**Solution:**
1. ‚úÖ **Separate train/predict modes** - Pipeline factory conditionally includes CancellationTargetTransformer
2. ‚úÖ **No transformer refitting on test** - Use training-fitted pipeline for predictions
3. ‚úÖ **Combined data for cumulative features** - Accumulated features need full user history (train + test)
4. ‚úÖ **Proper feature alignment** - Exclude 'Cancel' as a feature (it's the target!)

**Impact:** Realistic validation scores and proper Kaggle submission generation.

In [14]:
import pandas as pd
import numpy as np
import sys, os
sys.path.append(os.path.abspath(".."))

import src.preprocessing
from importlib import reload
reload(src.preprocessing)

from src.preprocessing import (
    aggregate_user_day_activity, 
    add_rolling_averages,
    compute_cancellation_batch
)

# Import sklearn components
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import lightgbm as lgb
from sklearn.metrics import classification_report, roc_auc_score

In [2]:
# ============================================================================
# LOAD RAW TRAINING DATA
# ============================================================================
print("=" * 80)
print("LOADING RAW TRAINING DATA")
print("=" * 80)

root = '/Users/mdiaspinto/Documents/School/Python Data Science/Final Project/kaggle-churn'
df_raw = pd.read_parquet(root + '/data/train.parquet')

# Clean up: convert object columns to category, drop unnecessary columns
object_cols = df_raw.select_dtypes(include="object").columns
df_raw[object_cols] = df_raw[object_cols].astype("category")
df_raw = df_raw.drop(columns=['gender', 'firstName', 'lastName', 'location', 'userAgent', 'status', 'auth', 'method'])

print(f"\nRaw training data shape: {df_raw.shape}")
print(f"Date range: {pd.to_datetime(df_raw['time']).min()} to {pd.to_datetime(df_raw['time']).max()}")
print(f"Unique users: {df_raw['userId'].nunique()}")

LOADING RAW TRAINING DATA

Raw training data shape: (17499636, 11)
Date range: 2018-10-01 00:00:01 to 2018-11-20 00:00:00
Unique users: 19140


In [12]:
# Reload the preprocessing module to pick up changes
from importlib import reload
import src.preprocessing
reload(src.preprocessing)
from src.preprocessing import aggregate_user_day_activity

# üîÑ Time-Series Cross-Validation

Run proper time-series CV to validate the approach before generating final submission.

In [4]:
# ============================================================================
# TIME-SERIES CROSS-VALIDATION
# ============================================================================
from importlib import reload
import src.preprocessing
reload(src.preprocessing)

from src.preprocessing import run_time_series_cv, create_feature_pipeline

# Run 3-fold time-series cross-validation
print("Running 3-fold time-series cross-validation...")
print("This validates our pipeline with proper temporal splits\n")

ts_cv_results = run_time_series_cv(df_raw, n_splits=3, window_days=10)

print("\n" + "=" * 80)
print("CROSS-VALIDATION RESULTS")
print("=" * 80)
print(f"Mean ROC-AUC: {ts_cv_results['mean_roc_auc']:.4f} ¬± {ts_cv_results['std_roc_auc']:.4f}")

Running 3-fold time-series cross-validation...
This validates our pipeline with proper temporal splits

TIME-SERIES CROSS-VALIDATION (3 FOLDS)


  user_dates = df_raw.groupby(['userId', 'date']).size().reset_index(name='count')



Dataset info:
  Total user-days: 976,140
  Date range: 2018-10-01 to 2018-11-20
  Total events: 17,499,636

FOLD 1/3
Train period: 2018-10-01 to 2018-10-13 (244,035 user-days)
Val period:   2018-10-13 to 2018-10-26 (244,035 user-days)

‚è±Ô∏è  TIMING BREAKDOWN:
  1. Filter raw data: 4.1s (9,726,480 events)
  2. Create pipeline: 0.0s
  3. Fit transformers: 1.8s
  4. Transform (detailed breakdown below)...
BasicEventAggregator: Aggregating events to user-day level...
  Output shape: (497640, 27)
  Features: ['About', 'Add Friend', 'Add to Playlist', 'Cancel', 'Cancellation Confirmation', 'Downgrade', 'Error', 'Help', 'Home', 'Logout', 'NextSong', 'Roll Advert', 'Save Settings', 'Settings', 'Submit Downgrade', 'Submit Upgrade', 'Thumbs Down', 'Thumbs Up', 'Upgrade', 'event_count', 'session_count', 'avg_session_length', 'events_per_session', 'level', 'days_since_registration']
RollingAverageTransformerModular: Computing 7d rolling averages...
RollingAverageTransformerModular: Computing 14

  errors_per_day = df_errors.groupby([self.user_col, 'date']).size().reset_index(name='daily_errors')


  Added: accumulated_errors, accumulated_unique_artists
PageInteractionTransformer: Computing page interaction features...


  page_counts = df_pages.groupby([self.user_col, 'date', self.page_col]).size().unstack(fill_value=0).reset_index()


  Added: []
CancellationTargetTransformerModular: Computing churn targets (window=10d)...
  Churn status - 0: 477467, 1: 20173
FeaturePreprocessor: Final preprocessing...
     TOTAL TRANSFORM TIME: 403.4s ‚ö†Ô∏è
Feature matrix - Train: (248820, 62), Val: (248820, 62)
Class distribution - Train: 12504/248820 (5.03% churn)
                     Val:   7669/248820 (3.08% churn)
Training model...
  5. Model training: 3.1s




  6. Evaluation: 0.8s

  ‚è±Ô∏è  FOLD TOTAL: 413.2s (6.9 min)

Fold 1 Results:
  ROC-AUC: 0.6849

FOLD 2/3
Train period: 2018-10-01 to 2018-10-26 (488,070 user-days)
Val period:   2018-10-26 to 2018-11-08 (244,035 user-days)

‚è±Ô∏è  TIMING BREAKDOWN:
  1. Filter raw data: 10.4s (14,112,729 events)
  2. Create pipeline: 0.1s
  3. Fit transformers: 3.9s
  4. Transform (detailed breakdown below)...
BasicEventAggregator: Aggregating events to user-day level...
  Output shape: (746460, 27)
  Features: ['About', 'Add Friend', 'Add to Playlist', 'Cancel', 'Cancellation Confirmation', 'Downgrade', 'Error', 'Help', 'Home', 'Logout', 'NextSong', 'Roll Advert', 'Save Settings', 'Settings', 'Submit Downgrade', 'Submit Upgrade', 'Thumbs Down', 'Thumbs Up', 'Upgrade', 'event_count', 'session_count', 'avg_session_length', 'events_per_session', 'level', 'days_since_registration']
RollingAverageTransformerModular: Computing 7d rolling averages...
RollingAverageTransformerModular: Computing 14d rolling

  errors_per_day = df_errors.groupby([self.user_col, 'date']).size().reset_index(name='daily_errors')


  Added: accumulated_errors, accumulated_unique_artists
PageInteractionTransformer: Computing page interaction features...


  page_counts = df_pages.groupby([self.user_col, 'date', self.page_col]).size().unstack(fill_value=0).reset_index()


  Added: []
CancellationTargetTransformerModular: Computing churn targets (window=10d)...
  Churn status - 0: 716127, 1: 30333
FeaturePreprocessor: Final preprocessing...
     TOTAL TRANSFORM TIME: 497.8s ‚ö†Ô∏è
Feature matrix - Train: (497640, 62), Val: (248820, 62)
Class distribution - Train: 23441/497640 (4.71% churn)
                     Val:   6892/248820 (2.77% churn)
Training model...
  5. Model training: 4.1s




  6. Evaluation: 1.0s

  ‚è±Ô∏è  FOLD TOTAL: 517.3s (8.6 min)

Fold 2 Results:
  ROC-AUC: 0.7327

FOLD 3/3
Train period: 2018-10-01 to 2018-11-08 (732,105 user-days)
Val period:   2018-11-08 to 2018-11-20 (244,035 user-days)

‚è±Ô∏è  TIMING BREAKDOWN:
  1. Filter raw data: 9.2s (17,499,636 events)
  2. Create pipeline: 0.1s
  3. Fit transformers: 7.2s
  4. Transform (detailed breakdown below)...
BasicEventAggregator: Aggregating events to user-day level...
  Output shape: (976140, 27)
  Features: ['About', 'Add Friend', 'Add to Playlist', 'Cancel', 'Cancellation Confirmation', 'Downgrade', 'Error', 'Help', 'Home', 'Logout', 'NextSong', 'Roll Advert', 'Save Settings', 'Settings', 'Submit Downgrade', 'Submit Upgrade', 'Thumbs Down', 'Thumbs Up', 'Upgrade', 'event_count', 'session_count', 'avg_session_length', 'events_per_session', 'level', 'days_since_registration']
RollingAverageTransformerModular: Computing 7d rolling averages...
RollingAverageTransformerModular: Computing 14d rolling 

  errors_per_day = df_errors.groupby([self.user_col, 'date']).size().reset_index(name='daily_errors')


  Added: accumulated_errors, accumulated_unique_artists
PageInteractionTransformer: Computing page interaction features...


  page_counts = df_pages.groupby([self.user_col, 'date', self.page_col]).size().unstack(fill_value=0).reset_index()


  Added: []
CancellationTargetTransformerModular: Computing churn targets (window=10d)...
  Churn status - 0: 938477, 1: 37663
FeaturePreprocessor: Final preprocessing...
     TOTAL TRANSFORM TIME: 562.2s ‚ö†Ô∏è
Feature matrix - Train: (746460, 62), Val: (229680, 62)
Class distribution - Train: 33299/746460 (4.46% churn)
                     Val:   4364/229680 (1.90% churn)
Training model...
  5. Model training: 5.5s




  6. Evaluation: 0.8s

  ‚è±Ô∏è  FOLD TOTAL: 584.9s (9.7 min)

Fold 3 Results:
  ROC-AUC: 0.7625

CROSS-VALIDATION SUMMARY
 fold  train_size  val_size  roc_auc  churn_rate_train  churn_rate_val
    1      248820    248820 0.684916          0.050253        0.030821
    2      497640    248820 0.732658          0.047104        0.027699
    3      746460    229680 0.762506          0.044609        0.019000

Mean ROC-AUC: 0.7267 ¬± 0.0391
Best fold:    3 (0.7625)
Worst fold:   1 (0.6849)

CROSS-VALIDATION RESULTS
Mean ROC-AUC: 0.7267 ¬± 0.0391


# üöÄ Train Final Model & Generate Submission

Train on ALL training data and generate predictions for Kaggle test set (leak-free!).

In [5]:
# ============================================================================
# TRAIN FINAL MODEL ON ALL TRAINING DATA
# ============================================================================
print("=" * 80)
print("TRAINING FINAL MODEL ON ALL TRAINING DATA")
print("=" * 80)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, balanced_accuracy_score, classification_report
import numpy as np

# 1. Create training pipeline (mode='train' includes churn labels)
print("\n1. Creating TRAINING pipeline...")
train_pipeline = create_feature_pipeline(
    cutoff_date=pd.to_datetime('2099-12-31'),  # Use all data for training
    mode='train',
    window_days=10
)

# 2. Create temporal raw train/validation split (leak-free)
print("\n2. Creating temporal raw train/validation split...")
cutoff = pd.Timestamp("2018-11-01")
raw_train = df_raw[pd.to_datetime(df_raw['time']).dt.normalize() < cutoff.normalize()]
raw_val   = df_raw[pd.to_datetime(df_raw['time']).dt.normalize() >= cutoff.normalize()]
print(f"   Train raw rows: {len(raw_train):,}")
print(f"   Val raw rows:   {len(raw_val):,}")

# 3. Transform raw data to features (with raw_df passed)
print("\n3. Transforming raw data to features (with raw_df passed)...")
train_feat = train_pipeline.fit_transform(
    raw_train,
    accumulated__raw_df=raw_train,
    page_interactions__raw_df=raw_train,
    churn_target__raw_df=raw_train,
 )
val_feat = train_pipeline.transform(raw_val)  # uses state (including raw_df_) learned from training fit
print(f"   Train features shape: {train_feat.shape}")
print(f"   Val features shape:   {val_feat.shape}")

# 4. Prepare train/validation feature matrices and targets
df_train = train_feat.copy()
df_val = val_feat.copy()

exclude_cols = ['userId', 'date', 'churn_status']
feature_cols = [col for col in df_train.columns if col not in exclude_cols]

X_train = df_train[feature_cols]
y_train = df_train['churn_status']
X_val = df_val[feature_cols]
y_val = df_val['churn_status']

print(f"\n4. Feature matrix prepared:")
print(f"   Features: {len(feature_cols)}")
print(f"   Train churn rate: {y_train.mean():.2%}")
print(f"   Val churn rate: {y_val.mean():.2%}")

# 5. Build and train model
print("\n5. Training LightGBM model...")

# Calculate class weights
neg_count = (y_train == 0).sum()
pos_count = (y_train == 1).sum()
scale_pos_weight = neg_count / pos_count

# Identify feature types
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns.tolist()

# Build model pipeline
model_pipeline = Pipeline([
    ('preprocessor', ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numeric_features),
            ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
        ]
    )),
    ('classifier', lgb.LGBMClassifier(
        scale_pos_weight=scale_pos_weight,
        random_state=42,
        verbose=-1,
        n_jobs=-1,
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1
    ))
])

model_pipeline.fit(X_train, y_train)
print(f"   ‚úì Model trained (scale_pos_weight={scale_pos_weight:.2f})")

# 6. Evaluate on validation set and optimize threshold
print("\n6. Evaluating on validation set...")
y_val_pred_proba = model_pipeline.predict_proba(X_val)[:, 1]

# Test different thresholds
thresholds = np.arange(0.05, 0.95, 0.01)
balanced_accuracies = []

for threshold in thresholds:
    y_pred_threshold = (y_val_pred_proba >= threshold).astype(int)
    bal_acc = balanced_accuracy_score(y_val, y_pred_threshold)
    balanced_accuracies.append(bal_acc)

# Find optimal threshold
optimal_idx = np.argmax(balanced_accuracies)
optimal_threshold = thresholds[optimal_idx]
optimal_bal_acc = balanced_accuracies[optimal_idx]

print(f"\n   Optimal threshold: {optimal_threshold:.2f}")
print(f"   Validation balanced accuracy: {optimal_bal_acc:.4f}")
print(f"   Validation ROC-AUC: {roc_auc_score(y_val, y_val_pred_proba):.4f}")

# Show classification report with optimal threshold
y_val_pred_optimal = (y_val_pred_proba >= optimal_threshold).astype(int)
print(f"\n   Classification Report (threshold={optimal_threshold:.2f}):")
print(classification_report(y_val, y_val_pred_optimal, target_names=['No Churn', 'Churn']))

TRAINING FINAL MODEL ON ALL TRAINING DATA

1. Creating TRAINING pipeline...

2. Fitting transformers on training data...

3. Transforming raw data to features...
BasicEventAggregator: Aggregating events to user-day level...
  Output shape: (976140, 27)
  Features: ['About', 'Add Friend', 'Add to Playlist', 'Cancel', 'Cancellation Confirmation', 'Downgrade', 'Error', 'Help', 'Home', 'Logout', 'NextSong', 'Roll Advert', 'Save Settings', 'Settings', 'Submit Downgrade', 'Submit Upgrade', 'Thumbs Down', 'Thumbs Up', 'Upgrade', 'event_count', 'session_count', 'avg_session_length', 'events_per_session', 'level', 'days_since_registration']
RollingAverageTransformerModular: Computing 7d rolling averages...
RollingAverageTransformerModular: Computing 14d rolling averages...
TrendFeaturesTransformer: Computing trend features...
AccumulatedFeaturesTransformer: Computing cumulative features...


  errors_per_day = df_errors.groupby([self.user_col, 'date']).size().reset_index(name='daily_errors')


  Added: accumulated_errors, accumulated_unique_artists
PageInteractionTransformer: Computing page interaction features...


  page_counts = df_pages.groupby([self.user_col, 'date', self.page_col]).size().unstack(fill_value=0).reset_index()


  Added: []
CancellationTargetTransformerModular: Computing churn targets (window=10d)...
  Churn status - 0: 938477, 1: 37663
FeaturePreprocessor: Final preprocessing...
   Features shape: (976140, 62)

4. Creating temporal train/validation split...
   Train: 593,340 samples (2018-10-01 to 2018-10-31)
   Val:   382,800 samples (2018-11-01 to 2018-11-20)

5. Feature matrix prepared:
   Features: 59
   Train churn rate: 4.62%
   Val churn rate: 2.68%

6. Training LightGBM model...
   ‚úì Model trained (scale_pos_weight=20.66)

7. Evaluating on validation set...





   Optimal threshold: 0.43
   Validation balanced accuracy: 0.6875
   Validation ROC-AUC: 0.7448

   Classification Report (threshold=0.43):
              precision    recall  f1-score   support

    No Churn       0.99      0.66      0.79    372536
       Churn       0.06      0.71      0.10     10264

    accuracy                           0.66    382800
   macro avg       0.52      0.69      0.45    382800
weighted avg       0.96      0.66      0.78    382800



In [6]:
# ============================================================================
# GENERATE KAGGLE SUBMISSION (LEAK-FREE - CORRECTED!)
# ============================================================================
print("\n" + "=" * 80)
print("GENERATING KAGGLE SUBMISSION")
print("=" * 80)

# 1. Load Kaggle test data
print("\n1. Loading Kaggle test data...")
df_test_raw = pd.read_parquet(root + '/data/test.parquet')

# Clean up test data (same as training)
object_cols_test = df_test_raw.select_dtypes(include="object").columns
df_test_raw[object_cols_test] = df_test_raw[object_cols_test].astype("category")
df_test_raw = df_test_raw.drop(
    columns=['gender', 'firstName', 'lastName', 'location', 'userAgent', 'status', 'auth', 'method']
)

print(f"   Test data shape: {df_test_raw.shape}")
print(f"   Date range: {pd.to_datetime(df_test_raw['time']).min()} to {pd.to_datetime(df_test_raw['time']).max()}")
print(f"   Unique users: {df_test_raw['userId'].nunique()}")

# 2. Re-fit ONLY cumulative features on combined train+test data
print("\n2. Re-fitting cumulative features on combined train+test...")
df_combined = pd.concat([df_raw, df_test_raw], ignore_index=True)

# Re-fit only the transformers that need full user history
if 'accumulated' in dict(train_pipeline.steps):
    train_pipeline.named_steps['accumulated'].fit(None, raw_df=df_combined)
if 'page_interactions' in dict(train_pipeline.steps):
    train_pipeline.named_steps['page_interactions'].fit(None, raw_df=df_combined)

# 3. Create prediction pipeline by removing churn_target from trained pipeline
print("\n3. Creating prediction pipeline from training-fitted transformers...")
from sklearn.pipeline import Pipeline

# Build pipeline WITHOUT churn_target transformer
predict_steps = [
    (name, transformer) for name, transformer in train_pipeline.steps
    if name != 'churn_target'
]
predict_pipeline = Pipeline(predict_steps)

# 4. Transform test data (using FITTED transformers from training!)
print("\n4. Transforming test data with training-fitted pipeline...")
df_test_features = predict_pipeline.transform(df_test_raw)  # ‚Üê TRANSFORM ONLY - NO REFITTING!
print(f"   Test features shape: {df_test_features.shape}")

# 5. Get last observation per user
print("\n5. Extracting last observation per user...")
df_test_features['date'] = pd.to_datetime(df_test_features['date'])
last_user_data = df_test_features.sort_values('date').groupby('userId').tail(1).reset_index(drop=True)
print(f"   Submission samples: {len(last_user_data):,}")

# 6. Align features with training (add missing columns, ensure same order)
print("\n6. Aligning features with training schema...")
for col in feature_cols:
    if col not in last_user_data.columns:
        last_user_data[col] = 0
        print(f"   Added missing feature: {col}")

X_submission = last_user_data[feature_cols]
print(f"   ‚úì Features aligned: {len(feature_cols)} features")

# 7. Make predictions with optimal threshold
print("\n7. Making predictions...")
y_pred_proba = model_pipeline.predict_proba(X_submission)[:, 1]
y_pred = (y_pred_proba >= optimal_threshold).astype(int)

print(f"   Using threshold: {optimal_threshold:.2f}")
print(f"   Predicted churn rate: {y_pred.mean():.2%}")

# 8. Create and save submission
submission = pd.DataFrame({
    'id': last_user_data['userId'].astype(int).values,
    'target': y_pred
})

output_path = root + '/data/submission_ndl.csv'
submission.to_csv(output_path, index=False)

print("\n" + "=" * 80)
print("‚úì SUBMISSION GENERATED (LEAK-FREE!)")
print("=" * 80)
print(f"\nFile: {output_path}")
print(f"Shape: {submission.shape}")
print(f"Users: {len(submission):,}")
print(f"\nTarget distribution:")
print(submission['target'].value_counts())
print(f"\nPredicted churn rate: {submission['target'].mean():.2%}")
print(f"\n‚úì Ready for Kaggle submission!")
print(f"\nModel Performance (validation):")
print(f"  ROC-AUC: {roc_auc_score(y_val, y_val_pred_proba):.4f}")
print(f"  Balanced Accuracy: {optimal_bal_acc:.4f}")
print(f"  Optimal threshold: {optimal_threshold:.2f}")
print(f"\nüîí NO DATA LEAKAGE:")
print(f"  ‚úì Using .transform() only (no refitting on test)")
print(f"  ‚úì All transformers fitted on training data")
print(f"  ‚úì Only cumulative features use combined data (valid)")


GENERATING KAGGLE SUBMISSION

1. Loading Kaggle test data...
   Test data shape: (4393179, 11)
   Date range: 2018-10-01 00:00:06 to 2018-11-20 00:00:00
   Unique users: 2904

2. Re-fitting cumulative features on combined train+test...

3. Creating prediction pipeline from training-fitted transformers...

4. Transforming test data with training-fitted pipeline...
BasicEventAggregator: Aggregating events to user-day level...
  Output shape: (148104, 28)
  Features: ['About', 'Add Friend', 'Add to Playlist', 'Downgrade', 'Error', 'Help', 'Home', 'Login', 'Logout', 'NextSong', 'Register', 'Roll Advert', 'Save Settings', 'Settings', 'Submit Downgrade', 'Submit Registration', 'Submit Upgrade', 'Thumbs Down', 'Thumbs Up', 'Upgrade', 'event_count', 'session_count', 'avg_session_length', 'events_per_session', 'level', 'days_since_registration']
RollingAverageTransformerModular: Computing 7d rolling averages...
RollingAverageTransformerModular: Computing 14d rolling averages...
TrendFeaturesTr



   ‚úì Features aligned: 59 features

7. Making predictions...
   Using threshold: 0.43
   Predicted churn rate: 42.46%

‚úì SUBMISSION GENERATED (LEAK-FREE!)

File: /Users/mdiaspinto/Documents/School/Python Data Science/Final Project/kaggle-churn/data/submission_ndl.csv
Shape: (2904, 2)
Users: 2,904

Target distribution:
target
0    1671
1    1233
Name: count, dtype: int64

Predicted churn rate: 42.46%

‚úì Ready for Kaggle submission!

Model Performance (validation):




  ROC-AUC: 0.7448
  Balanced Accuracy: 0.6875
  Optimal threshold: 0.43

üîí NO DATA LEAKAGE:
  ‚úì Using .transform() only (no refitting on test)
  ‚úì All transformers fitted on training data
  ‚úì Only cumulative features use combined data (valid)


In [7]:
# ============================================================================
# VERIFY SUBMISSION FILE
# ============================================================================
print("=" * 80)
print("SUBMISSION FILE VERIFICATION")
print("=" * 80)

# Read the generated submission
submission_check = pd.read_csv(output_path)

print("\n1. File Structure:")
print(f"   Columns: {list(submission_check.columns)}")
print(f"   Shape: {submission_check.shape}")
print(f"   ‚úÖ Format correct: {list(submission_check.columns) == ['id', 'target']}")

print("\n2. Data Quality:")
print(f"   Unique users: {submission_check['id'].nunique()}")
print(f"   Duplicates: {submission_check['id'].duplicated().sum()}")
print(f"   Missing values: {submission_check.isnull().sum().sum()}")
print(f"   ‚úÖ No duplicates: {submission_check['id'].duplicated().sum() == 0}")
print(f"   ‚úÖ No missing: {submission_check.isnull().sum().sum() == 0}")

print("\n3. Target Values:")
print(f"   Unique values: {sorted(submission_check['target'].unique())}")
print(f"   ‚úÖ Binary (0/1): {set(submission_check['target'].unique()) == {0, 1}}")

print("\n4. Sample Predictions:")
print(submission_check.head(10))

print("\n5. Probability Distribution:")
bins = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
prob_dist = pd.cut(y_pred_proba, bins=bins).value_counts().sort_index()
for interval, count in prob_dist.items():
    pct = count / len(y_pred_proba) * 100
    bar = '‚ñà' * int(pct / 2)
    print(f"   {interval}: {count:5d} ({pct:5.1f}%) {bar}")

print("\n" + "=" * 80)
print("‚úÖ SUBMISSION FILE VERIFIED - READY FOR UPLOAD")
print("=" * 80)

SUBMISSION FILE VERIFICATION

1. File Structure:
   Columns: ['id', 'target']
   Shape: (2904, 2)
   ‚úÖ Format correct: True

2. Data Quality:
   Unique users: 2904
   Duplicates: 0
   Missing values: 0
   ‚úÖ No duplicates: True
   ‚úÖ No missing: True

3. Target Values:
   Unique values: [0, 1]
   ‚úÖ Binary (0/1): True

4. Sample Predictions:
        id  target
0  1995115       0
1  1993285       1
2  1979129       1
3  1997769       0
4  1997880       1
5  1985914       0
6  1987068       0
7  1988412       1
8  1994524       1
9  1988592       1

5. Probability Distribution:
   (0.0, 0.1]:     9 (  0.3%) 
   (0.1, 0.2]:   142 (  4.9%) ‚ñà‚ñà
   (0.2, 0.3]:   729 ( 25.1%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
   (0.3, 0.4]:   630 ( 21.7%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
   (0.4, 0.5]:   479 ( 16.5%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
   (0.5, 0.6]:   550 ( 18.9%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
   (0.6, 0.7]:   361 ( 12.4%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
   (0.7, 0.8]:     4 (  0.1%) 
   (0.8, 0.9]:     0 ( 

In [19]:
from importlib import reload
import src.preprocessing
reload(src.preprocessing)

from src.preprocessing import create_feature_pipeline

pipe = create_feature_pipeline(cutoff_date='2018-10-15', mode='predict')

out = pipe.fit_transform(
    df_raw,
    accumulated__raw_df=df_raw,
    page_interactions__raw_df=df_raw,
)

assert out['date'].max() <= pd.Timestamp('2018-10-15')

RawDataSplitter: Filtered to 5,130,187 events (<= 2018-10-15)
BasicEventAggregator: Aggregating events to user-day level...
  Output shape: (287100, 27)
  Features: ['About', 'Add Friend', 'Add to Playlist', 'Cancel', 'Cancellation Confirmation', 'Downgrade', 'Error', 'Help', 'Home', 'Logout', 'NextSong', 'Roll Advert', 'Save Settings', 'Settings', 'Submit Downgrade', 'Submit Upgrade', 'Thumbs Down', 'Thumbs Up', 'Upgrade', 'event_count', 'session_count', 'avg_session_length', 'events_per_session', 'level', 'days_since_registration']
RollingAverageTransformerModular: Computing 7d rolling averages...
RollingAverageTransformerModular: Computing 14d rolling averages...
TrendFeaturesTransformer: Computing trend features...
AccumulatedFeaturesTransformer: Computing cumulative features...


  errors_per_day = df_errors.groupby([self.user_col, 'date']).size().reset_index(name='daily_errors')


  Added: accumulated_errors, accumulated_unique_artists
PageInteractionTransformer: Computing page interaction features...


  page_counts = df_pages.groupby([self.user_col, 'date', self.page_col]).size().unstack(fill_value=0).reset_index()


  Added: []
FeaturePreprocessor: Final preprocessing...
