# Test: Data Processing and EasyRec Config Generation

This notebook demonstrates how to use the scripts from the `src` directory to:
1. Simulate splitting data (as `split_dataset.py` would do from GCS).
2. Use a sample of this data with `auto_feature_config_generator.py` to create `input_fields` and `feature_config` sections.
3. Use `generate_and_run_train.py` to populate the `esmm_pipeline_template.config` with these generated sections and other parameters.

In [None]:
import os
import sys
import pandas as pd
import subprocess

# Add src to path to allow direct imports if package not installed via pip install -e .
module_path = os.path.abspath(os.path.join('..', 'src'))
if module_path not in sys.path:
    sys.path.append(module_path)
    print(f"Added {module_path} to sys.path")

try:
    from data_processing import auto_feature_config_generator
    # from data_processing import split_dataset # Not directly calling its main for this test
    # from training import generate_and_run_train # Will call this script via subprocess
    print("Successfully imported 'auto_feature_config_generator'.")
except ImportError as e:
    print(f"Error importing modules: {e}")
    print("Please ensure 'src' is in PYTHONPATH or the package is installed (e.g., pip install -e .)")

# Mock GCS paths and local paths for testing
# These would be replaced with actual GCS paths in a real scenario.
MOCK_GCS_RAW_DATA_INPUT_DAY_PATH = "gs://mock-bucket/raw_data/20231026/" # Input for split_dataset.py (not used directly here)
MOCK_GCS_PROCESSED_PATH_PREFIX = "gs://mock-bucket/processed_data/"      # Output prefix for split_dataset.py
MOCK_GCS_MODEL_OUTPUT_PATH_PREFIX = "gs://mock-bucket/model_output/"    # Output for training model_dir

# For generate_and_run_train.py
MOCK_GCS_TRAIN_DIR_FOR_SCHEMA_INF = os.path.join(MOCK_GCS_PROCESSED_PATH_PREFIX, "20231026/train/") # Used for schema inference
MOCK_GCS_TRAIN_DATA_PATH = os.path.join(MOCK_GCS_PROCESSED_PATH_PREFIX, "20231026/train/")
MOCK_GCS_EVAL_DATA_PATH = os.path.join(MOCK_GCS_PROCESSED_PATH_PREFIX, "20231026/validation/")
MOCK_GCS_MODEL_DIR = os.path.join(MOCK_GCS_MODEL_OUTPUT_PATH_PREFIX, "20231026/")

CONFIG_TEMPLATE_PATH = "../configs/esmm_pipeline_template.config"
GENERATED_CONFIG_PATH = "../configs/generated_notebook_test_pipeline.config"

print(f"Config template path: {os.path.abspath(CONFIG_TEMPLATE_PATH)}")
print(f"Generated config will be at: {os.path.abspath(GENERATED_CONFIG_PATH)}")

## 1. Simulate Data Splitting (using logic from `split_dataset.py`)

The `split_dataset.py` script would typically read from a GCS path like `MOCK_GCS_RAW_DATA_INPUT_DAY_PATH`, process multiple Parquet files, perform splits, and save them to paths like `MOCK_GCS_TRAIN_DATA_PATH` and `MOCK_GCS_EVAL_DATA_PATH`.

For this notebook, we'll directly generate a sample DataFrame that represents the kind of data `auto_feature_config_generator.py` would expect for schema inference. This sample would conceptually come from the *training split*.

In [None]:
# This DataFrame simulates a sample taken from the training data after splitting.
# auto_feature_config_generator.load_sample_from_gcs_parquet would be used by generate_and_run_train.py
# to (mock) load this from a GCS path.
try:
    sample_df_for_schema = auto_feature_config_generator.get_sample_dataframe(num_rows=500) 
    print("Sample DataFrame for schema inference created:")
    sample_df_for_schema.head()
    # sample_df_for_schema.info() # Uncomment for more detail
except Exception as e:
    print(f"Error generating sample DataFrame: {e}")
    sample_df_for_schema = pd.DataFrame() # Ensure it exists for next cell

## 2. Generate EasyRec Feature Configurations

Using `auto_feature_config_generator.py` functions with the sample DataFrame to produce protobuf snippets for `input_fields` and `feature_config`.

In [None]:
try:
    classified_features = auto_feature_config_generator.classify_features(sample_df_for_schema)
    print("Classified Features:", classified_features)
    
    input_fields_proto_str = auto_feature_config_generator.generate_input_fields_proto(classified_features)
    feature_config_proto_str = auto_feature_config_generator.generate_feature_config_proto(classified_features)

    print("\n--- Input Fields Proto Snippet ---")
    print(input_fields_proto_str)
    print("\n--- Feature Config Proto Snippet ---")
    print(feature_config_proto_str)
except Exception as e:
    print(f"Error during feature configuration generation: {e}")
    classified_features = {} # Ensure it exists

## 3. Populate Full EasyRec Pipeline Configuration

This step uses the `generate_and_run_train.py` script to take the template config and populate it with:
1. The auto-generated `input_fields` and `feature_config` (which the script internally generates using `auto_feature_config_generator`).
2. GCS paths for training data, evaluation data, and model output.
3. Automated feature groupings (user, item) based on naming conventions.
4. Other parameters like batch size, epochs (if overridden).

In [None]:
# Construct the command to call generate_and_run_train.py
# The script will internally call auto_feature_config_generator functions, 
# including the (mock) GCS sample loading and automated feature grouping.

generate_script_path = os.path.join(module_path, "training/generate_and_run_train.py")

cmd_args = [
    "python", generate_script_path,
    "--gcs_train_data_for_schema_inference", MOCK_GCS_TRAIN_DIR_FOR_SCHEMA_INF, # Path for schema sample loading
    "--template_config_path", CONFIG_TEMPLATE_PATH,
    "--output_config_path", GENERATED_CONFIG_PATH,
    "--gcs_processed_data_path_train", MOCK_GCS_TRAIN_DATA_PATH,
    "--gcs_processed_data_path_eval", MOCK_GCS_EVAL_DATA_PATH,
    "--gcs_model_dir_path", MOCK_GCS_MODEL_DIR,
    # --user_features and --item_features are no longer used by generate_and_run_train.py
    # Feature grouping is now automated within the script.
    # "--execute_training", # Uncomment to try to execute training (requires EasyRec env)
]

print(f"Ensuring script path exists: {generate_script_path}")
if not os.path.exists(generate_script_path):
    print(f"Error: generate_and_run_train.py not found at {generate_script_path}")
else:
    print(f"Running command: {' '.join(cmd_args)}\n")
    try:
        result = subprocess.run(cmd_args, capture_output=True, text=True, check=False) # check=False to see output even on error
        
        print("--- STDOUT from generate_and_run_train.py ---")
        print(result.stdout)
        print("--- STDERR from generate_and_run_train.py ---")
        print(result.stderr)

        if result.returncode == 0:
            print("\nConfig generation script ran successfully.")
            print("Generated config content (from file):")
            try:
                with open(GENERATED_CONFIG_PATH, 'r') as f:
                    print(f.read())
            except FileNotFoundError:
                print(f"Error: Generated config file not found at {GENERATED_CONFIG_PATH}")
        else:
            print(f"\nError running config generation script (return code: {result.returncode}).")
            
    except FileNotFoundError:
        print(f"Error: 'python' executable not found. Please ensure Python is in your PATH.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")


## 4. Next Steps
        
- Review the `configs/generated_notebook_test_pipeline.config` to see the populated values.
- Replace mock GCS paths in the notebook and in actual script calls with real paths when ready.
- Ensure the GCS sample reading in `auto_feature_config_generator.load_sample_from_gcs_parquet` is implemented to use real GCS data if you need true schema inference from your dataset.
- To run EasyRec training using the generated config (ensure EasyRec is installed and environment is active):
  ```bash
  python -m easy_rec.python.train_eval --pipeline_config_path configs/generated_notebook_test_pipeline.config
  ```
- To evaluate model predictions (after generating them), use `src/evaluation/evaluate_model.py`.