# Building Enrichment Pipeline - Simple Job Creator

**Quick Start:**
1. Replace `data/NOS_storey_mapping.csv` with your country-specific file
2. Update `ISO3` in Cell 2 to your country code
3. Run all cells
4. Monitor job progress

**The job will automatically:**
- Use files from `data/` folder (tsi.csv, admin boundaries already included)
- Create {ISO3}/input/, {ISO3}/output/, {ISO3}/logs/ folders
- Copy files to correct locations
- Generate full config.json with ISO3 suffixes
- Run the complete pipeline

**Note:** All required data files (except your NOS file) are already in the `data/` folder!

## Step 1: Install Required Packages

In [None]:
# Auto-install notebook dependencies
try:
    import databricks.sdk
    import yaml
    print("‚úÖ All dependencies available")
except ImportError:
    print("Installing packages...")
    %pip install databricks-sdk pyyaml --quiet
    dbutils.library.restartPython()

## Step 2: Configuration (EDIT THIS!)

In [None]:
# ============================================================================
# USER CONFIGURATION - Edit these values
# ============================================================================

# Run mode: "test" or "full"
# - test: Process only 1 tile with 10k grid cells for quick validation
# - full: Process all tiles for complete country coverage
RUN_MODE = "test"  # Change to "full" for production run

# Country code (CHANGE THIS for your country)
ISO3 = "IND"

# Databricks settings
CATALOG = "prp_mr_bdap_projects"
SCHEMA = "geospatialsolutions"
VOLUME_BASE = "/Volumes/prp_mr_bdap_projects/geospatialsolutions/external/jrc/data"

# Workspace path (where these scripts are located)
# IMPORTANT: 
# - For Databricks: Use workspace path like "/Workspace/Users/yourname/project/mre/job1"
# - For local development: Use absolute path to the mre/job1 directory
#   Example: "/home/user/code-for-copilot/mre/job1"
WORKSPACE_BASE = "/Workspace/Users/npokkiri@munichre.com/inventory_nos_db/code-for-copilot-main/mre/job1"

# ============================================================================
# Input files from data/ folder
# Just replace NOS_storey_mapping.csv in the data/ folder with your file!
# ============================================================================
PROPORTIONS_CSV = f"{WORKSPACE_BASE}/data/NOS_storey_mapping.csv"
TSI_CSV = f"{WORKSPACE_BASE}/data/tsi.csv"
ADMIN_BOUNDARIES = f"{WORKSPACE_BASE}/data/RMS_Admin0_geozones.json.gz"

# Optional: Email for notifications
EMAIL = "npokkiri@munichre.com"

# Optional: Cluster ID (leave empty to auto-detect)
CLUSTER_ID = ""  # Will auto-detect current cluster if empty

# ============================================================================
# Processing parameters (optional - defaults provided)
# ============================================================================
CELL_SIZE = 2000              # Grid cell size in meters (2km default)
DOWNLOAD_CONCURRENCY = 3      # Parallel tile downloads
MAX_WORKERS = 8               # Raster processing threads
TILE_PARALLELISM = 4          # Concurrent tile processing

# Test mode overrides (automatically set if RUN_MODE="test")
if RUN_MODE.lower() == "test":
    SAMPLE_SIZE = 10000        # Limit to 10k grid cells
    MAX_TILES = 1              # Process only 1 tile
    print("‚ö†Ô∏è  TEST MODE: Will process only 1 tile with 10k grid cells")
else:
    SAMPLE_SIZE = None         # No limit - process all
    MAX_TILES = None           # Process all tiles
    print("‚úÖ FULL MODE: Will process all tiles for complete coverage")

## Step 3: Initialize & Auto-Detect Cluster

In [None]:
import sys
import os
from pyspark.sql import SparkSession

# Add workspace base to path for helper imports
# Convert Workspace path to actual filesystem path if needed
if WORKSPACE_BASE.startswith("/Workspace/"):
    # In Databricks, /Workspace/ paths map to actual filesystem
    actual_path = WORKSPACE_BASE
else:
    # For local/cloned repos, use the path as-is
    actual_path = WORKSPACE_BASE

# Also try to add current working directory (where notebook is running)
current_dir = os.getcwd()
for path in [actual_path, current_dir]:
    if path and path not in sys.path:
        sys.path.insert(0, path)
        print(f"üìÅ Added to Python path: {path}")

# Initialize Spark
spark = SparkSession.builder.getOrCreate()

# Auto-detect cluster if not specified
if not CLUSTER_ID:
    CLUSTER_ID = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
    print(f"üîç Auto-detected cluster ID: {CLUSTER_ID}")
else:
    print(f"üìå Using specified cluster ID: {CLUSTER_ID}")

print(f"‚úÖ Configuration loaded for {ISO3}")

## Step 4: Validate Helper Modules

In [None]:
# Validate that helper modules can be imported
print("üîç Validating helper modules...\n")

try:
    import config_generator
    print("‚úÖ config_generator module found")
except ImportError as e:
    print(f"‚ùå config_generator import failed: {e}")
    print(f"\nüí° Troubleshooting:")
    print(f"   1. Ensure WORKSPACE_BASE points to the directory containing:")
    print(f"      - config_generator.py")
    print(f"      - job_creator.py") 
    print(f"      - job_monitor.py")
    print(f"   2. Current WORKSPACE_BASE: {WORKSPACE_BASE}")
    print(f"   3. Current working directory: {os.getcwd()}")
    print(f"   4. Python sys.path: {sys.path[:3]}...")
    raise

try:
    import job_creator
    print("‚úÖ job_creator module found")
except ImportError as e:
    print(f"‚ùå job_creator import failed: {e}")
    raise

try:
    import job_monitor
    print("‚úÖ job_monitor module found")
except ImportError as e:
    print(f"‚ùå job_monitor import failed: {e}")
    raise

print("\n‚úÖ All helper modules validated successfully!")

## Step 5: Generate Minimal Config

## Step 5: Create Databricks Job

## Step 6: Create Databricks Job

## Step 6: Run Job & Monitor Progress

## Step 7: Run Job & Monitor Progress

## Step 7: Verify Outputs

## Step 8: Verify Outputs

## Summary

In [None]:
generated_config_path = f"{VOLUME_BASE}/{ISO3}/config.json"

print("="*60)
print("PIPELINE EXECUTION SUMMARY")
print("="*60)
print(f"Country: {ISO3}")
print(f"Job ID: {JOB_ID}")
print(f"Run ID: {RUN_ID}")
print(f"")
print(f"üìÅ Data Location: {VOLUME_BASE}/{ISO3}")
print(f"üìä Main Output Table: {output_table}")
print(f"üìÇ Exports: {VOLUME_BASE}/{ISO3}/outputs/exports/{ISO3}/")
print(f"‚öôÔ∏è  Config: {generated_config_path}")
print(f"")
print(f"View job in Databricks UI: Workflows ‚Üí Jobs ‚Üí Building_Enrichment_{ISO3}")
print("="*60)