# Backfill Demo: Job Manager

This notebook manages Databricks jobs for the backfill demo.

## Functions:
1. **List Jobs** - Show all existing jobs with the same name
2. **Create/Update Jobs** - Create new jobs or update existing ones
3. **Cleanup** - Delete old/duplicate jobs

In [0]:
# Setup
import sys
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import JobSettings as Job

# Get current notebook path dynamically
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
workspace_path = f"/Workspace{notebook_path}"
base_path = workspace_path.rsplit('/', 1)[0]

# Initialize Databricks client
w = WorkspaceClient()

# Get workspace details
workspace_id = dbutils.entry_point.getDbutils().notebook().getContext().workspaceId().get()

print(f"📁 Base Path: {base_path}")
print(f"🔧 Workspace ID: {workspace_id}")

📁 Base Path: /Workspace/Users/krish.kilaru@lumenalta.com/demos/back_fill
🔧 Workspace ID: 5584109198115548


## Step 1: Setup and Configuration

Initialize the Databricks SDK WorkspaceClient and get workspace context for job management.

In [0]:
# Function to list all jobs by name
def list_jobs_by_name(job_name):
    """List all active jobs with the given name"""
    query = f"""
        SELECT DISTINCT job_id, name, creator_id, change_time
        FROM system.lakeflow.jobs 
        WHERE workspace_id = '{workspace_id}' 
          AND name = '{job_name}'
          AND delete_time IS NULL
        ORDER BY change_time DESC
    """
    df = spark.sql(query)
    
    # Verify jobs actually exist by checking with the API
    active_jobs = []
    for row in df.collect():
        try:
            w.jobs.get(job_id=row['job_id'])
            active_jobs.append(row)
        except:
            # Job doesn't exist, skip it
            pass
    
    # Return only verified active jobs
    if active_jobs:
        return spark.createDataFrame(active_jobs)
    else:
        # Return empty dataframe with same schema
        return spark.createDataFrame([], df.schema)
    
def list_jobs_by_name_raw(job_name):
    """List all jobs from metadata (may include deleted/stale jobs)"""
    query = f"""
        SELECT job_id, name, creator_id, change_time, delete_time
        FROM system.lakeflow.jobs 
        WHERE workspace_id = '{workspace_id}' 
          AND name = '{job_name}'
        ORDER BY change_time DESC
    """
    return spark.sql(query)

# Function to delete all jobs with a given name except the latest
def cleanup_duplicate_jobs(job_name, keep_latest=True):
    """Delete duplicate jobs, optionally keeping the latest one"""
    df = list_jobs_by_name(job_name)
    jobs = df.collect()
    
    if len(jobs) == 0:
        print(f"No jobs found with name: {job_name}")
        return
    
    print(f"Found {len(jobs)} job(s) with name '{job_name}':")
    for job in jobs:
        print(f"  • Job ID: {job['job_id']} | Created: {job['change_time']}")
    
    if keep_latest and len(jobs) > 1:
        jobs_to_delete = jobs[1:]  # Skip the first (latest) one
        print(f"\n🗑️ Deleting {len(jobs_to_delete)} old job(s)...")
        
        for job in jobs_to_delete:
            try:
                w.jobs.delete(job_id=job['job_id'])
                print(f"  ✓ Deleted job ID: {job['job_id']}")
            except Exception as e:
                print(f"  ✗ Failed to delete job ID {job['job_id']}: {e}")
    elif not keep_latest:
        print(f"\n🗑️ Deleting all {len(jobs)} job(s)...")
        for job in jobs:
            try:
                w.jobs.delete(job_id=job['job_id'])
                print(f"  ✓ Deleted job ID: {job['job_id']}")
            except Exception as e:
                print(f"  ✗ Failed to delete job ID {job['job_id']}: {e}")
    else:
        print(f"\n✓ Only 1 job found, no cleanup needed")

# Function to create or update job
def create_or_update_job(job_name, job_config, force_create=False):
    """Create a new job or update existing one"""
    df = list_jobs_by_name(job_name)
    jobs = df.collect()
    
    if len(jobs) > 0 and not force_create:
        latest_job = jobs[0]
        job_id = latest_job['job_id']
        print(f"📝 Attempting to update existing job: {job_name} (ID: {job_id})")
        
        try:
            # Verify job still exists by trying to get it
            w.jobs.get(job_id=job_id)
            w.jobs.reset(job_id=job_id, new_settings=job_config)
            print(f"✓ Job updated successfully")
            return job_id
        except Exception as e:
            print(f"⚠️ Job {job_id} not found or inaccessible: {e}")
            print(f"➕ Creating new job instead...")
            # Build kwargs dynamically to only include attributes that exist
            create_kwargs = {
                "name": job_config.name,
                "tasks": job_config.tasks,
            }
            if hasattr(job_config, 'max_concurrent_runs') and job_config.max_concurrent_runs:
                create_kwargs["max_concurrent_runs"] = job_config.max_concurrent_runs
            if hasattr(job_config, 'tags') and job_config.tags:
                create_kwargs["tags"] = job_config.tags
            if hasattr(job_config, 'queue') and job_config.queue:
                create_kwargs["queue"] = job_config.queue
            
            response = w.jobs.create(**create_kwargs)
            print(f"✓ Job created with ID: {response.job_id}")
            return response.job_id
    else:
        print(f"➕ Creating new job: {job_name}")
        # Build kwargs dynamically to only include attributes that exist
        create_kwargs = {
            "name": job_config.name,
            "tasks": job_config.tasks,
        }
        if hasattr(job_config, 'max_concurrent_runs') and job_config.max_concurrent_runs:
            create_kwargs["max_concurrent_runs"] = job_config.max_concurrent_runs
        if hasattr(job_config, 'tags') and job_config.tags:
            create_kwargs["tags"] = job_config.tags
        if hasattr(job_config, 'queue') and job_config.queue:
            create_kwargs["queue"] = job_config.queue
        
        response = w.jobs.create(**create_kwargs)
        print(f"✓ Job created with ID: {response.job_id}")
        return response.job_id

print("✓ Helper functions loaded")

✓ Helper functions loaded


## Step 2: Helper Functions

These utility functions help manage Databricks Jobs:

1. **list_jobs_by_name()**: Query and verify active jobs (handles stale metadata)
2. **list_jobs_by_name_raw()**: View all job metadata including deleted jobs
3. **cleanup_duplicate_jobs()**: Remove duplicate job definitions
4. **create_or_update_job()**: Smart function that updates existing jobs or creates new ones

**Note:** The helper functions include dual verification (metadata + API) to handle stale job references.

## Step 1: Cleanup Duplicate Jobs

Run this cell to remove duplicate jobs and keep only the latest version.

In [0]:
# Cleanup duplicates for process_data job
print("🧹 Cleaning up '02_process_data' jobs...")
cleanup_duplicate_jobs("02_process_data", keep_latest=True)

print("\n" + "="*60 + "\n")

# Cleanup duplicates for orchestrator job
print("🧹 Cleaning up '03_backfill_orchestrator' jobs...")
cleanup_duplicate_jobs("03_backfill_orchestrator", keep_latest=True)

🧹 Cleaning up '02_process_data' jobs...
No jobs found with name: 02_process_data


🧹 Cleaning up '03_backfill_orchestrator' jobs...
No jobs found with name: 03_backfill_orchestrator


## Step 3: Cleanup Duplicate Jobs (Optional)

Run this cell to remove duplicate job definitions and keep only the latest version.

**When to use:**
- After repeatedly testing job creation during development
- To clean up test jobs
- To maintain a single active job version

**Safe to run:** This cell will keep the latest job version and only delete duplicates.

## Step 2: Create/Update Jobs

Define and create or update the jobs.

In [0]:
# Job 1: Process Data Job
process_data_config = Job.from_dict({
    "name": "02_process_data",
    "max_concurrent_runs": 20,
    "tasks": [{
        "task_key": "process_data",
        "notebook_task": {
            "notebook_path": f"{base_path}/02_process_data",
            "base_parameters": {
                "position_date": "",
            },
            "source": "WORKSPACE",
        },
        "max_retries": 3,
        "min_retry_interval_millis": 60000,
        "retry_on_timeout": True,
        "timeout_seconds": 1800,  # 30 minutes
    }],
    "tags": {
        "app_name": "backfill_demo",
        "environment": "dev"
    },
    "queue": {
        "enabled": True,
    },
    "performance_target": "PERFORMANCE_OPTIMIZED",
})

# Create or update the job
process_job_id = create_or_update_job("02_process_data", process_data_config)
print(f"\n✓ Process Data Job ID: {process_job_id}")

➕ Creating new job: 02_process_data
✓ Job created with ID: 743256224103762

✓ Process Data Job ID: 743256224103762


## Step 4: Define Job Configurations

Configure the two main jobs for the backfill system:

### Job 1: Process Data Job (`02_process_data`)
- **Purpose**: Worker job that processes data for a single position_date
- **Max Concurrent Runs**: 20 (process up to 20 dates in parallel)
- **Retry Settings**: Up to 3 retries with 1-minute intervals
- **Timeout**: 30 minutes per date
- **Parameters**: Accepts `position_date` (YYYY-MM-DD)

### Job 2: Orchestrator Job (`03_backfill_orchestrator`)
- **Purpose**: Orchestrates backfill operations with business day validation and logging
- **Max Concurrent Runs**: 10 (up to 10 parallel orchestrations)
- **Retry Settings**: Up to 2 retries with 2-minute intervals
- **Timeout**: 1 hour per orchestration
- **Parameters**: Accepts `position_date`, `job_name`, and optional `retry_metadata`

In [0]:
# Job 2: Backfill Orchestrator Job
orchestrator_config = Job.from_dict({
    "name": "03_backfill_orchestrator",
    "max_concurrent_runs": 10,
    "tasks": [{
        "task_key": "orchestrate_backfill",
        "notebook_task": {
            "notebook_path": f"{base_path}/03_backfill_orchestrator",
            "base_parameters": {
                "start_date": "",
                "end_date": "",
                "job_name": "02_process_data",
            },
            "source": "WORKSPACE",
        },
        "max_retries": 2,
        "min_retry_interval_millis": 120000,
        "retry_on_timeout": False,
        "timeout_seconds": 3600,  # 1 hour
    }],
    "tags": {
        "app_name": "backfill_demo",
        "environment": "dev"
    },
    "queue": {
        "enabled": True,
    },
    "performance_target": "PERFORMANCE_OPTIMIZED",
})

# Create or update the job
orchestrator_job_id = create_or_update_job("03_backfill_orchestrator", orchestrator_config)
print(f"\n✓ Backfill Orchestrator Job ID: {orchestrator_job_id}")

➕ Creating new job: 03_backfill_orchestrator
✓ Job created with ID: 729227138192618

✓ Backfill Orchestrator Job ID: 729227138192618


## Step 5: Create/Update Process Data Job

Create or update the `02_process_data` job with the defined configuration.

## Step 3: Verify Jobs

List all current jobs to verify the setup.

## Step 6: Create/Update Orchestrator Job

Create or update the `03_backfill_orchestrator` job with the defined configuration.

In [0]:
# Verify final state
import time

print("📋 Final Job Status:\n")

while True:
    process_jobs_df = list_jobs_by_name("02_process_data")
    orchestrator_jobs_df = list_jobs_by_name("03_backfill_orchestrator")
    
    process_count = process_jobs_df.count()
    orchestrator_count = orchestrator_jobs_df.count()  
    
    if process_count > 0 and orchestrator_count > 0:
        print("1. Process Data Jobs:")
        display(process_jobs_df)
        
        print("\n2. Backfill Orchestrator Jobs:")
        display(orchestrator_jobs_df)

        break

    print("\n⏳ Waiting for jobs to appear... Retrying in 60 seconds.\n")
    time.sleep(60)

📋 Final Job Status:


⏳ Waiting for jobs to appear... Retrying in 60 seconds.


⏳ Waiting for jobs to appear... Retrying in 60 seconds.

1. Process Data Jobs:


job_id,name,creator_id,change_time
743256224103762,02_process_data,75053393473744,2025-11-21T05:39:37.396Z



2. Backfill Orchestrator Jobs:


job_id,name,creator_id,change_time
729227138192618,03_backfill_orchestrator,75053393473744,2025-11-21T05:39:50.099Z


## Step 7: Verify Job Creation

Query and display all created jobs to verify successful setup.

**Note:** There may be a brief delay before jobs appear in the system. This cell will retry automatically.

## Job Management Complete! ✓

**Available Jobs:**
- ✅ **02_process_data** - Processes data for a single position date
- ✅ **03_backfill_orchestrator** - Orchestrates backfill across multiple dates

**Usage:**
- Update job definitions in cells 6-7 and rerun to apply changes
- Use Step 1 to cleanup duplicates anytime
- Job IDs are displayed after creation/update