# Backfill Demo: Retry Failed Jobs

This notebook identifies and retries failed backfill jobs from the log table.

## What It Does:
1. Queries the backfill log table for failed jobs
2. Optionally filters by date range or age
3. Re-triggers the orchestrator for each failed position date
4. Updates the log table with new run information

## Usage:
- **Schedule Daily:** Run this notebook every day to automatically retry yesterday's failures
- **Manual Retry:** Adjust parameters to retry specific date ranges
- **Batch Retry:** Can process multiple failures in one execution

In [0]:
# Setup and Configuration
import sys
from pyspark.sql import functions as F
from datetime import datetime, timedelta

# Get current notebook path dynamically
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
workspace_path = f"/Workspace{notebook_path}"
base_path = workspace_path.rsplit('/', 1)[0]

# Add to sys.path for importing config
sys.path.append(base_path)

from config import BACKFILL_LOG_TABLE, print_config

# Display configuration
print_config()
print(f"\n📋 Log Table: {BACKFILL_LOG_TABLE}")

📁 Catalog: demos
📂 Schema: backfill_demo
📊 Tables:
  • Source: demos.backfill_demo.source_data
  • Destination: demos.backfill_demo.destination_data
  • Calendar: demos.backfill_demo.calendar
  • Backfill Log: demos.backfill_demo.backfill_log

📋 Log Table: demos.backfill_demo.backfill_log


## Step 1: Configuration Setup

Load the centralized configuration and display the log table being queried for failed jobs.

In [0]:
# Parameters - Customize based on your retry strategy

# Option 1: Retry failures from the last N days
dbutils.widgets.text("retry_days_back", "1", "Days Back to Check (0=all)")
retry_days_back = int(dbutils.widgets.get("retry_days_back"))

# Option 2: Specific date range (leave empty to use retry_days_back)
dbutils.widgets.text("start_date", "", "Start Date (YYYY-MM-DD, optional)")
dbutils.widgets.text("end_date", "", "End Date (YYYY-MM-DD, optional)")

start_date_str = dbutils.widgets.get("start_date")
end_date_str = dbutils.widgets.get("end_date")

# Option 3: Max number of retries to process in one run
dbutils.widgets.text("max_retries", "10", "Max Jobs to Retry")
max_retries = int(dbutils.widgets.get("max_retries"))

print(f"🔧 Retry Configuration:")
print(f"  • Days Back: {retry_days_back if retry_days_back > 0 else 'All'}")
print(f"  • Date Range: {start_date_str or 'Not specified'} to {end_date_str or 'Not specified'}")
print(f"  • Max Retries: {max_retries}")

🔧 Retry Configuration:
  • Days Back: 1
  • Date Range: Not specified to Not specified
  • Max Retries: 10


## Step 2: Configure Retry Parameters

Customize the retry behavior using widgets:

**retry_days_back**: Number of days to look back for failures (0 = all time)
- Set to `1` for daily scheduled retries (yesterday's failures)
- Set to `7` for weekly cleanup

**start_date / end_date**: Specific date range (optional, overrides retry_days_back)
- Use for targeted retries of specific periods

**max_retries**: Maximum number of jobs to retry in one execution
- Prevents overwhelming the system with too many simultaneous retries

In [0]:
# Query Failed Jobs from Log Table

# Check if retry_metadata column exists
table_schema = spark.table(BACKFILL_LOG_TABLE).schema
has_retry_metadata = "retry_metadata" in [field.name for field in table_schema.fields]

print(f"📊 Table Schema Check:")
print(f"  • retry_metadata column exists: {has_retry_metadata}")

# Build the WHERE clause based on parameters
where_clauses = ["status = 'FAILED'"]

# Only filter by retry_metadata if the column exists
if has_retry_metadata:
    where_clauses.append("retry_metadata IS NULL")  # Only get original failed runs, not failed retries
    print(f"  • Filtering: Only original failed runs (retry_metadata IS NULL)")
else:
    print(f"  • Warning: retry_metadata column not found. Run 03_backfill_orchestrator.ipynb cell 7 to add it.")
    print(f"  • Filtering: All failed runs (cannot distinguish originals from retries)")

if start_date_str and end_date_str:
    # Use specific date range
    where_clauses.append(f"start_time BETWEEN '{start_date_str}' AND '{end_date_str}'")
elif retry_days_back > 0:
    # Calculate date threshold
    cutoff_date = (datetime.now() - timedelta(days=retry_days_back)).strftime('%Y-%m-%d')
    where_clauses.append(f"start_time >= '{cutoff_date}'")

where_clause = " AND ".join(where_clauses)

# Query for failed jobs - get original runs with retry count
if has_retry_metadata:
    # Full query with retry tracking
    query = f"""
        WITH failed_originals AS (
            SELECT 
                position_date,
                job_name,
                job_id,
                run_id as original_run_id,
                status,
                end_time,
                error_message
            FROM {BACKFILL_LOG_TABLE}
            WHERE {where_clause}
        ),
        retry_counts AS (
            SELECT 
                position_date,
                COUNT(*) - 1 as current_retry_count
            FROM {BACKFILL_LOG_TABLE}
            WHERE position_date IN (SELECT position_date FROM failed_originals)
            GROUP BY position_date
        )
        SELECT 
            f.position_date,
            f.job_name,
            f.job_id,
            f.original_run_id,
            f.status,
            f.end_time,
            f.error_message,
            COALESCE(r.current_retry_count, 0) as retry_count
        FROM failed_originals f
        LEFT JOIN retry_counts r ON f.position_date = r.position_date
        ORDER BY f.position_date DESC
        LIMIT {max_retries}
    """
else:
    # Simplified query without retry tracking
    query = f"""
        SELECT 
            position_date,
            job_name,
            job_id,
            run_id as original_run_id,
            status,
            end_time,
            error_message,
            0 as retry_count
        FROM {BACKFILL_LOG_TABLE}
        WHERE {where_clause}
        ORDER BY position_date DESC
        LIMIT {max_retries}
    """

print(f"\n🔍 Searching for failed jobs...\n")
print(f"Query: {query}\n")

failed_jobs_df = spark.sql(query)
failed_jobs = failed_jobs_df.collect()

if len(failed_jobs) == 0:
    print("✅ No failed jobs found! Nothing to retry.")
    dbutils.notebook.exit("No failed jobs to retry")
else:
    print(f"📋 Found {len(failed_jobs)} failed job(s) to retry:\n")
    display(failed_jobs_df)

📊 Table Schema Check:
  • retry_metadata column exists: True
  • Filtering: Only original failed runs (retry_metadata IS NULL)

🔍 Searching for failed jobs...

Query: 
        WITH failed_originals AS (
            SELECT 
                position_date,
                job_name,
                job_id,
                run_id as original_run_id,
                status,
                end_time,
                error_message
            FROM demos.backfill_demo.backfill_log
            WHERE status = 'FAILED' AND retry_metadata IS NULL AND start_time >= '2025-11-27'
        ),
        retry_counts AS (
            SELECT 
                position_date,
                COUNT(*) - 1 as current_retry_count
            FROM demos.backfill_demo.backfill_log
            WHERE position_date IN (SELECT position_date FROM failed_originals)
            GROUP BY position_date
        )
        SELECT 
            f.position_date,
            f.job_name,
            f.job_id,
            f.original_

position_date,job_name,job_id,original_run_id,status,end_time,error_message,retry_count
2025-01-07,02_process_data,743256224103762,813573629431429,FAILED,2025-11-28T04:32:38.285Z,,0
2025-01-06,02_process_data,743256224103762,215057038482639,FAILED,2025-11-28T04:32:53.618Z,,0


## Step 3: Query Failed Jobs

Query the backfill log table for failed jobs matching the criteria:

**Logic:**
1. Check if `retry_metadata` column exists (backward compatible)
2. Filter for `status = 'FAILED'` jobs
3. If retry_metadata exists, only get original failures (not failed retries)
4. Calculate retry count for each position_date
5. Apply date filters based on parameters
6. Limit results to max_retries

**Retry Tracking:**
- Original runs: `retry_metadata IS NULL`
- Retry runs: `retry_metadata IS NOT NULL`
- Retry count: Number of attempts for each position_date

In [0]:
# Retry Logic - Trigger Orchestrator for Each Failed Date
from databricks.sdk import WorkspaceClient
import time

w = WorkspaceClient()

# Get the orchestrator job (assuming it's named "03_backfill_orchestrator")
workspace_id = dbutils.entry_point.getDbutils().notebook().getContext().workspaceId().get()

# Find the orchestrator job ID
orchestrator_query = f"""
    SELECT job_id, name
    FROM system.lakeflow.jobs
    WHERE workspace_id = '{workspace_id}'
      AND name = '03_backfill_orchestrator'
      AND delete_time IS NULL
    ORDER BY change_time DESC
    LIMIT 1
"""

orchestrator_df = spark.sql(orchestrator_query)
if orchestrator_df.count() == 0:
    print("❌ Orchestrator job '03_backfill_orchestrator' not found!")
    print("Please run 04_job_manager.ipynb first to create the orchestrator job.")
    dbutils.notebook.exit("Orchestrator job not found")

orchestrator_job_id = orchestrator_df.first()['job_id']
print(f"🎯 Using Orchestrator Job ID: {orchestrator_job_id}\n")

# Track retry results
retry_results = []

print(f"🔄 Starting retry process for {len(failed_jobs)} failed job(s)...\n")
print("=" * 80)

🎯 Using Orchestrator Job ID: 729227138192618

🔄 Starting retry process for 2 failed job(s)...



## Step 4: Setup Orchestrator Connection

Find the `03_backfill_orchestrator` job ID to trigger retries.

**Note:** This requires that jobs have been created using `04_job_manager.ipynb`.

In [0]:
# Execute Retries
import json

for idx, failed_job in enumerate(failed_jobs, 1):
    position_date = str(failed_job['position_date'])
    original_error = failed_job['error_message']
    original_job_name = failed_job['job_name']
    original_job_id = failed_job['job_id']
    original_run_id = failed_job['original_run_id']
    current_retry_count = failed_job['retry_count']
    
    # Build retry metadata JSON
    retry_metadata = {
        "is_retry": True,
        "original_run_id": original_run_id,
        "retry_count": current_retry_count + 1,
        "retry_triggered_by": "05_retry_notebook",
        "retry_triggered_at": datetime.now().isoformat()
    }
    retry_metadata_json = json.dumps(retry_metadata)
    
    print(f"\n[{idx}/{len(failed_jobs)}] Retrying: {position_date}")
    print(f"  Original Job: {original_job_name} (ID: {original_job_id})")
    print(f"  Original Run ID: {original_run_id}")
    print(f"  Retry Attempt: #{current_retry_count + 1}")
    print(f"  Original Error: {original_error}")
    
    try:
        # Trigger the orchestrator job with retry metadata
        response = w.jobs.run_now(
            job_id=orchestrator_job_id,
            notebook_params={
                "position_date": position_date,
                "job_name": original_job_name,
                "retry_metadata": retry_metadata_json
            }
        )
        
        retry_run_id = response.run_id
        print(f"  ✓ Retry triggered successfully")
        print(f"  ↳ New Run ID: {retry_run_id}")
        
        # Store result
        retry_results.append({
            "position_date": position_date,
            "retry_status": "TRIGGERED",
            "retry_run_id": str(retry_run_id),
            "original_error": original_error
        })
        
        # Small delay to avoid overwhelming the API
        time.sleep(2)
        
    except Exception as e:
        error_msg = str(e)
        print(f"  ✗ Failed to trigger retry: {error_msg}")
        
        retry_results.append({
            "position_date": position_date,
            "retry_status": "TRIGGER_FAILED",
            "retry_run_id": None,
            "original_error": original_error,
            "retry_error": error_msg
        })

print("\n" + "=" * 80)
print(f"\n✅ Retry process complete!\n")


[1/2] Retrying: 2025-01-07
  Original Job: 02_process_data (ID: 743256224103762)
  Original Run ID: 813573629431429
  Retry Attempt: #1
  Original Error: None
  ✓ Retry triggered successfully
  ↳ New Run ID: 1023773776942719

[2/2] Retrying: 2025-01-06
  Original Job: 02_process_data (ID: 743256224103762)
  Original Run ID: 215057038482639
  Retry Attempt: #1
  Original Error: None
  ✓ Retry triggered successfully
  ↳ New Run ID: 687943302127233


✅ Retry process complete!



## Step 5: Execute Retries

For each failed job, this cell:

1. **Builds Retry Metadata JSON:**
   ```json
   {
     "is_retry": true,
     "original_run_id": "<original_run_id>",
     "retry_count": <attempt_number>,
     "retry_triggered_by": "05_retry_notebook",
     "retry_triggered_at": "<timestamp>"
   }
   ```

2. **Triggers Orchestrator Job:**
   - Passes `position_date`, `job_name`, and `retry_metadata` as parameters
   - Orchestrator validates business day and logs the retry

3. **Tracks Results:**
   - Stores triggered run IDs
   - Captures any trigger failures
   - Adds 2-second delay between triggers to avoid API rate limits

**Retry Workflow:**
```
05_retry_failed_jobs.ipynb 
  → triggers → 03_backfill_orchestrator.ipynb (with retry metadata)
    → triggers → 02_process_data.ipynb
      → logs → backfill_log (with retry_metadata populated)
```

In [0]:
# Summary Report

# Convert results to DataFrame for easy viewing
from pyspark.sql.types import StructType, StructField, StringType

retry_schema = StructType([
    StructField("position_date", StringType(), False),
    StructField("retry_status", StringType(), False),
    StructField("retry_run_id", StringType(), True),
    StructField("original_error", StringType(), True)
])

# Create summary DataFrame
summary_data = [{
    "position_date": r["position_date"],
    "retry_status": r["retry_status"],
    "retry_run_id": r.get("retry_run_id"),
    "original_error": r.get("original_error")
} for r in retry_results]

summary_df = spark.createDataFrame(summary_data, schema=retry_schema)

# Display summary
print("📊 Retry Summary:\n")
display(summary_df)

# Statistics
triggered_count = len([r for r in retry_results if r["retry_status"] == "TRIGGERED"])
failed_count = len([r for r in retry_results if r["retry_status"] == "TRIGGER_FAILED"])

print(f"\n📈 Statistics:")
print(f"  • Total Failed Jobs Found: {len(failed_jobs)}")
print(f"  • Successfully Triggered: {triggered_count}")
print(f"  • Failed to Trigger: {failed_count}")

if triggered_count > 0:
    # Build the run IDs list outside the f-string to avoid backslash in f-string
    triggered_run_ids = ', '.join([f"'{r['retry_run_id']}'" for r in retry_results if r['retry_status'] == 'TRIGGERED'])
    print(f"\n💡 Tip: Check the backfill log table to monitor the status of these retry runs.")
    print(f"   Query: SELECT * FROM {BACKFILL_LOG_TABLE} WHERE backfill_job_id IN ({triggered_run_ids})")

📊 Retry Summary:



position_date,retry_status,retry_run_id,original_error
2025-01-07,TRIGGERED,1023773776942719,
2025-01-06,TRIGGERED,687943302127233,



📈 Statistics:
  • Total Failed Jobs Found: 2
  • Successfully Triggered: 2
  • Failed to Trigger: 0

💡 Tip: Check the backfill log table to monitor the status of these retry runs.
   Query: SELECT * FROM demos.backfill_demo.backfill_log WHERE backfill_job_id IN ('1023773776942719', '687943302127233')


## Step 6: Display Retry Summary

Show results of the retry process:
- Position dates retried
- Retry status (TRIGGERED or TRIGGER_FAILED)
- New run IDs for monitoring
- Original error messages for context

**Statistics:**
- Total failed jobs found
- Successfully triggered retries
- Failed to trigger (if any)

**Next Steps:**
- Monitor the backfill log table for retry outcomes
- Check if retries succeeded or need further investigation

## Retry Process Complete! ✅

### What Happened:
- ✅ Queried log table for failed jobs
- ✅ Triggered orchestrator for each failed position date
- ✅ New runs will be tracked in the log table with updated run IDs

### Scheduling This Notebook:

**Option 1: Daily Scheduled Job**
```python
# Run every day at 9 AM to retry yesterday's failures
# Set parameters: retry_days_back = 1
```

**Option 2: Weekly Cleanup**
```python
# Run weekly to retry all failures from the past 7 days
# Set parameters: retry_days_back = 7
```

**Option 3: Manual Investigation**
```python
# Run manually with specific date range
# Set parameters: start_date, end_date
```

### Monitoring:
- Check the backfill log table to see the status of retry runs
- Each retry will have a new `backfill_job_id` in the log
- Original failed runs remain in the log for audit purposes

---
*Tip: You can create a Databricks Job for this notebook using `04_job_manager.ipynb` to automate daily retries.*

## 📊 Retry Tracking Queries

Now that retry metadata is tracked, you can run powerful analytics:

### **See all retry attempts for a position date:**
```sql
SELECT 
    run_id,
    status,
    start_time,
    end_time,
    retry_metadata:is_retry as is_retry,
    retry_metadata:retry_count as retry_attempt,
    retry_metadata:original_run_id as original_run_id,
    error_message
FROM demos.backfill_demo.backfill_log
WHERE position_date = '2025-01-15'
ORDER BY start_time;
```

### **Find jobs that succeeded after retries:**
```sql
SELECT 
    position_date,
    COUNT(*) as total_attempts,
    MAX(CAST(retry_metadata:retry_count AS INT)) as max_retry_count,
    MIN(CASE WHEN status = 'FAILED' THEN start_time END) as first_failure,
    MAX(CASE WHEN status = 'SUCCESS' THEN start_time END) as final_success
FROM demos.backfill_demo.backfill_log
WHERE retry_metadata IS NOT NULL
GROUP BY position_date
HAVING MAX(CASE WHEN status='SUCCESS' THEN 1 ELSE 0 END) = 1;
```

### **Retry success rate:**
```sql
SELECT 
    retry_metadata:retry_count as retry_attempt,
    COUNT(*) as total_retries,
    SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) as successful,
    ROUND(100.0 * SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) / COUNT(*), 2) as success_rate_pct
FROM demos.backfill_demo.backfill_log
WHERE retry_metadata IS NOT NULL
GROUP BY retry_metadata:retry_count
ORDER BY retry_attempt;
```