# Orchestration with Databricks Jobs

Automate and schedule data pipelines, notebooks, and workflows with Databricks Jobs.

## What You'll Learn

✅ Create and configure Databricks Jobs  
✅ Schedule workflows with cron expressions  
✅ Implement file arrival triggers  
✅ Build multi-task dependencies  
✅ Monitor and debug job runs  

---

## Why Orchestration?

**Manual execution** doesn't scale.

**Databricks Jobs** provide:
- Scheduled execution
- Event-driven triggers
- Task dependencies
- Error handling and retries
- Centralized monitoring

---

## Table of Contents

1. [Jobs Overview](#overview)
2. [Creating Jobs](#creating)
3. [Scheduling](#scheduling)
4. [File Arrival Triggers](#triggers)
5. [Task Dependencies](#dependencies)
6. [Monitoring and Debugging](#monitoring)

---

**References:**
- [Jobs Documentation](https://docs.databricks.com/aws/en/jobs/)
- [Jobs Quickstart](https://docs.databricks.com/aws/en/jobs/jobs-quickstart)
- [Triggers](https://docs.databricks.com/aws/en/jobs/triggers)
- [File Arrival Triggers](https://docs.databricks.com/aws/en/jobs/file-arrival-triggers)

## 1. Jobs Overview <a id="overview"></a>

### Job Components

**1. Tasks**: Individual units of work
- Notebook execution
- SQL queries
- Python scripts
- JAR files
- Delta Live Tables pipelines

**2. Clusters**: Compute resources
- Job clusters (ephemeral)
- All-purpose clusters (shared)
- Serverless (recommended)

**3. Schedules**: When jobs run
- Cron expressions
- Continuous
- Triggered by events

**4. Parameters**: Dynamic inputs
- Notebook parameters
- SQL parameters
- Environment variables

### Job Types

**Batch Jobs:**
- Run on schedule
- Process historical data
- Generate reports

**Streaming Jobs:**
- Run continuously
- Process real-time data
- Low latency

**Event-Driven:**
- Triggered by file arrival
- Webhook calls
- Manual triggers

---

## 2. Creating Jobs <a id="creating"></a>

### Quick Start: Single Task Job

**Step 1: Create Job**
1. Click **Workflows** → **Jobs**
2. Click **Create Job**
3. Name: "Daily IoT Data Pipeline"

**Step 2: Add Task**
```
Task Name: ingest_sensor_data
Type: Notebook
Notebook Path: /Day 1/6 Data Transformation
Cluster: New job cluster (2 workers, i3.xlarge)
```

**Step 3: Configure Schedule**
```
Trigger: Scheduled
Schedule: 0 6 * * * (6 AM daily)
Timezone: America/Los_Angeles
```

**Step 4: Save and Run**
- Click **Save**
- Click **Run Now** to test
- View run details

### Multi-Task Workflow

**IoT Processing Pipeline:**
```
Task 1: Ingest Data
  ↓
Task 2: Data Quality Checks
  ↓
Task 3: Transform to Silver
  ↓
Task 4: Aggregate to Gold
  ↓
Task 5: Send Email Summary
```

**Configuration:**
```json
{
  "tasks": [
    {
      "task_key": "ingest",
      "notebook_task": {
        "notebook_path": "/pipelines/01_ingest"
      }
    },
    {
      "task_key": "quality_check",
      "notebook_task": {
        "notebook_path": "/pipelines/02_quality"
      },
      "depends_on": [{"task_key": "ingest"}]
    },
    {
      "task_key": "silver_transform",
      "notebook_task": {
        "notebook_path": "/pipelines/03_silver"
      },
      "depends_on": [{"task_key": "quality_check"}]
    }
  ]
}
```

---

## 3. Scheduling <a id="scheduling"></a>

### Cron Expressions

**Format:** `minute hour day month day_of_week`

**Common Patterns:**
```
Every hour:       0 * * * *
Every day at 6AM: 0 6 * * *
Every Monday:     0 0 * * 1
First of month:   0 0 1 * *
Every 15 mins:    */15 * * * *
Business hours:   0 9-17 * * 1-5
```

**Examples:**
```
# Daily at 2 AM
Schedule: 0 2 * * *
Description: Run overnight batch processing

# Every 4 hours
Schedule: 0 */4 * * *
Description: Periodic data refresh

# Weekdays at 8 AM
Schedule: 0 8 * * 1-5
Description: Business day reports
```

### Timezone Handling

```
Schedule: 0 6 * * *
Timezone: America/Los_Angeles
Note: Handles daylight saving time automatically
```

---

## 4. File Arrival Triggers <a id="triggers"></a>

### Use Case: Process New Sensor Files

**Scenario**: New CSV files arrive in cloud storage throughout the day

**Setup:**
```
Trigger Type: File Arrival
Location: /Volumes/default/db_crash_course/sensor_data/
File Pattern: *.csv
Wait Duration: 5 minutes
Max Wait: 1 hour
```

**How it Works:**
1. Monitor specified path
2. Detect new files matching pattern
3. Wait for additional files (batch processing)
4. Trigger job after wait duration
5. Pass file paths as parameters

**Job Configuration:**
```python
# In notebook, access file paths
import json
file_paths = json.loads(dbutils.widgets.get("file_paths"))

for path in file_paths:
    print(f"Processing: {path}")
    df = spark.read.csv(path, header=True)
    # Process file
```

---

## 5. Task Dependencies <a id="dependencies"></a>

### Dependency Patterns

**1. Linear Pipeline**
```
A → B → C → D
```

**2. Parallel Processing**
```
    ┌→ B ┐
A ──┼→ C ┼→ E
    └→ D ┘
```

**3. Conditional Execution**
```
A → B → if success: C
         if failure: D
```

### Task Values

**Pass data between tasks:**

Task 1 (Python):
```python
# Write output value
dbutils.jobs.taskValues.set("row_count", 1000)
dbutils.jobs.taskValues.set("status", "success")
```

Task 2 (Consuming values):
```python
# Read values from previous task
row_count = dbutils.jobs.taskValues.get("ingest_task", "row_count")
status = dbutils.jobs.taskValues.get("ingest_task", "status")

print(f"Previous task processed {row_count} rows with status: {status}")
```

---

## 6. Monitoring and Debugging <a id="monitoring"></a>

### Job Monitoring

**Metrics to Track:**
- Success/failure rates
- Run duration trends
- Cost per run
- Cluster utilization

**Alerts:**
```
Email on Failure: team@company.com
Slack Notification: #data-alerts
PagerDuty: Critical jobs only
```

### Debugging Failures

**Common Issues:**

**1. Timeout:**
```
Error: Job exceeded maximum runtime (3 hours)
Solution: Increase timeout or optimize query
```

**2. Cluster Start Failure:**
```
Error: Cannot acquire cluster
Solution: Check quotas, use job clusters
```

**3. Data Not Found:**
```
Error: Table not found
Solution: Check dependencies, verify data arrival
```

**Debug Tools:**
- Job run logs
- Spark UI
- Driver logs
- Task output

---

## Summary

✅ **Jobs** - Automate notebook and pipeline execution  
✅ **Scheduling** - Cron expressions for time-based triggers  
✅ **File triggers** - Event-driven processing  
✅ **Dependencies** - Multi-task workflows  
✅ **Monitoring** - Track success and debug failures  

### Best Practices:

1. **Use job clusters** for cost optimization
2. **Set retries** for transient failures
3. **Monitor costs** - track DBU usage
4. **Alert appropriately** - don't spam
5. **Document workflows** - explain dependencies

---

**References:**
- [Jobs Docs](https://docs.databricks.com/aws/en/jobs/)
- [Triggers](https://docs.databricks.com/aws/en/jobs/triggers)
- [Best Practices](https://docs.databricks.com/aws/en/jobs/jobs-best-practices)