# 📅 Job Creation: Automate F1 Data Pipeline
*Schedule and automate your F1 data workflows*

---

## 🎯 Learning Objectives

By the end of this demo, you'll understand:
- ✅ **Creating Databricks Jobs** for automation
- ✅ **Scheduling workflows** for regular data updates
- ✅ **Monitoring job execution** and handling failures
- ✅ **Best practices** for production job management

---

## 🚀 What We'll Build

**Automated F1 Data Pipeline Job:**
```
📅 Scheduled Job:
├── 🔄 Daily F1 data refresh (6 AM UTC)
├── 📧 Email notifications on success/failure
├── 🔁 Retry policies for resilience
└── 📊 Performance monitoring and alerts
```

### 💡 Job Creation Steps:
1. **Navigate** to Workflows → Jobs in the left sidebar
2. **Click** "Create Job" button
3. **Select** the F1 Notebook Tour as the main task
4. **Configure** scheduling and notifications
5. **Test** the job execution

### 🎯 Use Cases:
- **Race weekend data updates** (automated Sunday evening)
- **Season statistics refresh** (weekly aggregations)
- **Performance monitoring** (daily driver rankings)
- **Data quality checks** (validation and alerts)

**Continue to the next notebook:** `05_Delta_Live_Pipeline.ipynb`

**🏁 Ready to build production ETL pipelines? Let's explore Delta Live Tables! 🏎️**

# ⏰ Job Creation: Automate Your Data Pipelines
*Learn to schedule and monitor data workflows in 3 minutes*

---

## 🎯 Learning Objectives

By the end of this demo, you'll know how to:
- ✅ **Create automated jobs** to refresh data regularly
- ✅ **Configure scheduling** for different business needs
- ✅ **Set up monitoring and alerts** for job failures
- ✅ **Track job execution** with logging and history

---

## 🔄 What We'll Build

**Automated Data Refresh Job:**
```
📅 Daily Schedule (6 AM)
    ↓
🔄 Refresh Driver Standings
    ↓  
📊 Update job_driver_standings_daily
    ↓
📝 Log Execution Status
    ↓
📧 Send Alerts (if needed)
```

**🎯 Goal:** Create a production-ready job that can run automatically to keep our F1 data fresh!

## 📊 Step 1: Create Job-Ready Data Table

First, let's create a table that our job will refresh daily with the latest F1 driver standings.

In [0]:
%sql
-- Create a table for daily driver standings refresh
CREATE OR REPLACE TABLE main.default.job_driver_standings_daily
USING DELTA
COMMENT 'Daily refreshed driver standings - maintained by automated job'
AS
SELECT 
  driverId,
  full_name,
  nationality,
  total_career_points,
  wins,
  podiums,
  total_races,
  points_per_race,
  win_percentage,
  -- Add job execution metadata
  'manual_creation' as refresh_method,
  current_timestamp() as last_updated,
  current_user() as updated_by
FROM main.default.gold_driver_standings
ORDER BY total_career_points DESC

In [0]:
%sql
-- Verify our job table was created
SELECT 
  'job_driver_standings_daily' as table_name,
  COUNT(*) as driver_count,
  MAX(last_updated) as last_refresh,
  MAX(updated_by) as last_updated_by
FROM main.default.job_driver_standings_daily

## 📝 Step 2: Create Job Execution Log Table

Good production jobs always log their execution for monitoring and debugging.

In [0]:
%sql
-- Create job execution log table
CREATE OR REPLACE TABLE main.default.job_run_log
(
  job_run_id STRING,
  job_name STRING,
  start_time TIMESTAMP,
  end_time TIMESTAMP,
  status STRING,
  records_processed BIGINT,
  error_message STRING,
  execution_user STRING,
  execution_details MAP<STRING, STRING>
)
USING DELTA
COMMENT 'Job execution tracking and monitoring log'

## 🔄 Step 3: Build Job Logic with Logging

This is the core logic that our scheduled job will execute.

In [0]:
import uuid
from datetime import datetime

# Job execution function with comprehensive logging
def refresh_driver_standings_job():
    """
    Refreshes the driver standings table and logs execution details.
    This function will be called by our scheduled job.
    """
    
    # Generate unique job run ID
    job_run_id = str(uuid.uuid4())
    job_name = "daily_driver_standings_refresh"
    start_time = datetime.now()
    
    print(f"🚀 Starting job: {job_name}")
    print(f"📝 Job Run ID: {job_run_id}")
    print(f"⏰ Start Time: {start_time}")
    
    try:
        # Step 1: Refresh the driver standings data
        print("📊 Refreshing driver standings data...")
        
        # Get current record count before refresh
        old_count = spark.sql("SELECT COUNT(*) as count FROM main.default.job_driver_standings_daily").collect()[0].count
        
        # Refresh with latest data from gold layer
        spark.sql("""
            CREATE OR REPLACE TABLE main.default.job_driver_standings_daily
            USING DELTA
            AS
            SELECT 
              driverId,
              full_name,
              nationality,
              total_career_points,
              wins,
              podiums,
              total_races,
              points_per_race,
              win_percentage,
              'automated_job_refresh' as refresh_method,
              current_timestamp() as last_updated,
              current_user() as updated_by
            FROM main.default.gold_driver_standings
            ORDER BY total_career_points DESC
        """)
        
        # Get new record count
        new_count = spark.sql("SELECT COUNT(*) as count FROM main.default.job_driver_standings_daily").collect()[0].count
        
        end_time = datetime.now()
        duration = (end_time - start_time).total_seconds()
        
        print(f"✅ Job completed successfully!")
        print(f"📊 Records processed: {new_count}")
        print(f"⏱️ Duration: {duration:.2f} seconds")
        
        # Log successful execution
        spark.sql(f"""
            INSERT INTO main.default.job_run_log VALUES (
                '{job_run_id}',
                '{job_name}',
                timestamp('{start_time}'),
                timestamp('{end_time}'),
                'SUCCESS',
                {new_count},
                NULL,
                current_user(),
                map('duration_seconds', '{duration:.2f}', 'old_count', '{old_count}', 'new_count', '{new_count}')
            )
        """)
        
        return {"status": "SUCCESS", "records_processed": new_count, "duration": duration}
        
    except Exception as e:
        end_time = datetime.now()
        error_message = str(e)
        
        print(f"❌ Job failed: {error_message}")
        
        # Log failed execution
        spark.sql(f"""
            INSERT INTO main.default.job_run_log VALUES (
                '{job_run_id}',
                '{job_name}',
                timestamp('{start_time}'),
                timestamp('{end_time}'),
                'FAILED',
                0,
                '{error_message}',
                current_user(),
                map('error_type', 'execution_error')
            )
        """)
        
        raise e

# Test our job function
print("🧪 Testing job execution...")
result = refresh_driver_standings_job()
print(f"🎯 Job test result: {result}")

In [0]:
%sql
-- Check our job execution log
SELECT 
  job_name,
  start_time,
  end_time,
  status,
  records_processed,
  execution_details
FROM main.default.job_run_log
ORDER BY start_time DESC
LIMIT 5

## 🏗️ Step 4: Complete Job Creation Guide

Now let's learn how to create an automated job in the Databricks workspace.

### 📋 Job Creation Steps:

#### 1. Navigate to Workflows 🔄
- Click **"Workflows"** in the left sidebar
- Click **"Create Job"** button
- You'll see the job configuration interface

#### 2. Configure Basic Job Settings ⚙️
```
Job Name: "F1 Driver Standings Daily Refresh"
Description: "Automated daily refresh of F1 driver standings data"
```

#### 3. Add Job Task 📝
- **Task Name:** `refresh_driver_standings`
- **Type:** `Notebook`
- **Source:** Select this notebook (`04_Job_Creation.ipynb`)
- **Cluster:** Choose `Serverless` compute

#### 4. Set Schedule ⏰
- **Trigger Type:** `Scheduled`
- **Schedule:** `0 6 * * *` (Daily at 6 AM)
- **Timezone:** Your local timezone

#### 5. Configure Notifications 📧
- **On Success:** Email notification (optional)
- **On Failure:** Email + Slack alert (recommended)
- **Recipients:** Your email or team distribution list

#### 6. Advanced Options 🎛️
- **Max Concurrent Runs:** `1` (prevent overlapping executions)
- **Timeout:** `30 minutes` (reasonable for this job)
- **Retry Policy:** `Retry 2 times with 5 minute intervals`

## ⚙️ Common Job Configuration Examples

Here are some typical scheduling patterns for different business needs:

In [0]:
print("⏰ Common Job Scheduling Patterns")
print("=" * 45)

schedules = {
    "Every Hour": "0 * * * *",
    "Daily at 6 AM": "0 6 * * *", 
    "Daily at Midnight": "0 0 * * *",
    "Weekly on Monday": "0 6 * * 1",
    "Monthly on 1st": "0 6 1 * *",
    "Business Days Only": "0 6 * * 1-5",
    "Every 15 minutes": "*/15 * * * *",
    "Twice Daily": "0 6,18 * * *"
}

for description, cron in schedules.items():
    print(f"📅 {description:<20} {cron}")

print("\n💡 Cron format: minute hour day month day-of-week")

## 📊 Job Monitoring and Troubleshooting

### 🔍 Monitoring Your Jobs:

#### Job Run History 📈
- **View runs:** Workflows → Your Job → "Runs" tab
- **Check status:** SUCCESS, FAILED, RUNNING, CANCELED
- **View logs:** Click on any run to see detailed logs
- **Performance:** Check duration trends over time

#### Common Job Issues & Solutions 🔧

| **Issue** | **Symptoms** | **Solution** |
|-----------|-------------|-------------|
| **Timeout** | Job runs too long | Optimize queries, increase timeout |
| **Cluster startup** | Slow job start | Use Serverless compute |
| **Data skew** | Uneven task performance | Repartition data, optimize joins |
| **Memory errors** | OOM exceptions | Increase cluster size, optimize code |
| **Dependencies** | Missing tables/files | Check data availability, add retries |

In [0]:
# Let's create a job monitoring query
print("📊 Job Performance Monitoring")
print("=" * 35)

In [0]:
%sql
-- Job performance monitoring query
SELECT 
  job_name,
  status,
  COUNT(*) as run_count,
  AVG(CAST(execution_details['duration_seconds'] AS DOUBLE)) as avg_duration_seconds,
  MAX(end_time) as last_run,
  SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) as successful_runs,
  SUM(CASE WHEN status = 'FAILED' THEN 1 ELSE 0 END) as failed_runs,
  ROUND(SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as success_rate_pct
FROM main.default.job_run_log
GROUP BY job_name, status
ORDER BY last_run DESC

## 🔄 Advanced Job Patterns

### Multi-Step Workflows 🔗

For complex data pipelines, you can create jobs with multiple tasks:

```
📥 Task 1: Data Ingestion
    ↓
🔄 Task 2: Data Transformation  
    ↓
📊 Task 3: Generate Reports
    ↓
📧 Task 4: Send Notifications
```

### Job Dependencies 🔗
- **Sequential:** Tasks run one after another
- **Parallel:** Multiple tasks run simultaneously  
- **Conditional:** Tasks run based on previous results

### Resource Management 💰
- **Serverless:** Recommended for most jobs (auto-scaling)
- **Shared clusters:** Cost-effective for multiple small jobs
- **Dedicated clusters:** High-performance critical workloads

In [0]:
# Example of a more complex job function with multiple steps
def multi_step_etl_job():
    """
    Example of a complex ETL job with multiple steps and error handling.
    """
    job_run_id = str(uuid.uuid4())
    
    try:
        # Step 1: Data validation
        print("🔍 Step 1: Validating source data...")
        validation_result = spark.sql("""
            SELECT COUNT(*) as count 
            FROM main.default.silver_drivers 
            WHERE full_name IS NOT NULL
        """).collect()[0].count
        
        if validation_result == 0:
            raise Exception("No valid driver data found")
            
        # Step 2: Data processing
        print("⚙️ Step 2: Processing data transformations...")
        # (Your transformation logic here)
        
        # Step 3: Data quality checks
        print("✅ Step 3: Running data quality checks...")
        # (Your quality check logic here)
        
        # Step 4: Update production tables
        print("📊 Step 4: Updating production tables...")
        # (Your table update logic here)
        
        print("🎉 Multi-step ETL job completed successfully!")
        return {"status": "SUCCESS", "steps_completed": 4}
        
    except Exception as e:
        print(f"❌ Multi-step ETL job failed: {str(e)}")
        raise e

print("🧪 Example multi-step job structure created")

## ✅ Job Creation Complete!

**🎉 Excellent! You've learned how to create production-ready automated jobs!**

### What You've Accomplished:
- ✅ **Created job-ready data table** for daily driver standings
- ✅ **Built execution logging** for monitoring and debugging
- ✅ **Developed job function** with comprehensive error handling
- ✅ **Learned job configuration** (scheduling, notifications, monitoring)
- ✅ **Explored advanced patterns** (multi-step workflows, dependencies)

### 🔄 Your Job Architecture:
```
⏰ Schedule (Daily 6 AM)
    ↓
🔄 refresh_driver_standings_job()
    ↓
📊 job_driver_standings_daily (Updated)
    ↓
📝 job_run_log (Execution tracked)
```

In [0]:
# Final verification of our job-ready components
print("⏰ Job Creation Summary")
print("=" * 30)

# Check our job table
job_table_count = spark.sql("SELECT COUNT(*) as count FROM main.default.job_driver_standings_daily").collect()[0].count
print(f"📊 Driver standings table: {job_table_count:,} records")

# Check our log table  
log_count = spark.sql("SELECT COUNT(*) as count FROM main.default.job_run_log").collect()[0].count
print(f"📝 Job execution logs: {log_count} entries")

print(f"\n✅ Job components ready for scheduling!")
print(f"🎯 Next: Create your job in Workflows → Create Job")

## 🚀 Next Steps

Ready to explore more advanced data engineering features?

### Immediate Actions:
1. **🔄 Create Your Job:** 
   - Go to Workflows → Create Job
   - Follow the configuration guide above
   - Schedule your first automated refresh!

2. **➡️ Next Notebook:** [05_Delta_Live_Pipeline.ipynb](05_Delta_Live_Pipeline.ipynb)
   - Learn about managed ETL pipelines
   - Declarative data transformations
   - Built-in data quality expectations

3. **📊 Monitor Your Jobs:**
   - Check the job_run_log table regularly
   - Set up email notifications for failures
   - Monitor job performance trends

### 💡 Pro Tips:
- **🧪 Test thoroughly** before scheduling in production
- **📧 Set up alerts** for job failures (early detection is key)
- **📊 Monitor performance** to optimize job runtime
- **🔄 Use retries** for transient failures
- **📝 Log everything** for easier debugging

**⏰ Time to automate your data pipelines! 🚀**