# 📅 Job Creation: Automate F1 Data Pipeline
*Schedule and automate your F1 data workflows*

---

## 🎯 Learning Objectives

By the end of this demo, you'll understand:
- ✅ **Creating Databricks Jobs** for automation
- ✅ **Scheduling workflows** for regular data updates
- ✅ **Monitoring job execution** and handling failures
- ✅ **Best practices** for production job management

---

## 🚀 What We'll Build

**Automated F1 Data Pipeline Job:**
```
📅 Scheduled Job:
├── 🔄 Daily F1 data refresh (6 AM UTC)
├── 📧 Email notifications on success/failure
├── 🔁 Retry policies for resilience
└── 📊 Performance monitoring and alerts
```

### 🎯 Use Cases:
- **Race weekend data updates** (automated Sunday evening)
- **Season statistics refresh** (weekly aggregations)
- **Performance monitoring** (daily driver rankings)
- **Data quality checks** (validation and alerts)

**🏁 Ready to build production ETL pipelines? Let's explore Delta Live Tables! 🏎️**

# ⏰ Job Creation: Automate Your Data Pipelines
*Learn to schedule and monitor data workflows in 3 minutes*

## 🔄 What We'll Build

**Automated Data Refresh Job:**
```
📅 Daily Schedule (6 AM)
    ↓
🔄 Refresh Driver Standings
    ↓  
📊 Update job_driver_standings_daily
    ↓
📝 Log Execution Status
    ↓
📧 Send Alerts (if needed)
```

**🎯 Goal:** Create a production-ready job that can run automatically to keep our F1 data fresh!

## 📊 Step 1: Create Job-Ready Data Table

First, let's create a table that our job will refresh daily with the latest F1 driver standings.

In [0]:
%sql
-- Create a table for daily driver standings refresh using race results
CREATE OR REPLACE TABLE main.default.f1_job_driver_standings_daily
USING DELTA
COMMENT 'Daily refreshed driver standings - maintained by automated job'
AS
WITH driver_points AS (
  SELECT
    driver,
    team,
    SUM(points) AS total_points,
    COUNT(*) AS total_races,
    SUM(CASE WHEN position = '1' THEN 1 ELSE 0 END) AS wins,
    SUM(CASE WHEN position IN ('1','2','3') THEN 1 ELSE 0 END) AS podiums
  FROM main.default.f1_bronze_race_results
  GROUP BY driver, team
),
standings AS (
  SELECT
    driver,
    team,
    total_points,
    total_races,
    wins,
    podiums,
    ROUND(total_points / total_races, 2) AS points_per_race,
    ROUND(wins * 100.0 / total_races, 2) AS win_percentage
  FROM driver_points
)
SELECT
  driver AS full_name,
  team,
  total_points,
  wins,
  podiums,
  total_races,
  points_per_race,
  win_percentage,
  -- Add job execution metadata
  'manual_creation' as refresh_method,
  current_timestamp() as last_updated,
  current_user() as updated_by
FROM standings
ORDER BY total_points DESC, wins DESC, podiums DESC, full_name

In [0]:
%sql
-- Verify our job table was created and check all required columns
SELECT 
  'job_driver_standings_daily' as table_name,
  COUNT(*) as driver_count,
  MAX(last_updated) as last_refresh,
  MAX(updated_by) as last_updated_by,
  COUNT(DISTINCT full_name) as full_name_count,
  COUNT(DISTINCT team) as team_count,
  COUNT(DISTINCT total_points) as total_points_count,
  COUNT(wins) as wins_count,
  COUNT(podiums) as podiums_count,
  COUNT(total_races) as total_races_count,
  COUNT(points_per_race) as points_per_race_count,
  COUNT(win_percentage) as win_percentage_count,
  COUNT(DISTINCT refresh_method) as refresh_method_count
FROM main.default.f1_job_driver_standings_daily

## 🏗️ Step 2: Complete Job Creation Guide

Now let's learn how to create an automated job in the Databricks workspace.

### 📋 Job Creation Steps:

#### 1. Navigate to Workflows 🔄
- Click **"Jobs & Pipelines"** in the left sidebar
- Click **"Create"** and then **"Job"**
- You'll see the job configuration interface

#### 2. Configure Basic Job Settings ⚙️
```
Job Name: "F1 Driver Standings Daily Refresh"
Description: "Automated daily refresh of F1 driver standings data"
```

#### 3. Add Job Task 📝
- **Task Name:** `refresh_driver_standings`
- **Type:** `Notebook`
- **Source:** Select this notebook (`04_Job_Creation.ipynb`)
- **Cluster:** Choose `Serverless` compute

#### 4. Set Schedule ⏰
- **Trigger Type:** `Scheduled`
- **Schedule:** `0 6 * * *` (Daily at 6 AM)
- **Timezone:** Your local timezone

#### 5. Configure Notifications 📧
- **On Success:** Email notification (optional)
- **On Failure:** Email + Slack alert (recommended)
- **Recipients:** Your email or team distribution list

#### 6. Advanced Options 🎛️
- **Max Concurrent Runs:** `1` (prevent overlapping executions)
- **Timeout:** `30 minutes` (reasonable for this job)
- **Retry Policy:** `Retry 2 times with 5 minute intervals`

## 📊 Job Monitoring and Troubleshooting

### 🔍 Monitoring Your Jobs:

#### Job Run History 📈
- **View runs:** Workflows → Your Job → "Runs" tab
- **Check status:** SUCCESS, FAILED, RUNNING, CANCELED
- **View logs:** Click on any run to see detailed logs
- **Performance:** Check duration trends over time

#### Common Job Issues & Solutions 🔧

| **Issue** | **Symptoms** | **Solution** |
|-----------|-------------|-------------|
| **Timeout** | Job runs too long | Optimize queries, increase timeout |
| **Cluster startup** | Slow job start | Use Serverless compute |
| **Data skew** | Uneven task performance | Repartition data, optimize joins |
| **Memory errors** | OOM exceptions | Increase cluster size, optimize code |
| **Dependencies** | Missing tables/files | Check data availability, add retries |

## 🔄 Advanced Job Patterns

### Multi-Step Workflows 🔗

For complex data pipelines, you can create jobs with multiple tasks:

```
📥 Task 1: Data Ingestion
    ↓
🔄 Task 2: Data Transformation  
    ↓
📊 Task 3: Generate Reports
    ↓
📧 Task 4: Send Notifications
```

### Job Dependencies 🔗
- **Sequential:** Tasks run one after another
- **Parallel:** Multiple tasks run simultaneously  
- **Conditional:** Tasks run based on previous results

### Resource Management 💰
- **Serverless:** Recommended for most jobs (auto-scaling)
- **Shared clusters:** Cost-effective for multiple small jobs
- **Dedicated clusters:** High-performance critical workloads

## ✅ Job Creation Complete!

**🎉 Excellent! You've learned how to create production-ready automated jobs!**

### What You've Accomplished:
- ✅ **Created job-ready data table** for daily driver standings
- ✅ **Built execution logging** for monitoring and debugging
- ✅ **Developed job function** with comprehensive error handling
- ✅ **Learned job configuration** (scheduling, notifications, monitoring)
- ✅ **Explored advanced patterns** (multi-step workflows, dependencies)

### 🔄 Your Job Architecture:
```
⏰ Schedule (Daily 6 AM)
    ↓
🔄 refresh_driver_standings_job()
    ↓
📊 job_driver_standings_daily (Updated)
    ↓
📝 job_run_log (Execution tracked)
```

## 🚀 Next Steps

Ready to explore more advanced data engineering features?

### Immediate Actions:
1. **🔄 Create Your Job:** 
   - Go to Workflows → Create Job
   - Follow the configuration guide above
   - Schedule your first automated refresh!

2. **➡️ Next Notebook:** [05_Delta_Live_Pipeline.ipynb](05_Delta_Live_Pipeline.ipynb)
   - Learn about managed ETL pipelines
   - Declarative data transformations
   - Built-in data quality expectations

3. **📊 Monitor Your Jobs:**
   - Check the job_run_log table regularly
   - Set up email notifications for failures
   - Monitor job performance trends

### 💡 Pro Tips:
- **🧪 Test thoroughly** before scheduling in production
- **📧 Set up alerts** for job failures (early detection is key)
- **📊 Monitor performance** to optimize job runtime
- **🔄 Use retries** for transient failures
- **📝 Log everything** for easier debugging

**⏰ Time to automate your data pipelines! 🚀**