## Databricks Jobs
## https://docs.databricks.com/aws/en/getting-started/etl-quick-start#process

'''
Run your first ETL workload on Databricks

1. Configure Auto Loader to ingest data to Delta Lake

# Import functions
from pyspark.sql.functions import col, current_timestamp

# Define variables used in code below
file_path = "/databricks-datasets/structured-streaming/events"
username = spark.sql("SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')").first()[0]
table_name = f"{username}_etl_quickstart"
checkpoint_path = f"/tmp/{username}/_checkpoint/etl_quickstart"

# Clear out data from previous demo execution
spark.sql(f"DROP TABLE IF EXISTS {table_name}")
dbutils.fs.rm(checkpoint_path, True)

# Configure Auto Loader to ingest JSON data to a Delta table
(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load(file_path)
  .select("*", col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time"))
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .toTable(table_name))

   # Process and interact with data
   df = spark.read.table(table_name)

# Schedule a job

 You can run Databricks notebooks as production scripts by adding them as a task in a Databricks job. In this step, you will create a new job that you can trigger manually.

To schedule your notebook as a task:

1. Click Schedule on the right side of the header bar.
2. Enter a unique name for the Job name.
3. Click Manual.
4. In the Cluster drop-down, select the cluster you created in step 1.
5. Click Create.
6. In the window that appears, click Run now.
7. To see the job run results, click the External Link icon next to the Last run timestamp.

Orchestration concepts
There are three main concepts when using orchestration in Databricks: jobs, tasks, and triggers.

Job - A job is the primary resource for coordinating, scheduling, and running your operations. Jobs can vary in complexity from a single task running a Databricks notebook to hundreds of tasks with conditional logic and dependencies. The tasks in a job are visually represented by a Directed Acyclic Graph (DAG). You can specify properties for the job, including:

Trigger - this defines when to run the job.
Parameters - run-time parameters that are automatically pushed to tasks within the job.
Notifications - emails or webhooks to be sent when a job fails or takes too long.
Git - source control settings for the job tasks.
Task - A task is a specific unit of work within a job. Each task can perform a variety of operations, including:

A notebook task runs a Databricks notebook. You specify the path to the notebook and any parameters that it requires.
A pipeline task runs a pipeline. You can specify an existing Delta Live Tables pipeline, such as a materialized view or streaming table.
A Python script tasks runs a Python file. You provide the path to the file and any necessary parameters.
There are many types of tasks. For a complete list, see Types of tasks. Tasks can have dependencies on other tasks, and conditionally run other tasks, allowing you to create complex workflows with conditional logic and dependencies.

Trigger - A trigger is a mechanism that initiates running a job based on specific conditions or events. A trigger can be time-based, such as running a job at a scheduled time (for example, ever day at 2 AM), or event-based, such as running a job when new data arrives in cloud storage.


Monitoring and observability

Jobs provide built-in support for monitoring and observability. The following topics give an overview of this support. For more details about monitoring jobs and orchestration, see Monitoring and observability for Databricks Jobs.

Job monitoring and observability in the UI - In the Databricks UI you can view jobs, including details such as the job owner and the result of the last run, and filter by job properties. You can view a history of job runs, and get detailed information about each task in the job.

Job run status and metrics - Databricks reports job run success, and logs and metrics for each task within a job run to diagnose issues and understand performance.

Notifications and alerts - You can set up notifications for job events via email, Slack, custom webhooks and a host of other options.

Custom queries through system tables - Databricks provides system tables that record job runs and tasks across the account. You can use these tables to query and analyze job performance and costs. You can create dashboards to visualize job metrics and trends, to help monitor the health and performance of your workflows.

Limitations

The following limitations exist:

A workspace is limited to 2000 concurrent task runs. A 429 Too Many Requests response is returned when you request a run that cannot start immediately.\
The number of jobs a workspace can create in an hour is limited to 10000 (includes “runs submit”). This limit also affects jobs created by the REST API and notebook workflows.\
A workspace can contain up to 12000 saved jobs.\
A job can contain up to 100 tasks.\


Databricks has tools and APIs that allow you to schedule and orchestrate your workflows programmatically, including the following:

Databricks CLI
Databricks Asset Bundles
Databricks extension for Visual Studio Code
Databricks SDKs
Jobs REST API

# Databricks Pipeline Task
A Pipeline Task is a task within a Databricks Job that runs a Delta Live Tables (DLT) pipeline. Delta Live Tables is a framework in Databricks designed for building, managing, and monitoring data pipelines. It supports batch and streaming data transformations while ensuring data reliability.

How Pipeline Task Works in Databricks
Runs a Delta Live Tables (DLT) Pipeline

The pipeline task allows you to trigger an existing DLT pipeline within a Databricks Job.
The DLT pipeline can process data incrementally in a streaming or batch mode.
Supports Materialized Views & Streaming Tables

The pipeline can include materialized views (precomputed tables for performance optimization) and streaming tables (real-time or near-real-time data processing).
Configuration in Databricks Jobs

You can add a pipeline task while setting up a job in Databricks.
The job can have multiple tasks, and the pipeline task can be a dependency for other tasks.
Example Use Case
Suppose you have a streaming pipeline that processes real-time IoT sensor data and stores it in a Delta Table.
You create a DLT pipeline that defines:
A Streaming Table for raw sensor data.
A Materialized View that aggregates sensor readings.
The pipeline task in a Databricks job triggers this pipeline at scheduled intervals or in response to an event.
Would you like a Databricks job JSON example to configure a pipeline task?
'''