# Airflow and ETL/ELT Concepts

## Keywords

### 1. ETL/ELT
- **ETL (Extract, Transform, Load)**: Refers to the process of extracting data from different sources, transforming it into a desired format, and loading it into a destination.
- **ELT (Extract, Load, Transform)**: A variant where data is first loaded into the destination before being transformed, often due to the power of modern data warehouses.

### 2. DAG
- **DAG (Directed Acyclic Graph)**: A graph structure used in Airflow to define workflows, where tasks are represented as nodes, and the edges represent dependencies between tasks. The key feature is that it does not allow cycles.

### 3. Airflow Task
- An **Airflow Task** is an individual unit of work in an Airflow DAG. Each task performs a specific operation and is represented as a node in the DAG.

## Airflow Concepts

### 1. with DAG
- The `with DAG` context manager is used to define a set of tasks and their dependencies inside the DAG context. It simplifies the creation of the DAG.

### 2. @dag / @task
- `@dag` is used to define a DAG in Airflow (available from Airflow 2.0 onwards). It allows you to create a DAG using a decorator.
- `@task` is a decorator used to define a task in Airflow, making it easier to write and manage tasks.

### 3. >>, <<
- These are used to set task dependencies in Airflow. 
    - `>>` sets downstream dependencies (i.e., task A >> task B means task A must run before task B).
    - `<<` sets upstream dependencies (i.e., task A << task B means task B must run before task A).

### 4. set_downstream, set_upstream
- `set_downstream` is used to set downstream dependencies for a task.
- `set_upstream` is used to set upstream dependencies for a task.

### 5. cross_downstream
- `cross_downstream` is a method to set downstream tasks across different branches in a DAG.

### 6. chain
- `chain` is a method to link a sequence of tasks, making them run in a specific order.

---

# Questions and Answers

### 1. Why is Airflow better than a simple scheduler? When would you prefer the cron utility over Airflow?

- **Airflow** `task dependencies`, retries, `logging`, and monitoring, `retring` when tasks fail
    It supports `dynamic task creation`, execution logging, and better `error handling`, which cron doesn't.
- **Cron** might be preferred for simpler, less complex scheduling needs where task dependencies, retries, and logs aren't required. For instance, when you need to run a basic script at a regular interval, cron can be sufficient.

### 2. In what different ways can a task be triggered?

- `Scheduled execution` based on DAG schedule.
- `Manual execution` via the UI or CLI.
- `Code changes` (if DAG definition changes).
- `Programmatic execution` using trigger_dag or the Airflow REST API not from DAG but by external code
- `Task dependencies` once they complete.

### 3. Can Airflow run multiple tasks in parallel?

- Yes, Airflow can run multiple tasks in parallel, provided that the necessary resources (e.g., worker capacity) are available. 

Airflow uses a distributed system of workers to run tasks in parallel.

- `SequentialExecutor` which allows only one task to run at a time
- `LocalExecutor`: Runs tasks in parallel on your `local machine` using multiple `processes`.
- `CeleryExecutor`: Distributes tasks across `multiple worker nodes`, ideal for scaling horizontally.

### 4. How can you handle dependencies between tasks?

- Dependencies in Airflow are handled using:
    - Task dependencies (`>>`, `<<`, `set_upstream`, `set_downstream`, `cross_downstream`).
    - The `chain` method can also be used to define sequences of tasks.

### 5. How can you monitor your workflow?

- You can monitor your workflow in Airflow using:
    - The `Airflow UI` to check DAG and task statuses.
    - `Logs` generated by each task run.
    - `Alerts` and `notifications` for failures and retries.
    - Airflow's integration with `external tools` like `Grafana` or `Prometheus`.

### 6. What are the two ways to create a DAG? Are there any pros and cons?

- **1. Using the `DAG` context manager (`with DAG`):**
    - **Pros**: Simplifies the definition of tasks and their dependencies, avoids code duplication, and provides clear DAG structure.
    - **Cons**: Less flexibility for complex dynamic DAGs.
- **2. Using the `@dag` decorator (Airflow 2.0+):**
    - **Pros**: More Pythonic and concise, better readability, and simpler task management.
    - **Cons**: Can be less intuitive for more complex DAG structures, and it may require a higher version of Airflow.
- **3. Creating a DAG via the `Airflow UI`** 
    -  `not recommended for complex` workflows. It is mostly useful for `simple DAGs` or testing.

### 7. Can a task have more than one upstream dependency?

- Yes, a task can have multiple upstream dependencies in Airflow. This allows tasks to depend on the completion of several other tasks before they can run.

# Set the dependencies
`task1 >> final_task`
`task2 >> final_task`
`task3 >> final_task`

final_task is the downstream task that depends on task1, task2, and task3

### 8. How can you test a task individually if it has upstream dependencies?

- You can test a task individually in Airflow by:
    - Running the task using the `airflow tasks test` command, which allows you to skip upstream tasks.
    - Triggering the task manually from the UI or CLI after ensuring that upstream tasks have been completed.

### 9. What are the parameters to create a DAG?

- Key parameters to create a DAG:
    - `dag_id`: Unique identifier for the DAG.
    - `default_args`: A dictionary of default parameters (e.g., retries, start date, etc.).
    - `schedule_interval`: Defines the schedule (can be a cron expression or a timedelta).
    - `catchup`: If True, Airflow will run past missed scheduled runs.
    - `dagrun_timeout`: Maximum time allowed for a DAG to complete.
    - `description`: A short description of the DAG.

### 10. What are the possible states for a task?

- The possible states for a task in Airflow are:
    - `queued`: The task is waiting to be executed.
    - `running`: The task is currently being executed.
    - `success`: The task has successfully completed.
    - `failed`: The task has failed during execution.
    - `skipped`: The task was skipped due to conditional logic or other reasons.
    - `up_for_retry`: The task has failed but is eligible for retrying.
    - `up_for_reschedule`: The task is waiting to be rescheduled.

### 11. Why is SLA important?

- `SLA for a task`: You can set an SLA for each individual task to specify the `maximum time allowed for the task` to complete. This helps ensure that tasks finish in a timely manner.
- `SLA Miss`: If a task takes longer than the defined SLA, an "SLA miss" is recorded. This can trigger notifications (such as emails) to alert the stakeholders.
- SLA (Service Level Agreement) in Airflow is important because it sets the expectations for task completion time. It helps ensure that tasks are completed within the desired time frame, allows teams to track performance, and provides insight into the reliability and efficiency of workflows.
