<div style="line-height:1.2;">

<h1 style="color:darkturquoise; margin-bottom: 0.3em;">Airflow tutorial 1 </h1>

<p style="margin-top: 0.5em; margin-bottom: 1.5em;"><strong> Basics concepts for dealing with workflows </strong></p>

<div style="line-height:1.4; margin-bottom: 1em;">
    <h3 style="color: lightblue; display: inline; margin-right: 0.5em;">Keywords:</h3>
    <span style="display: inline;">DAG creation + terminal commands to launch airflow + attributes included in default_args + bashoperator + Jinja templating + depends_on_past + shedule options </span>
</div>

<div style="line-height:1.4; margin-top: 1em;">
    <h3 style="color: red; display: inline; margin-right: 0.5em;">Notes:</h3>
    <span style="display: inline;">
    Jupyter Notebooks must be converted to Python Script to work, since not designed for task scheduling and workflow management. <br>
    All the dags ".py" files need to be placed and place it in the Airflow's dags folder. => ~/airflow/dags (or the dir specified in the "airflow.cfg" config file) <br>
    In Apache Airflow, it is possible to define multiple DAGs into the same Python file. It is not mandatory having separate file for each DAG. <br>
    </span>
</div>

</div>

In [1]:
%%script echo skipping, it\'s already installed
!pip install python-dateutil

skipping, it's already installed


In [2]:
import os
import psutil
import calendar
import textwrap

from datetime import datetime, timedelta
from dateutil.rrule import rrule, DAILY
from dateutil.relativedelta import relativedelta

from airflow import DAG
#from airflow.operators.python_operator import PythonOperator           #deprecated!
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator

In [3]:
"""  Define the dictionary of default parameters that will be used when creating tasks, that are explicitly passed to the DAG."""
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 14),
    #'email': ['your_email@example.com'],
    #'email_on_failure': False,
    #'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

<h3 style="color: darkturquoise;"> Recap: DAG </h3>
<div style="margin-top: -8px;">
A Directed Acyclic Graph (DAG) is a collection of tasks organized to reflect their relationships and dependencies in a workflow.  <br>
Where nodes represent tasks, and edges represent the dependencies between them. <br>
DAGs, the tasks are directed, indicating that each task has a predefined execution path. <br>
=> Each DAG in Airflow represents a separate workflow, and tasks belong to a specific DAG. <br>
=> When a DAG is scheduled to run periodically, each scheduled run of the DAG is considered a separate instance. 
<div>

<h2 style="color: darkturquoise;"> Example #1 </h2>

In [5]:
dag_1 = DAG('first_dag_example', 
            default_args=default_args, 
            description='A simple DAG to write current time to a file',
            schedule=timedelta(days=1))     #schedule_interval=@daily is deprecated!

In [6]:
def function_1_for_first_operator():
    """ Write the current date and time into a text file.\\
    A Simple method to try Airflow, it doesn't require external APIs.
    """
    now = datetime.now()
    current_time = now.strftime("%Y-%m-%d %H:%M:%S")

    with open("current_time.txt", "a") as file:
        file.write(f"Current Date and Time: {current_time}\n")

    print(f"I have written to file this: Current Date and Time are => {current_time}")

In [7]:
# Create an operator => a task
write_time_task = PythonOperator(
    task_id='get_date_#1',
    python_callable=function_1_for_first_operator,
    dag=dag_1,
)

In [8]:
write_time_task

<Task(PythonOperator): first_task_get_date>

In [9]:
def function_2_for_second_operator():
    """ Define a list of meetings. Each meeting is a dictionary with date and topic.
    
    Notes:
        - Days outside the current month are represented by 0
        - A Newline represents at the end of the week
    """
    meetings = [
        {"date": datetime(2024, 1, 2), "topic": "Team Introduction"},
        {"date": datetime(2024, 1, 5), "topic": "Project Kickoff and Vision Sharing"},
        {"date": datetime(2024, 1, 7), "topic": "Initial Requirements Gathering"},
        {"date": datetime(2024, 1, 10), "topic": "Technology Stack Discussion"},
        {"date": datetime(2024, 1, 12), "topic": "Development Workflow Setup"},
        {"date": datetime(2024, 1, 14), "topic": "First Sprint Planning"},
        {"date": datetime(2024, 1, 17), "topic": "Coding Standards Review"},
        {"date": datetime(2024, 1, 20), "topic": "Progress Check and Feedback"},
        {"date": datetime(2024, 1, 22), "topic": "Code Review Process"},
        {"date": datetime(2024, 1, 27), "topic": "Preparation for First Commit"},
        {"date": datetime(2024, 1, 28), "topic": "First Commit Celebration"},
    ]

    ##### Create the calendar
    year, month = 2024, 1
    first_day_of_month = datetime(year, month, 1)
    last_day_of_month = first_day_of_month + relativedelta(months=1, days=-1)
    cal = calendar.monthcalendar(year, month)

    ############################## Display the calendar
    print("Calendar for", calendar.month_name[month], year)
    days = "Mo Tu We Th Fr Sa Su"
    print(days)
    for week in cal:
        for day in week:
            if day == 0:
                print("   ", end="") 
            else:
                date = datetime(year, month, day)
                meeting_topics = [m["topic"] for m in meetings if m["date"].date() == date.date()]
                day_str = str(day) if not meeting_topics else f"{day}*"
                print(f"{day_str:>2} ", end="")
        print()  

    print("\nMeetings this month:")
    for meeting in meetings:
        if first_day_of_month <= meeting["date"] <= last_day_of_month:
            print(f"- {meeting['date'].strftime('%Y-%m-%d')}: {meeting['topic']}")

function_2_for_second_operator()

Calendar for January 2024
Mo Tu We Th Fr Sa Su
 1 2*  3  4 5*  6 7* 
 8  9 10* 11 12* 13 14* 
15 16 17* 18 19 20* 21 
22* 23 24 25 26 27* 28* 
29 30 31             

Meetings this month:
- 2024-01-02: Team Introduction
- 2024-01-05: Project Kickoff and Vision Sharing
- 2024-01-07: Initial Requirements Gathering
- 2024-01-10: Technology Stack Discussion
- 2024-01-12: Development Workflow Setup
- 2024-01-14: First Sprint Planning
- 2024-01-17: Coding Standards Review
- 2024-01-20: Progress Check and Feedback
- 2024-01-22: Code Review Process
- 2024-01-27: Preparation for First Commit
- 2024-01-28: First Commit Celebration


In [10]:
#### Create second operator 
create_meetings_task = PythonOperator(
    task_id='create_meetings_table_#2',
    python_callable=function_2_for_second_operator,
    dag=dag_1,
)

In [11]:
# Set a dependency
write_time_task >> create_meetings_task

<Task(PythonOperator): create_meetings_table>

<h3 style="color: darkturquoise;"> Recap: start Airflow Web-UI </h3>
<div style="margin-top: -8px;">


Frist and Foremost
- Set up the Airflow database with all necessary tables:
    $airflow db init
- Deprecated! Use this instead 
    $airflow db migrate

- Create user: (no need to shut down the connection)
    $airflow users create --username admin --password mysecurepassword --firstname John --lastname Doe --role Admin --email johndoe@ example.com

- Launch on different terminals:

    - $airflow webserver --port 8080
    - $airflow scheduler 
<div>

- Open Airflow in the browser to use DAGS on Web-UI => http://localhost:8080

<h3 style="color: darkturquoise;"> Notes: Extra SW to install </h3>
<div style="margin-top: -8px;">


Use the command: <br>
    $pip install connexion[swagger-ui] <br>
=> To avoid the warning: Missing Swagger UI directory when using Connexion. <br>
The Swagger UI is a web-based UI that allows the interactaction with the API documentation. <br>
<div>

<h2 style="color: darkturquoise;"> Example #2 </h2>

In [None]:
default_arg_2={
    "depends_on_past": False,
    "email": ["airflow@example.com"],
    "email_on_failure": False,
    "email_on_retry": False,
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'end_date': datetime(2016, 1, 1),
    # 'wait_for_downstream': False,
    # 'sla': timedelta(hours=2),
    # 'execution_timeout': timedelta(seconds=300),
    # 'on_failure_callback': some_function,             # works also with a list of funcs
    # 'on_success_callback': some_other_function,       # works also with a list of funcs
    # 'on_retry_callback': another_function,            # works also with a list of funcs
    # 'sla_miss_callback': yet_another_function,        # works also with a list of funcs
    # 'trigger_rule': 'all_success'
},

<h3 style="color: darkturquoise;"> Recap: textwrap </h3>
<div style="margin-top: -8px;">
textwrap is a module used for formatting text by adjusting the line breaks in the input string.  <br>
Useful to display text in a specific width in console applications or when dealing with text in a limited-space environment (e.g. a UI element).

The primary use of textwrap is to wrap or fill text: <br>
- Wrapping => means breaking a single long line into multiple lines of a specific width <br>
- Filling => means converting a single paragraph into a single string with newlines to separate lines

textwrap can also be used to: <br>
_Adjusting Indentation <br>
_Handling Whitespace <br>
_Customizing Word Splitting <br>
<div>

In [4]:
""" Create the graps (args will get passed on to each operator)
Avoid adding "dag=dag" in each tasks definition, creating the methods for all operators inside the DAG definition.
Thanks to the "Context Manager" approach, using the "with statement", a temporary context can be set up and reliably torn it down under various conditions. 
It is just a convenient way to manage resources such as file streams, database connections, or anything else that needs to be SET UP and then CLEANED up.

N.B.1
The name of the task_id cannot contain symbolic chars like # but just alphanumeric characters, dashes, dots, and underscores
To avoid AirflowException
"""
with DAG(
    "tutorial",
    default_args={
        "depends_on_past": False,
        "retries": 1,
        "retry_delay": timedelta(minutes=5),
        'queue': 'bash_queue',
        'execution_timeout': timedelta(seconds=300),
        'trigger_rule': 'all_success'
        # 'pool': 'backfill',
        # 'priority_weight': 10,
        # 'end_date': datetime(2016, 1, 1),
        # 'wait_for_downstream': False,
        # 'sla': timedelta(hours=2),
        # 'on_failure_callback': some_function, # or list of functions
        # 'on_success_callback': some_other_function, # or list of functions
        # 'on_retry_callback': another_function, # or list of functions
        # 'sla_miss_callback': yet_another_function, # or list of functions
    },
    description="Simple DAG with Bash operator",
    schedule=timedelta(days=1),
    start_date=datetime(2021, 1, 1),
    catchup=False,
    tags=["simple_example_1"],
) as dag:
    ######################################## Create tasks by instantiating operators
    print_date_0 = BashOperator(
        task_id="print_date_0",                 
        bash_command="date",
    )
    create_file_1 = BashOperator(
        task_id='create_file_1',
        bash_command='touch ~/file_temp_1.txt',
    )
    write_content_2 = BashOperator(
        task_id='write_content_2',
        bash_command='echo "Sample content to insert just to testing. Ok works " > ~/file_temp_1.txt',
    )
    modify_content_3 = BashOperator(
        task_id='modify_content_sed_3',
        bash_command='sed -i "s/content/text/g" ~/file_temp_1.txt',
    )
    flow_operator_3 = BashOperator(
        task_id='flow_operator_3',
        bash_command='echo "Flow step 3"',
    )    
    process_content_4 = BashOperator(
        task_id='process_content_awk_4',
        bash_command='awk \'{print $1}\' ~/file_temp_1.txt > ~/file_temp_2.txt',
    )
    sleep_5 = BashOperator(
        task_id="sleep_5",
        # depends_on_past => What is the the relationship of a task with its previous instances within the same DAG,
        # across different schedule intervals or runs?
        # No => False => the task does not depend on the success of its previous runs.
        # The task will run, regardless of the success or failure of its previous instance.
        depends_on_past=False,
        bash_command="sleep 5",
        retries=3,
    )
    process_content_6 = BashOperator(
        task_id='read_file_6',
        bash_command='cat ~/file_temp_2.txt',
    )
    remove_file_7 = BashOperator(
        task_id='remove_file_7',
        bash_command='rm  ~/file_temp_1.txt',
    )
    flow_operator_7 = BashOperator(
        task_id='flow_operator_7',
        bash_command='echo "File 1 cancelled"',
    )    
    remove_file_8 = BashOperator(
        task_id='remove_file_8',
        bash_command='rm  ~/file_temp_2.txt',
    )
    
    """ Jinja templating engine:
    textwrap.dedent() creates a multi-line command string (the templated command)
    Jinja templating can be used to incorporate dynamic elements based on the context of the DAG run.
    For instance, {{ ds }} is a template variable for the execution date as a YYYY-MM-DD string.
    """
    # Add a documentation anywhere
    dag.doc_md = """ Documentation Read me"""  
    
    ############ Use Jinja
    """ For each run of the DAG, replace {{ ds }} with the execution date and {{ macros.ds_add(ds, 7) }} ...
    with the execution date plus 7 days """
    templated_command = textwrap.dedent("""
        {% for i in range(11) %}
            echo "{{ ds }}"
            echo "{{ macros.ds_add(ds, 7)}}"
        {% endfor %}
    """
    )
    final_task_11 = BashOperator(
        task_id="templated_11",
        depends_on_past=False,
        bash_command=templated_command,
    )

<h3 style="color: darkturquoise;"> Recap: Jinja templating </h3>
<div style="margin-top: -8px;">
Jinja templating in Apache Airflow is quite powerful and offers a variety of variables and macros that you can use to create dynamic task definitions and commands.
<div>

In [5]:
""" Use set_upstream and set_downstream for defining task workflow
task2.set_upstream(task1)   => task1 will run before task2. Then task2 depends on task1
task2.set_downstream(task1) => task2 will run before task1.
"""
print_date_0.set_downstream(create_file_1)
create_file_1.set_downstream(write_content_2)
write_content_2.set_downstream(modify_content_3)
modify_content_3.set_downstream(flow_operator_3)
flow_operator_3.set_downstream(process_content_4)
process_content_4.set_downstream(sleep_5)
sleep_5.set_downstream(process_content_6)
process_content_6.set_downstream(remove_file_7)
remove_file_7.set_downstream(flow_operator_7)
flow_operator_7.set_downstream(remove_file_8)

"""
print_date_0 >> \
create_file_1 >> \
write_content_2 >> \
modify_content_3 >> \
flow_operator_3 >> \
process_content_4 >> \
sleep_5 >> \
process_content_6 >> \
remove_file_7  >> \
flow_operator_7 >> \
remove_file_8
""";

<h2 style="color: darkturquoise;"> Example #3 </h2>

In [7]:
%%script echo skipping since the requirement is already satisfied
!pip install psutil

skipping since the requirement is already satisfied


In [8]:
def third_function_get_information():
    cpu_usage = psutil.cpu_percent()
    memory_usage = psutil.virtual_memory().percent
    disk_usage = psutil.disk_usage('/').percent

    report = f"System Statistics Report:\n"
    report += f"CPU Usage: {cpu_usage}%\n"
    report += f"Memory Usage: {memory_usage}%\n"
    report += f"Disk Usage: {disk_usage}%\n"

    print(report)

In [9]:
default_args = {
    'owner': 'airflow',
    'start_date': datetime(2024, 1, 14),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag_3 = DAG(
    'system_stats_dag',
    default_args=default_args,
    description='A simple DAG to report system stats',
    schedule=timedelta(days=1), # schedule_interval is deprecated!
)

system_stats_task = PythonOperator(
    task_id='report_system_stats',
    python_callable=third_function_get_information,
    dag=dag_3,
)

In [10]:
system_stats_task

<Task(PythonOperator): report_system_stats>

<h2 style="color: darkturquoise;"> Example #4</h2>

In [11]:
def get_weather_task():
    import requests
    
    city = "London"
    api_key = "your_api_key" 
    url = f"http://api.openweathermap.org/data/2.5/weather?q={city}&appid={api_key}"

    response = requests.get(url)
    data = response.json()

    if response.status_code == 200:
        weather = data['weather'][0]['description']
        # Convert from Kelvin to Celsius
        temperature = data['main']['temp'] - 273.15  
        print(f"Current weather in {city}: {weather}, Temperature: {temperature:.2f}°C")
    else:
        print("Failed to retrieve weather data")


In [12]:
""" Some options for schedule """
dag_4 = DAG('weather_data_dag',
    default_args=default_args,
    description='A simple DAG to fetch weather data',
    schedule="0 0 * * *",       # daily at midnight Cron Expressions
    #schedule="@daily",         # Presets
    #schedule="@hourly",        #
    #schedule="@yearly",        # on January 1st at midnight
    #schedule="0 0 * * 0",      # weekly on Sunday at midnight
    #schedule="*/15 * * * *",   # every 15 minutes 
    #schedule="0 0 1 * *",      # monthly on the 1st day at midnight
    #schedule="0 12 */5 * *",   # every 5 Days at Noon:
    #schedule="0 19 * * 1,3,5", # every Monday, Wednesday, and Friday at 7 PM
    #schedule="0 */6 * * *",    # every 6 hours
)

weather_task = PythonOperator(
    task_id='fetch_weather',
    python_callable=get_weather_task,
    dag=dag_4,
)

weather_task

<Task(PythonOperator): fetch_weather>