# **Data Pipelines Using Apache AirFlow**

## Scenario

Write a pipeline that analyzes the web server log file, extracts the required lines(ending with html) and fields(time stamp, size ) and transforms (bytes to mb) and load (append to an existing file.)

**Objectives**

In this assignment you will author an Apache Airflow DAG that will:

- Extract data from a web server log file
- Transform the data
- Load the transformed data into a tar file

## Tools / Software

Apache AirFlow

## **Exercise 1**

- Prepare the lab environment
Before you start the assignment:

Start Apache Airflow.
Download the dataset from the source to the destination mentioned below.
Source : https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/ETL/accesslog.txt

Destination : /home/project/airflow/dags/capstone

In [None]:
theia@theiadocker-orokgospel:/home/project$ cd airflow
theia@theiadocker-orokgospel:/home/project/airflow$ cd dags
theia@theiadocker-orokgospel:/home/project/airflow/dags$ mkdir capstone 
theia@theiadocker-orokgospel:/home/project/airflow/dags$ wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/ETL/accesslog.txt

## **Exercise 2**

- Create a DAG

**Task 1**

- Define the DAG arguments

Create a DAG with these arguments.

- owner
- start_date
- email

You may define any suitable additional arguments.

In [None]:
# Defining DAG arguments

# You can override them on a per-task basis during operator initialization
default_args = {
    'owner': 'Gospel Orok',
    'start_date': days_ago(0),
    'email': ['orokgospel@gmeil.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

**Task 2**

- Define the DAG

Create a DAG named process_web_log that runs daily.

Using suitable description.

In [None]:
# Defining the DAG

dag = DAG(
    'process_web_log',
    default_args=default_args,
    description='My ETL Capstone DAG',
    schedule_interval=timedelta(days=1),
)

**Task 3**

- Create a task to extract data

Create a task named extract_data.

This task should extract the ipaddress field from the web server log file and save it into a file named extracted_data.txt

In [None]:
# Define the task 'extract'
extract = BashOperator(
    task_id='extract',
    bash_command='cut -f1,4 -d"#" accesslog.txt > /home/project/airflow/dags/capstone/accesslog.txt',
    dag=dag,
)

In [None]:
# define the task 'transform'
transform = BashOperator(
    task_id='transform',
    bash_command='tr "[a-z]" "[A-Z]" < /home/project/airflow/dags/capstone/accesslog.txt't > /home/project/airflow/dags/capstone/accesslog.txt',
    dag=dag,
)