# "Data Engineering - Week 2"
> "Week 2 - Data Engineering Zoomcamp course."

- toc: True
- branch: master
- badges: true
- comments: true
- categories: [data engineering, jupyter]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true

**Note**: The content of this post is from the course videos, my understandings and searches, and reference documentations.


> youtube: https://youtu.be/W3Zm6rjOq70


# Data Lake

![](images/data-engineering-w2/1.png)

A data lake is a collection of technologies that enables querying of data contained in files or blob objects. When used effectively, they enable massive scale and cost-effective analysis of structured and unstructured data assets [[source](https://lakefs.io/data-lakes/)].

Data lakes are comprised of four primary components: storage, format, compute, and metadata layers [[source](https://lakefs.io/data-lakes/)].

![](images/data-engineering-w2/2.png)


A data lake is a centralized repository for large amounts of data from a variety of sources. Data can be structured, semi-structured, or unstructured in general.
The goal is to rapidly ingest data and make it available to or accessible to other team members such as data scientists, analysts, and engineers.
The data lake is widely used for machine learning and analytical solutions.
Generally, when you store data in a data lake, you associate it with some form of metadata to facilitate access. Generally, a data lake solution must be secure and scalable.
Additionally, the hardware should be affordable. The reason for this is that you want to store as much data as possible quickly.


# Data Lake vs Data Warehouse

![](images/data-engineering-w2/3.png)


Generally a data lake is an unstructured data and the target users are data scientists or data analysts. It stores huge amount of data, sometimes in the size of petabytes and terabytes. The use cases which are covered by data lake are basically stream processing, machine learning, and real-time analytics.
On the data warehouse side, the data is generally structured. The users are business analysts, the data size is generally small, and the use case consists of batch processing or BI reporting.

To read more, please check [here](https://lakefs.io/data-lakes/) and [here](https://luminousmen.com/post/data-lake-vs-data-warehouse).


# ETL vs ELT
- Extract Transform and Load vs Extract Load and Transform
- ETL is mainly used for a small amount of data whereas ELT is used for large amounts of data
- ELT provides data lake support (Schema on read)
- ETL provides data warehouse solutions

![](images/data-engineering-w2/4.png)
*[source](https://www.guru99.com/etl-vs-elt.html#:~:text=ETL%20stands%20for%20Extract%2C%20Transform,directly%20into%20the%20target%20system.&text=ETL%2C%20ETL%20is%20mainly%20used,for%20large%20amounts%20of%20data.)*

![](images/data-engineering-w2/5.png)
*[source](https://www.guru99.com/etl-vs-elt.html#:~:text=ETL%20stands%20for%20Extract%2C%20Transform,directly%20into%20the%20target%20system.&text=ETL%2C%20ETL%20is%20mainly%20used,for%20large%20amounts%20of%20data.)*

Data lake solutions provided by main cloud providers are as follows:

- GCP - cloud storage
- AWS - S3
- AZURE - AZURE BLOB


# Workflow Orchestration

> youtube: https://youtu.be/0yK7LXwYeD0

We saw a simple data pipeline in week 1. One of the problems in that data pipeline was that we did several important jobs in the same place: downloading data and doing small processing and putting it into postgres. What if after downloading data, some error happens in the code or with the internet? We will lose the downloaded data and should do everything from scratch. That's why we need to do those steps separately. 

A data pipeline is a series of steps for data processing. If the data has not yet been loaded into the data platform, it is ingested at the pipeline's start. Then there is a series of steps, each of which produces an output that serves as the input for the subsequent step. This procedure is repeated until the pipeline is completed. In some instances, independent steps may be performed concurrently. [[source](https://hazelcast.com/glossary/data-pipeline/)].


A data pipeline is composed of three critical components: a source, a processing step or series of processing steps, and a destination. The destination may be referred to as a sink in some data pipelines. Data pipelines, for example, enable the flow of data from an application to a data warehouse, from a data lake to an analytics database, or to a payment processing system. Additionally, data pipelines can share the same source and sink, allowing the pipeline to focus entirely on data modification. When data is processed between points A and B (or B, C, and D), there is a data pipeline between those points [[source](https://hazelcast.com/glossary/data-pipeline/)].

![](images/data-engineering-w2/6.png)
*[source](https://hazelcast.com/glossary/data-pipeline/)*

In our example, the data pipeline we had in the previous week can be as follows:

![](images/data-engineering-w2/7.png)

We separated downloading dataset using `wget` and then ingesting it into postgres. I think we can have even another more step for processing (changing the string to datetime in the downloaded dataset).

But this week, we will do something more complex. Let's have a look at the data workflow.

![](images/data-engineering-w2/8.png)

The above figure is called a DAG (Directed Acyclic Graph). We need to be sure that all steps are done sequentially and we can retry some of the steps if some thing happens and then go to the next step. There are some tools called workflow engines tat allow us to define these DAGs and do the data workflow orchestration:

- LUIGI
- APACHE AIRFLOW (we will go for this)
- PREFECT
- Google Cloud Dataflow

Let's get more familiar with the last two ones:

### Airflow
Airflow is a platform to programmatically author, schedule and monitor workflows.
Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative [[Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/index.html)].


![](images/data-engineering-w2/airflow.gif)
*[Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/index.html)*


### Google Cloud Dataflow
Real-time data is generated by websites, mobile applications, IoT devices, and other workloads. All businesses make data collection, processing, and analysis a priority. However, data from these systems is frequently not in a format suitable for analysis or effective use by downstream systems. That is where Dataflow enters the picture! Dataflow is used to process and enrich batch or stream data for analysis, machine learning, and data warehousing applications.

Dataflow is a serverless, high-performance, and cost-effective service for stream and batch processing. It enables portability for processing jobs written in the open source Apache Beam libraries and alleviates operational burden on your data engineering teams by automating infrastructure provisioning and cluster management [[Google cloud docs](https://cloud.google.com/blog/topics/developers-practitioners/dataflow-backbone-data-analytics)]. 


![](images/data-engineering-w2/9.jpeg)
*[Google cloud docs](https://cloud.google.com/blog/topics/developers-practitioners/dataflow-backbone-data-analytics)*


[Here](https://stackshare.io/stackups/airflow-vs-google-cloud-dataflow) is a comparison between Airflow and Google cloud dataflow.


# Airflow Architecture 

> youtube: https://youtu.be/lqDMzReAtrw



![](images/data-engineering-w2/10.png)
*[Airflow architecture](https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html)*


Let's review the Airflow architecture. An Airflow installation generally consists of the following components:

- **Web server**: GUI to inspect, trigger and debug the behaviour of DAGs and tasks. Available at http://localhost:8080.

- **Scheduler**: Responsible for scheduling jobs. Handles both triggering & scheduled workflows, submits Tasks to the executor to run, monitors all tasks and DAGs, and then triggers the task instances once their dependencies are complete.

- **Worker**: This component executes the tasks given by the scheduler.

- **Metadata database (postgres)**: Backend to the Airflow environment. Used by the scheduler, executor and webserver to store state.

Other components (seen in docker-compose services):

- *redis*: Message broker that forwards messages from scheduler to worker.
- *flower*: The flower app for monitoring the environment. It is available at http://localhost:5555.
- *airflow-init*: initialization service (customized as per this design)

Please read more about Airflow architecture [here](https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html#architecture-overview) before continuing the blog post.

Now let's install airflow environment using docker. 

> youtube: https://youtu.be/lqDMzReAtrw



First, there are some pre-requisites. For the sake of standardization across this tutorial's config, rename your gcp-service-accounts-credentials file to `google_credentials.json` and store it in your `$HOME` directory:

In [None]:
cd ~ && mkdir -p ~/.google/credentials/
mv <path/to/your/service-account-authkeys>.json ~/.google/credentials/google_credentials.json

You also need Python version 3.7+.

You may need to upgrade your docker-compose version to v2.x+ (as suggested in the course - however airflow documentation suggests v1.29.1 or newer).

Default amount of memory available for Docker on MacOS is often not enough to get Airflow up and running. If enough memory is not allocated, it might lead to airflow webserver continuously restarting. You should at least allocate 4GB memory for the Docker Engine (ideally 8GB). You can check and change the amount of memory in Resources

You can also check if you have enough memory by running this command [[Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html)]:



In [None]:
docker run --rm "debian:buster-slim" bash -c 'numfmt --to iec $(echo $(($(getconf _PHYS_PAGES) * $(getconf PAGE_SIZE))))'

For me, this is 16 GB:

In [None]:
Unable to find image 'debian:buster-slim' locally
buster-slim: Pulling from library/debian
6552179c3509: Pull complete 
Digest: sha256:f6e5cbc7eaaa232ae1db675d83eabfffdabeb9054515c15c2fb510da6bc618a7
Status: Downloaded newer image for debian:buster-slim
16G


If enough memory is not allocated, it might lead to airflow-webserver continuously restarting. I used [this](https://stackoverflow.com/questions/49839028/how-to-upgrade-docker-compose-to-latest-version#:~:text=First%2C%20remove%20the%20old%20version%3A) answer to update mine. For limiting memory, it is easy to do it in mac and windows like [here](https://stackoverflow.com/questions/44533319/how-to-assign-more-memory-to-docker-container) and for linux you can check [here](https://phoenixnap.com/kb/docker-memory-and-cpu-limit).

To deploy Airflow on Docker Compose, you should fetch `docker-compose.yaml`.


In [None]:
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.2.3/docker-compose.yaml'

This file contains several service definitions:

- airflow-scheduler - The scheduler monitors all tasks and DAGs, then triggers the task instances once their dependencies are complete. Behind the scenes, the scheduler spins up a subprocess, which monitors and stays in sync with all DAGs in the specified DAG directory. Once per minute, by default, the scheduler collects DAG parsing results and checks whether any active tasks can be triggered [[ref](https://airflow.apache.org/docs/apache-airflow/stable/concepts/scheduler.html)].
- airflow-webserver - The webserver is available at http://localhost:8080.
- airflow-worker - The worker that executes the tasks given by the scheduler.
- airflow-init - The initialization service.
- flower - The [flower](https://flower.readthedocs.io/en/latest/) app is a web based tool for monitoring the environment. It is available at http://localhost:5555.
- postgres - The database.
- redis - The [redis](https://redis.io/) - broker that forwards messages from scheduler to worker.

All these services allow you to run Airflow with [CeleryExecutor](https://airflow.apache.org/docs/apache-airflow/stable/executor/celery.html). For more information, see [Architecture Overview](https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html).

Some directories in the container are mounted, which means that their contents are synchronized between your computer and the container.

- ./dags - you can put your DAG files here.
- ./logs - contains logs from task execution and scheduler.
- ./plugins - you can put your custom [plugins](https://airflow.apache.org/docs/apache-airflow/stable/plugins.html) here. Airflow has a simple plugin manager built-in that can integrate external features to its core by simply dropping files in your $AIRFLOW_HOME/plugins folder.

Here is the architecture of `docker-compose.yaml` file:

In [None]:
version: '3'
x-airflow-common:
  &airflow-common
  # In order to add custom dependencies or upgrade provider packages you can use your extended image.
  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
  # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.2.3}
  # build: .
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
  user: "${AIRFLOW_UID:-50000}:0"
  depends_on:
    &airflow-common-depends-on
    redis:
      condition: service_healthy
    postgres:
      condition: service_healthy

services:
  postgres:
    ...

  redis:
    ...

  airflow-webserver:
    ...

  airflow-scheduler:
    ...

  airflow-worker:
    ...

  airflow-triggerer:
    ...

  airflow-init:
    ...

  airflow-cli:
    ...
    
  flower:
    ...

volumes:
  postgres-db-volume:

The above file uses the latest Airflow image ([apache/airflow](https://hub.docker.com/r/apache/airflow)). If you need to install a new Python library or system library, you can build your image.

When running Airflow locally, you may wish to use an extended image that includes some additional dependencies - for example, you may wish to add new Python packages or upgrade the airflow providers to a newer version. This is accomplished by including a custom Dockerfile alongside your `docker-compose.yaml` file. Then, using the `docker-compose build` command, you can create your image (you need to do it only once). Additionally, you can add the `--build` flag to your `docker-compose` commands to automatically rebuild the images when other `docker-compose` commands are run. To learn more and see additional examples, visit [here](https://airflow.apache.org/docs/docker-stack/build.html).

In order to use airflow with GCP, we have changed the `docker-compose.yaml` file in this course as follows:

- instead of using the official airflow image as the base image, we use a custom docker file to build and start from.


In [None]:
  build:
    context: .
    dockerfile: ./Dockerfile

- disable loading the DAG examples that ship with Airflow. It’s good to get started, but you probably want to set this to False in a production environment

In [None]:
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'

- add GCP environment variables

In [None]:
GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json
    AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json'
    GCP_PROJECT_ID: 'pivotal-surfer-336713'
    GCP_GCS_BUCKET: "dtc_data_lake_pivotal-surfer-336713"


- add the folder we created at the beginning of the post for google credentials.

In [None]:
- ~/.google/credentials/:/.google/credentials:ro

Here is the beginning of the file after our modifications:

In [None]:
  build:
    context: .
    dockerfile: ./Dockerfile
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
    GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json
    AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json'
    GCP_PROJECT_ID: 'pivotal-surfer-336713'
    GCP_GCS_BUCKET: "dtc_data_lake_pivotal-surfer-336713"

  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    - ~/.google/credentials/:/.google/credentials:ro

Following is the custom Dockerfile whcich is placed inside the `airflow` folder.
The Dockerfile has the custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket/Data Lake.

In [None]:
# First-time build can take upto 10 mins.

FROM apache/airflow:2.2.3

ENV AIRFLOW_HOME=/opt/airflow

USER root
RUN apt-get update -qq && apt-get install vim -qqq
# git gcc g++ -qqq

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt


# Ref: https://airflow.apache.org/docs/docker-stack/recipes.html

SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]

ARG CLOUD_SDK_VERSION=322.0.0
ENV GCLOUD_HOME=/home/google-cloud-sdk

ENV PATH="${GCLOUD_HOME}/bin/:${PATH}"

RUN DOWNLOAD_URL="https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-${CLOUD_SDK_VERSION}-linux-x86_64.tar.gz" \
    && TMP_DIR="$(mktemp -d)" \
    && curl -fL "${DOWNLOAD_URL}" --output "${TMP_DIR}/google-cloud-sdk.tar.gz" \
    && mkdir -p "${GCLOUD_HOME}" \
    && tar xzf "${TMP_DIR}/google-cloud-sdk.tar.gz" -C "${GCLOUD_HOME}" --strip-components=1 \
    && "${GCLOUD_HOME}/install.sh" \
       --bash-completion=false \
       --path-update=false \
       --usage-reporting=false \
       --quiet \
    && rm -rf "${TMP_DIR}" \
    && gcloud --version

WORKDIR $AIRFLOW_HOME

USER $AIRFLOW_UID


The `requirements.txt` file in the Dockerfile which contains the required pyton packages is as follows:

In [None]:
apache-airflow-providers-google
pyarrow

In case you don't want to see so many services as it is done in the above `docker-compose.yaml` file, you can use the following one which is placed in the `week_2_data_ingestion/airflow/extras` folder in the course github repo:

In [None]:
version: '3.7'
services:
    webserver:
        container_name: airflow
        build:
            context: ..
            dockerfile: ../Dockerfile
        environment:
            - PYTHONPATH=/home/airflow
            # airflow connection with SQLAlchemy container
            - AIRFLOW__CORE__SQL_ALCHEMY_CONN=sqlite:///$AIRFLOW_HOME/airflow.db
            - AIRFLOW__CORE__EXECUTOR=LocalExecutor
            # disable example loading
            - AIRFLOW__CORE__LOAD_EXAMPLES=FALSE

        volumes:
            - ./dags:/home/airflow/dags
        # user: "${AIRFLOW_UID:-50000}:0"
        ports:
            - "8080:8080"
        command: >  # airflow db upgrade;
            bash -c "
                airflow scheduler -D;
                rm /home/airflow/airflow-scheduler.*;
                airflow webserver"
        healthcheck:
            test: [ "CMD-SHELL", "[ -f /home/airflow/airflow-webserver.pid ]" ]
            interval: 30s
            timeout: 30s
            retries: 3


We will not use this file to avoid any confusion.

Before starting Airflow for the first time, You need to prepare your environment, i.e. create the necessary files, directories and initialize the database.


On Linux, the quick-start needs to know your host user id and needs to have group id set to `0`. Otherwise the files created in `dags`, `logs` and `plugins` will be created with `root` user. You have to make sure to configure them for the docker-compose: (run it inside the `airflow` folder where the `docker-compose.yaml` file is placed)

In [None]:
mkdir -p ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" > .env

For other operating systems, you will get warning that `AIRFLOW_UID` is not set, but you can ignore it. You can also manually create the `.env` file in the same folder your `docker-compose.yaml` is placed with this content to get rid of the warning:

In [None]:
AIRFLOW_UID=1000

Then we need to initialize the database. On all operating systems, you need to run database migrations and create the first user account. To do it, run.


In [2]:
docker-compose build
docker-compose up airflow-init
docker-compose up

SyntaxError: invalid syntax (<ipython-input-2-f59d00961e52>, line 1)

You may also some error, but you can ignore them as they are for some services in the official docker compose file that we do not use.

You can check which services are up using:

In [None]:
docker-compose ps

For me, the output is as follows:

In [None]:
airflow-airflow-init-1        "/bin/bash -c 'funct…"   airflow-init        exited (0)          
airflow-airflow-scheduler-1   "/usr/bin/dumb-init …"   airflow-scheduler   running (healthy)   8080/tcp
airflow-airflow-triggerer-1   "/usr/bin/dumb-init …"   airflow-triggerer   running (healthy)   8080/tcp
airflow-airflow-webserver-1   "/usr/bin/dumb-init …"   airflow-webserver   running (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp
airflow-airflow-worker-1      "/usr/bin/dumb-init …"   airflow-worker      running (healthy)   8080/tcp
airflow-flower-1              "/usr/bin/dumb-init …"   flower              running (healthy)   0.0.0.0:5555->5555/tcp, :::5555->5555/tcp
airflow-postgres-1            "docker-entrypoint.s…"   postgres            running (healthy)   5432/tcp
airflow-redis-1               "docker-entrypoint.s…"   redis               running (healthy)   6379/tcp

Then you can go to this address: `http://0.0.0.0:8080/`

The airflow UI will be like this:

![](images/data-engineering-w2/11.png)

The account created has the login `airflow` and the password `airflow`. After log in you will see two generated dags from the `./da`