# "Data Engineering - Week 2"
> "Week 2 - Data Engineering Zoomcamp course."

- toc: True
- branch: master
- badges: true
- comments: true
- categories: [data engineering, jupyter]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true

In the first video of week 2, the following items will be reviewed:

> youtube: https://youtu.be/W3Zm6rjOq70


# Data Lake

![](images/data-engineering-w2/1.png)

A data lake is a collection of technologies that enables querying of data contained in files or blob objects. When used effectively, they enable massive scale and cost-effective analysis of structured and unstructured data assets [[source](https://lakefs.io/data-lakes/)].

Data lakes are comprised of four primary components: storage, format, compute, and metadata layers [[source](https://lakefs.io/data-lakes/)].

![](images/data-engineering-w2/2.png)


A data lake is a centralized repository for large amounts of data from a variety of sources. Data can be structured, semi-structured, or unstructured in general.
The goal is to rapidly ingest data and make it available to or accessible to other team members such as data scientists, analysts, and engineers.
The data lake is widely used for machine learning and analytical solutions.
Generally, when you store data in a data lake, you associate it with some form of metadata to facilitate access. Generally, a data lake solution must be secure and scalable.
Additionally, the hardware should be affordable. The reason for this is that you want to store as much data as possible quickly.


# Data Lake vs Data Warehouse

![](images/data-engineering-w2/3.png)


Generally a data lake is an unstructured data and the target users are data scientists or data analysts. It stores huge amount of data, sometimes in the size of petabytes and terabytes. The use cases which are covered by data lake are basically stream processing, machine learning, and real-time analytics.
On the data warehouse side, the data is generally structured. The users are business analysts, the data size is generally small, and the use case consists of batch processing or BI reporting.

To read more, please check [here](https://lakefs.io/data-lakes/) and [here](https://luminousmen.com/post/data-lake-vs-data-warehouse).


# ETL vs ELT
- Extract Transform and Load vs Extract Load and Transform
- ETL is mainly used for a small amount of data whereas ELT is used for large amounts of data
- ELT provides data lake support (Schema on read)
- ETL provides data warehouse solutions

![](images/data-engineering-w2/4.png)
*[source](https://www.guru99.com/etl-vs-elt.html#:~:text=ETL%20stands%20for%20Extract%2C%20Transform,directly%20into%20the%20target%20system.&text=ETL%2C%20ETL%20is%20mainly%20used,for%20large%20amounts%20of%20data.)*

![](images/data-engineering-w2/5.png)
*[source](https://www.guru99.com/etl-vs-elt.html#:~:text=ETL%20stands%20for%20Extract%2C%20Transform,directly%20into%20the%20target%20system.&text=ETL%2C%20ETL%20is%20mainly%20used,for%20large%20amounts%20of%20data.)*

Data lake solutions provided by main cloud providers are as follows:

- GCP - cloud storage
- AWS - S3
- AZURE - AZURE BLOB


# Workflow Orchestration

> youtube: https://youtu.be/0yK7LXwYeD0

We saw a simple data pipeline in week 1. One of the problems in that data pipeline was that we did several important jobs in the same place: downloading data and doing small processing and putting it into postgres. What if after downloading data, some error happens in the code or with the internet? We will lose the downloaded data and should do everything from scratch. That's why we need to do those steps separately. 

A data pipeline is a series of steps for data processing. If the data has not yet been loaded into the data platform, it is ingested at the pipeline's start. Then there is a series of steps, each of which produces an output that serves as the input for the subsequent step. This procedure is repeated until the pipeline is completed. In some instances, independent steps may be performed concurrently. [[source](https://hazelcast.com/glossary/data-pipeline/)].


A data pipeline is composed of three critical components: a source, a processing step or series of processing steps, and a destination. The destination may be referred to as a sink in some data pipelines. Data pipelines, for example, enable the flow of data from an application to a data warehouse, from a data lake to an analytics database, or to a payment processing system. Additionally, data pipelines can share the same source and sink, allowing the pipeline to focus entirely on data modification. When data is processed between points A and B (or B, C, and D), there is a data pipeline between those points [[source](https://hazelcast.com/glossary/data-pipeline/)].

![](images/data-engineering-w2/6.png)
*[source](https://hazelcast.com/glossary/data-pipeline/)*

In our example, the data pipeline we had in the previous week can be as follows:

![](images/data-engineering-w2/7.png)

We separated downloading dataset using `wget` and then ingesting it into postgres. I think we can have even another more step for processing (changing the string to datetime in the downloaded dataset).

But this week, we will do something more complex. Let's have a look at the data workflow.

![](images/data-engineering-w2/8.png)

The above figure is called a DAG (Directed Acyclic Graph). We need to be sure that all steps are done sequentially and we can retry some of the steps if some thing happens and then go to the next step. There are some tools called workflow engines tat allow us to define these DAGs and do the data workflow orchestration:

- LUIGI
- APACHE AIRFLOW (we will go for this)
- PREFECT
- Google Cloud Dataflow

Let's get more familiar with the last two ones:

### Airflow
Airflow is a platform to programmatically author, schedule and monitor workflows.
Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative [[Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/index.html)].


![](images/data-engineering-w2/airflow.gif)
*[Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/index.html)*


### Google Cloud Dataflow
Real-time data is generated by websites, mobile applications, IoT devices, and other workloads. All businesses make data collection, processing, and analysis a priority. However, data from these systems is frequently not in a format suitable for analysis or effective use by downstream systems. That is where Dataflow enters the picture! Dataflow is used to process and enrich batch or stream data for analysis, machine learning, and data warehousing applications.

Dataflow is a serverless, high-performance, and cost-effective service for stream and batch processing. It enables portability for processing jobs written in the open source Apache Beam libraries and alleviates operational burden on your data engineering teams by automating infrastructure provisioning and cluster management [[Google cloud docs](https://cloud.google.com/blog/topics/developers-practitioners/dataflow-backbone-data-analytics)]. 


![](images/data-engineering-w2/9.jpeg)
*[Google cloud docs](https://cloud.google.com/blog/topics/developers-practitioners/dataflow-backbone-data-analytics)*


[Here](https://stackshare.io/stackups/airflow-vs-google-cloud-dataflow) is a comparison between Airflow and Google cloud dataflow.
