Skip to content

lucaspfigueiredo/elt-pipeline

Repository files navigation

ELT Pipeline Project

Python 3.8

About

Personal project using data engineering concepts, to get apartments data for further analysis.

The project is an ELT (Extract, Load, Transform) data pipeline, orchestrated with Apache Airflow through Docker Containers.

An AWS S3 bucket is used as a Data Lake in which the files are stored through the layers of the Data Lake. The data is extracted from vivareal API and loaded in json to the first layer. It is then processed with Spark and loaded in parquet to the second layer. And finally, transformed with additional variables and partitioned by neighborhood.

Architecture

alt text

Scenario

Buying an apartment is a big deal, specially because of the price and all the variables that makes an apartment. It is important for the buyer and big companies like banks, that finance the property to know if the price is worth it. This analysis compare similar apartments statistically to each other to have a better understanding of all the features that makes the price. And for a better valuation, it needs data.:grinning:

Prerequisites

Setup

Clone the project to your desired location:

$ git clone https://github.com/lucaspfigueiredo/elt-pipeline

Execute the following command that will create the .env file containig the Airflow UID needed by docker-compose:

$ echo -e "AIRFLOW_UID=$(id -u)" > .env

In the dags/elt_dag.py file, put your s3 bucket url:

S3_BUCKET= join("https://s3a:", "YOUR BUCKET")

Build Docker:

$ docker-compose build 

Initialize Airflow database:

$ docker-compose up airflow-init

Start Containers:

$ docker-compose up -d

alt text

When everything is done, you can check all the containers running:

$ docker ps

Airflow Interface

Now you can access Airflow web interface by going to http://localhost:8080 with the default user which is in the docker-compose.yml. Username/Password: airflow

With your AWS S3 user and bucket created, you can store your credentials in the connections in Airflow. And we can store which port Spark is exposed when we submit our jobs:

alt text alt text

Now, we can trigger our DAG and see all the tasks running.

alt text

And finally, check the S3 bucket if our partitioned data is in the right place.

alt text

Shut down or restart Airflow

If you need to make changes or shut down:

$ docker-compose down

References

License

You can check out the full license here

This project is licensed under the terms of the MIT license.

About

Apartments Data Pipeline using Airflow and Spark.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published