ELT Pipeline Project

About

Personal project using data engineering concepts, to get apartments data for further analysis.

The project is an ELT (Extract, Load, Transform) data pipeline, orchestrated with Apache Airflow through Docker Containers.

An AWS S3 bucket is used as a Data Lake in which the files are stored through the layers of the Data Lake. The data is extracted from vivareal API and loaded in json to the first layer. It is then processed with Spark and loaded in parquet to the second layer. And finally, transformed with additional variables and partitioned by neighborhood.

Architecture

Scenario

Buying an apartment is a big deal, specially because of the price and all the variables that makes an apartment. It is important for the buyer and big companies like banks, that finance the property to know if the price is worth it. This analysis compare similar apartments statistically to each other to have a better understanding of all the features that makes the price. And for a better valuation, it needs data.:grinning:

Prerequisites

Setup

Clone the project to your desired location:

$ git clone https://github.com/lucaspfigueiredo/elt-pipeline

Execute the following command that will create the .env file containig the Airflow UID needed by docker-compose:

$ echo -e "AIRFLOW_UID=$(id -u)" > .env

In the dags/elt_dag.py file, put your s3 bucket url:

S3_BUCKET= join("https://s3a:", "YOUR BUCKET")

Build Docker:

$ docker-compose build

Initialize Airflow database:

$ docker-compose up airflow-init

Start Containers:

$ docker-compose up -d

When everything is done, you can check all the containers running:

$ docker ps

Airflow Interface

Now you can access Airflow web interface by going to http://localhost:8080 with the default user which is in the docker-compose.yml. Username/Password: airflow

With your AWS S3 user and bucket created, you can store your credentials in the connections in Airflow. And we can store which port Spark is exposed when we submit our jobs:

Now, we can trigger our DAG and see all the tasks running.

And finally, check the S3 bucket if our partitioned data is in the right place.

Shut down or restart Airflow

If you need to make changes or shut down:

$ docker-compose down

References

License

You can check out the full license here

This project is licensed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dags		dags
images		images
plugins		plugins
spark		spark
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ELT Pipeline Project

About

Architecture

Scenario

Prerequisites

Setup

Airflow Interface

Shut down or restart Airflow

References

License

About

Releases

Packages

Languages

License

lucaspfigueiredo/elt-pipeline

Folders and files

Latest commit

History

Repository files navigation

ELT Pipeline Project

About

Architecture

Scenario

Prerequisites

Setup

Airflow Interface

Shut down or restart Airflow

References

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages