Personal project using data engineering concepts, to get apartments data for further analysis.
The project is an ELT (Extract, Load, Transform) data pipeline, orchestrated with Apache Airflow through Docker Containers.
An AWS S3 bucket is used as a Data Lake in which the files are stored through the layers of the Data Lake. The data is extracted from vivareal API and loaded in json to the first layer. It is then processed with Spark and loaded in parquet to the second layer. And finally, transformed with additional variables and partitioned by neighborhood.
Buying an apartment is a big deal, specially because of the price and all the variables that makes an apartment. It is important for the buyer and big companies like banks, that finance the property to know if the price is worth it. This analysis compare similar apartments statistically to each other to have a better understanding of all the features that makes the price. And for a better valuation, it needs data.:grinning:
Clone the project to your desired location:
$ git clone https://github.com/lucaspfigueiredo/elt-pipeline
Execute the following command that will create the .env file containig the Airflow UID needed by docker-compose:
$ echo -e "AIRFLOW_UID=$(id -u)" > .env
In the dags/elt_dag.py file, put your s3 bucket url:
S3_BUCKET= join("https://s3a:", "YOUR BUCKET")
Build Docker:
$ docker-compose build
Initialize Airflow database:
$ docker-compose up airflow-init
Start Containers:
$ docker-compose up -d
When everything is done, you can check all the containers running:
$ docker ps
Now you can access Airflow web interface by going to http://localhost:8080 with the default user which is in the docker-compose.yml. Username/Password: airflow
With your AWS S3 user and bucket created, you can store your credentials in the connections in Airflow. And we can store which port Spark is exposed when we submit our jobs:
Now, we can trigger our DAG and see all the tasks running.
And finally, check the S3 bucket if our partitioned data is in the right place.
If you need to make changes or shut down:
$ docker-compose down
You can check out the full license here
This project is licensed under the terms of the MIT license.