Table of Contents
Study Case: Imagine that we have an application in charge of collecting customer information. BI team needs to analyze data constantly, but we don’t want them to analyze the data directly from the raw database. Here we have created two-layer above that raw database so that BI can build up some BI tools on the top of the data warehouse. The purpose of the datalake is a staging area to keep data in a raw format before we apply any transformation or for the backup purpose in case the Datawarehouse is down.
We used docker to host each isolated service and used apache airflow to execute our pipeline on a daily basis. The goal of this project is to simulate the data ETL process: Using a customize REST API to Extract three different tables from the MySQL local database, applying any required transformation to those data, first loading those data to AWS S3 data lake, and then loading those data into both AWS RDS and local Postgres database.
Some major frameworks/libraries used to bootstrap this project:
- Apache-airflow: Data pipeline Scheduling and orchestration
- Docker: Isolate application environment inside a container.
- Mysql-connector-python: Used to establish connection with MySQL database
- Psycopg2-binary: Used to establish connection with Postgres database
- Terraform: open-source infrastructure as Code (IAC) software tool to help us automatically setup or take down cloud-based framework and infrastructure.
- Faker: Generating fake data to build tables.
Some AWS services I have used in this project:
- AWS S3: Amazon Simple Storage Service is a service offered by Amazon Web Services that provides object storage through a web service interface.
- AWS Parameter Store: AWS service that stores strings, secret data and non-secret data alike.
- AWS RDS: a distributed relational database service by Amazon Web Services.
Make sure you have Docker Desktop installed on your computer. If you do not have Docker installed please use this link to download:
Once Docker is installed, make sure it is up and running in your background.
Verify that the minimum memory requirements for Docker are set. Use the image below as a reference:
-
Make sure your docker is running with the minimum requirement highlighted above.
-
Ensure that you are in the main directory of the project, and then run the following command in the command line:
make run-app
-
Check the table has been created in both MySQL and Postgres Databases by using the following commands:
docker ps
using the following command to access MySQL inside the container:
docker exec -it ms_container bash
After entering the mysql database, you can find the below table inside schema 'henry'. From this point we have successfully initiated the MySQL database and created all the tables we need for this project.
-
Check that the application is running by going to your localhost:8080
(Please note that application can take anywhere between 1 - 5 minutes to run depending on your particular system)
-
Login to the Airflow Webserver using the following credentials:
- username:
airflow
- password:
airflow
- username:
-
After few mintues, you can check to see if the task has finished by clicking the task name from the Airflow UI:
-
You can use Database Management Tools (PGAdmin4 or Dbeaver) to check if these tables exist in there
-
You can check if these tables exists in your local datawarehouse (postgres) by using the following steps:
docker exec -it pg_container bash
tables inside the local datawarehosue: -
Check the How To section for additional instructions.
-
Shut down the application by entering the following command in your terminal:
make reset
-
Check the PPT Data pipeline PPT
Open up a terminal and type:
docker ps
you will see a list of services inside the docker containers:
Distributed under the MIT License.
- Primary - Henry Zou
- Secondary - Domonique Gordon