GitHub

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Runing the Application
Usage
How To
License
Maintainer

About The Project

Study Case: Imagine that we have an application in charge of collecting customer information. BI team needs to analyze data constantly, but we don’t want them to analyze the data directly from the raw database. Here we have created two-layer above that raw database so that BI can build up some BI tools on the top of the data warehouse. The purpose of the datalake is a staging area to keep data in a raw format before we apply any transformation or for the backup purpose in case the Datawarehouse is down.

We used docker to host each isolated service and used apache airflow to execute our pipeline on a daily basis. The goal of this project is to simulate the data ETL process: Using a customize REST API to Extract three different tables from the MySQL local database, applying any required transformation to those data, first loading those data to AWS S3 data lake, and then loading those data into both AWS RDS and local Postgres database.

Built With

Some major frameworks/libraries used to bootstrap this project:

Apache-airflow: Data pipeline Scheduling and orchestration
Docker: Isolate application environment inside a container.
Mysql-connector-python: Used to establish connection with MySQL database
Psycopg2-binary: Used to establish connection with Postgres database
Terraform: open-source infrastructure as Code (IAC) software tool to help us automatically setup or take down cloud-based framework and infrastructure.
Faker: Generating fake data to build tables.

Some AWS services I have used in this project:

AWS S3: Amazon Simple Storage Service is a service offered by Amazon Web Services that provides object storage through a web service interface.
AWS Parameter Store: AWS service that stores strings, secret data and non-secret data alike.
AWS RDS: a distributed relational database service by Amazon Web Services.

Getting Started

Prerequisites

Docker

Make sure you have Docker Desktop installed on your computer. If you do not have Docker installed please use this link to download:

Once Docker is installed, make sure it is up and running in your background.

Verify that the minimum memory requirements for Docker are set. Use the image below as a reference:

Running the Application

Make sure your docker is running with the minimum requirement highlighted above.
Ensure that you are in the main directory of the project, and then run the following command in the command line:
```
make run-app
```
Check the table has been created in both MySQL and Postgres Databases by using the following commands:
```
docker ps
```
using the following command to access MySQL inside the container:
```
docker exec -it ms_container bash
```
After entering the mysql database, you can find the below table inside schema 'henry'. From this point we have successfully initiated the MySQL database and created all the tables we need for this project.
Check that the application is running by going to your localhost:8080

(Please note that application can take anywhere between 1 - 5 minutes to run depending on your particular system)
Login to the Airflow Webserver using the following credentials:
- username: airflow
- password:airflow
Trigger the DAGs to ensure that they are working properly.
After few mintues, you can check to see if the task has finished by clicking the task name from the Airflow UI:
You can use Database Management Tools (PGAdmin4 or Dbeaver) to check if these tables exist in there
You can check if these tables exists in your local datawarehouse (postgres) by using the following steps:
```
docker exec -it pg_container bash
```
tables inside the local datawarehosue:

Values inside the local datawarehouse:
Check the How To section for additional instructions.
Shut down the application by entering the following command in your terminal:
```
make reset
```
Check the PPT Data pipeline PPT

How To

1. How to verify the Airflow is running

Open up a terminal and type:

docker ps

you will see a list of services inside the docker containers:

License

Distributed under the MIT License.

Maintainer

Primary - Henry Zou
Secondary - Domonique Gordon

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
.vscode		.vscode
dags		dags
data_pipelines		data_pipelines
database		database
dev		dev
fake_data_api		fake_data_api
images		images
streamlit		streamlit
terraform		terraform
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
5800a8ae-ca68-4aeb-8773-198242dabb2c_data.json		5800a8ae-ca68-4aeb-8773-198242dabb2c_data.json
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
c63c2405-dbe8-4d6b-bf7a-1c9328ec6df4_data.json		c63c2405-dbe8-4d6b-bf7a-1c9328ec6df4_data.json
docker-compose.yaml		docker-compose.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About The Project

Built With

Getting Started

Prerequisites

Docker

Running the Application

How To

1. How to verify the Airflow is running

License

Maintainer

About

Releases

Packages

Contributors 2

Languages

License

henryzzz093/data_pipeline2

Folders and files

Latest commit

History

Repository files navigation

About The Project

Built With

Getting Started

Prerequisites

Docker

Running the Application

How To

1. How to verify the Airflow is running

License

Maintainer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages