RetailFlow is a comprehensive ELT (Extract, Load, Transform) project designed to simulate the flow of retail sales data for an e-commerce platform. The infrastructure is provisioned and managed on AWS, with each service optimized for its specific role in the pipeline.
The data simulation is handled by a Python script executing within an AWS Lambda function. The generated data is then pushed to a PostgreSQL database instance deployed on AWS EC2.
Data is ingested using Airbyte into the data warehousing solution, Snowflake. Airbyte operates on its own EC2 instance, ensuring dedicated resources for the critical task of data synchronization.
For the transformation phase, we utilize a combination of Dagster and dbt, two cutting-edge tools in the data engineering ecosystem. These tools are deployed on an EC2 instance, allowing for a flexible and powerful transformation process.
The final piece of the pipeline is data visualization, which is handled by Metabase. Running on a dedicated EC2 instance, Metabase provides intuitive and insightful data analytics, allowing stakeholders to extract meaningful conclusions from the data.
The entire system is orchestrated using Terraform, an Infrastructure as Code (IaC) tool that simplifies and standardizes infrastructure deployment. On the application level, we utilize Docker for containerization, ensuring consistency across all stages of development and production.
graph LR
subgraph L["AWS Lambda"]
style L fill:#e8fce8
LA["generate_fake_data.py"]
end
subgraph EC2_1["EC2 Instance"]
subgraph D1["Docker"]
style D1 fill:#d4ebf2
P["Postgres DB"]
end
end
subgraph EC2_2["EC2 Instance"]
A["Airbyte"]
end
subgraph EC2_5["Hosted on AWS"]
S["Snowflake"]
end
subgraph EC2_3["EC2 Instance"]
subgraph D3["Docker"]
style D3 fill:#d4ebf2
D["dbt + Dagster"]
end
end
subgraph EC2_4["EC2 Instance"]
subgraph D4["Docker"]
style D4 fill:#d4ebf2
M["Metabase"]
end
end
L -- "Generates Fake Data" --> P
P -- "Data Ingestion" --> A
A -- "Data Loading" --> S
S -- "Data Transformation" --> D
D -- "Data Transformation" --> S
S -- "Data Visualization" --> M
linkStyle 0 stroke:#2ecd71,stroke-width:2px;
linkStyle 1 stroke:#2ecd71,stroke-width:2px;
linkStyle 2 stroke:#2ecd71,stroke-width:2px;
linkStyle 3 stroke:#2ecd71,stroke-width:2px;
linkStyle 4 stroke:#2ecd71,stroke-width:2px;
linkStyle 5 stroke:#2ecd71,stroke-width:2px;
- AWS Account
- AWS CLI (installed and configured)
- Create a Snowflake account and note down the account_id, username and password
- Docker
- docker-compose
- Terraform
You can install these requirements using the following command: brew install docker docker-compose awscli terraform
To see a full list of commands, run make help
- Run
make venv-setup
to create your virtual environment - Run
make initial-config
to set up everything related to containers, container orchestration, permissions, etc. - Run
make infra-up
to deploy the pipeline to AWS and wait until you see the 'All Ready' Message - Run
make port-forwarding-airbyte
and configure the PostgresDB source and the Snowflake destination (this cannot be done programmatically) - Run
make port-forwarding-metabase
and configure the Snowflake source (this cannot be done programmatically) - Explore the remainder of the project by running
make port-forwarding-dbt
,make ssh-postgres
,make port-forwarding-dagster
,make open-snowflake
ormake print-lambda
to interact with the ec2 instances (port-forwarding, ssh, information, etc.) - Once you are finished, run
make infra-down