This project automates a data engineering pipeline to analyze retail sales data and generate a comprehensive sales dashboard. This dashboard will empower us to visualize revenue generated by each store and identify the factors influencing those figures.
Problem Statement •
Solution •
Project Details •
Data Pipeline Architecture
Running the Project •
Dashboard •
Resources •
Gaining a clear understanding of store performance is crucial for effective retail management. However, manually analyzing vast sales data sets makes it difficult to:
- Compare revenue across stores: Quickly identify top-performing and underperforming locations.
- Uncover trends: Analyze factors impacting revenue, such as location, product mix, or marketing efforts.
- Make data-driven decisions: Effectively allocate resources and optimize strategies based on real-time insights.
This pipeline automates data processing and prepares the data for analysis in Google BigQuery. This allows us to:
- Visualize store revenue: Create clear visualizations comparing revenue across all stores over time.
- Identify influencing factors: Analyze the impact of location, product categories, promotions, and other relevant factors on store performance.
- Gain actionable insights: Leverage the sales dashboard to pinpoint areas for improvement and make data-driven decisions to optimize store performance and maximize profitability.
This project proposes a data engineering pipeline that automates the data processing workflow utilizing the following tech stacks:
Mage AI
for data pipeline orchestrationGoogle Cloud Platform (GCP)
services for storagePyspark
for big data loading and transformationDBT
for data transformationTerraform
for cloud resources provisionLooker Studio
for dashboard
The pipeline will:
- Extract retail sales data from Kaggle Retail Sales Data which consists of sales data collected from a Turkish retail company. Time period begins from 2017 to the end of 2019.
- Validate the data against pre-defined schemas to ensure consistency and quality.
- Load the processed data into a BigQuery table.
- Transform the data to prepare it for analysis with DBT.
Diagram below shows an overview of data pipeline architecture used in this project.
More info about data pipeline with Mage can be found here.
- Clone this project to local drive.
- Ensure docker desktop is downloaded locally.
- Ensure you have a GCP and Kaggle account.
- Place the following JSON files to
secrets
folder.
This project utilizes a Makefile
to automate various tasks. Here's how to use the Makefile to run the pipeline:
1. Cloud resources provision:
make terraform
- Use the
make terraform
command to initialize Terraform in the terraform` directory. This make command will trigger a series of Terraform command to initialize Terraform in the Terraform directory, format Terraform code to ensure it adheres to a consistent style, preview the changes Terraform will make to your GCP resources based on your configuration and finally create or modify GCP resources as defined in your Terraform configuration.
Remember:
- Update the
terraform
directory with your GCP project ID and desired resource configurations before running these commands.
2. Destroying Terraform resources:
make terraform_destroy
- Use
make terraform_destroy
to remove all resources provisioned by Terraform. This is useful for cleaning up your environment.
3. Launching the Data Pipeline:
make start
- Run
make start
to start the data processing container defined indocker-compose.yml
. This container executes the data pipeline steps using Mage AI.
4. Stopping the Data Pipeline and clean up resources:
make clean
- Use
make clean
to stop the running data processing container and clean up stopped images, container and volume.
5. Getting Help:
Available rules:
clean Clean up data pipeline container
start Launch data pipeline container
terraform Create cloud resources via Terraform
terraform_destroy Destroy cloud resources via Terraform
- Type
make
to display a list of available commands and their brief descriptions.
The dashboard is available for viewing here.