Retail Sales Data Engineering Pipeline

This project automates a data engineering pipeline to analyze retail sales data and generate a comprehensive sales dashboard. This dashboard will empower us to visualize revenue generated by each store and identify the factors influencing those figures.

Problem Statement • Solution • Project Details • Data Pipeline Architecture
Running the Project • Dashboard • Resources •

Problem Statement

Gaining a clear understanding of store performance is crucial for effective retail management. However, manually analyzing vast sales data sets makes it difficult to:

Compare revenue across stores: Quickly identify top-performing and underperforming locations.
Uncover trends: Analyze factors impacting revenue, such as location, product mix, or marketing efforts.
Make data-driven decisions: Effectively allocate resources and optimize strategies based on real-time insights.

Solution

This pipeline automates data processing and prepares the data for analysis in Google BigQuery. This allows us to:

Visualize store revenue: Create clear visualizations comparing revenue across all stores over time.
Identify influencing factors: Analyze the impact of location, product categories, promotions, and other relevant factors on store performance.
Gain actionable insights: Leverage the sales dashboard to pinpoint areas for improvement and make data-driven decisions to optimize store performance and maximize profitability.

Project Details

This project proposes a data engineering pipeline that automates the data processing workflow utilizing the following tech stacks:

Mage AI for data pipeline orchestration
Google Cloud Platform (GCP) services for storage
Pyspark for big data loading and transformation
DBT for data transformation
Terraform for cloud resources provision
Looker Studio for dashboard

The pipeline will:

Extract retail sales data from Kaggle Retail Sales Data which consists of sales data collected from a Turkish retail company. Time period begins from 2017 to the end of 2019.
Validate the data against pre-defined schemas to ensure consistency and quality.
Load the processed data into a BigQuery table.
Transform the data to prepare it for analysis with DBT.

Data Pipeline Architecture

Diagram below shows an overview of data pipeline architecture used in this project.

More info about data pipeline with Mage can be found here.

Running the Project

Prerequisites

Clone this project to local drive.
Ensure docker desktop is downloaded locally.
Ensure you have a GCP and Kaggle account.
Place the following JSON files to secrets folder.
- GCP Credentials API Key. Refer here for instructions. Ensure proper IAM role was given to the user.
- Kaggle API Key. Refer here for instructions.

Makefile

This project utilizes a Makefile to automate various tasks. Here's how to use the Makefile to run the pipeline:

1. Cloud resources provision:

make terraform

Use the make terraform command to initialize Terraform in the terraform` directory. This make command will trigger a series of Terraform command to initialize Terraform in the Terraform directory, format Terraform code to ensure it adheres to a consistent style, preview the changes Terraform will make to your GCP resources based on your configuration and finally create or modify GCP resources as defined in your Terraform configuration.

Remember:

Update the terraform directory with your GCP project ID and desired resource configurations before running these commands.

2. Destroying Terraform resources:

make terraform_destroy

Use make terraform_destroy to remove all resources provisioned by Terraform. This is useful for cleaning up your environment.

3. Launching the Data Pipeline:

make start

Run make start to start the data processing container defined in docker-compose.yml. This container executes the data pipeline steps using Mage AI.

4. Stopping the Data Pipeline and clean up resources:

make clean

Use make clean to stop the running data processing container and clean up stopped images, container and volume.

5. Getting Help:

Available rules:

clean               Clean up data pipeline container
start               Launch data pipeline container
terraform           Create cloud resources via Terraform
terraform_destroy   Destroy cloud resources via Terraform

Type make to display a list of available commands and their brief descriptions.

Dashboard

The dashboard is available for viewing here.

Resources

Data Engineering Zoomcamp by DataTalks.Club

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
jars		jars
retail_sales_analysis_mage		retail_sales_analysis_mage
screenshots		screenshots
secrets		secrets
terraform		terraform
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retail Sales Data Engineering Pipeline

Problem Statement

Solution

Project Details

Data Pipeline Architecture

Running the Project

Prerequisites

Makefile

Dashboard

Resources

About

Releases

Packages

Languages

hwchua0209/Retail_Sales_Analysis

Folders and files

Latest commit

History

Repository files navigation

Retail Sales Data Engineering Pipeline

Problem Statement

Solution

Project Details

Data Pipeline Architecture

Running the Project

Prerequisites

Makefile

Dashboard

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages