Skip to content

This is a project repo for Data Engineering Zoomcamp 2024

Notifications You must be signed in to change notification settings

hwchua0209/Retail_Sales_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Retail Sales Data Engineering Pipeline

This project automates a data engineering pipeline to analyze retail sales data and generate a comprehensive sales dashboard. This dashboard will empower us to visualize revenue generated by each store and identify the factors influencing those figures.


Problem Statement

Gaining a clear understanding of store performance is crucial for effective retail management. However, manually analyzing vast sales data sets makes it difficult to:

  • Compare revenue across stores: Quickly identify top-performing and underperforming locations.
  • Uncover trends: Analyze factors impacting revenue, such as location, product mix, or marketing efforts.
  • Make data-driven decisions: Effectively allocate resources and optimize strategies based on real-time insights.

Solution

This pipeline automates data processing and prepares the data for analysis in Google BigQuery. This allows us to:

  • Visualize store revenue: Create clear visualizations comparing revenue across all stores over time.
  • Identify influencing factors: Analyze the impact of location, product categories, promotions, and other relevant factors on store performance.
  • Gain actionable insights: Leverage the sales dashboard to pinpoint areas for improvement and make data-driven decisions to optimize store performance and maximize profitability.

Project Details

This project proposes a data engineering pipeline that automates the data processing workflow utilizing the following tech stacks:

  • Mage AI for data pipeline orchestration
  • Google Cloud Platform (GCP) services for storage
  • Pyspark for big data loading and transformation
  • DBT for data transformation
  • Terraform for cloud resources provision
  • Looker Studio for dashboard

The pipeline will:

  • Extract retail sales data from Kaggle Retail Sales Data which consists of sales data collected from a Turkish retail company. Time period begins from 2017 to the end of 2019.
  • Validate the data against pre-defined schemas to ensure consistency and quality.
  • Load the processed data into a BigQuery table.
  • Transform the data to prepare it for analysis with DBT.

Data Pipeline Architecture

Diagram below shows an overview of data pipeline architecture used in this project.

More info about data pipeline with Mage can be found here.

Running the Project

Prerequisites

  1. Clone this project to local drive.
  2. Ensure docker desktop is downloaded locally.
  3. Ensure you have a GCP and Kaggle account.
  4. Place the following JSON files to secrets folder.
    • GCP Credentials API Key. Refer here for instructions. Ensure proper IAM role was given to the user.
    • Kaggle API Key. Refer here for instructions.

Makefile

This project utilizes a Makefile to automate various tasks. Here's how to use the Makefile to run the pipeline:

1. Cloud resources provision:

make terraform
  • Use the make terraform command to initialize Terraform in the terraform` directory. This make command will trigger a series of Terraform command to initialize Terraform in the Terraform directory, format Terraform code to ensure it adheres to a consistent style, preview the changes Terraform will make to your GCP resources based on your configuration and finally create or modify GCP resources as defined in your Terraform configuration.

Remember:

  • Update the terraform directory with your GCP project ID and desired resource configurations before running these commands.

2. Destroying Terraform resources:

make terraform_destroy
  • Use make terraform_destroy to remove all resources provisioned by Terraform. This is useful for cleaning up your environment.

3. Launching the Data Pipeline:

make start
  • Run make start to start the data processing container defined in docker-compose.yml. This container executes the data pipeline steps using Mage AI.

4. Stopping the Data Pipeline and clean up resources:

make clean
  • Use make clean to stop the running data processing container and clean up stopped images, container and volume.

5. Getting Help:

Available rules:

clean               Clean up data pipeline container
start               Launch data pipeline container
terraform           Create cloud resources via Terraform
terraform_destroy   Destroy cloud resources via Terraform
  • Type make to display a list of available commands and their brief descriptions.

Dashboard

The dashboard is available for viewing here.

Resources

About

This is a project repo for Data Engineering Zoomcamp 2024

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published