This project aims to design and implement a data mesh architecture by using close to real business data provided on the Yelp Dataset. This matches perfectly with the Data Mesh concept of modeling analytics for business. For details about the data see Yelp Dataset Documentation.
Note that this is not a production-ready project. This is rather a lab to deep my knowledge into Data Engineering, DevOps and mainly data meshs. So errors and changes will occur as my knowledge evolves. Feel free to contribute with this project by contacting me with suggestions, tips and ways that I can improve this code (see Contributing for more details)
Note
Please refer to Logical Architecture for details about the diagram. For information about each product (including their canvas and interaction map) refer to their own documentation on products folder.
Note
This architecture (and the diagram) is heavily based on the tech stacks found here, more precisely this a mix of both Datamesh Architecture: MinIO and Trino and Datamesh Architecture: dbt and Snowflake. Changes should occur as the project. Please refer to Infra README's for more information about the architecture.
In order to deploy the resources a kubernetes cluster is required. How to deploy a local kubernetes cluster is out of the scope of this project. This code was tested under a MicroK8S managed cluster. If this is your choice the following addons were enabled:
microk8s enable dns
microk8s enable helm
microk8s enable helm3
microk8s enable hostpath-storage
microk8s enable rbac
microk8s enable registry
Note
There are many solutions out there to deploy a local cluster (e.g. Minikube, Kind). You can see some examples on Kubernetes: Install Tools.
It is also required to download the Yelp Dataset (photos are not required) and extract it on the data
folder. To download please follow the instruction on
Yelp Dataset: Download The Data
This project embed a full-feature developer container for VSCode users containing all the tools, extensions and required configurations to develop the code. If you don't know how dev containers work please read Visual Studio Code: Developing Inside a Container.
For people that do not use VSCode the Dockerfile contains all the tools used by the project. You can use that as a base for setup your environment.
Since this is a lab project currently I am the only person developing the code. However feel free to propose new features/improvements, ask questions, suggest tips and etc on the discussion tab. For bug reports use the issues tab (with the bug template).
Note
Please, read the CONTRIBUTING Guide for more details about styleguides, best practices and conventions followed by the project.
Below are some main references used by this project. Feel free to read them for a more deep understanding about the project.
- Data Mesh Architecture
- Data Mesh Architecture (Tech Stack): dbt and Snowflake
- Data Mesh Architecture (Tech Stack): MinIO and Trino
- dbt in a data mesh world
- Data Mesh Principles and Logical Architecture
- MicroK8S
- Yelp Dataset
- Medallion Architecture
- Building Data Lakes on AWS with Kafka Connect, Debezium, Apicurio Registry, and Apache Hudi