Realtime Data Architecture: Hiring Platform

Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to multiple storages, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, Cassandra, MySQL, and Grafana. Everything is containerized using Docker for ease of deployment and scalability.

System Architecture

The project is designed with the following components:

Data Source: randomuser.me API provides mock user data.
Apache Airflow: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
Apache Kafka and Zookeeper: Used for streaming data from PostgreSQL to the processing engine.
Apache Spark: For data processing with its master and worker nodes.
Cassandra: Where the raw user and tracking data will be stored.
MySQL: Stored transformed and aggregated tracking data for analysis.
Grafana: Create real-time dashboards, pulling data directly from MySQL.

Technologies

Apache Airflow
Python
Apache Kafka
Apache Zookeeper
Apache Spark
Cassandra
MySQL
Grafana
Docker

Project Files

docker-compose.yml: Configure containers for each technology
kafka_stream.py: Streams data from the API to Kafka
spark_stream.py: Processes data from Kafka and stores it in Cassandra
faking_log.py: Generates sample interaction log data for testing
spark_cdc.py: Captures changes in Cassandra, transforms and pushes them to MySQL

Getting Started

Clone the repository:

git clone https://github.com/MarcusLe02/realtime-pipeline-hiring-platform.git

Navigate to the project directory:
```
cd realtime-pipeline-hiring-platform
```
Run Docker Compose to spin up the services:
```
docker-compose up -d
```

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
dags		dags
.DS_Store		.DS_Store
Dockerfile		Dockerfile
README.md		README.md
data-engineering-architecture.png		data-engineering-architecture.png
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
faking_log.py		faking_log.py
kafka_faking_log.py		kafka_faking_log.py
mysql-connector-j-8.2.0.jar		mysql-connector-j-8.2.0.jar
requirements.txt		requirements.txt
spark_cdc.py		spark_cdc.py
spark_stream.py		spark_stream.py

MarcusLe02/realtime-pipeline-hiring-platform

Folders and files

Latest commit

History

Repository files navigation

Realtime Data Architecture: Hiring Platform

Table of Contents

Introduction

System Architecture

Technologies

Project Files

Getting Started

About

Topics

Resources

Stars

Watchers

Forks

Languages