GitHub - moSabryii/Olist-Ecommerce-Pipeline: a repo for ITI-ICC graduation project

📘 Table of Contents

📌 Project Overview

🏗️ Architecture

🗂️ Data Modeling

⚡ Streaming Pipeline

❄️ Iceberg Metadata & Versioning

📊 Analytics Layer

📊 Business Impact & Recommendations

🚀 Airflow Orchestration

📁 Repository Structure

▶️ How to Run

🤝 Contributing

📜 License

📌 Project Overview

This project implements a real-time and historical analytics pipeline using a modern Kappa Architecture. It processes Olist marketplace data using:

Kafka for event streaming

Spark Structured Streaming for data processing

Apache Iceberg on AWS S3 as the lakehouse storage layer

Airflow for orchestration

Snowflake as the analytics warehouse

Kibana for real-time visualization

Power BI for BI reporting

The pipeline delivers both live operational dashboards and deep historical analytics, making it ideal for marketplace, e‑commerce, and data‑intensive platforms.

🏗️ Architecture

Below is the high-level architecture describing ingestion, real-time serving, Iceberg storage, and BI analytics.

🗂️ Data Modeling

Dimensions: dim_customer dim_seller dim_product dim_date

Facts: fact_order_line fact_payment

Modeling Notes:

Order-level grain using item detail

SCD Type 1

Snowflake Streams + Tasks handle incremental loading, capturing new Iceberg changes and applying them through scheduled MERGE operations

Below is a photo of our DWH schema

⚡ Streaming Pipeline

Kafka Producer:

-Reads CSV files incrementally

-Sends JSON events to Kafka topic

Spark Structured Streaming:

-Consumes Kafka messages

-Applies transformations

-Writes to two sinks:

Elasticsearch → real-time dashboards

Iceberg → historical analytics

❄️ Iceberg Metadata & Versioning

-Iceberg enables:

-Time Travel

-ACID operations

-Schema evolution

-Snapshot-based versioning

Snowflake Metadata Refresh Procedure:

CALL REFRESH_ICEBERG_METADATA_SP(
  'table_name',
  'external_volume_path'
);

This ensures Snowflake always reads the latest Iceberg metadata JSON.

📊 Analytics Layer

Real-Time Dashboards (Kibana): Most used payment methods

High-demand product categories

High-demand regions

Live monitoring of spikes or drops in orders on the platform

Track business growth (e.g., total revenue, orders, customers over time)

Track top-selling categories live

View order status distributions instantly

Historical Analysis (Snowflake + Power BI):

Delivery efficiency

Highest-demanding product categories over time

Customer purchasing behavior (e.g., churn rate, repeat purchases)

Delivery performance over time

Tracking revenue and Financial KPIs

Demand forecasting and inventory management planning with sellers

Key Power BI Dashboards:

Executive Summary: High-level KPIs & Volumes (e.g., monthly orders, delivery efficiency, top categories, top sellers/customers, maps)

Sales & Revenue: Monthly revenue, freight value, average order value, revenue by payment type, installments

Customer Behavior: Repeat rate, churn, inactive customers, insights by state, high-value customers, growth per year

Product Analysis: Most selling categories, revenue/price per category/product

Logistics & Fulfillment: Average delivery time, on-time rate, undelivered orders, duration by day, freight by state, category shipping analysis

Undelivered Orders Drillthrough: End-to-end analysis, seller/carrier verification, customer orders analysis

📊Business Impact & Recommendations

Faster delivery → higher order volume & satisfaction

Prioritize top-performing sellers

Re-engage inactive customers

Focus inventory & promotions on top categories

Reduce freight costs in high-volume states (SP, RJ, MG)

🚀 Airflow Orchestration

Airflow is used to orchestrate:

-Kafka Producer startup and monitoring

-Spark Consumer/Streaming Job triggers

-Iceberg metadata updates

-Dimension upserts

-Fact table refreshes

DAGs follow dependency chaining to ensure end‑to‑end flow from producer → consumer → Iceberg → Snowflake.

▶️ How to Run

1. Start Docker Services

docker compose up -d

2. Start Kafka Producer

python producer/producer.py

3. Start Spark Streaming Job

1-To elastic

spark-submit \
  --master local[*] \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
 Transformation Code\Spark_Consumer_To_Elastic.py

2- To s3

spark-submit \
  --master local[*] \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
  Transformation Code\Spark_Consumer_To_S3.py

🤝 Contributing

Pull requests are welcome.

📜 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.vscode		.vscode
Olist Dataset		Olist Dataset
Snowflake		Snowflake
Spark		Spark
Transformation Code		Transformation Code
Visualization		Visualization
elastic/data		elastic/data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_readme.txt		_readme.txt
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📘 Table of Contents

📌 Project Overview

🏗️ Architecture

🗂️ Data Modeling

⚡ Streaming Pipeline

❄️ Iceberg Metadata & Versioning

📊 Analytics Layer

📊Business Impact & Recommendations

🚀 Airflow Orchestration

▶️ How to Run

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📘 Table of Contents

📌 Project Overview

🏗️ Architecture

🗂️ Data Modeling

⚡ Streaming Pipeline

❄️ Iceberg Metadata & Versioning

📊 Analytics Layer

📊Business Impact & Recommendations

🚀 Airflow Orchestration

▶️ How to Run

🤝 Contributing

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages