Skip to content

iamandy200/azure-data-engineering-project

Repository files navigation

📌 Overview

This project demonstrates an end-to-end data engineering pipeline built on Microsoft Azure. The pipeline ingests raw data, processes it using PySpark, and stores it in a scalable data lake architecture.


🏗️ Architecture

  • Azure Data Lake Storage Gen2 (ADLS Gen2)
  • Azure Data Factory (ADF)
  • Azure Databricks (PySpark)

🔄 Workflow

  1. Data is ingested into the Bronze layer (raw data)
  2. Data is cleaned and transformed using Azure Databricks (PySpark)
  3. Processed data is stored in the Silver layer
  4. Aggregated data is stored in the Gold layer
  5. Azure Data Factory orchestrates the pipeline

📂 Dataset

  • AdventureWorks dataset (Sales, Customers, Products, etc.)

⚙️ Technologies Used

  • Python, PySpark
  • Azure Data Factory
  • Azure Databricks
  • ADLS Gen2

📊 Key Features

  • End-to-end ETL pipeline
  • Medallion Architecture (Bronze, Silver, Gold)
  • Scalable data processing using Spark
  • Automated workflows using ADF

🚀 Future Enhancements

  • Add real-time data processing using Event Hub
  • Integrate Power BI for visualization
  • Implement data quality checks

👨‍💻 Author

Anand Rathod

About

This project demonstrates an end-to-end data engineering pipeline on Microsoft Azure for ingesting, processing, and transforming data into a scalable data lake architecture. The pipeline automates data ingestion, transformation, and storage using modern cloud technologies.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages