This project demonstrates an end-to-end data engineering pipeline built on Microsoft Azure. The pipeline ingests raw data, processes it using PySpark, and stores it in a scalable data lake architecture.
- Azure Data Lake Storage Gen2 (ADLS Gen2)
- Azure Data Factory (ADF)
- Azure Databricks (PySpark)
- Data is ingested into the Bronze layer (raw data)
- Data is cleaned and transformed using Azure Databricks (PySpark)
- Processed data is stored in the Silver layer
- Aggregated data is stored in the Gold layer
- Azure Data Factory orchestrates the pipeline
- AdventureWorks dataset (Sales, Customers, Products, etc.)
- Python, PySpark
- Azure Data Factory
- Azure Databricks
- ADLS Gen2
- End-to-end ETL pipeline
- Medallion Architecture (Bronze, Silver, Gold)
- Scalable data processing using Spark
- Automated workflows using ADF
- Add real-time data processing using Event Hub
- Integrate Power BI for visualization
- Implement data quality checks
Anand Rathod