This data engineering project utilizes Azure Synapse Analytics to analyze and transform New York City taxi data released by nyc.gov. The project covers the entire data processing pipeline, from raw data ingestion to creating meaningful insights using Azure Synapse Analytics, Apache Spark, and Power BI.
- NYC Taxi Data Overview
- Project Resources and Architecture
- Architecture Explanation & Project Working
- Synapse Pipeline Orchestration
- Power BI Reporting
- Budget Analysis For Project
- Conclusion and Future Enhancements
The project analyzes NYC taxi data, categorizing taxis into types (Yellow Taxis, Green Taxis, For-Hire Vehicles) and considering boroughs as distinct administrative divisions. The seven main tables contributing to the project include Trip Data, Taxi Zone, Calendar, Trip Type, Payment Type, Rate Code, and Vendor.
The project relies on Azure Synapse Analytics, utilizing Azure Data Lake Storage, Serverless SQL Pool, Apache Spark, Synapse Pipelines, and Power BI. The architecture ensures seamless integration and ease of use for handling big data projects.
Detailed explanations are provided for loading raw data into the Raw Container, transforming data from the Bronze Schema to Silver Schema, and further transforming it into the Gold Schema. The project utilizes External Tables, CETAS, and Stored Procedures for efficient data processing.
The Synapse pipeline orchestrates various stages of the data processing pipeline, including creating Silver External Tables, handling Trip Data partitioning in the Silver Schema, and transforming data from the Silver Schema to the Gold Schema. Triggers are used for scheduling these pipelines.
Power BI is employed for creating insightful reports on payment methods used by passengers and taxi demand in NYC. The reports offer valuable insights for decision-making.
Explore the Power BI reports for detailed insights:
A budget analysis section outlines the incurred costs, primarily from Azure Synapse Analytics Workspace, SQL Serverless Pool, Pipelines, and Storage.
The project successfully demonstrates the capabilities of Azure Synapse Analytics. Future enhancements could include cost optimization, real-time data processing, machine learning integration, data governance, security measures, and improved Power BI dashboards.
Feel free to explore the GitHub repository and use the code as a reference or starting point for similar projects.