In this project, I had the opportunity to work on an Uber data engineering project. The main objectives of the project were as follows:
I extracted the TLC Trip Record Data, which consisted of Yellow and Green taxi trip records. The dataset contained essential information such as pick-up and drop-off dates/times, locations, distances, fares, rate types, payment types, and passenger counts. The extracted data was then loaded into Google Cloud Storage (GCS). Google Cloud Storage is a reliable and scalable storage service provided by Google Cloud Platform (GCP).
Using Jupyter Notebook and Python, I performed extensive data transformation and modeling tasks. This involved applying a fact and dimensional data modeling schema to the dataset. I cleaned, organized, and structured the data to create meaningful relationships between different entities and dimensions, ensuring its suitability for analysis and further processing.
To streamline the data processing workflow, I implemented an Extract, Transform, Load (ETL) process using a data pipeline tool called Mage. Mage is a modern data pipeline tool that facilitates the efficient extraction, transformation, and loading of data from various sources. The transformed data was then loaded into Google BigQuery, a powerful and fully-managed data warehouse solution offered by GCP. BigQuery enables fast and scalable analysis of large datasets. Development of a Dashboard using Looker:
This project was an exciting and valuable learning experience for me, as it was my first time working with Google Cloud and Mage Technologies. I gained substantial knowledge and hands-on experience in data engineering, and it has motivated me to pursue more thrilling data engineering projects in the future.