Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 1.9 KB

thoughts_and_improvements.md

File metadata and controls

24 lines (17 loc) · 1.9 KB

Lakehouse structure

To maintain an overall structure that more resembles a Medallion Architecture design, I could have extracted and stored the raw data differently into a Bronze Layer. For ease of use, I implement a step to flatten the data before loading into GCS, but in future it may have been worth storing the raw JSONs in the lakehouse, which would make it easier to reconstruct the data in the future from the raw files.

If I was dealing with larger files, it may have been better to store in an alternative format, such as parquet.

Modelling

The extracted data contains MBIDs, and so the overall data model could have been supplemented further with the MusicBrainz API data to pull more information for the dimensional tables.

Orchestration

There were many features in Mage that could have been used to improve the overall pipeline. The ability to create notifications when a pipeline failed would be useful in a production environment. Also, Mage supports dbt natively and so the dbt processing could have been included in the orchestration step.

Testing

Adding more tests for each block of the pipeline, and more testing within dbt to make the entire process more robust and reliable.