In this repository, there are to example of an ETL pipeline and Database design either in PostgreSQL and Cassandra. Each folder contains the code and data necessary to create and populate the database.
- PostgreSQL: Folder containing SQL oriented ETL pipeline.
- Cassandra: Folder containing No-SQL ETL pipeline.
- AWS Redshift: Folder containing AWS Redshift-based ETL pipeline.
- Spark: Folder containing Spark-based ETL pipeline.
The dataset used is a subset of real data from the Million Song Dataset. Each file is in JSON/CSV format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.