This is the third project of Udacity's Data Engineering Nanodegree🎓.
The purpose is to build an ETL pipeline that extracts data from S3
, stages them in Redshift
, and transforms data into a set of dimensional tables.
A startup called 🎵Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app.
The analytics team is particularly interested in understanding what songs users are listening to.
Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
create_tables.py
creates a cluster if not exists, drops and creates fact and dimension tables for the star schema in Redshift. You run this file to reset your tabels before each time you run your ETL scripts.etl.py
loads data from S3 into staging tables on Redshift and then process that data into analytics tables on Redshift.data_quality_check.ipynb
checks data insertions and uniqueness of primary key in each table.analysis.ipynb
executes some analytic queries on tables and measure the improvement of distribution style.iac.py
is a module for creating or deleting a Redshift cluster.sql_queries.py
contains all sql queries, and is imported into the files above.
tablename | tbl_rows | sortkey1 | diststyle |
---|---|---|---|
artists | 9553 | artist_id | ALL |
songplays | 309 | start_time | KEY(song_id) |
songs | 14896 | None | KEY(song_id) |
time | 6813 | start_time | ALL |
users | 96 | user_id | ALL |
- Copy
$ git clone https://github.com/kjh7176/data_warehouse
# change current working directory
$ cd data_warehouse
- Create Database and Tables
$ python create_tables.py
- Execute ETL process
$ python etl.py
- Confirm
Opendata_quality_check.ipynb
andanalysis.ipynb
in order to test.
title | artist | play_date |
---|---|---|
Rianna | Fisher | 2018-11-28 |
I CAN'T GET STARTED | Ron Carter | 2018-11-27 |
Shimmy Shimmy Quarter Turn (Take It Back To Square One) | Hellogoodbye | 2018-11-26 |
Emergency (Album Version) | Paramore | 2018-11-26 |
What It Ain't | Josh Turner | 2018-11-26 |
Eye Of The Beholder | Metallica | 2018-11-26 |
Loneliness | Tomcraft | 2018-11-24 |
Bang! Bang! | The Knux | 2018-11-24 |
You're The One | Dwight Yoakam | 2018-11-24 |
Sun / C79 | Cat Stevens | 2018-11-24 |
Wax on Tha Belt (Baby G Gets Biz) | Mad Flava | 2018-11-24 |
Catch You Baby (Steve Pitron & Max Sanna Radio Edit) | Lonnie Gordon | 2018-11-23 |
Nothin' On You [feat. Bruno Mars] (Album Version) | B.o.B | 2018-11-21 |
Die Kunst der Fuge_ BWV 1080 (2007 Digital Remaster): Contrapunctus XVII - Inversus | Lionel Rogg | 2018-11-21 |
Mr. Jones | Counting Crows | 2018-11-21 |
You're The One | Dwight Yoakam | 2018-11-09 |
year | title | artist | play_count |
---|---|---|---|
2018 | You're The One | Dwight Yoakam | 37 |
rank | artist | play_count |
---|---|---|
1 | Linkin Park | 4 |
1 | Metallica | 4 |
3 | Black Eyed Peas | 3 |
4 | Katy Perry | 1 |
4 | Maroon 5 | 1 |
hour | play_count |
---|---|
17 | 27 |
15 | 18 |
18 | 15 |
16 | 14 |
14 | 13 |
11 | 13 |
8 | 12 |
20 | 12 |
19 | 10 |
21 | 10 |
Query
SELECT * FROM songplays LIMIT 5;
Result
Query
SELECT * FROM users LIMIT 5;
Result
Query
SELECT * FROM songs LIMIT 5;
Result
Query
SELECT * FROM artists LIMIT 5;
Result
Query
SELECT * FROM time LIMIT 5;
Result