Data Warehouse

This is the third project of Udacity's Data Engineering Nanodegree🎓.
The purpose is to build an ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables.

Background

A startup called 🎵Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app.
The analytics team is particularly interested in understanding what songs users are listening to.
Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

File Description

create_tables.py creates a cluster if not exists, drops and creates fact and dimension tables for the star schema in Redshift. You run this file to reset your tabels before each time you run your ETL scripts.
etl.py loads data from S3 into staging tables on Redshift and then process that data into analytics tables on Redshift.
data_quality_check.ipynb checks data insertions and uniqueness of primary key in each table.
analysis.ipynb executes some analytic queries on tables and measure the improvement of distribution style.
iac.py is a module for creating or deleting a Redshift cluster.
sql_queries.pycontains all sql queries, and is imported into the files above.

Database Schema

tablename	tbl_rows	sortkey1	diststyle
artists	9553	artist_id	ALL
songplays	309	start_time	KEY(song_id)
songs	14896	None	KEY(song_id)
time	6813	start_time	ALL
users	96	user_id	ALL

Usage

Copy

$ git clone https://github.com/kjh7176/data_warehouse

# change current working directory
$ cd data_warehouse

Create Database and Tables

$ python create_tables.py

Execute ETL process

$ python etl.py

Confirm
Open data_quality_check.ipynb and analysis.ipynb in order to test.

Analytics

1. Display a play list of the specific user in the latest played order.

title	artist	play_date
Rianna	Fisher	2018-11-28
I CAN'T GET STARTED	Ron Carter	2018-11-27
Shimmy Shimmy Quarter Turn (Take It Back To Square One)	Hellogoodbye	2018-11-26
Emergency (Album Version)	Paramore	2018-11-26
What It Ain't	Josh Turner	2018-11-26
Eye Of The Beholder	Metallica	2018-11-26
Loneliness	Tomcraft	2018-11-24
Bang! Bang!	The Knux	2018-11-24
You're The One	Dwight Yoakam	2018-11-24
Sun / C79	Cat Stevens	2018-11-24
Wax on Tha Belt (Baby G Gets Biz)	Mad Flava	2018-11-24
Catch You Baby (Steve Pitron & Max Sanna Radio Edit)	Lonnie Gordon	2018-11-23
Nothin' On You [feat. Bruno Mars] (Album Version)	B.o.B	2018-11-21
Die Kunst der Fuge_ BWV 1080 (2007 Digital Remaster): Contrapunctus XVII - Inversus	Lionel Rogg	2018-11-21
Mr. Jones	Counting Crows	2018-11-21
You're The One	Dwight Yoakam	2018-11-09

2. What is the most played song every year?

year	title	artist	play_count
2018	You're The One	Dwight Yoakam	37

3. Display the 5 most played artists from LA.

rank	artist	play_count
1	Linkin Park	4
1	Metallica	4
3	Black Eyed Peas	3
4	Katy Perry	1
4	Maroon 5	1

4. What time are women most likely to listen music?

hour	play_count
17	27
15	18
18	15
16	14
14	13
11	13
8	12
20	12
19	10
21	10

Examples of Data in Tables

Query

SELECT * FROM songplays LIMIT 5;

Result

Query

SELECT * FROM users LIMIT 5;

Result

Query

SELECT * FROM songs LIMIT 5;

Result

Query

SELECT * FROM artists LIMIT 5;

Result

Query

SELECT * FROM time LIMIT 5;

Result

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.ipynb_checkpoints		.ipynb_checkpoints
images		images
README.md		README.md
analysis.ipynb		analysis.ipynb
create_tables.py		create_tables.py
data_quality_check.ipynb		data_quality_check.ipynb
dwh.cfg		dwh.cfg
etl.py		etl.py
iac.py		iac.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Warehouse

Background

File Description

Database Schema

Usage

Analytics

1. Display a play list of the specific user in the latest played order.

2. What is the most played song every year?

3. Display the 5 most played artists from LA.

4. What time are women most likely to listen music?

Examples of Data in Tables

About

Releases

Packages

Languages

kjh7176/data_warehouse

Folders and files

Latest commit

History

Repository files navigation

Data Warehouse

Background

File Description

Database Schema

Usage

Analytics

1. Display a play list of the specific user in the latest played order.

2. What is the most played song every year?

3. Display the 5 most played artists from LA.

4. What time are women most likely to listen music?

Examples of Data in Tables

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages