Skip to content

Builds an ETL pipeline transferring data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these tables.

Notifications You must be signed in to change notification settings

kjh7176/data_warehouse

Repository files navigation

Data Warehouse

This is the third project of Udacity's Data Engineering Nanodegree🎓.
The purpose is to build an ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables.

Background

A startup called 🎵Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app.
The analytics team is particularly interested in understanding what songs users are listening to.
Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

File Description

  • create_tables.py creates a cluster if not exists, drops and creates fact and dimension tables for the star schema in Redshift. You run this file to reset your tabels before each time you run your ETL scripts.
  • etl.py loads data from S3 into staging tables on Redshift and then process that data into analytics tables on Redshift.
  • data_quality_check.ipynb checks data insertions and uniqueness of primary key in each table.
  • analysis.ipynb executes some analytic queries on tables and measure the improvement of distribution style.
  • iac.py is a module for creating or deleting a Redshift cluster.
  • sql_queries.pycontains all sql queries, and is imported into the files above.

Database Schema

ERD

tablename tbl_rows sortkey1 diststyle
artists 9553 artist_id ALL
songplays 309 start_time KEY(song_id)
songs 14896 None KEY(song_id)
time 6813 start_time ALL
users 96 user_id ALL

Usage

  1. Copy
$ git clone https://github.com/kjh7176/data_warehouse

# change current working directory
$ cd data_warehouse
  1. Create Database and Tables
$ python create_tables.py
  1. Execute ETL process
$ python etl.py
  1. Confirm
    Open data_quality_check.ipynb and analysis.ipynb in order to test.

Analytics

1. Display a play list of the specific user in the latest played order.

title artist play_date
Rianna Fisher 2018-11-28
I CAN'T GET STARTED Ron Carter 2018-11-27
Shimmy Shimmy Quarter Turn (Take It Back To Square One) Hellogoodbye 2018-11-26
Emergency (Album Version) Paramore 2018-11-26
What It Ain't Josh Turner 2018-11-26
Eye Of The Beholder Metallica 2018-11-26
Loneliness Tomcraft 2018-11-24
Bang! Bang! The Knux 2018-11-24
You're The One Dwight Yoakam 2018-11-24
Sun / C79 Cat Stevens 2018-11-24
Wax on Tha Belt (Baby G Gets Biz) Mad Flava 2018-11-24
Catch You Baby (Steve Pitron & Max Sanna Radio Edit) Lonnie Gordon 2018-11-23
Nothin' On You [feat. Bruno Mars] (Album Version) B.o.B 2018-11-21
Die Kunst der Fuge_ BWV 1080 (2007 Digital Remaster): Contrapunctus XVII - Inversus Lionel Rogg 2018-11-21
Mr. Jones Counting Crows 2018-11-21
You're The One Dwight Yoakam 2018-11-09

2. What is the most played song every year?

year title artist play_count
2018 You're The One Dwight Yoakam 37

3. Display the 5 most played artists from LA.

rank artist play_count
1 Linkin Park 4
1 Metallica 4
3 Black Eyed Peas 3
4 Katy Perry 1
4 Maroon 5 1

4. What time are women most likely to listen music?

hour play_count
17 27
15 18
18 15
16 14
14 13
11 13
8 12
20 12
19 10
21 10

Examples of Data in Tables

Query

SELECT * FROM songplays LIMIT 5;

Result

songplays

Query

SELECT * FROM users LIMIT 5;

Result

users

Query

SELECT * FROM songs LIMIT 5;

Result

songs

Query

SELECT * FROM artists LIMIT 5;

Result

artists

Query

SELECT * FROM time LIMIT 5;

Result

time

About

Builds an ETL pipeline transferring data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these tables.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published