Skip to content

Project Data Warehouse as part of Udacity's Data Engineering Nanodegree

Notifications You must be signed in to change notification settings

manuel-lang/Cloud-Data-Warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Warehouse

Project Data Warehouse as part of the Udactiy Data Engineering Nanodegree.

Project Summary

An implementation of a Data Warehouse leveraging AWS RedShift. This projects contains the ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team.

The data on S3 contains song and log information from a music store. This solution enables music stores to easily process loads of information efficiently.

Purpose of this project

This projects processes data from different sources (in this case multiple S3 buckets) in a way that it can be analyzed easily and efficiently. The startup Sparkify thus enjoys the eased analysis of the run-time data of their application.

Project instructions

  1. Setup a redshift cluster on AWS and insert the connection details in dwh.cfg.
  2. Create the needed the database structure by executing create_tables.py.
  3. Process the data from the configured S3 data sources by executing etl.py.

Database schema

Table Description
staging_events stating table for event data
staging_songs staging table for song data
songplays information how songs were played, e.g. when by which user in which session
users user-related information such as name, gender and level
songs song-related information containing name, artist, year and duration
artists artist name and location (geo-coords and textual location)
time time-related info for timestamps

ETL pipeline

  1. Load song and log data both from S3 buckets.
  2. Stage the loaded data.
  3. Transform the data into the above described data schema.

Example queries

  • Find all users at a certain location:
SELECT DISTINCT users.user_id FROM users JOIN songplays ON songplays.user_id = users.user_id WHERE songplays.location = <LOCATION>
  • Find all songs by a given artist:
SELECT songs.song_id FROM songs JOIN artists ON songs.artist_id = artists.artist_id WHERE artist.name = <ARTIST>

About

Project Data Warehouse as part of Udacity's Data Engineering Nanodegree

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages