Skip to content

ETL data to a Postgres database to enable logging and analysis for music streaming service.

Notifications You must be signed in to change notification settings

ponugoti/sparkify

Repository files navigation

Data Modeling with Postgres

Introduction

Sparkify is a makeshift music streaming service. A data infrastructure needs to be set up to enable analytics and understand user behavior. Initially, the data collected by the service resides in a two directories.

  • log_data contains JSON logs on user activity
  • song_data contains JSON metadata on the songs

Goal

Create a Postgres database with tables designed to optimize queries on song play analysis. The steps I've taken to achive are

  1. Create database schema (in Postgres)
  2. Build an ETL pipeline for analysis using Python
  3. Test implement and improve data quality
  4. Compare query results to expected

Datasets

Song Data

Million Song Dataset offers a freely avaiable collection of audio features and metadata for music tracks. Each file is in JSON format and contains metadata about songs and the corresponding artists. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two of the files.

data/song_data/A/B/C/TRABCEI128F424C983.json
data/song_data/A/A/B/TRAABJL12903CDCF1A.json

Here is an example of a file from song_data: TRABCFL128F149BB0D.json

{
    "num_songs": 1,
    "artist_id": "ARLTWXK1187FB5A3F8",
    "artist_latitude": 32.74863,
    "artist_longitude": -97.32925,
    "artist_location": "Fort Worth, TX",
    "artist_name": "King Curtis",
    "song_id": "SODREIN12A58A7F2E5",
    "title": "A Whiter Shade Of Pale (Live @ Fillmore West)",
    "duration": 326.00771,
    "year": 0
}

Log Data

The second dataset consists of log file in JSON format generated by an event simulator based on the songs in the first dataset. The log files are partitioned by year and month. For example, here are filepaths to two of the files.

data/log_data/2018/11/2018-11-12-events.json
data/log_data/2018/11/2018-11-13-events.json

Here is an example of a file from log_data :2018-11-12-events.json

Log Data

Schema for Song Play Analysis

Using the song and log datasets, we create a star schema optimized for queries on song play analysis. We build the following tables.

Fact Table

  • songplays - records in log data associated with song plays i.e. records with page NextSong
    • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

  • users - users of the service
    • user_id, first_name, last_name, gender, level
  • songs - songs in music database
    • song_id, title, artist_id, year, duration
  • artists - artists in music database
    • artist_id, name, location, latitude, longitude
  • time - timestamps of records in songplays broken down into specific units
    • start_time, hour, day, week, month, year, weekday

File Breakdown

  1. sql_queries.py contains all the SQL queries that are used across the project.
  2. create_tables.py creates (and drops) tables outlined above.
  3. etl.ipynb pocesses a single file from song_data and log_data and loads data into coresponding tables.
  4. etl.py processes all files from song_data and log_data and loads them into tables.
  5. test.ipynb displays the first few rows of each table in the project.

Dependencies

  • python==3.9
  • requirements.txt

About

ETL data to a Postgres database to enable logging and analysis for music streaming service.

Topics

Resources

Stars

Watchers

Forks