The project is one of the many projects I worked on through my training at Udacity as a data engineer. In this project, I'll model user activity data for a music streaming app called Sparkify. I’ll create a relational database and ETL pipeline designed to optimize queries for understanding what songs users are listening to. In PostgreSQL, I will also define Fact and dimension tables and insert data into the new tables.
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:
song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file, TRAABJL12903CDCF1A.json
, looks like:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
The second dataset consists of log files in JSON format generated by an event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
The log files in the dataset I'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset :
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
And below is an example of what the data in a log file, 2018-11-12-events.json, looks like:
songs
: dimension table about songs
Column's name | type | example |
---|---|---|
song_id | varchar PRIMARY KEY |
"SONHOTT12A8C13493C" |
title | varchar | "Something Girls" |
artist_id | varchar | "AR7G5I41187FB4CE6C" |
year | integer | 1982 |
duration | float | 233.40363 |
artists
: dimension table about artists
Column's name | type | example |
---|---|---|
artist_id | varchar PRIMARY KEY |
"ARGSJW91187B9B1D6B" |
name | varchar | "JennyAnyKind" |
location | varchar | "North Carolina" |
latitude | float | 35.21962 |
longitude | float | -80.01955 |
time
: dimension table about the date-time
Column's name | type | example |
---|---|---|
start_time | timestamp PRIMARY KEY |
"2018-11-11 02:33:56.796" |
hour | integer | 2 |
day | varchar | 11 |
week | integer | 45 |
month | integer | 11 |
year | integer | 2018 |
dayofweek | integer | 6 |
users
: dimension table about users
Column's name | type | example |
---|---|---|
userId | integer PRIMARY KEY |
69 |
firstName | varchar | "Anabelle" |
lastName | varchar | "Simpson" |
gender | varchar | "F" |
lever | varchar | "free" |
songplays
: fact table about songplays
Column's name | type | example |
---|---|---|
songplay_id | serial PRIMARY KEY |
1 |
start_time | timestamp FOREIGN KEY references start_time in time table |
"2018-11-11 02:33:56.796" |
userId | integer FOREIGN KEY references userId in users table |
69 |
song_id | varchar FOREIGN KEY references song_id in songs table |
null |
artist_id | varchar FOREIGN KEY references artist_id in artists table |
null |
level | varchar | "free" |
session_id | integer | 455 |
location | varchar | "Philadelphia-Camden-Wilmington, PA-NJ-DE-MD" |
user_agent | varchar | """Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36""" |
To fill in the database created above with these local JSON files, we need to create an ETL pipeline using pandas to parse files and extract the needed fields for each table. Steps to build ETL pipeline for to extract 1 file and fill in 1 table:
- Create function get_files to return json song files
- Use pandas to read the json file
- Select columns for song ID, title, artist ID, year, and duration
- Select the record in the dataframe and convert it to a list
- Query the insert statement with that one record to insert the song into the song table
The same process is applied to the other tables with other fields extracted.
This one is a little more complicated since information from the songs table, artists table, and original log file are all needed for the songplays
table. Since the log file does not specify an ID for either the song or the artist, you'll need to get the song ID and artist ID by querying the songs and artists tables to find matches based on song title, artist name, and song duration time. More details are within the notebook ( etl-proto.ipynb
)
- Run create_tables.py to drop if exists then create your database and tables.
- Run etl.py to execute the ETL process on the whole datasets and load data into the different tables
- Run test.ipynb to confirm the creation of the tables with the correct columns. Make sure to click "Restart kernel" to close the connection to the database after running this notebook, and the etl-proto.ipynb as well
- EXAMPLE-QUERIES.ipynb to execute some queries for analysis purpose