Skip to content

michalmiki/postgresql-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparkify ETL

Project overview

THe purpose of this project was to extract data given in json files, transform to desired structure and then load into PosgreSQL Database.

Data

Data comprises two sets:

  • Song Dataset The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "arti

  • Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.

The log files in the dataset are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json

PostgreSQL Database

Created database reflects star schema optimized for queries on song play analysis. Below is some brief description on particular tables:

Fact table

  1. songplays Records in log data associated with song plays i.e. records with page NextSong
  • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension tables

  1. users Contains data about users using app
  • user_id, first_name, last_name, gender, level
  1. songs Contains data about music played in app
  • song_id, title, artist_id, year, duration
  1. artists Contains data about music's artists
  • artist_id, name, location, lattitude, longitude
  1. time Represent timestamps of records in songplays broken down into specific units
  • start_time, hour, day, week, month, year, weekday

ETL Description

The ETL process was coded in Python language. Latest version is saved in etl.py file.
The process could be presented in 2 following steps:

  1. Extract data

First, data needs to be loaded. Each json file is loaded one by one and processed respectively. At first songs data is processed. After it completes, ETL process loads logs data.

  1. Transform & Load At first song data is transformed. With each json file, one record is appended to following tables into DB respectively:
  • songs
  • artists

Once process for music data is finished, ETL starts to transform data for logs. Data is extracted and transformed and loaded into following tables respectively:

  • time
  • user
  • songplay

Build

In order to make process run, please follow below steps:

  1. Navigate to project directory
  2. Run

python3 create_tables.py

After that completes, run:

python3 etl.py

Testing

SELECT * FROM time LIMIT 5;

Should give the result:

start_time hour day week month year weekday 2018-11-28 00:00:15.796000 0 28 48 11 2018 2 2018-11-28 00:03:40.796000 0 28 48 11 2018 2 2018-11-28 00:09:38.796000 0 28 48 11 2018 2 2018-11-28 00:19:09.796000 0 28 48 11 2018 2 2018-11-28 00:25:29.796000 0 28 48 11 2018 2

About

Building Python ETL pipeline for PostgreSQL DB

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published