Sparkify ETL

Project overview

THe purpose of this project was to extract data given in json files, transform to desired structure and then load into PosgreSQL Database.

Data

Data comprises two sets:

Song Dataset The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "arti

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.

The log files in the dataset are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json

PostgreSQL Database

Created database reflects star schema optimized for queries on song play analysis. Below is some brief description on particular tables:

Fact table

songplays Records in log data associated with song plays i.e. records with page NextSong

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension tables

users Contains data about users using app

user_id, first_name, last_name, gender, level

songs Contains data about music played in app

song_id, title, artist_id, year, duration

artists Contains data about music's artists

artist_id, name, location, lattitude, longitude

time Represent timestamps of records in songplays broken down into specific units

start_time, hour, day, week, month, year, weekday

ETL Description

The ETL process was coded in Python language. Latest version is saved in etl.py file.
The process could be presented in 2 following steps:

Extract data

First, data needs to be loaded. Each json file is loaded one by one and processed respectively. At first songs data is processed. After it completes, ETL process loads logs data.

Transform & Load At first song data is transformed. With each json file, one record is appended to following tables into DB respectively:

songs
artists

Once process for music data is finished, ETL starts to transform data for logs. Data is extracted and transformed and loaded into following tables respectively:

time
user
songplay

Build

In order to make process run, please follow below steps:

Navigate to project directory
Run

python3 create_tables.py

After that completes, run:

python3 etl.py

Testing

SELECT * FROM time LIMIT 5;

Should give the result:

start_time hour day week month year weekday 2018-11-28 00:00:15.796000 0 28 48 11 2018 2 2018-11-28 00:03:40.796000 0 28 48 11 2018 2 2018-11-28 00:09:38.796000 0 28 48 11 2018 2 2018-11-28 00:19:09.796000 0 28 48 11 2018 2 2018-11-28 00:25:29.796000 0 28 48 11 2018 2

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
data		data
LICENSE		LICENSE
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sparkify ETL

Project overview

Data

PostgreSQL Database

Fact table

Dimension tables

ETL Description

Build

Testing

About

Uh oh!

Releases

Packages

Languages

License

michalmiki/postgresql-etl

Folders and files

Latest commit

History

Repository files navigation

Sparkify ETL

Project overview

Data

PostgreSQL Database

Fact table

Dimension tables

ETL Description

Build

Testing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages