This is a Data Modeling project using PostgreSQL. This project builds an ETL pipeline using Python using generated song data. The datas are in json and this project analyzes the songs to discern what songs the users are listening to.
This project uses songs generated from Million Song Dataset
The log dataset is generated by Event Simulator
songplays : records song playes in the log data, records with NextSong
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
users : users in the app
user_id, first_name, last_name, gender, level
songs : songs in music database
song_id, title, artist_id, year, duration
artists : artists in music database
artist_id, name, location, latitude, longitude
time : timestamps of records in songplays broken down into specific units
start_time, hour, day, week, month, year, weekday
sql_queries.py : contains sql queries for dropping and creating fact and dimension tables. Also, contains insertion query template.
create_tables.py : contains code for setting up database. Running this file creates sparkifydb and also creates the fact and dimension tables.
etl.ipynb : a jupyter notebook to analyse dataset before loading.
etl.py : read and process song_data and log_data
test.ipynb : a notebook to connect to postgres db and validate the data loaded.
Python 3.6 or above
PostgresSQL 9.5 or above
psycopg2 - PostgreSQL database adapter for Python
Run the drive program main.py as below.
python main.py
The create_tables.py and etl.py file can also be run independently as below:
python create_tables.py
python etl.py