Data Modeling with PostgreSQL

Overview

This is a Data Modeling project using PostgreSQL. This project builds an ETL pipeline using Python using generated song data. The datas are in json and this project analyzes the songs to discern what songs the users are listening to.

The Dataset

This project uses songs generated from Million Song Dataset

Log Dataset

The log dataset is generated by Event Simulator

Project Schema

Fact Table

songplays : records song playes in the log data, records with NextSong

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users : users in the app

user_id, first_name, last_name, gender, level

songs : songs in music database

song_id, title, artist_id, year, duration

artists : artists in music database

artist_id, name, location, latitude, longitude

time : timestamps of records in songplays broken down into specific units

start_time, hour, day, week, month, year, weekday

Project Documents

sql_queries.py : contains sql queries for dropping and creating fact and dimension tables. Also, contains insertion query template.

create_tables.py : contains code for setting up database. Running this file creates sparkifydb and also creates the fact and dimension tables.

etl.ipynb : a jupyter notebook to analyse dataset before loading.

etl.py : read and process song_data and log_data

test.ipynb : a notebook to connect to postgres db and validate the data loaded.

Environments used

Python 3.6 or above

PostgresSQL 9.5 or above

psycopg2 - PostgreSQL database adapter for Python

Run

Run the drive program main.py as below.

python main.py

The create_tables.py and etl.py file can also be run independently as below:

python create_tables.py 
python etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Data Modeling with PostgreSQL

Overview

The Dataset

Log Dataset

Project Schema

Fact Table

Dimension Tables

Project Documents

Environments used

Run

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Data		Data
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
main.py		main.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Uh oh!

Uh oh!

ntalib/Data-Modeling-with-PostgreSQL

Folders and files

Latest commit

History

Repository files navigation

Data Modeling with PostgreSQL

Overview

The Dataset

Log Dataset

Project Schema

Fact Table

Dimension Tables

Project Documents

Environments used

Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages