A Redshift data warehouse for understanding the listing habits of Sparkify users
This database serves analytical data repository for Sparkify, a startup which building a new music streaming application. They intend to perform analytics on understand the listening habits of their users. Ideally, the want to know which users are playing certain songs and further gather all relevant metadata about the track played. Currently all their data is stored in Amazon S3, and will copy their data over to Redshift for analysis.
You'll need an AWS account with the resources to provision a Redshift cluster (recommended: 4 x dc2.large EC2 instances)
You'll also need this software installed on your system
In addition you'll need the PostgreSQL python driver which can be obtained via pip
pip install psycopg2
Provision a Redshift cluster within AWS utilizing either the Quick Launch wizard, AWS CLI, or the various AWS SDKs (e.g. boto3 for python).
Update the credentials and configuration details in dwh.cfg
. Change the S3 source bucket locations if necessary.
Create the database tables by running the create_tables.py
script, followed by etl.py
to perform the data loading.
$ python create_tables.py
$ python etl.py
Optionally a PostgreSQL client (or psycopg2
) can be used to connect to the Sparkify db to perform analytical queries afterwards.
sparkify-redshift-dwh/
├── README.md Documentation of the project
├── create_tables.py Python script to create all tables in Redshift
├── dwh.cfg Configuration file for setting up S3 sources & Redshift credentials
├── etl.py Python script to load data from S3 into Redshift and again into the datawarehouse
└── sql_queries.py Python file containing all SQL statements for ELT process