Sparkify Redshift Data Warehouse

A Redshift data warehouse for understanding the listing habits of Sparkify users

Purpose

This database serves analytical data repository for Sparkify, a startup which building a new music streaming application. They intend to perform analytics on understand the listening habits of their users. Ideally, the want to know which users are playing certain songs and further gather all relevant metadata about the track played. Currently all their data is stored in Amazon S3, and will copy their data over to Redshift for analysis.

Setup

Requirements

You'll need an AWS account with the resources to provision a Redshift cluster (recommended: 4 x dc2.large EC2 instances)

You'll also need this software installed on your system

In addition you'll need the PostgreSQL python driver which can be obtained via pip

pip install psycopg2

Quick Start

Provision a Redshift cluster within AWS utilizing either the Quick Launch wizard, AWS CLI, or the various AWS SDKs (e.g. boto3 for python).

Update the credentials and configuration details in dwh.cfg. Change the S3 source bucket locations if necessary.

Create the database tables by running the create_tables.py script, followed by etl.py to perform the data loading.

$ python create_tables.py 
$ python etl.py

Optionally a PostgreSQL client (or psycopg2) can be used to connect to the Sparkify db to perform analytical queries afterwards.

Project Structure

sparkify-redshift-dwh/
 ├── README.md              Documentation of the project
 ├── create_tables.py       Python script to create all tables in Redshift
 ├── dwh.cfg                Configuration file for setting up S3 sources & Redshift credentials
 ├── etl.py                 Python script to load data from S3 into Redshift and again into the datawarehouse
 └── sql_queries.py         Python file containing all SQL statements for ELT process

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
Analytical queries.ipynb		Analytical queries.ipynb
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
provision_cluster.py		provision_cluster.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Redshift Data Warehouse

Purpose

Setup

Requirements

Quick Start

Project Structure

About

Releases

Packages

Languages

ibromley/sparkify-redshift-dwh

Folders and files

Latest commit

History

Repository files navigation

Sparkify Redshift Data Warehouse

Purpose

Setup

Requirements

Quick Start

Project Structure

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages