Project: Data Warehouse

Project Datasets

Song data: 's3://udacity-dend/song_data'
Log data: 's3://udacity-dend/log_data'

ETL Process

Each Dataset was copied into a table created in the Redshift Cluster, and then these tables were used to insert values into the Star schema, so that the data can be ready for analysis later.

Database

The database design schema consists of the following tables:

Staging_events Table

Staging table contains the data copied from the S3 log data.

Staging_songs Table

Staging table contains the data copied from the S3 song data.

Songplay Table

This is the fact table for the Star Schema that will be used for analysis.

Songs Table

Dimension table that contains details on songs from song files.

Artists Table

Dimension table that contains details on artist from song files.

Users Table

Dimension table that contains data on sparkify users derived from log files.

Time Table

Dimension table that contains a list of timestamps and converted time data from log files.

Project Files

This project consists of the following files:

sql_queries.py - This file contains Postgres SQL queries in string formate.
create_tables.py - This script uses the sql_queries.py file to create new tables or drop old tables in the database.
etl.py - This script is used to build ETL processes which will read every file contained S3 bucket, copy its data into tables in the Redshift Cluster, then insert its values into the Star Schema using variables in sql_queries.py file.
dwh.cfg - This File contains the IAM role ARN, the path to S3 Datasets and the Redshift Cluster configurations.
test.ipynb - This notebook is used for testing purposes after finishing, to run queries on (you can also run queries in AWS Redshift query editor).

How To Run

Firstly, you need to create IAM Role that has read access to S3 bucket, then you need to create the Redshift Cluster and assosiate the IAM role to it. After that, you need to fill in the IAM role ARN and the Redshift Cluster configurations into the dwh.cfg file. Finally run create_tables.py file to drop and create the tables and then run etl.py to insert the data into the tables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Data Warehouse

Contents

Project Discription

Project Datasets

ETL Process

Database

Project Files

How To Run

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Images		Images
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

ibrahimmoursy/Data-Warehousing

Folders and files

Latest commit

History

Repository files navigation

Project: Data Warehouse

Contents

Project Discription

Project Datasets

ETL Process

Database

Project Files

How To Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages