The purpose of this project is to create an ETL pipeline that helps extracting song and user activity log data residing in a S3 bucket in JSON format. The data in then staged in AWS Redshift Cluster. From the staging table the data is then transformed into a set of fact and dimensional table. Having a dataset in form of a star-schema database tables will allow Sparkify analytics team to analyze their users’ activity on their app and to get data reports relevant to their business initiatives
The schema of all the tables is defined in the create table queries in the sql_queries.py file. Also it has the drop table queries that are needed to reset our tables. Every time we run the create_table.py, the script will run the drop table queries and then the create_table queries to reset our tables.
The ETL process is defined in the etl.py file. The process involves the following steps:
- ETL file first runs the load_staging_table function that runs copy_table queries imported from sql_queries.
- The first query of copy_table_queries extracts data from s3 log_data and copies it into staging_events_table.
- The second query from copy_table_queries, extracts s3 song_data and inserts into the staging_songs_table.
- Then second step of etl.py file runs insert_tables function.
- The function runs the insert_table queries that inserts data from the two staging tables defined above to the facts and dimensional tables.
- The data is first inserted into songplay_table which is the fact table of our starscehma database, by joining both the staging tables above.
- Then data is inserted into user_table from staging_events_table.
- Next the data is inserted into song_table from staging_songs_table.
- artist_table takes the data from staging_songs_table.
- At last the data is inserted into time_table from staging_events_table.
This file has all the sql queries needed to define our Staging, fact and dimension tables and inserting data records into those tables.
This file helps us to reset and create our tables in the databse and it imports the drop_table and create_table queries from the sql_queries.py file.
This file is run to to extract data from S3 buckets and load them into staging tables, and then insert it into fact and dimensional tables.
This is the configuration file that holds the credentials needed to connect to the Redshift cluster, along with the S3 song and log data paths.
- Redshift host endpoint
- Database name
- Database user name
- Database password
- Arn of IAM role
date_part | total_sessions |
---|---|
17 | 40 |
first_name | last_name | start_time |
---|---|---|
Kevin | Arellano | 2018-11-06 07:36:46.796 |