In order to run this solution properly you will need to install Airflow, and configure an AWS Redshift cluster.
-
Set up the environment variable for airflow
export AIRFLOW_HOME=~/airflow
-
Install Airflow Python Package (ideally in an environment)
pip3 install apache-airflow
pip3 install typing_extensions
pip3 install 'apache-airflow[postgres]'
pip3 install apache-airflow-providers-amazon
-
[OPTIONAL] Edit airflow.cfg line 111 so airflow doesn’t load the examples:
load_examples = False
-
Start Airflow
airflow db init
-
Create initial user (substitute USR, PWD and other information with your choice of Admin user and Password)
airflow users create --role Admin --username USR --email EMAIL --firstname FIRSTNAME --lastname LASTNAME --password PWD
-
Start airflow
airflow webserver -p 8080
-
Start Airflow scheduler (in another terminal window)
Airflow scheduler
-
Open the airflow in the browser by accessing the following address: http://localhost:8080/
-
Add the following Connections:
Conn Id: aws credentials
Conn Type: Amazon Web Service
Login AWS ACCESS KEY
Pwd: AWS SECRET
Conn Id: redshift
Conn Type: Postgres
Host [REDSHIFT ENDPOINT]
Login: [DB USER]
Password: [DB PASSWORD]
Port:5439
-
Add the following Variable:
Key: s3_bucket Val: udac-data-pipelines
- Initiate a new Redshift cluster
- Release public acces to the cluster by clickin in Actions >> Modify publicly accessible setting
- Create inbound rules in the security group attached to the VPC for this cluster to enable remote access
Using the Redshift Query Editor, run the SQL queries in the Create Tables.sql file
Sparify is a streaming startup that is growing its user base and database and wish to move their database to the cloud. They used to store their data in JSON files in their on-prem servers. The data was made available in S3 buckets in order to be transitioned into a Parquet Database.
This project is composed of an Airflow managed ETL pipeline that extracts data from S3, stages them in AWS Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.
- Open Airflow user interface by accessing http://localhost:8080/
- Click in the dag icon
- Turn on the on/off toggle
- Monitor the DAG running
- When finished, turn the DAG off