Airflow Pipeline processing

Pre requisites:

In order to run this solution properly you will need to install Airflow, and configure an AWS Redshift cluster.

Airflow installation (instructions for MacOS):

Set up the environment variable for airflow

export AIRFLOW_HOME=~/airflow
Install Airflow Python Package (ideally in an environment)

pip3 install apache-airflow

pip3 install typing_extensions

pip3 install 'apache-airflow[postgres]'

pip3 install apache-airflow-providers-amazon
[OPTIONAL] Edit airflow.cfg line 111 so airflow doesn’t load the examples:

load_examples = False
Start Airflow

airflow db init
Create initial user (substitute USR, PWD and other information with your choice of Admin user and Password)

airflow users create --role Admin --username USR --email EMAIL --firstname FIRSTNAME --lastname LASTNAME --password PWD
Start airflow

airflow webserver -p 8080
Start Airflow scheduler (in another terminal window)

Airflow scheduler
Open the airflow in the browser by accessing the following address: http://localhost:8080/

Airflow configuration:

Add the following Connections:

Conn Id: aws credentials

Conn Type: Amazon Web Service

Login AWS ACCESS KEY

Pwd: AWS SECRET

Conn Id: redshift

Conn Type: Postgres

Host [REDSHIFT ENDPOINT]

Login: [DB USER]

Password: [DB PASSWORD]

Port:5439
Add the following Variable:

Key: s3_bucket Val: udac-data-pipelines

Redshift setup:

Initiate a new Redshift cluster
Release public acces to the cluster by clickin in Actions >> Modify publicly accessible setting
Create inbound rules in the security group attached to the VPC for this cluster to enable remote access

Create tables in Redshift

Using the Redshift Query Editor, run the SQL queries in the Create Tables.sql file

Purpose of this database

Sparify is a streaming startup that is growing its user base and database and wish to move their database to the cloud. They used to store their data in JSON files in their on-prem servers. The data was made available in S3 buckets in order to be transitioned into a Parquet Database.

This project is composed of an Airflow managed ETL pipeline that extracts data from S3, stages them in AWS Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.

How to run this solution

Open Airflow user interface by accessing http://localhost:8080/
Click in the dag icon
Turn on the on/off toggle
Monitor the DAG running
When finished, turn the DAG off

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dags		dags
plugins		plugins
.DS_Store		.DS_Store
Create Tables.sql		Create Tables.sql
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airflow Pipeline processing

Pre requisites:

Airflow installation (instructions for MacOS):

Airflow configuration:

Redshift setup:

Create tables in Redshift

Purpose of this database

How to run this solution

About

Releases

Packages

Languages

guiml/AirflowDataPipeline

Folders and files

Latest commit

History

Repository files navigation

Airflow Pipeline processing

Pre requisites:

Airflow installation (instructions for MacOS):

Airflow configuration:

Redshift setup:

Create tables in Redshift

Purpose of this database

How to run this solution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages