Skip to content

guiml/AirflowDataPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airflow Pipeline processing

Pre requisites:

In order to run this solution properly you will need to install Airflow, and configure an AWS Redshift cluster.

Airflow installation (instructions for MacOS):

  1. Set up the environment variable for airflow

    export AIRFLOW_HOME=~/airflow

  2. Install Airflow Python Package (ideally in an environment)

    pip3 install apache-airflow

    pip3 install typing_extensions

    pip3 install 'apache-airflow[postgres]'

    pip3 install apache-airflow-providers-amazon

  3. [OPTIONAL] Edit airflow.cfg line 111 so airflow doesn’t load the examples:

    load_examples = False

  4. Start Airflow

    airflow db init

  5. Create initial user (substitute USR, PWD and other information with your choice of Admin user and Password)

    airflow users create --role Admin --username USR --email EMAIL --firstname FIRSTNAME --lastname LASTNAME --password PWD

  6. Start airflow

    airflow webserver -p 8080

  7. Start Airflow scheduler (in another terminal window)

    Airflow scheduler

  8. Open the airflow in the browser by accessing the following address: http://localhost:8080/

Airflow configuration:

  1. Add the following Connections:

    Conn Id: aws credentials

    Conn Type: Amazon Web Service

    Login AWS ACCESS KEY

    Pwd: AWS SECRET

    Conn Id: redshift

    Conn Type: Postgres

    Host [REDSHIFT ENDPOINT]

    Login: [DB USER]

    Password: [DB PASSWORD]

    Port:5439

  2. Add the following Variable:

    Key: s3_bucket Val: udac-data-pipelines

Redshift setup:

  1. Initiate a new Redshift cluster
  2. Release public acces to the cluster by clickin in Actions >> Modify publicly accessible setting
  3. Create inbound rules in the security group attached to the VPC for this cluster to enable remote access

Create tables in Redshift

Using the Redshift Query Editor, run the SQL queries in the Create Tables.sql file

Purpose of this database

Sparify is a streaming startup that is growing its user base and database and wish to move their database to the cloud. They used to store their data in JSON files in their on-prem servers. The data was made available in S3 buckets in order to be transitioned into a Parquet Database.

This project is composed of an Airflow managed ETL pipeline that extracts data from S3, stages them in AWS Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.

How to run this solution

  1. Open Airflow user interface by accessing http://localhost:8080/
  2. Click in the dag icon
  3. Turn on the on/off toggle
  4. Monitor the DAG running
  5. When finished, turn the DAG off

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages