Skip to content

data pipeline management framework Airflow and how it can help us solve the problem of the traditional ETL

Notifications You must be signed in to change notification settings

ortizfram/Apache-Airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 

Repository files navigation

Apache-Airflow

data pipeline management framework Airflow and how it can help us solve the problem of the traditional ETL

image

πŸ“— what is it?

is a platform that lets you build and run and monitor workflows or PipeLines

πŸ“— workflow

squence of scheduled tasks triggered by an event

used to handle big data pipelines

image

πŸ“— traditional ETL

  • write script to pull data from DB & send to HDFS to process

  • schedule script as a cronjob

image

πŸ“— Problems

  • Failures: retry if happends(how many times?,how often?)
  • Monitoring: pass or fails (how long does it take?)
  • Dependencies:
    • Data dependencies: upstream data is missing
    • execution dependencies: job2 runs after job 1 is finished
  • Scalability: no centralized schedulerbetween diff cron machines
  • Deplyment: deploy smt new constantly
  • Process historic data: backfill/rerun historical data

πŸ“— Airflow DAG

image

πŸ”Ά Set up airflow environment with docker

Docker & DockerCompose

image

allows you to run many containers simultaneously

  • container are lightweight no need of of hypervisor
  • you can run more container than if tou were using virtual machine

Beneficts:

  • no more taks-managing/manteining dependencies, deployment
  • easy to share & deploy different version & environments
  • keep track through github tags & releases
  • ease of deployment from testing to production environment

About

data pipeline management framework Airflow and how it can help us solve the problem of the traditional ETL

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published