# Quick start

A quick example showcasing ploomber's essential features.

ploomber expressive syntax makes pipeline declarations read like blueprints that provide a full picture: not only they include which tasks to perform and in which order, but where output will be stored and in which form. Furthermore, upstream products are available to downstream tasks, this way, each product is only declared once.

## Expressive syntax

We start defining a few functions:

In [1]:
from pathlib import Path
import tempfile

import pandas as pd
import numpy as np
from IPython.display import display, HTML

from ploomber import DAG
from ploomber.tasks import PythonCallable
from ploomber.products import File

# we declare two functions one to get data
# and another one to clean it

def get(product):
    """Get data
    """
    df = pd.DataFrame({'column': np.random.rand(100)})
    df.to_csv(str(product))

def clean(upstream, product):
    """Clean data
    """
    data = pd.read_csv(str(upstream['get']))
    clean = data[data.column >= 0.5]
    clean.to_csv(str(product))

Create a DAG which will hold all our tasks and instantiate Task objects using our functions:

In [2]:
# tmp directory to save our data
tmp_dir = Path(tempfile.mkdtemp())

# create a DAG object to organize all the tasks
dag = DAG()

# create tasks from our functions
task_get = PythonCallable(get,
                          # where to save output
                          product=File(tmp_dir / 'data.csv'),
                          dag=dag)

task_clean = PythonCallable(clean,
                            product=File(tmp_dir / 'clean.csv'),
                            dag=dag)

Declare how our tasks relate to each other:

In [3]:
task_get >> task_clean

PythonCallable: clean -> File(/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmpjvj0xb2m/clean.csv)

## Standalone execution

ploomber's pipelines are ready to run right after being created, no need to setup a separate system. Since all products are part of the declaration, one can switch them entirely to isolate executions depending on the environment (e.g. save all output in `/data/{{user}}` where user is the current logged-in user).

Before building the pipeline, let's get a summary:

In [16]:
dag.status()

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




name,Last updated,Outdated dependencies,Outdated code,Product,Doc (short),Location
get,"4 seconds ago (Mar 15, 20 at 18:09)",False,True,/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmpjvj0xb2m/data.csv,Get data,:15
clean,"4 seconds ago (Mar 15, 20 at 18:09)",True,True,/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmpjvj0xb2m/clean.csv,Clean data,:21


In [15]:
dag.build()

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




name,Ran?,Elapsed (s),Percentage
get,True,0.112725,49.7236
clean,True,0.113978,50.2764


## Incremental

Pipelines usually take hours or even days to run, during the development phase, it is wasteful to re-execute the pipeline on each change. ploomber keeps track of code changes and only executes a task if the source code has changed since its last execution.


In [6]:
import logging
logging.basicConfig(level='DEBUG')

In [7]:
dag.build()

INFO:ploomber.dag:Rendering DAG DAG("No name")


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

DEBUG:ploomber.tasks.Task.PythonCallable:Calling render on task get
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "get" status to TaskStatus.WaitingExecution
DEBUG:ploomber.tasks.Task.PythonCallable:Calling render on task clean
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "clean" status to TaskStatus.WaitingUpstream
DEBUG:ploomber.dag:Setting DAG("No name") status to DAGStatus.WaitingExecution
INFO:ploomber.dag:Building DAG DAG("No name")





HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

INFO:ploomber.tasks.Task.PythonCallable:Checking status for task "get"
DEBUG:ploomber.products.Product.File:Returning cached data dependencies status. Outdated? False
DEBUG:ploomber.products.Product.File:Returning cached code dependencies status. Outdated? True
INFO:ploomber.tasks.Task.PythonCallable:Up-to-date data deps...
INFO:ploomber.tasks.Task.PythonCallable:Outdated code dep...
INFO:ploomber.tasks.Task.PythonCallable:Should run? True
INFO:ploomber.tasks.Task.PythonCallable:Starting execution: PythonCallable: get -> File(/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmpzotvc7_w/data.csv)
INFO:ploomber.tasks.Task.PythonCallable:Done. Operation took 0.0 seconds
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "get" status to TaskStatus.Executed
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "clean" status to TaskStatus.WaitingExecution
INFO:ploomber.tasks.Task.PythonCallable:Checking status for task "clean"
DEBUG:ploomber.products.Product.File:Returning cached data dependencies




name,Ran?,Elapsed (s),Percentage
get,True,0.121344,49.6853
clean,True,0.122881,50.3147


## Testable

ploomber also supports a hook to execute code upon task execution. This allows to write acceptance tests that explicitely state input assumptions (e.g. check a data frame's input schema).

In [8]:
def test_no_nas(task):
    print('Testing there are no NAs...')
    path = str(task.product)
    df = pd.read_csv(path)
    assert not df.column.isna().sum()

task_get.on_finish = test_no_nas
task_clean.on_finish = test_no_nas

In [9]:
# Ignore status and force execution on all tasks
# so we also run on_finish
dag.build(force=True)

INFO:ploomber.dag:Rendering DAG DAG("No name")


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

DEBUG:ploomber.tasks.Task.PythonCallable:Calling render on task get
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "get" status to TaskStatus.WaitingExecution
DEBUG:ploomber.tasks.Task.PythonCallable:Calling render on task clean
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "clean" status to TaskStatus.WaitingUpstream
DEBUG:ploomber.dag:Setting DAG("No name") status to DAGStatus.WaitingExecution
INFO:ploomber.dag:Building DAG DAG("No name")





HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

INFO:ploomber.tasks.Task.PythonCallable:Forcing run "get", skipping checks...
INFO:ploomber.tasks.Task.PythonCallable:Starting execution: PythonCallable: get -> File(/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmpzotvc7_w/data.csv)
INFO:ploomber.tasks.Task.PythonCallable:Done. Operation took 0.0 seconds
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "get" status to TaskStatus.Executed
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "clean" status to TaskStatus.WaitingExecution


Testing there are no NAs...


INFO:ploomber.tasks.Task.PythonCallable:Forcing run "clean", skipping checks...
INFO:ploomber.tasks.Task.PythonCallable:Starting execution: PythonCallable: clean -> File(/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmpzotvc7_w/clean.csv)
INFO:ploomber.tasks.Task.PythonCallable:Done. Operation took 0.0 seconds
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "clean" status to TaskStatus.Executed
INFO:ploomber.executors.Serial: DAG report:
name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
get     True         0.136777       50.8157
clean   True         0.132386       49.1843
DEBUG:ploomber.dag:Setting DAG("No name") status to DAGStatus.Executed


Testing there are no NAs...



name,Ran?,Elapsed (s),Percentage
get,True,0.136777,50.8157
clean,True,0.132386,49.1843


## Interactive

In [10]:
# get task named "get"
task = dag['clean']

# which are their upstream dependencies?
print(task.upstream)

{'get': PythonCallable: get -> File(/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmpzotvc7_w/data.csv)}


In [11]:
# only execute this task instead of the entire dag
task.build(force=True)

INFO:ploomber.tasks.Task.PythonCallable:Forcing run "clean", skipping checks...
INFO:ploomber.tasks.Task.PythonCallable:Starting execution: PythonCallable: clean -> File(/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmpzotvc7_w/clean.csv)
INFO:ploomber.tasks.Task.PythonCallable:Done. Operation took 0.0 seconds
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "clean" status to TaskStatus.Executed


Testing there are no NAs...


<TaskStatus.Executed: 'executed'>

In [12]:
# start a debugging session (only works if task is a PythonCallable)
task.debug()

> <ipython-input-1-97bc00c66958>(24)clean()
-> data = pd.read_csv(str(upstream['get']))
(Pdb) c


## Communicable

In [13]:
html = dag.to_markup()
HTML(html)

INFO:ploomber.dag:Rendering DAG DAG("No name")


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

DEBUG:ploomber.tasks.Task.PythonCallable:Calling render on task get
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "get" status to TaskStatus.WaitingExecution
DEBUG:ploomber.tasks.Task.PythonCallable:Calling render on task clean
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "clean" status to TaskStatus.WaitingUpstream
DEBUG:ploomber.dag:Setting DAG("No name") status to DAGStatus.WaitingExecution
DEBUG:ploomber.products.Product.File:Returning cached data dependencies status. Outdated? False
DEBUG:ploomber.products.Product.File:Returning cached code dependencies status. Outdated? True
DEBUG:ploomber.products.Product.File:Returning cached data dependencies status. Outdated? True
DEBUG:ploomber.products.Product.File:Returning cached code dependencies status. Outdated? True
INFO:ploomber.dag:Rendering DAG DAG("No name")





HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

DEBUG:ploomber.tasks.Task.PythonCallable:Calling render on task get
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "get" status to TaskStatus.WaitingExecution
DEBUG:ploomber.tasks.Task.PythonCallable:Calling render on task clean
DEBUG:ploomber.tasks.Task.PythonCallable:Setting "clean" status to TaskStatus.WaitingUpstream
DEBUG:ploomber.dag:Setting DAG("No name") status to DAGStatus.WaitingExecution
DEBUG:ploomber.products.Product.File:Returning cached data dependencies status. Outdated? False
DEBUG:ploomber.products.Product.File:Returning cached code dependencies status. Outdated? True
DEBUG:ploomber.products.Product.File:Returning cached data dependencies status. Outdated? True





name,Last updated,Outdated dependencies,Outdated code,Product,Doc (short),Location
get,"6 seconds ago (Mar 15, 20 at 17:53)",False,True,/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmpzotvc7_w/data.csv,Get data,:15
clean,"6 seconds ago (Mar 15, 20 at 17:53)",True,True,/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmpzotvc7_w/clean.csv,Clean data,:21


In [16]:
# import shutil
# shutil.rmtree(tmp_dir)