# Quick start

A quick example showcasing ploomber's essential features.

ploomber expressive syntax makes pipeline declarations read like blueprints that provide a full picture: not only they include which tasks to perform and in which order, but where output will be stored and in which form. Furthermore, upstream products are available to downstream tasks, this way, each product is only declared once.

## Expressive

We start defining a few functions:

In [41]:
from pathlib import Path
import tempfile

import pandas as pd
import numpy as np
from IPython.display import display, HTML

from ploomber import DAG
from ploomber.tasks import PythonCallable
from ploomber.products import File

# we declare two functions one to get data
# and another one to clean it

def get(product):
    """Get data
    """
    df = pd.DataFrame({'column': np.random.rand(100)})
    df.to_csv(str(product))

def clean(upstream, product):
    """Clean data
    """
    data = pd.read_csv(str(upstream['get']))
    clean = data[data.column >= 0.5]
    clean.to_csv(str(product))

Create a DAG which will hold all our tasks and instantiate Task objects using our functions:

In [25]:
# tmp directory to save our data
tmp_dir = Path(tempfile.mkdtemp())

# create a DAG object to organize all the tasks
dag = DAG()

# create tasks from our functions
task_get = PythonCallable(get,
                          # where to save output
                          product=File(tmp_dir / 'data.csv'),
                          dag=dag)

task_clean = PythonCallable(clean,
                            product=File(tmp_dir / 'clean.csv'),
                            dag=dag)

Declare how our tasks relate to each other:

In [26]:
task_get >> task_clean

PythonCallable: clean -> File(/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmp205owbvh/clean.csv)

## Standalone

ploomber's pipelines are ready to run right after being created, no need to setup a separate system. Since all products are part of the declaration, one can switch them entirely to isolate executions.

Let's get a summary of our pipeline:

In [27]:
dag.status()

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




name,Last updated,Outdated dependencies,Outdated code,Product,Doc (short),Location
get,Has not been run,False,True,/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmp205owbvh/data.csv,Get data,:14
clean,Has not been run,True,True,/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmp205owbvh/clean.csv,Clean data,:20


We can see that it hasn't been executed, where the output will be stored upon execution and other useful information such as the docstrings. Let's build our dag now:

In [28]:
dag.build()

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




name,Ran?,Elapsed (s),Percentage
get,True,0.117956,51.0889
clean,True,0.112928,48.9111


## Incremental

Pipelines usually take hours or even days to run, during the development phase, it is wasteful to re-execute the pipeline on each change. ploomber keeps track of code changes and only executes a task if the source code has changed since its last execution.


In [29]:
dag.build()

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




name,Ran?,Elapsed (s),Percentage
get,True,0.116942,49.8716
clean,True,0.117544,50.1284


## Testable

ploomber also supports a hook to execute code upon task execution. This allows to write acceptance tests that explicitely state input assumptions (e.g. check a data frame's input schema).

In [30]:
def test_no_nas(task):
    print('Testing there are no NAs')
    path = str(task.product)
    df = pd.read_csv(path)
    assert not df.column.isna().sum()

get.on_finish = test_no_nas
clean.on_finish = test_no_nas

In [33]:
dag.build()

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




name,Ran?,Elapsed (s),Percentage
get,True,0.120416,50.9855
clean,True,0.115761,49.0145


## Interactive

In [36]:
# get task named "get"
task = dag['clean']

# which are their upstream dependencies?
print(task.upstream)

{'get': PythonCallable: get -> File(/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmp205owbvh/data.csv)}


In [37]:
# only execute this task instead of the entire dag
task.build()

PythonCallable: clean -> File(/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmp205owbvh/clean.csv)

In [38]:
# start a debugging session (only works if task is a PythonCallable)
task.debug()

> <ipython-input-24-25d1d4621d26>(23)clean()
-> data = pd.read_csv(str(upstream['get']))
(Pdb) c


## Communicable

In [42]:
html = dag.to_markup()
HTML(html)

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




name,Last updated,Outdated dependencies,Outdated code,Product,Doc (short),Location
get,"a minute ago (Mar 14, 20 at 22:40)",False,True,/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmp205owbvh/data.csv,Get data,:14
clean,"a minute ago (Mar 14, 20 at 22:40)",True,True,/var/folders/3h/_lvh_w_x5g30rrjzb_xnn2j80000gq/T/tmp205owbvh/clean.csv,Clean data,:20
