# Core concepts


## Ploomber's core: Tasks, Products, DAG and Clients

To get started with ploomber you only have to learn four concepts:

1. Task. A unit of work that takes some input and produces a persistent change
2. Product. A persistent change *produced* by a Task (e.g. a file in the local filesystem, a table in a remote database)
3. DAG. A collection of Tasks used to specify dependencies among them (use output from Task A as input for Task B)
4. Client. An object that keeps communication with an external system (e.g. a database)

There is a standard [Task API](../api.rst#ploomber.tasks.Task) defined by an abstract class, this is also true for [Products](../api.rst#ploomber.products.Product) and [Clients](../api.rst#ploomber.clients.Client). Which means you only have to learn the concept once and all concrete classes will behave in the same way.


## The DAG lifecycle: Declare, render, build

A DAG goes through three steps before being executed:

1. Declaration. A DAG is created and Tasks are added to it
2. Rendering. Placeholders are resolved and validation is performed on Task inputs
3. Building. All *outdated* Tasks are executed in the appropriate order (run upstream task dependencies first)

### Declaration

In [1]:
from pathlib import Path
import pandas as pd
from ploomber import DAG
from ploomber.tasks import PythonCallable, SQLUpload, SQLScript
from ploomber.clients import SQLAlchemyClient
from ploomber.products import File, SQLiteRelation

The simplest Task is PythonCallable, which takes a callable (e.g. a function) as its first argument. The only requirement for the functions is to have a product
argument, if the task has dependencies, it must have an upstream argument as well.

In [2]:
def _one_task(product):
    pd.DataFrame({'one_column': [1, 2, 3]}).to_csv(str(product))

def _another_task(upstream, product):
    df = pd.read_csv(str(upstream['one']))
    df['another_column'] = df['one_column'] + 1
    df.to_csv(str(product))

In [3]:
dag = DAG()

# instantiate two tasks and add them to the DAG
one_task = PythonCallable(_one_task, File('one_file.csv'), dag, name='one')
another_task = PythonCallable(_another_task, File('another_file.csv'), dag, name='another')
# declare dependencies: another_task depends on one_task
one_task >> another_task

PythonCallable: another -> File('another_file.csv')

Note that in the previous function definitions we use `str(product)`, since products are custom objects, they will not work directly when used as parameters to external functions, hence using `str` will return a string representation that can be used. For `File` the path to the file will be returned, other products implement different logic, for example a `SQLRelation` returns a `"schema"."name"` string.

### Rendering

To generate a Product, Tasks use a combination of inputs and a `source`. The kind of source depends on the kind of Task, `PythonCallable` uses a Python function as source, `SQLScript` uses a string with SQL code as source, `SQLUpload` uses a string to a file as source. Rendering is the process where any necessary preparation and validation to the source takes place.

One use case for this is to avoid redudant code. If a Task is declared to have an upstream dependency, it means that it will take the upstream Product as input, instead of declaring the Product twice, we can refer to it in the downstream task using a placeholder. Let's see an example using `SQLUpload`:

In [4]:
client = SQLAlchemyClient('sqlite:///my_db.db')

# Tasks that use clients have a client argument, but you can also define
# DAG-level clients
dag.clients[SQLUpload] = client
dag.clients[SQLiteRelation] = client
dag.clients[SQLScript] = client

# Take the product from the upstream task named "another" and use it as source
my_table = SQLUpload(source='{{upstream["another"]}}',
                     product=SQLiteRelation((None, 'my_table', 'table')),
                     dag=dag,
                     name='my_table')

another_task >> my_table

SQLUpload: my_table -> SQLiteRelation('None.my_table')

In [5]:
dag.render()

# let's see the rendered value:
str(my_table.source)

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




'another_file.csv'

Another important use case for placeholders are parametrized SQL queries. `SQLScript` runs SQL code in a database that creates a table or a view. Since ploomber requires sources (SQL code) and products (a table/view) to be declared separately we use placeholdes to only declare the product once:

In [6]:
source = """
DROP TABLE IF EXISTS {{product}};

CREATE TABLE {{product}}
AS SELECT * FROM {{upstream["my_table"]}}
WHERE one_column = 1
"""

# instead of declaring "second_table" twice, we declare it in product and refer to it in source
second_table = SQLScript(source=source,
                     product=SQLiteRelation((None, 'second_table', 'table')),
                     dag=dag,
                     name='second_table')

my_table >> second_table

SQLScript: second_table -> SQLiteRelation('None.second_table')

In [7]:
dag.render()

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




DAG("No name")

In [8]:
print(str(dag['second_table'].source))


DROP TABLE IF EXISTS second_table;

CREATE TABLE second_table
AS SELECT * FROM my_table
WHERE one_column = 1


ploomber uses [jinja2](https://jinja.palletsprojects.com/en/2.11.x/api/) for rendering, which opens a wide range of possibilities rendering SQL source code. Note that this time, we didn't use the `str` operator explicitely as we did for PythonCallable, this is because jinja automatically casts objects to strings.

In [9]:

# Rendering is a pre-processing step where placeholders are resolved and
# a few validation checks are run. Placeholders exist to succinctly declare
# DAGs, once a parameter is declared, it can be used in several contexts. Placeholders are extremely useful for propagating values in a DAG.

# Once you declare a dependency, you make the product(s) of the upstream Task
# available to the downstream Task at render time, this allows you to only
# define a product once and propagate it to downstream tasks.

# Within the downstream task, you can access upstream dependencies using jinja syntax via the `upstream` key. Any parameter passed to the Task via the
# `param` parameter, will also be available at rendering time.

# A task Product does not have to be fully defined at declaration time. Rendering is the process where a DAG object resolves any dependencies


# dag = DAG()

# one_task = PythonCallable(_one_task, File('one_file'), dag, 'one')
# another_task = PythonCallable(_another_task, File('{{upstream["one"]}}_and_another_file'), dag, 'another')
# one_task >> another_task

# dag.render()

# another_task.product





If anything in the rendering process goes wrong, you will see a detailed traceback to debug. Furthermore



talk about debugging rendering, using the traceback (errors are details) 
but also doing task.params (explain pre and post render difference)
talk about source code tracking, parameter passing

### Build

Once rendering is done, we can build our DAG. 

In [10]:
dag.build()

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




name,Ran?,Elapsed (s),Percentage
one,False,0,0
another,False,0,0
my_table,False,0,0
second_table,False,0,0


The first time we run our pipeline, all Tasks are executed, but the real power of ploomber is running builds over and over again. Ploomber keeps track of each Task's status and only executed outdated ones, since we just built our pipeline, nothing will run:

In [11]:
dag.build()

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




name,Ran?,Elapsed (s),Percentage
one,False,0,0
another,False,0,0
my_table,False,0,0
second_table,False,0,0


### Task status

Upon sucessful execution, a Task will save metadata along with the Product, to keep track of status in subsequent builds. Upon DAG execution (even if some tasks fail), another call to `dag.build()` will only trigger execution on outdated tasks. A task is run if any of the following conditions is true:

1. No product (when a Task is run for the first time)
2. No metadata (when a Task crashes, no metadata is saved)
3. Any upstream source changed (e.g. an upstream SQL script changed)
4. Source changed (the the Task source changed)

These rules enable the following use cases:

1. Fast incremental builds (Modify any Task source, next build will only run affected Tasks)
2. Crash recovery (If a DAG crashes, the next run will start where it was interrupted)


### Task parameters

There is one last remaining Task argument: `params`, they are optional parameters whose effect varies depending on the kind of Task. `PythonCallable` just passes them when calling the underlying function, Tasks that take SQL code as source, pass them directly to the source (they are available as placeholders), `NotebookRunner` (which runs Jupyter notebooks), passes them as parameters using [papermill](https://github.com/nteract/papermill).

As a general advice, it is best to keep `params` short, their main use case for creating dynamic DAGs (whose number of Tasks is determined using control structures). Dynamic DAGs are covered in a more advanced tutorial.