# Your first Python pipeline

<!-- start description -->
Introductory tutorial to learn the basics of Ploomber.
<!-- end description -->

## Installing dependencies
We'll run a few commands to bootstrap the colab instance ~30 seconds

In [1]:
# Basic installs, run this and read the bottom instructions
!echo Installing packages with fixed versions for colab
!pip -q install -r https://github.com/ploomber/ploomber/raw/master/requirements-colab.lock.txt
!echo Installing Ploomber
!pip -q install ploomber black

Installing packages with fixed versions for colab
[K     |████████████████████████████████| 45 kB 927 kB/s 
[K     |████████████████████████████████| 96 kB 1.8 MB/s 
[K     |████████████████████████████████| 793 kB 42.7 MB/s 
[K     |████████████████████████████████| 130 kB 46.7 MB/s 
[K     |████████████████████████████████| 297 kB 42.2 MB/s 
[K     |████████████████████████████████| 9.9 MB 60.2 MB/s 
[K     |████████████████████████████████| 40 kB 17 kB/s 
[K     |████████████████████████████████| 11.1 MB 60.4 MB/s 
[K     |████████████████████████████████| 69 kB 7.8 MB/s 
[K     |████████████████████████████████| 950 kB 49.1 MB/s 
[K     |████████████████████████████████| 636 kB 49.3 MB/s 
[K     |████████████████████████████████| 42 kB 993 kB/s 
[K     |████████████████████████████████| 381 kB 55.9 MB/s 
[K     |████████████████████████████████| 428 kB 55.5 MB/s 
[K     |████████████████████████████████| 43 kB 2.1 MB/s 
[K     |████████████████████████████████| 84 k

In [2]:
!ploomber examples -n guides/first-pipeline -o first-pipeline
%cd first-pipeline

# You should see all of the pipeline content
!ls -l

Loading examples...
Local copy does not exist...
Cloning into '/root/.ploomber/projects'...
remote: Enumerating objects: 555, done.[K
remote: Counting objects: 100% (555/555), done.[K
remote: Compressing objects: 100% (434/434), done.[K
remote: Total 555 (delta 100), reused 322 (delta 61), pack-reused 0[K
Receiving objects: 100% (555/555), 2.03 MiB | 14.64 MiB/s, done.
Resolving deltas: 100% (100/100), done.
Next steps:

$ cd first-pipeline/
$ ploomber install[34m

Open first-pipeline/README.md for details.
[0m/content/first-pipeline
total 220
-rw-r--r-- 1 root root    338 Jun 13 15:40 1-get.py
-rw-r--r-- 1 root root    354 Jun 13 15:40 2-profile-raw.py
-rw-r--r-- 1 root root    381 Jun 13 15:40 3-clean.py
-rw-r--r-- 1 root root    364 Jun 13 15:40 4-profile-clean.py
-rw-r--r-- 1 root root    325 Jun 13 15:40 5-plot.py
-rw-r--r-- 1 root root 116130 Jun 13 15:40 colab.ipynb
-rw-r--r-- 1 root root    141 Jun 13 15:40 environment.yml
-rw-r--r-- 1 root root   7786 Jun 13 15:40 pipeli

## Introduction

Ploomber helps you build modular pipelines. A pipeline (or **DAG**) is a group of tasks with a particular execution order, where subsequent (or **downstream** tasks) use previous (or **upstream**) tasks as inputs.

## Pipeline declaration

This example pipeline contains five tasks, `1-get.py`, `2-profile-raw.py`, 
`3-clean.py`, `4-profile-clean.py` and `5-plot.py`; we declare them in a `pipeline.yaml` file:

```yaml
# Content of pipeline.yaml
tasks:
   # source is the code you want to execute (.ipynb also supported)
  - source: 1-get.py
    # products are task's outputs
    product:
      # scripts generate executed notebooks as outputs
      nb: output/1-get.html
      # you can define as many outputs as you want
      data: output/raw_data.csv

  - source: 2-profile-raw.py
    product: output/2-profile-raw.html

  - source: 3-clean.py
    product:
      nb: output/3-clean.html
      data: output/clean_data.parquet

  - source: 4-profile-clean.py
    product: output/4-profile-clean.html

  - source: 5-plot.py
    product: output/5-plot.html

```

**Note:** YAML is a human-readable text format similar to JSON.

**Note:** Ploomber supports Python scripts, Python functions, Jupyter notebooks, R scripts, and SQL scripts.

## Opening `.py` files as notebooks

Ploomber integrates with Jupyter. Among other things, it **allows you to open `.py` files as notebooks** (via `jupytext`).

![lab-open-with-nb](https://ploomber.io/images/doc/lab-open-with-notebook.png)

### What sets the execution order?

Ploomber infers the pipeline structure from your code. For example, to
clean the data, we must get it first; hence, we declare the following in `3-clean.py`:

~~~python
# 3-clean.py

# this tells Ploomber to execute the '1-get' task before '3-clean'
upstream = ['1-get']
~~~

## Plotting the pipeline

In [3]:
%%bash
ploomber plot

Loading pipeline...
Plot saved at: pipeline.html


  0%|          | 0/5 [00:00<?, ?it/s]100%|██████████| 5/5 [00:00<00:00, 7881.07it/s]


In [5]:
from IPython.display import HTML, IFrame
display(HTML('pipeline.html')) # Either HTML or IFrame should work

You can see that our pipeline has a defined execution order.

**Note:** This is a sample predifined five-task pipeline, Ploomber can manage arbitrarily complex pipelines and dependencies among tasks.

## Running the pipeline

In [6]:
%%bash
# takes a few seconds to finish
ploomber build

Loading pipeline...
name             Ran?      Elapsed (s)    Percentage
---------------  ------  -------------  ------------
1-get            True          2.81864       11.531
2-profile-raw    True          4.76771       19.5045
3-clean          True          2.83165       11.5842
4-profile-clean  True          4.68803       19.1785
5-plot           True          9.3381        38.2018


  0%|          | 0/5 [00:00<?, ?it/s]Building task '1-get':   0%|          | 0/5 [00:00<?, ?it/s]
Executing:   0%|          | 0/6 [00:00<?, ?cell/s][A
Executing:  17%|█▋        | 1/6 [00:01<00:08,  1.77s/cell][A
Executing:  67%|██████▋   | 4/6 [00:01<00:00,  2.59cell/s][AExecuting: 100%|██████████| 6/6 [00:02<00:00,  2.49cell/s]
Building task '1-get':  20%|██        | 1/5 [00:02<00:11,  2.82s/it]Building task '2-profile-raw':  20%|██        | 1/5 [00:02<00:11,  2.82s/it]
Executing:   0%|          | 0/7 [00:00<?, ?cell/s][A
Executing:  14%|█▍        | 1/7 [00:01<00:08,  1.47s/cell][A
Executing:  43%|████▎     | 3/7 [00:02<00:02,  1.61cell/s][A
Executing:  71%|███████▏  | 5/7 [00:03<00:01,  1.71cell/s][A
Executing:  86%|████████▌ | 6/7 [00:03<00:00,  1.85cell/s][A
Executing: 100%|██████████| 7/7 [00:03<00:00,  2.15cell/s][AExecuting: 100%|██████████| 7/7 [00:04<00:00,  1.58cell/s]
Building task '2-profile-raw':  40%|████      | 2/5 [00:07<00:11,  3.97s/it]Buildi

This pipeline saves all the output in the `output/` directory; we have the output notebooks and data files:

In [7]:
%%bash
ls output

1-get.html
2-profile-raw.html
3-clean.html
4-profile-clean.html
5-plot.html
clean_data.parquet
raw_data.csv


## Updating the pipeline

Ploomber automatically caches your pipeline’s previous results and only runs tasks that changed since your last execution.

Execute the following to modify the `3-clean.py` script

In [8]:
from pathlib import Path

path = Path('3-clean.py')
clean = path.read_text()

# add a print statement at the end of 3-clean.py
path.write_text(clean + """
print("hello")
""")

397

Execute the pipeline again:

In [9]:
%%bash
# takes a few seconds to finish
ploomber build

Loading pipeline...
name             Ran?      Elapsed (s)    Percentage
---------------  ------  -------------  ------------
3-clean          True          2.78456       16.7928
4-profile-clean  True          4.495         27.1079
5-plot           True          9.30233       56.0993
1-get            False         0              0
2-profile-raw    False         0              0


  0%|          | 0/3 [00:00<?, ?it/s]Building task '3-clean':   0%|          | 0/3 [00:00<?, ?it/s]
Executing:   0%|          | 0/9 [00:00<?, ?cell/s][A
Executing:  11%|█         | 1/9 [00:01<00:14,  1.78s/cell][A
Executing:  78%|███████▊  | 7/9 [00:01<00:00,  4.95cell/s][AExecuting: 100%|██████████| 9/9 [00:02<00:00,  3.80cell/s]
Building task '3-clean':  33%|███▎      | 1/3 [00:02<00:05,  2.79s/it]Building task '4-profile-clean':  33%|███▎      | 1/3 [00:02<00:05,  2.79s/it]
Executing:   0%|          | 0/7 [00:00<?, ?cell/s][A
Executing:  14%|█▍        | 1/7 [00:01<00:08,  1.39s/cell][A
Executing:  43%|████▎     | 3/7 [00:01<00:02,  1.71cell/s][A
Executing:  71%|███████▏  | 5/7 [00:03<00:01,  1.84cell/s][A
Executing:  86%|████████▌ | 6/7 [00:03<00:00,  2.00cell/s][A
Executing: 100%|██████████| 7/7 [00:03<00:00,  2.31cell/s][AExecuting: 100%|██████████| 7/7 [00:04<00:00,  1.68cell/s]
Building task '4-profile-clean':  67%|██████▋   | 2/3 [00:07<00:03,  3.79s/it

In [10]:
# restore contents
path.write_text(clean)

381

You'll see that `1-get.py` & `2-profile-raw.py` didn't run because it was not affected by the change!

## Where to go from here

**Bring your own code!** Check out the tutorial to [migrate your code to Ploomber](https://docs.ploomber.io/en/latest/user-guide/refactoring.html).

Have questions? [Ask us anything on Slack](https://ploomber.io/community/).

Want to dig deeper into Ploomber's core concepts? Check out [the basic concepts tutorial](https://docs.ploomber.io/en/latest/get-started/basic-concepts.html).

Want to start a new project quickly? Check out [how to get examples](https://docs.ploomber.io/en/latest/user-guide/templates.html).

