# Intermediate ML project

This example shows how to build an ML pipeline with integration testing (using
the `on_finish` key). When the pipeline takes a lot of time to run end-to-end
it is a good idea to test with a sample, take a look at the `pipeline.yaml`,
`env.yaml` to see how this parametrization happens and how this affects the
`get` function defined in `tasks.py`.

## Setup

## Execute the pipeline

In [1]:
%%sh
ploomber build

Using the full dataset
name      Ran?      Elapsed (s)    Percentage
--------  ------  -------------  ------------
get       True         0.639868      2.51247
features  True         0.216619      0.850563
join      True         0.338534      1.32927
fit.py    True        24.2727       95.3077


  0%|          | 0/4 [00:00<?, ?it/s]Rendering DAG:   0%|          | 0/4 [00:00<?, ?it/s]Rendering DAG:   0%|          | 0/4 [00:00<?, ?it/s]Rendering DAG:   0%|          | 0/4 [00:00<?, ?it/s]Rendering DAG:   0%|          | 0/4 [00:00<?, ?it/s]Rendering DAG: 100%|██████████| 4/4 [00:00<00:00, 30.90it/s]Rendering DAG: 100%|██████████| 4/4 [00:00<00:00, 30.83it/s]
  0%|          | 0/4 [00:00<?, ?it/s]Building task "get":   0%|          | 0/4 [00:00<?, ?it/s]Building task "get":  25%|██▌       | 1/4 [00:00<00:02,  1.34it/s]Building task "features":  25%|██▌       | 1/4 [00:00<00:02,  1.34it/s]Building task "features":  50%|█████     | 2/4 [00:01<00:01,  1.61it/s]Building task "join":  50%|█████     | 2/4 [00:01<00:01,  1.61it/s]    Building task "join":  75%|███████▌  | 3/4 [00:01<00:00,  1.77it/s]Building task "fit.py":  75%|███████▌  | 3/4 [00:01<00:00,  1.77it/s]
Executing:   0%|          | 0/11 [00:00<?, ?cell/s][A
Executing:   9%|▉         | 1/11 [00:01<00:11,  1.1

## Integration testing with a sample

To see available parameters:

In [2]:
%%sh
ploomber build --help

usage: ploomber [-h] [--log LOG] [--entry-point ENTRY_POINT] [--force]
                [--partially PARTIALLY] [--env--sample ENV__SAMPLE]

Build pipeline

optional arguments:
  -h, --help            show this help message and exit
  --log LOG, -l LOG     Enables logging to stdout at the specified level
  --entry-point ENTRY_POINT, -e ENTRY_POINT
                        Entry point(DAG), defaults to pipeline.yaml. Replaced
                        if there is an ENTRY_POINT env variable defined
  --force, -f           Force execution by ignoring status
  --partially PARTIALLY, -p PARTIALLY
                        Build a pipeline partially until certain task
  --env--sample ENV__SAMPLE
                        Default: False


Run with a sample:

In [3]:
%%sh
ploomber build --env--sample true 

Sampling 10%
name      Ran?      Elapsed (s)    Percentage
--------  ------  -------------  ------------
get       True         0.755183      11.9634
features  True         0.195999       3.10496
join      True         0.23536        3.7285
fit.py    True         5.12591       81.2032


  0%|          | 0/4 [00:00<?, ?it/s]Rendering DAG:   0%|          | 0/4 [00:00<?, ?it/s]Rendering DAG:   0%|          | 0/4 [00:00<?, ?it/s]Rendering DAG:   0%|          | 0/4 [00:00<?, ?it/s]Rendering DAG:   0%|          | 0/4 [00:00<?, ?it/s]Rendering DAG: 100%|██████████| 4/4 [00:00<00:00, 20.30it/s]Rendering DAG: 100%|██████████| 4/4 [00:00<00:00, 20.23it/s]
  0%|          | 0/4 [00:00<?, ?it/s]Building task "get":   0%|          | 0/4 [00:00<?, ?it/s]Building task "get":  25%|██▌       | 1/4 [00:00<00:02,  1.20it/s]Building task "features":  25%|██▌       | 1/4 [00:00<00:02,  1.20it/s]Building task "features":  50%|█████     | 2/4 [00:01<00:01,  1.46it/s]Building task "join":  50%|█████     | 2/4 [00:01<00:01,  1.46it/s]    Building task "join":  75%|███████▌  | 3/4 [00:01<00:00,  1.73it/s]Building task "fit.py":  75%|███████▌  | 3/4 [00:01<00:00,  1.73it/s]
Executing:   0%|          | 0/11 [00:00<?, ?cell/s][A
Executing:   9%|▉         | 1/11 [00:01<00:12,  1.2