
*Note:* You can run this from your computer (Jupyter or terminal), or use one of the
hosted options:
[![binder-logo](https://mybinder.org/v2/gh/ploomber/binder-env/main?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252Fploomber%252Fprojects%26urlpath%3Dlab%252Ftree%252Fprojects%252Fml-intermediate%252FREADME.ipynb%26branch%3Dmaster)
[![deepnote-logo](https://deepnote.com/buttons/launch-in-deepnote-small.svg)](https://deepnote.com/launch?template=deepnote&url=https://github.com/ploomber/projects/blob/master/ml-intermediate/README.ipynb)


# Intermediate ML project

This example shows how to build an ML pipeline with integration testing (using
the `on_finish` key). When the pipeline takes a lot of time to run end-to-end
it is a good idea to test with a sample, take a look at the `pipeline.yaml`,
`env.yaml` to see how this parametrization happens and how this affects the
`get` function defined in `tasks.py`.

## Setup

~~~bash
# same instructions as the other version
git clone https://github.com/ploomber/projects
cd ml-intermediate

conda env create --file environment.yml
conda activate ml-intermediate
~~~

## Execute the pipeline

In [1]:
%%sh
ploomber build

Using the full dataset
name      Ran?      Elapsed (s)    Percentage
--------  ------  -------------  ------------
get       True         0.732092      2.6742
features  True         0.090157      0.329328
join      True         0.239166      0.873631
fit       True        26.3147       96.1228


100%|██████████| 4/4 [00:00<00:00, 8568.55it/s]
Building task "fit":  75%|███████▌  | 3/4 [00:01<00:00,  1.88it/s] 
Executing:   0%|          | 0/11 [00:00<?, ?cell/s][A
Executing:   9%|▉         | 1/11 [00:00<00:09,  1.04cell/s][A
Executing:  18%|█▊        | 2/11 [00:02<00:09,  1.02s/cell][A
Executing:  45%|████▌     | 5/11 [00:02<00:04,  1.36cell/s][A
Executing:  55%|█████▍    | 6/11 [00:02<00:02,  1.81cell/s][A
Executing:  64%|██████▎   | 7/11 [00:21<00:24,  6.03s/cell][A
Executing:  73%|███████▎  | 8/11 [00:22<00:14,  4.69s/cell][A
Executing:  82%|████████▏ | 9/11 [00:23<00:06,  3.38s/cell][A
Executing: 100%|██████████| 11/11 [00:25<00:00,  2.28s/cell]
Building task "fit": 100%|██████████| 4/4 [00:27<00:00,  6.90s/it]


## Integration testing with a sample

To see available parameters (params parsed from `env.yaml` start with `--env`):

In [2]:
%%sh
ploomber build --help

usage: ploomber [-h] [--log LOG] [--entry-point ENTRY_POINT] [--force]
                [--partially PARTIALLY] [--debug]
                [--env--path--products ENV__PATH__PRODUCTS]
                [--env--sample ENV__SAMPLE]

Build pipeline

optional arguments:
  -h, --help            show this help message and exit
  --log LOG, -l LOG     Enables logging to stdout at the specified level
  --entry-point ENTRY_POINT, -e ENTRY_POINT
                        Entry point(DAG), defaults to pipeline.yaml. Replaced
                        if there is an ENTRY_POINT env variable defined
  --force, -f           Force execution by ignoring status
  --partially PARTIALLY, -p PARTIALLY
                        Build a pipeline partially until certain task
  --debug, -d           Drop a debugger session if an exception happens
  --env--path--products ENV__PATH__PRODUCTS
                        Default: /Users/Edu/dev/projects-ploomber/ml-
                        intermediate/output
  --env--sample EN

Run with a sample:

In [3]:
%%sh
ploomber build --env--sample true 

Sampling 10%
name      Ran?      Elapsed (s)    Percentage
--------  ------  -------------  ------------
get       True         0.678332      12.2154
features  True         0.060655       1.09228
join      True         0.086937       1.56557
fit       True         4.72715       85.1267


100%|██████████| 4/4 [00:00<00:00, 8715.44it/s]
Building task "fit":  75%|███████▌  | 3/4 [00:01<00:00,  2.23it/s] 
Executing:   0%|          | 0/11 [00:00<?, ?cell/s][A
Executing:   9%|▉         | 1/11 [00:00<00:09,  1.04cell/s][A
Executing:  18%|█▊        | 2/11 [00:02<00:09,  1.03s/cell][A
Executing:  45%|████▌     | 5/11 [00:02<00:04,  1.36cell/s][A
Executing:  64%|██████▎   | 7/11 [00:03<00:02,  1.45cell/s][A
Executing:  73%|███████▎  | 8/11 [00:03<00:01,  1.87cell/s][A
Executing: 100%|██████████| 11/11 [00:04<00:00,  2.44cell/s]
Building task "fit": 100%|██████████| 4/4 [00:05<00:00,  1.45s/it]


## Where to go from here

Using a `pipeline.yaml` is a convenient way to describe your workflows but it
has some limitations. [`ml-advanced/`](../ml-advanced/README.ipynb) shows a
pipeline written using the Python API, this gives you full flexibility and
allows you to do things such as creating tasks dynamically.

It also shows how to create a Python package to easily share your pipeline and how to test it using `pytest`.