
*Note:* You can run this from your computer (Jupyter or terminal), or use one of the
hosted options:
[![binder-logo](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ploomber/binder-env/main?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252Fploomber%252Fprojects%26urlpath%3Dlab%252Ftree%252Fprojects%252Fml-intermediate%252FREADME.ipynb%26branch%3Dmaster)
[![deepnote-logo](https://deepnote.com/buttons/launch-in-deepnote-small.svg)](https://deepnote.com/launch?template=deepnote&url=https://github.com/ploomber/projects/blob/master/ml-intermediate/README.ipynb)


# Intermediate ML project

This example shows how to build an ML pipeline with integration testing (using
the `on_finish` key). When the pipeline takes a lot of time to run end-to-end
it is a good idea to test with a sample, take a look at the `pipeline.yaml`,
`env.yaml` to see how this parametrization happens and how this affects the
`get` function defined in `tasks.py`.

## Setup

~~~bash
# same instructions as the other version
git clone https://github.com/ploomber/projects
cd ml-intermediate

conda env create --file environment.yml
conda activate ml-intermediate
~~~

## Execute the pipeline

In [1]:
%%sh
ploomber build

Using the full dataset
name      Ran?      Elapsed (s)    Percentage
--------  ------  -------------  ------------
get       True         0.523268      2.32697
features  True         0.070489      0.313465
join      True         0.20202       0.898383
fit       True        21.6913       96.4612


100%|██████████| 4/4 [00:00<00:00, 2846.97it/s]
Building task "fit":  75%|███████▌  | 3/4 [00:08<00:02,  2.98s/it] 
Executing:   0%|          | 0/11 [00:00<?, ?cell/s][A
Executing:   9%|▉         | 1/11 [00:00<00:08,  1.22cell/s][A
Executing:  18%|█▊        | 2/11 [00:01<00:08,  1.12cell/s][A
Executing:  45%|████▌     | 5/11 [00:02<00:03,  1.55cell/s][A
Executing:  55%|█████▍    | 6/11 [00:02<00:02,  2.06cell/s][A
Executing:  64%|██████▎   | 7/11 [00:18<00:21,  5.38s/cell][A
Executing:  73%|███████▎  | 8/11 [00:20<00:12,  4.15s/cell][A
Executing:  82%|████████▏ | 9/11 [00:20<00:05,  3.00s/cell][A
Executing: 100%|██████████| 11/11 [00:21<00:00,  1.96s/cell]
Building task "fit": 100%|██████████| 4/4 [00:30<00:00,  7.61s/it]


## Integration testing with a sample

To see available parameters (params parsed from `env.yaml` start with `--env`):

In [2]:
%%sh
ploomber build --help

usage: ploomber [-h] [--log LOG] [--entry-point ENTRY_POINT] [--force]
                [--partially PARTIALLY] [--debug]
                [--env--path--products ENV__PATH__PRODUCTS]
                [--env--sample ENV__SAMPLE]

Build pipeline

optional arguments:
  -h, --help            show this help message and exit
  --log LOG, -l LOG     Enables logging to stdout at the specified level
  --entry-point ENTRY_POINT, -e ENTRY_POINT
                        Entry point(DAG), defaults to pipeline.yaml. Replaced
                        if there is an ENTRY_POINT env variable defined
  --force, -f           Force execution by ignoring status
  --partially PARTIALLY, -p PARTIALLY
                        Build a pipeline partially until certain task
  --debug, -d           Drop a debugger session if an exception happens
  --env--path--products ENV__PATH__PRODUCTS
                        Default: /Users/Edu/dev/projects-ploomber/ml-
                        intermediate/output
  --env--sample EN

Run with a sample:

In [3]:
%%sh
ploomber build --env--sample true 

Sampling 10%
name      Ran?      Elapsed (s)    Percentage
--------  ------  -------------  ------------
get       True         0.501051       7.41058
features  True         0.036964       0.5467
join      True         0.069101       1.02201
fit       True         6.15417       91.0207


100%|██████████| 4/4 [00:00<00:00, 4073.13it/s]
Building task "fit":  75%|███████▌  | 3/4 [00:08<00:02,  2.90s/it] 
Executing:   0%|          | 0/11 [00:00<?, ?cell/s][A
Executing:   9%|▉         | 1/11 [00:00<00:09,  1.04cell/s][A
Executing:  18%|█▊        | 2/11 [00:02<00:09,  1.00s/cell][A
Executing:  45%|████▌     | 5/11 [00:02<00:04,  1.40cell/s][A
Executing:  64%|██████▎   | 7/11 [00:03<00:02,  1.48cell/s][A
Executing:  73%|███████▎  | 8/11 [00:03<00:01,  1.93cell/s][A
Executing: 100%|██████████| 11/11 [00:05<00:00,  2.16cell/s]
Building task "fit": 100%|██████████| 4/4 [00:14<00:00,  3.65s/it]


## Where to go from here

Using a `pipeline.yaml` is a convenient way to describe your workflows but it
has some limitations. [`ml-advanced/`](../ml-advanced/README.ipynb) shows a
pipeline written using the Python API, this gives you full flexibility and
allows you to do things such as creating tasks dynamically.

It also shows how to create a Python package to easily share your pipeline and how to test it using `pytest`.

In [4]:
# Parameters
product = "ml-intermediate/README.ipynb"
