# Intermediate ML project

This example shows how to build an ML pipeline with integration testing (using
the `on_finish` key). When the pipeline takes a lot of time to run end-to-end
it is a good idea to test with a sample, take a look at the `pipeline.yaml`,
`env.yaml` to see how this parametrization happens and how this affects the
`get` function defined in `tasks.py`.

## Setup

~~~bash
# same instructions as the other version
git clone https://github.com/ploomber/projects
cd ml-intermediate

conda env create --file environment.yml
conda activate ml-intermediate
~~~

## Execute the pipeline

In [1]:
%%sh
ploomber build

name      Ran?      Elapsed (s)    Percentage
--------  ------  -------------  ------------
get       False               0             0
features  False               0             0
join      False               0             0
fit       False               0             0


100%|██████████| 4/4 [00:00<00:00, 7219.11it/s]
0it [00:00, ?it/s]


## Integration testing with a sample

To see available parameters:

In [2]:
%%sh
ploomber build --help

usage: ploomber [-h] [--log LOG] [--entry-point ENTRY_POINT] [--force]
                [--partially PARTIALLY] [--debug]
                [--env--path--products ENV__PATH__PRODUCTS]
                [--env--sample ENV__SAMPLE]

Build pipeline

optional arguments:
  -h, --help            show this help message and exit
  --log LOG, -l LOG     Enables logging to stdout at the specified level
  --entry-point ENTRY_POINT, -e ENTRY_POINT
                        Entry point(DAG), defaults to pipeline.yaml. Replaced
                        if there is an ENTRY_POINT env variable defined
  --force, -f           Force execution by ignoring status
  --partially PARTIALLY, -p PARTIALLY
                        Build a pipeline partially until certain task
  --debug, -d           Drop a debugger session if an exception happens
  --env--path--products ENV__PATH__PRODUCTS
                        Default: /Users/Edu/dev/projects-ploomber/ml-
                        intermediate/output
  --env--sample EN

Run with a sample:

In [3]:
%%sh
ploomber build --env--sample true 

Sampling 10%
name      Ran?      Elapsed (s)    Percentage
--------  ------  -------------  ------------
get       True         0.622989      11.6216
features  True         0.064425       1.20182
join      True         0.086574       1.615
fit       True         4.58664       85.5616


100%|██████████| 4/4 [00:00<00:00, 9024.86it/s]
Building task "fit":  75%|███████▌  | 3/4 [00:01<00:00,  2.18it/s] 
Executing:   0%|          | 0/11 [00:00<?, ?cell/s][A
Executing:   9%|▉         | 1/11 [00:00<00:08,  1.22cell/s][A
Executing:  18%|█▊        | 2/11 [00:02<00:08,  1.06cell/s][A
Executing:  45%|████▌     | 5/11 [00:02<00:04,  1.48cell/s][A
Executing:  64%|██████▎   | 7/11 [00:03<00:02,  1.54cell/s][A
Executing:  73%|███████▎  | 8/11 [00:03<00:01,  1.98cell/s][A
Executing: 100%|██████████| 11/11 [00:04<00:00,  2.51cell/s]
Building task "fit": 100%|██████████| 4/4 [00:05<00:00,  1.42s/it]


## Where to go from here

Using a `pipeline.yaml` is a convenient way to describe your workflows but it has some limitations. `ml-advanced` shows a pipeline written using the Python API, this gives you full flexibility and allows you to do things such as creating tasks dynamically.

It also shows how to create a Python package to easily share your pipeline and how to test it using `pytest`.