In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.

!pip install -U oracle-ads

<font color=gray>Oracle Data Science service sample notebook.

Copyright (c) 2022 Oracle, Inc.  All rights reserved.
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***

# <font color=red>Working with Pipelines</font>
<p style="margin-left:10%; margin-right:10%;">by <font color="teal">The Oracle Data Science Team</font></p>

***

## Overview:

The machine learning lifecycle is composed of several steps: data acquisition and extraction, data preparation, featurization, model training (including algorithm selection and hyper-parameter tuning), model evaluation, deployment, and monitoring and possibly retraining the deployed model. Oracle Cloud Infrastructure (OCI) Data Science Machine Learning (ML) Pipeline enables you to define and run an end-to-end machine learning orchestration covering the entire machine learning lifecycle. Thus you can execute in a repeatable, and continuous ML pipeline with a few simple commands. 

This notebook uses the Accelerated Data Science (ADS) SDK to construct, control, and leverage pipelines within the Oracle Data Science service.

Compatible conda pack: [General Machine Learning](https://docs.oracle.com/en-us/iaas/data-science/using/conda-gml-fam.htm) for CPU on Python 3.8 (version 1.0)

---

## Contents:

- <a href="#concepts">Introduction</a>
  - <a href='#preliminary'>Setup</a>
    - <a href='#policy'>Policy</a>
    - <a href='#var'>Variables</a>
- <a href="#construct">Construct a Pipeline</a>
  - <a href="#option1">With PipelineSteps</a>
  - <a href="#option2">With YAML</a>
- <a href='#create'>Create Pipeline</a>
- <a href='#run'>Create a Pipeline run</a>
  - <a href="#watch_status">Watch Status</a>
  - <a href="#monitor_logs">Monitor Logs</a>
  - <a href="#cancel_run">Cancel Pipeline Run</a>
  - <a href="#delete_run">Delete Pipeline Run</a>
- <a href='#load'>Load an Existing Pipeline</a>
- <a href='#delete'>Delete Pipeline</a>
- <a href='#clean-up'>Clean Up</a>
- <a href='#magic'>Magic Commands</a>
  - <a href="#magic_install">Install</a>
  - <a href="#magic_create">Create</a>
  - <a href="#magic_visualize">Visualize</a>
  - <a href="#magic_watch">Watch</a>
  - <a href="#magic_monitor">Monitor</a>
  - <a href="#magic_cancel">Cancel</a>
  - <a href="#magic_delete">Delete</a>
- <a href='#ref'>References</a>

---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

Datasets are provided as a convenience. Datasets are considered third-party content and are not considered materials under your agreement with Oracle.

In [None]:
import ads
import oci
import os

from ads.jobs import DataScienceJob
from ads.pipeline import (
    Pipeline,
    PipelineStep,
    PipelineRun,
    ScriptRuntime,
    NotebookRuntime,
    CustomScriptStep,
)
from tempfile import mkdtemp

ads.set_auth("resource_principal")

<a id="concepts"></a>
# Introduction

A pipeline is a workflow of tasks, called steps. Steps can be run in sequence or in parallel, creating a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) (DAG) of the steps.

In a machine learning context, pipelines usually provide a workflow for data import → data transformation → model training → model evaluation. In addition, the model can also be registered to a model catalog and deployed to serve predictions.

The following are some key terms that will help you understand OCI Data Science ML Pipelines:

* **Directed acyclic graph (DAG)**: A graph of the steps in a workflow. It defines the dependencies of each step on the other steps in the pipeline. The dependencies create a logical workflow in the form of an acyclic, there or no loops, graph. The pipeline will strive to execute steps in parallel to optimize the pipeline completion time unless the dependencies force steps to run sequentially. For example, the training steps must be completed before running the model evaluation steps. However multiple models can be trained in parallel.

* **Pipeline lifecycle state**: This defines the lifecycle state of a pipeline. A pipeline can be in various states such as created, constructed, and even deleted. It is important to note that after the pipeline creation, it will still be in the CREATING state and can't be executed (run) until all steps have an artifact or job to run. In which case the pipeline will change to an ACTIVE state.

* **Pipeline run**: The execution instance of a pipeline. Each pipeline run will include its step runs. A pipeline run can be configured to override some of the pipeline's defaults before starting the execution.

* **Pipeline step**: a task in a pipeline. A pipeline step can be either a Data Science Job step or a Custom Script step.

    - Data Science Job: the OCID of an existing Data Science Job must be provided.
    - Custom Script: the artifact of the Python script and the execution configuration must be specified.
    
* **Step artifact**: The Python code to be used for the step. This code will be executed when the pipeline step is run.

<a id="preliminary"></a>
## Setup

<a id='policy'></a>
### Policy

Before using this notebook, your tenancy must be configured to use the ML Pipeline service.

* Create or use an existing VCN Private Subnet with a Service Gateway attached to your Private Subnet Routing Table.
* Set required policies.
* Add users to the group's policies.

In [None]:
# Provide and OCID of existing DataScience Job
job_id = "<job_id>"

# The log group OCID
log_group_id = "<log_group_id>"

# The log OCID
log_id = "<log_id>"

In [None]:
compartment_id = os.environ["NB_SESSION_COMPARTMENT_OCID"]
project_id = os.environ["PROJECT_OCID"]

<a id="construct"></a>
# Construct a Pipeline

In an ADS pipeline module, you can either use the Python API or YAML to define a pipeline. In addition to the configuration of the pipeline, you provide the details of the Pipeline Steps and the DAG. The DAG is the pipeline steps and it defines dependencies between the steps.

The following symbols are used in the DAG to define the dependencies between the steps. If a DAG is not provided, all steps will run in parallel.

- ``>>`` denotes the tasks running in sequence, ``A >> B`` means that A is followed by B.
- ``()`` denotes the tasks running in parallel.

In the following example, `step_2` will start after `step_1` complete. `step_3` will start after both `step_1` and `step_2` are complete.

```YAML
dag:
- step_1 >> step_2
- (step_1, step_2) >> step_3
```

Both the log OCID and corresponding log group OCID can be specified in the ``Pipeline`` instance. If you specify only the log group OCID and no log OCID, a new Log resource is automatically created within the log group to store the logs.

There are two types of logs for pipeline runs, service log and custom log. When defining a pipeline:

- To enable custom log, specify ``log_id`` and ``log_group_id``.
- To enable service log, specify ``log_group_id`` and set ``enable_service_log`` to ``True``.
- To enable both types of logs, specify ``log_id`` and ``log_group_id``, and set ``enable_service_log`` to ``True``.

With the specified DAG and pre-created pipeline steps, you can define a pipeline and give it a name.

<a id="option1"></a>
## With Pipeline Steps

To create a Pipeline, first, define a series of ``Pipeline Steps`` and then construct the pipeline object by providing the list of steps and DAG details. A pipeline step can be either a Data Science Job or a custom script. A Custom Script step can have different types of ``runtime`` depending on the source code you run:

* ``GitPythonRuntime``: This allows you to run source code from a Git repository.
* ``NotebookRuntime``: Allows you to run a JupyterLab Python notebook.
* ``PythonRuntime``: This allows you to run Python code with additional options, including setting a working directory, adding Python paths, and copying output files.
* ``ScriptRuntime`` allows you to run Python, Bash, and Java scripts from a single source file (``.zip`` or ``.tar.gz``) or code directory.

The following example shows creating and running a pipeline with multiple steps. Where the steps, ``step1`` and ``step_2``are represented as a custom script, and ``step_3``is represented as a DataScience Job. The steps, ``step_1`` and ``step_2`` run in parallel and ``step_3`` runs after ``step_1`` and ``step_2`` are complete.

In [None]:
# Prepare a simple script to be run with in a Pipeline step
script_dir = mkdtemp()
pipeline_step_script = os.path.join(script_dir, "pipeline_step_script.py")
with open(pipeline_step_script, "w") as f:
    f.write("print('Hello World!')")

In [None]:
infrastructure = CustomScriptStep(
    block_storage_size=200,
    shape_name="VM.Standard3.Flex",
    shape_config_details={"ocpus": 4, "memory_in_gbs": 32},
)

script_runtime = ScriptRuntime(
    script_path_uri=pipeline_step_script,
    conda={"type": "service", "slug": "tensorflow26_p37_cpu_v2"},
)

notebook_runtime = NotebookRuntime(
    notebook_path_uri="https://raw.githubusercontent.com/tensorflow/docs/master/site/en/tutorials/customization/basics.ipynb",
    conda={"type": "service", "slug": "tensorflow26_p37_cpu_v2"},
)

pipeline_step_1 = PipelineStep(
    name="step_1",
    description="A step running a python script",
    infrastructure=infrastructure,
    runtime=script_runtime,
)

pipeline_step_2 = PipelineStep(
    name="step_2",
    description="A step running a notebook",
    infrastructure=infrastructure,
    runtime=notebook_runtime,
)

pipeline_step_3 = PipelineStep(
    name="step_3", description="A step running a Data Science Job", job_id=job_id
)

pipeline = Pipeline(
    name="An example pipeline",
    compartment_id=compartment_id,
    project_id=project_id,
    step_details=[pipeline_step_1, pipeline_step_2, pipeline_step_3],
    dag=["(step_1, step_2) >> step_3"],
    log_group_id=log_group_id,
    log_id=log_id,
    enable_service_log=True,
)

<a id="option2"></a>
## With YAML
A pipeline can also be constructed from a YAML string or a YAML file.

* `Pipeline.from_yaml(<YAML string>)`
* `Pipeline.from_yaml(uri="/path/to/file.yaml")`
* `Pipeline.from_yaml(uri="oci://<bucket_name>@<namespace>/<prefix>/file.yaml")`

In the previous section, a pipeline was created using Pipeline Steps. The `.to_yaml()` method can be used to convert the pipeline into a YAML format.

In [None]:
pipeline = Pipeline.from_yaml(pipeline.to_yaml())

Once the Pipeline object has been created, it can be printed in a YAML format.

In [None]:
print(pipeline)

Use the ```.show()``` method on the Pipeline instance to visualize the pipeline in a graph.

In [None]:
pipeline.show()

<a id="create"></a>
# Create Pipeline

Call the ```.create()``` method of the Pipeline instance to create a pipeline.

In [None]:
pipeline.create()

If you print the ```pipleine``` object now, you will notice that the pipleine has an OCID value.

In [None]:
print(pipeline)

<a id="run"></a>
# Create a Pipeline Run
A Pipeline Run is the execution instance of a Pipeline. Each Pipeline Run includes its step runs. A Pipeline Run can be configured to override some of the pipeline's defaults before starting the execution.

You can call the ``.run()`` method of the ``Pipeline`` instance to launch a new Pipeline Run.
It returns a ``PipelineRun`` instance. With a ``PipelineRun`` instance, you can watch the status of the run and stream logs for the pipeline run and the step runs.

The ``.run()`` method gives you the option to override the configurations in a pipeline run. It takes the following optional parameters:

- ``compartment_id: str, optional``. Defaults to ``None``. The compartment id overrides the one defined previously.
- ``configuration_override_details: dict, optional``. Defaults to ``None``.
The configuration details the dictionary to override the one defined previously. The ``configuration_override_details`` contains the following keys:

    - ``command_line_arguments``: str, the command line arguments.
    - ``environment_variables``: dict, the environment variables.
    - ``maximum_runtime_in_minutes``: int, the maximum runtime allowed in minutes.
    - ``type``: str, only ``DEFAULT`` is allowed.

- ``defined_tags: dict(str, dict(str, object)), optional``. Defaults to ``None``. The defined tags dictionary to override the one defined previously.
- ``display_name: str, optional``. Defaults to ``None``. The display name of the run.
- ``free_form_tags: dict(str, str), optional``. Defaults to ``None``. The free-form tags dictionary overrides the one defined previously.
- ``log_configuration_override_details: dict, optional``. Defaults to ``None``. The log configuration details the dictionary to override the one defined previously.
- ``project_id: str, optional``. Defaults to ``None``. The project id to override the one defined previously.
- ``step_override_details: list[PipelineStepOverrideDetails], optional``. Defaults to ``None``. The step details list overrides the one defined previously.
- ``system_tags: dict(str, dict(str, object)), optional``. Defaults to ``None``. The system tags the dictionary to override the one defined previously.



In [None]:
# Run a pipeline, a pipeline run will be created and started
pipeline_run = pipeline.run()

<a id="watch_status"></a>
## Watch Status

Use the ``.show()`` method of the ``PipelineRun`` instance to retreive the current status of the pipeline run as well as each of the step runs. 

The ``.show()`` method takes the following optional parameter:
- ``mode: (str, optional)``. Defaults to ``graph``. The allowed values are ``text`` or ``graph``. This parameter renders the current status of pipeline run as either text or a graph.

- ``wait: (bool, optional)``. Defaults to ``False`` and it only renders the current status of each step run in graph. If set to ``True``, it renders the current status of each step run until the entire pipeline is complete.

- ``rankdir: (str, optional)``. Defaults to ``TB``. The allowed values are ``TB`` or ``LR``. This parameter is applicable only for graph mode and it renders the direction of the graph as either top to bottom (TB) or left to right (LR).

Render the current pipeline run status in text until the entire pipeline is complete

In [None]:
pipeline_run.show(mode="text", wait=True)

```
  Step                Status
  ------------------  ---------
  step_1:             Succeeded
  step_2:             Succeeded
  step_3:             In Progress
```

Render the current pipeline run status in graph until the entire pipeline is complete

In [None]:
pipeline_run.show(wait=True)

<a id="monitor_logs"></a>
## Monitor Logs

Use the ``.watch()`` method on the ``PipelineRun`` instance to stream the service log or custom log of the pipeline run.
The ``.watch()`` method takes the following optional parameters:

- ``steps: (list, optional)``. Defaults to ``None`` and streams the log of the pipeline run. If a list of the step names is provided, the method streams the log of the specified pipeline step runs.
- ``log_type: (str, optional)``. Defaults to ``None``. The allowed values are ``custom_log``, ``service_log``, or ``None``. If ``None`` is provided, the method streams both the custom and service log of the pipeline run.
- ``interval: (float, optional)``. The default value is ``3``. Time interval in seconds between each request to update the logs.

Stream the custom and service log of the pipeline run.

In [None]:
pipeline_run.watch()

Stream the custom log of the specified steps.

In [None]:
pipeline_run.watch(steps=["step_1", "step_2"], log_type="custom_log")

<a id="cancel_run"></a>
## Cancel Pipeline Run
Use the ``.cancel()`` method on the ``PipelineRun`` instance to cancel a pipeline run.

Pipeline Runs can only be canceled when they are in the ACCEPTED or IN_PROGRESS state.

In [None]:
pipeline_run.cancel()

<a id="delete_run"></a>
## Delete Pipeline Run
Use the ``.delete()`` method on the ``PipelineRun`` instance to delete a pipeline run. It takes the following optional parameter:

- ``delete_related_job_runs: (bool, optional)``. Specify whether to delete related JobRuns or not. Defaults to ``True``.
- ``max_wait_seconds: (int, optional)``. The maximum time to wait in seconds. Defaults to ``1800``.

Pipeline runs can only be deleted when they are in the SUCCEEDED, FAILED, or CANCELED state.

In [None]:
pipeline_run.delete()

<a id="load"></a>
# Load an Existing Pipeline

Pipelines can be loaded by specifying their ``OCID`` 

In [None]:
pipeline = Pipeline.from_ocid(pipeline.id)
pipeline

<a id="delete"></a>
# Delete Pipeline

Use the ``.delete()`` method on the ``Pipeline`` instance to delete a pipeline. It takes the following optional parameters:

- ``delete_related_job_runs: (bool, optional)``. Specify whether to delete related JobRuns or not. Defaults to ``True``.
- ``delete_related_pipeline_runs: (bool, optional)``. Specify whether to delete related PipelineRuns or not. Defaults to ``True``.
- ``max_wait_seconds: (int, optional)``. The maximum time to wait, in seconds. Defaults to ``1800``.

A pipeline can only be deleted when its associated pipeline runs are all deleted, 
or alternatively, set the parameter ``delete_related_pipeline_runs`` to delete all associated runs in the same operation.
Delete fails if a PipelineRun is in progress. 

In [None]:
pipeline.delete()

<a id='clean-up'></a>
# Clean Up
The following code removes the all artifacts, Pipeline and Pipleine Run onbjects created in this notebook.

In [None]:
if os.path.exists(pipeline_step_script):
    os.remove(pipeline_step_script)

if pipeline:
    pipeline.delete()

if os.path.exists("test_pipeline.yaml"):
    os.remove("test_pipeline.yaml")

<a id='magic'></a>
# Magic Commands
Use magic commands of ``ads.pipeline`` module to construct, control, and leverage pipelines within the Oracle Data Science service.

<a id="magic_install"></a>
## Install
Install the pipeline extension by running the following command.

In [None]:
%load_ext ads.pipeline.extension

Run ``-h`` to see supported subcommands.

In [None]:
%pipeline -h

```
Usage: pipeline [SUBCOMMAND]
Subcommand:
    run, run a pipeline from YAML or an existing ocid.
    log, stream the logs from pipeline run.
    cancel, cancel a pipeline run.
    delete, delete pipeline or pipeline run.
    show, show the pipeline orchestration.
    status, show the real-time status of a pipeline run.

Run pipeline [SUBCOMMAND] -h to see more details.
```

<a id="magic_create"></a>
## Create
Use the ``run`` subcommand to create and run pipeline. Run ``-h`` to see allowed options.

In [None]:
%pipeline run -h

```
Usage: pipeline run [OPTIONS]
Options:
    -f, --file, optional, uri to the YAML.
    -o, --ocid, optional, ocid of existing pipeline.
    -w, --watch, optional, a flag indicating that pipeline run will be watched after submission.
    -l, --log-type, optional, should be custom_log, service_log or None. default is None.
    -h, show this help message.
```

To create a brand new Data Science Pipeline and run it, provide the path to pipeline YAML file for the ``--file`` option

In [None]:
%pipeline run --file <path_to_pipeline_yaml>

To run an existing pipeline, provide the pipeline OCID for the ``--ocid`` option

In [None]:
%pipeline run --ocid <pipeline_ocid>

<a id="magic_visualize"></a>
## Visualize
To visualize a pipeline in a graph, run the ``show`` subcommand and provide the pipeline OCID

In [None]:
%pipeline show <pipeline_ocid>

<a id="magic_watch"></a>
## Watch
Use the ``status`` subcommand to watch the current status of the pipeline run as well as each of the step runs. Run ``-h`` to see allowed options.

In [None]:
%pipeline status -h

```
Usage: pipeline status [OPTIONS] [RUN_ID]
Options:
    -x, --text, optional, a flag to show the status in text format.
    -w, --watch, optional, a flag to wait until the completion of the pipeline run.
    If set, the rendered graph will be updated until the completion of the pipeline run,
    otherwise will render one graph with the current status.
    -h, show this help message.
```

To watch the status of pipeline run in graph mode until it finishes

In [None]:
%pipeline status <pipeline_run_ocid> -w

To watch the status of pipeline run in text mode

In [None]:
%pipeline status <pipeline_run_ocid> -x

<a id="magic_monitor"></a>
## Monitor
Use the ``log`` subcommand to monitor the pipeline run. Run ``-h`` to see allowed options

In [None]:
%pipeline log -h

```
Usage: pipeline log [OPTIONS] [RUN_ID]
Options:
    -l, --log-type, optional, should be either custom_log, service_log or None. default is None.
    -t, --tail, a flag to show the most recent log records.
    -d, --head, a flag to show the preceding log records.
    -n, --number, number of lines of logs to be printed. Defaults to 100.
    -h, show this help message.
```

To stream the ``custom_log``

In [None]:
%pipeline log <pipeline_run_ocid> -l custom_log

To tail the last 10 consolidated logs

In [None]:
%pipeline log <pipeline_run_ocid> -t -n 10

<a id="magic_cancel"></a>
## Cancel
To cancel a pipeline run, use the ``cancel`` subcommand and provide the pipeline run OCID

In [None]:
%pipeline cancel <pipeline_run_ocid>

<a id="magic_delete"></a>
## Delete
Use the ``delete`` subcommand to delete a pipeline or pipeline run. Run ``-h`` to see allowed options

In [None]:
%pipeline delete -h

```
Usage: pipeline delete [OCID]
Options:
    -j, --no-delete-related-job-runs, a flag to not delete the related job runs.
    -p, --no-delete-related-pipeline-runs, a flag to not delete related pipeline runs.
    -m, --max-wait-seconds, integer, maximum wait time in second for delete to complete. Defaults to 1800.
    -s, --succeeded-on-not-found, to flag to return successfully if the data we're waiting on is not found.
    -h, show this help message.
```

To delete a pipeline run, provide the pipeline run OCID

In [None]:
%pipeline delete <pipeline_run_ocid>

To delete a pipeline, provide the pipeline OCID

In [None]:
%pipeline delete <pipeline_ocid>

<a id='ref'></a>
# References

- [ADS Library Documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)