Oracle Data Science service sample notebook.

Copyright (c) 2021, 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---

# <font color="red">ONNX Integration with the Accelerated Data Science (ADS) SDK</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---

# Overview:

This notebook showcases the integration between Open Neural Network Exchange, ([ONNX](https://onnx.ai/)), `ADS`, and `sklearn`. ONNX is an open standard for machine learning interoperability that enables easy deployment of models. ONNX is an extensible computational graph model with built-in operators and machine-independent data types. The operators are portable across hardware and frameworks. The computational flow is an acyclic graph that contains information about the flow of the data and also metadata. Each node in the data flow graph contains an operator that can accept multiple inputs and produce multiple outputs.

Compatible conda pack: [ONNX 1.13 for GPU on Python 3.9 (version 1.0)](https://docs.oracle.com/en-us/iaas/data-science/using/conda-onnx-fam.htm)

---

## Contents:

- <a href="#sklearn-ads">Build a Model</a>
- <a href="#onnx-serial">Model Serialization with Onnx</a>
  - <a href="#model-artifacts">Model Artifacts</a>
  - <a href="#model-workflow">Model Workflow</a>
- <a href="#model-prediction">Model Prediction</a>
  - <a href="#model-prediction-adsmodel">Prediction using `ADSModel`</a>
  - <a href="#model-prediction-onnx">Prediction using OnnxRuntime</a>
    - <a href="#model-prediction-missing">Prediction with Missing Values</a>
- <a href="#ref">References</a>

---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

<font color="gray">
Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials 
under your agreement with Oracle.
    
You can access the `iris` dataset license [here](https://github.com/scikit-learn/scikit-learn/blob/master/COPYING).  
</font>

---

## Optional Installation of Pydot

Prior to executing this notebook you may optionally install a library called `pydot`. This library is necessary to visualize a graph representation of the onnx model. This installation is optional. Set the flag `use_pydot` to True in the cell below and this will trigger the installation of `pydot` and enable code cells that create `pydot` visualizations. 

In [None]:
import subprocess 

use_pydot = True

if use_pydot: 
    process = subprocess.Popen(['pip','install','pydot'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print(stdout)
    print(stderr)
    if not process.returncode: 
        from onnx.tools.net_drawer import GetPydotGraph, GetOpNodeProducer
    else: 
        use_pydot = False 
        raise Exception("Skipping pydot installation. All pydot graphs are disabled in this notebook.")  

In [None]:
import logging
import matplotlib.pyplot as plt
import onnx
import onnxruntime
import os
import random
import shutil
import tempfile
import warnings

from ads.common.model import ADSModel
from ads.dataset.dataset_browser import DatasetBrowser
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

In [None]:
# Check GPU

import onnxruntime as ort
assert ort.get_device() == 'GPU', ""

print(onnx.__version__, onnxruntime.__version__)

<a id='sklearn-ads'></a>
# Build a Model

In the next cell, the `iris` dataset is loaded, and then split into a training and a test set. A pipeline is created to scale the data and perform a logistic regression. This `sklearn` pipeline is then converted into an `ADSModel`.

In [None]:
ds = DatasetBrowser.sklearn().open("iris")
train, test = ds.train_test_split()
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('classifier', LogisticRegression())])
pipe.fit(train.X, train.y)
adsmodel = ADSModel.from_estimator(pipe)

<a id="onnx-serial"> </a>
# Model Serialization with ONNX

This example uses the `ADSModel` class. The class supports a number of popular model libraries including Automl, SKlearn, XGBoost, LightGBM, and Pytorch. With `ADSModel` objects, the `prepare()` method is used to create the model artifacts. If you want to use an unsupported model type then the model must be manually serialized into ONNX and put in the folder that was created by a call to the `prepare_generic_model()` method.

`ADSModel.prepare()` does the following:
- Serializes the model into ONNX format into a file named `model.onnx`.
- Creates a file to save metadata about the data samples.
- Calls `prepare_generic_model`.

Thus, a call to `ADSModel.prepare()` is similar to calling `ADSModel.prepare_generic_model()` except that `ADSModel.prepare()` also serializes the model.

The next cell creates a temporary directory, serializes the model into an ONNX format, stores sample data, and then loads the ONNX model into memory.

In [None]:
model_path = tempfile.mkdtemp()
model_artifact = adsmodel.prepare(model_path, X_sample=test.X[:5], 
                                  y_sample=test.y[:5], force_overwrite=True, data_science_env=True)
onnx_model = onnx.load_model(os.path.join(model_path, "model.onnx"))

<a id="model-artifacts"></a>
## Model Artifacts

The prediction pipeline is written to the `score.py` file in the `model_path`. This allows for the prediction script, used by the `ADSModel` class, to be customized. This file is validated to confirm that it imports all required libraries so that the model works correctly when it is deployed. It can also be customized to meet your application's specific requirements. More details about using the `score.py` file are found in the `model_catalog.ipynb` notebook.

The next cell outputs the contents of the `score.py` file.

In [None]:
with open(os.path.join(model_path, "score.py"), "r") as f:
    print(f.read())

<a id="model-workflow"></a>
## Model Workflow

ONNX is an extensible computational graph model with built-in operators and machine-independent data types. The computational flow is an acyclic graph that contains information about the flow of the data and also metadata. Each node in the data flow graph contains an operator that can accept multiple inputs and produce multiple outputs. The next cell generates a plot of the ONNX model's acyclic graph.

In [None]:
if use_pydot:
    graph_path = tempfile.mkdtemp()
    graph_dot = os.path.join(graph_path, 'model.dot')
    graph_png = os.path.join(graph_path, 'model.dot.png')
    graph = GetPydotGraph(onnx_model.graph, name=onnx_model.graph.name, 
    rankdir="TB", 
    node_producer=GetOpNodeProducer("docstring", color="yellow", 
                                    fillcolor="yellow", style="filled"))
    graph.write_dot(graph_dot)
    os.system(f"dot -O -Gdpi=300 -Tpng {graph_dot}")
    image = plt.imread(graph_png)
    shutil.rmtree(graph_path)
    fig, ax = plt.subplots(figsize=(40, 20))
    ax.imshow(image)
    ax.axis('off')
    plt.show()
else: 
    print("Skipping ONNX graph")

<a id="model-prediction"></a>
# Model Prediction

Since `ADSModel` was created, predictions can be used using that mechanism. However, ONNX also has the ability to do predictions directly and it can deal with missing data in the predictors.

<a id="model-prediction-adsmodel"></a>
## Prediction using ADSModel

The `ADSModel` has the method `predict()` that accepts predictors, in the form of a `DataFrame` object, and returns predicted values. The next cell demonstrates how to make predictions using the test data.

In [None]:
adsmodel.predict(test.X)

<a id="model-prediction-onnx"></a>
## Prediction using OnnxRuntime

An `InterfaceSession` object is needed to create a session connection to the ONNX model. This session is then used to pass the model parameters to the `run()` method. While `ADSmodel.predict()` accepts these parameters as a `DataFrame`, ONNX accepts them as a dictionary. The parameters are stored in a key labeled `input` and the values are in a list of lists.

The next cell creates the `InferenceSession` object, requests a sets of predictions, and prints the predicted values.

In [None]:
session = onnxruntime.InferenceSession(os.path.join(model_path, "model.onnx"), 
                                        providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])
pred_class, pred_probability = session.run(None,  
    {'input': [[value for value in row] for index, row in test.X.iterrows()]})
pred_class

The `run()` method returns two datasets. The first is the class predictions as in the preceding cell. This is the class with the highest probability. The second is a list of all the probabilities for each class in a prediction. This information can be used to assess the confidence that the model has in the prediction. For example, the first predicted class was `setosa`. By examining the probabilities, it can be seen that the evidence is strong that this is a correct prediction because the probabilities for the other classes are extremely low.

In [None]:
pred_probability[0]

<a id="model-prediction-missing"></a>
### Prediction with Missing Values

ONNX can often handle missing data even when the underlying structural model cannot. In this example, a logistic regression is used and generally this class of model can't handle missing data. However, the ONNX inference engine can generally deal with this by imputing the data.

In the next cell, the test data has a small proportion of values masked (removed from the dataset). The ONNX `run()` method is called to make predictions.

In [None]:
random.seed(42)
pred_class, pred_probability = session.run(None,  
    {'input': [[None if random.random() < 0.1 else value for value in row] 
               for index, row in test.X.iterrows()]})
pred_class

<a id="ref"></a>
# References

- [ADS Library Documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [Managing Models](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/manage-models.htm)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [ONNX Homepage](https://onnx.ai/)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Using Notebook Sessions to Build and Train Models](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/use-notebook-sessions.htm)