Copyright (c) 2023 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---

# <font color="red">Model Tuning with Data Flow and ADS</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---


# Overview:

Tune your models using ``ads``, ``OCI Data Flow``, and ``hyperopt``. Run through an example tuning a model locally with ``hyperopt``, then submitting that script to ``OCI Data Flow`` using ``ads``' Jobs API.

---

## Contents:


- <a href='#intro'>Introduction</a>
    - <a href="#setup">Setup</a>
- <a href='#local'>Establishing a Model Tuning Script Locally</a>
    - <a href="#data">Source Data</a>
    - <a href="#tune">Build and Tune Model on Spark</a>
    - <a href="#save">Save Model to the Catalog</a>
- <a href="#remote">Running on Data Flow</a>
- <a href="#cleanup">Clean Up</a>
- <a href="#reference">References</a>

---

Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials under your agreement with Oracle.    

You can access the `iris` dataset license [here](https://github.com/scikit-learn/scikit-learn/blob/master/COPYING). 

---

The notebook is compatible with the following [Data Science conda environments](https://docs.oracle.com/en-us/iaas/data-science/using/conda_environ_list.htm):


* [PySpark 3.0 and Data Flow](https://docs.oracle.com/en-us/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.7 (version 5.0)
* [PySpark 3.2 and Data Flow](https://docs.oracle.com/en-us/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 2.0)

---

As of February 2023, the latest PySpark env is `pyspark32_p38_cpu_v2`, which can be installed by running:
```
odsc conda install -s pyspark32_p38_cpu_v2
```
in the terminal. To check the most up to date version, visit the Environment Explorer from the Launcher Tab.

In [None]:
# Note: You may need to install hyperopt by running: `conda install -c conda-forge hyperopt`
import os
import pyspark
import tempfile

from hyperopt import fmin, hp, tpe
from hyperopt import SparkTrials, STATUS_OK
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from ads.model.framework.sklearn_model import SklearnModel
from ads.common.model_metadata import UseCaseType

<a id='intro'></a>
# Introduction

Model tuning is an embarrassingly parallelizable process. Once you have your tuning script written with ``hyperopt``, ``ads`` can push that script onto a spark server size of your choice. We will walk through a simple example to show the configurability options and general workflow.

<a id='setup'></a>
## Setup

First we need to setup the details of the environment we want to use. These are listed out in the following cell with brackets on either said "<>".

In [None]:
compartment_id = "<compartment_id>"  # os.environ["NB_SESSION_COMPARTMENT_OCID"] # For the OCI Notebook Session 
logs_bucket_uri = "<logs_bucket_uri>"
conda_pack_uri = "oci://<conda_packs_bucket>@<conda_packs_namespace>/conda_environments/cpu/PySpark 3.2 and Data Flow/2.0/pyspark32_p38_cpu_v2"  # If you publish your pack using the script below
script_bucket_uri = "oci://<script_bucket>@<script_namespace>/<optional_script_prefix>"

Second we need to publish a conda environment with the necessary libraries. To do so from an OCI Notebook Session (recommended), run the following shell script in a terminal.

```
odsc conda install -s pyspark32_p38_cpu_v2
conda activate /home/datascience/conda/pyspark32_p38_cpu_v2
conda install -c conda-forge hyperopt
python3 -m pip install -U oracle_ads
odsc conda init -b <your-bucket-name> -n <your-tenancy-namespace> -a <api_key or resource_principal>
odsc conda publish -s /home/datascience/conda/pyspark32_p38_cpu_v2
```

<a id='local'></a>
# Establishing a Model Tuning Script Locally

We will load an ``sklearn`` dataset, train a simple model on it, and tune that model on a spark cluster. Then we'll combine this into 1 script, push it onto OCI Object Storage (along with our conda pack), and run our Data Flow job. ``ads`` will be able to run and watch this job all from the Notebook.

<a id="data"></a>
## Source Data

We will use the iris data provided by the sklearn datasets module. Then we will Standardize the features.

In [None]:
X, y = load_iris(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

<a id="tune"></a>
## Build and Tune Model on Spark

The following script will build and tune our model on a Spark cluster over the given search space. This can of course be reduced for local and increased for remote training.

In [None]:
def train(params):
    regParam = float(params['regParam'])
    penalty = params['penalty']
    clf = LogisticRegression(C=1.0 / regParam,
                            multi_class='multinomial',
                            penalty=penalty, solver='saga', tol=0.1)
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)

    return {'loss': -score, 'status': STATUS_OK}

search_space = {
'penalty': hp.choice('penalty', ['l1', 'l2']),
'regParam': hp.loguniform('regParam', -10.0, 0),
}

spark_trials = SparkTrials()
best_hyperparameters = fmin(
    fn=train,
    space=search_space,
    algo=tpe.suggest,
    trials=spark_trials,
    max_evals=32)

<a id="save"></a>
## Save Model to the Catalog

Now that we've found out best parameters, let's create our model and save it to the Model Catalog:

In [None]:
# If you are running with an OCI NotebookSession and have Resource Principal set up, run the folowing.

from ads import set_auth
set_auth("resource_principal")

In [None]:
artifact_dir = tempfile.TemporaryDirectory()

best_model = LogisticRegression(C=1.0 / best_hyperparameters['regParam'],
                        multi_class='multinomial',
                        penalty=('l1', 'l2')[best_hyperparameters['penalty']], solver='saga', tol=0.1).fit(X_train, y_train)

model = SklearnModel(estimator=best_model, artifact_dir=artifact_dir.name)

model.prepare(inference_conda_env=conda_pack_uri, force_overwrite=True, 
              use_case_type=UseCaseType.MULTINOMIAL_CLASSIFICATION)

Before saving to the catalog, let's confirm our model is working using the `.verify()` method. Calling `verify` will invoke the `score.py` file of the model artifact prepared. We will calculate accuracy as shown below:

In [None]:
sum(model.verify(X_test)['prediction'] == y_test)/(len(y_test))

If we're happy with our model, we can push it up to the Model Catalog using the `.save()` method.

In [None]:
model_id = model.save()

<a id="remote"></a>
# Running on Data Flow

Now we can put it all together and run our script on Data Flow. The following cell writes all of the code we've gone through above into a single file, script.py. Then it uses ``ads`` to run and monitor this script as a Data Flow Job.

In [None]:
from ads.jobs import DataFlow, Job, DataFlowRuntime
from uuid import uuid4
import os
import tempfile

td = tempfile.TemporaryDirectory()
local_script = os.path.join(td.name, "script.py")

In [None]:
%%writefile {local_script}
import pyspark
import tempfile
from ads import set_auth
from ads.common.model_metadata import UseCaseType
from ads.model.framework.sklearn_model import SklearnModel
from hyperopt import fmin, hp, tpe
from hyperopt import SparkTrials, STATUS_OK
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


def main():
    X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

    X = X.values.reshape((X.shape[0], -1))

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, train_size=5000, test_size=10000)

    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    def train(params):
        regParam = float(params['regParam'])
        penalty = params['penalty']
        clf = LogisticRegression(C=1.0 / regParam,
                                multi_class='multinomial',
                                penalty=penalty, solver='saga', tol=0.1)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)

        return {'loss': -score, 'status': STATUS_OK}

    search_space = {
    'penalty': hp.choice('penalty', ['l1', 'l2']),
    'regParam': hp.loguniform('regParam', -10.0, 0),
    }

    spark_trials = SparkTrials()
    best_hyperparameters = fmin(
        fn=train,
        space=search_space,
        algo=tpe.suggest,
        trials=spark_trials,
        max_evals=32)
    print(best_hyperparameters)

    artifact_dir = tempfile.TemporaryDirectory()
    best_model = LogisticRegression(C=1.0 / best_hyperparameters['regParam'],
                            multi_class='multinomial',
                            penalty=('l1', 'l2')[best_hyperparameters['penalty']], solver='saga', tol=0.1).fit(X_train, y_train)
    
    set_auth("resource_principal")

    model = SklearnModel(estimator=best_model, artifact_dir=artifact_dir.name)
    model.prepare(inference_conda_env="pyspark32_p38_cpu_v2", force_overwrite=True, 
                use_case_type=UseCaseType.MULTINOMIAL_CLASSIFICATION)
    print(model.verify(X_test))
    
    model_id = model.save()
    print(f"Model ID: {model_id}")
    
    

if __name__ == "__main__":
    main()

In [None]:
name = f"dataflow-app-{str(uuid4())}"
dataflow_configs = DataFlow()\
    .with_compartment_id(compartment_id)\
    .with_logs_bucket_uri(logs_bucket_uri)\
    .with_driver_shape("VM.Standard2.1") \
    .with_executor_shape("VM.Standard2.1") \
    .with_num_executors(2) \
    .with_spark_version("3.2.1")
runtime_config = DataFlowRuntime()\
    .with_script_uri(local_script)\
    .with_script_bucket(script_bucket_uri) \
    .with_custom_conda(conda_pack_uri)
df = Job(name=name, infrastructure=dataflow_configs, runtime=runtime_config)
df_run = df.create().run()

Monitor the logs and status by running the following:

In [None]:
df_run.watch()

## Pull Model To Local Environment

Finally, we can pull that model locally for further testing:

In [None]:
download_dir = tempfile.TemporaryDirectory()

model_reload = SklearnModel.from_model_catalog(
    model_id,
    artifact_dir=download_dir,
    force_overwrite=True,
)

<a id="cleanup"></a>
# Clean Up

To delete the resources from object storage, we need to create a file system instance. The following uses resource_principal, but other auth mechanisms are accepted, such as config. Then call `rm` on each file put into OCI Object Storage.

In [None]:
from ocifs import OCIFileSystem

td.cleanup()
artifact_dir.cleanup()
download_dir.cleanup()
fs = OCIFileSystem()
fs.rm(conda_pack_uri)
fs.rm(os.path.join(script_bucket_uri, "script.py"))

<a id="reference"></a>
# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Understanding Conda Environments](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/use-notebook-sessions.htm#conda_understand_environments)
- [Use Resource Manager to Configure Your Tenancy for Data Science](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/orm-configure-tenancy.htm)