Copyright (c) 2023 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---

# <font color="red">Model Training with Data Flow and ADS</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---


# Overview:

Train your models using ``ads``, ``OCI Data Flow``, and ``pyspark``'s MLLib. Run through an example training a model locally, then submitting that script to ``OCI Data Flow`` using ``ads``' Jobs API.

---

## Contents:


- <a href='#intro'>Introduction</a>
    - <a href="#setup">Setup</a>
- <a href='#local'>Establishing a Model Tuning Script Locally</a>
    - <a href="#data">Source Data</a>
    - <a href="#build">Build a Model on Spark</a>
- <a href="#remote">Running on Data Flow</a>
- <a href="#cleanup">Clean Up</a>
- <a href="#reference">References</a>

---

Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials under your agreement with Oracle.    

---

The notebook is compatible with the following [Data Science conda environments](https://docs.oracle.com/en-us/iaas/data-science/using/conda_environ_list.htm):

* [PySpark 3.2 and Data Flow](https://docs.oracle.com/en-us/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 2.0)

---

As of September 2022, the latest PySpark env is `pyspark32_p38_cpu_v2`, which can be installed by running:
```
odsc conda install -s pyspark32_p38_cpu_v2
```
in the terminal. To check the most up to date version, visit the Environment Explorer from the Launcher Tab.

In [None]:
# Note: You may need to install additional dependencies: `conda install -c conda-forge requests aiohttp`

import fsspec
import os
import tempfile
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession
from ads.model.framework.spark_model import SparkPipelineModel

<a id='intro'></a>
# Introduction

Model training with MLLib can be scaled up to extremely large cluster sizes, making model training arbitraily quick. ``ads`` can manage creating the Spark CLuster (using Data Flow), and expose full configurability to the Notebook user..

<a id='setup'></a>
## Setup

First we need to setup the details of the environment we want to use. These are listed out in the following cell with brackets on either said "<>".

In [None]:
compartment_id = "<compartment_id>"  # os.environ["NB_SESSION_COMPARTMENT_OCID"] # For the OCI Notebook Session 
logs_bucket_uri = "<logs_bucket_uri>"
conda_packs_bucket = "<conda_packs_bucket>"
conda_packs_namespace = "<conda_packs_namespace>"
script_bucket_uri = "oci://<script_bucket>@<script_namespace>/<optional_script_prefix>"
conda_pack_uri = f"oci://{conda_packs_bucket}@{conda_packs_namespace}/conda_environments/cpu/PySpark 3.0 and Data Flow/5.0/pyspark30_p37_cpu_v5"

Second we need to publish a conda environment with the necessary libraries. To do so from an OCI Notebook Session (recommended), the following cell shows the steps involved.

```
odsc conda install -s pyspark32_p38_cpu_v2
conda activate /home/datascience/conda/pyspark32_p38_cpu_v2
conda install -c conda-forge requests aiohttp
python3 -m pip install -U oracle_ads
odsc conda init -b <conda_packs_bucket> -n <conda_packs_namespace> -a <api_key or resource_principal>
odsc conda publish -s /home/datascience/conda/pyspark32_p38_cpu_v2
```

<a id='local'></a>
# Establishing a Model Tuning Script Locally

We will load an example dataset from ``spark``'s GitHub repo, then train a simple model on it using MLLib and our spark cluster. Ultimately, we'll combine this into 1 script, push it onto OCI Object Storage (along with our conda pack), and run our Data Flow job. ``ads`` will be able to run and watch this job all from the Notebook.

<a id="data"></a>
## Source Data

We will use the sample libsvm dataset provided by the spark GitHub Repo.

In [None]:
spark = SparkSession.builder.getOrCreate()

with fsspec.open("https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_libsvm_data.txt") as src:
    with open("./sample_libsvm_data.txt", 'wb') as dest:
        dest.write(src.read())
data = spark.read.format("libsvm").load("./sample_libsvm_data.txt")

<a id="build"></a>
## Build a Model on Spark

The following script will build our MLLib model on a Spark cluster locally:

In [None]:
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

(trainingData, testData) = data.randomSplit([0.7, 0.3])

dt = DecisionTreeRegressor(featuresCol="indexedFeatures")
pipeline = Pipeline(stages=[featureIndexer, dt])
pipeline_model = pipeline.fit(trainingData)

predictions = pipeline_model.transform(testData)
predictions.select("prediction", "label", "features").show(5)

evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

treeModel = pipeline_model.stages[1]
print(treeModel)

## Save to the Model Catalog

In [None]:
# If you are running with an OCI NotebookSession and have Resource Principal set up, run the folowing.

from ads import set_auth
set_auth("resource_principal")

In [None]:
model = SparkPipelineModel(estimator=pipeline_model, artifact_dir=tempfile.mkdtemp())
model.prepare(
    inference_conda_env=conda_pack_uri,
    force_overwrite=True,
    X_sample=testData.drop("label"),
    y_sample=testData.drop("features"),
)

#### Before we save the model to  our catalog, let's confirm that our score.py runs without errors and that the result is reasonable.

In [None]:
import numpy as np

y_true = np.asarray([x.label for x in testData.drop("features").collect()])
y_pred = model.verify(testData.drop("label"))['prediction'] 

y_true == y_pred

#### Now we are ready to save our model to the catalog.

In [None]:
model_id = model.save()

#### Before deploying our model, let's confirm we've done the necessary steps:

In [None]:
model.summary_status()

And deploy our model:

In [None]:
model.deploy()

<a id="remote"></a>
# Running on Data Flow

Now we can put it all together and run our script on Data Flow. The following cell writes all of the code we've gone through above into a single file, script.py. Then it uses ``ads`` to run and monitor this script as a Data Flow Job.

In [None]:
from ads.jobs import DataFlow, Job, DataFlowRuntime
from uuid import uuid4
import os

td = tempfile.TemporaryDirectory()
local_script = os.path.join(td.name, "script.py")

In [None]:
%%writefile {local_script}
import pyspark
import fsspec
import tempfile
from ads import set_auth
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession
from ads.model.framework.spark_model import SparkPipelineModel

def main():
    print("Hello World")
    print("Spark version is ", pyspark.__version__)
    
    spark = SparkSession.builder.getOrCreate()
    
    with fsspec.open("https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_libsvm_data.txt") as src:
        with open("./sample_libsvm_data.txt", 'wb') as dest:
            dest.write(src.read())
    data = spark.read.format("libsvm").load("./sample_libsvm_data.txt")

    featureIndexer =\
        VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

    (trainingData, testData) = data.randomSplit([0.7, 0.3])

    dt = DecisionTreeRegressor(featuresCol="indexedFeatures")
    pipeline = Pipeline(stages=[featureIndexer, dt])
    pipeline_model = pipeline.fit(trainingData)
    
    predictions = pipeline_model.transform(testData)
    predictions.select("prediction", "label", "features").show(5)
    
    evaluator = RegressionEvaluator(
        labelCol="label", predictionCol="prediction", metricName="rmse")
    rmse = evaluator.evaluate(predictions)
    print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

    treeModel = pipeline_model.stages[1]
    print(treeModel)

    set_auth("resource_principal")

    model = SparkPipelineModel(estimator=pipeline_model, artifact_dir=tempfile.mkdtemp())
    model.prepare(
        inference_conda_env=conda_pack_uri,
        force_overwrite=True,
        X_sample=testData.drop("label"),
        y_sample=testData.drop("features"),
    )

    model_id = model.save()
    print(f"Model ID: {model_id}")
    
if __name__ == "__main__":
    main()

In [None]:
dataflow_configs = DataFlow()\
    .with_compartment_id(compartment_id)\
    .with_logs_bucket_uri(logs_bucket_uri)\
    .with_driver_shape("VM.Standard2.1") \
    .with_executor_shape("VM.Standard2.1") \
    .with_num_executors(2) \
    .with_spark_version("3.2.1")
runtime_config = DataFlowRuntime()\
    .with_script_uri(local_script)\
    .with_script_bucket(script_bucket_uri) \
    .with_custom_conda(conda_pack_uri)
df = Job(infrastructure=dataflow_configs, runtime=runtime_config)
df_run = df.create().run()

Monitor the logs and status by running the following:

In [None]:
df_run.watch()

## Pull Model To Local

Finally, we can pull that model locally for further testing:

In [None]:
download_dir = tempfile.TemporaryDirectory()

model_reload = SparkPipelineModel.from_model_catalog(
    model_id,
    artifact_dir=download_dir,
    force_overwrite=True,
)

<a id="cleanup"></a>
# Clean Up

To delete the resources from object storage, we need to create a file system instance. The following uses resource_principal, but other auth mechanisms are accepted, such as config. Then call `rm` on each file put into OCI Object Storage.

In [None]:
from ocifs import OCIFileSystem

fs = OCIFileSystem()
fs.rm(conda_pack_uri)
fs.rm(os.path.join(script_bucket_uri, "script.py"))
td.cleanup()
download_dir.cleanup()

<a id="reference"></a>
# References

- [ADS PySpark Documentation](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/model_registration/frameworks/sparkpipelinemodel.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Understanding Conda Environments](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/use-notebook-sessions.htm#conda_understand_environments)
- [Use Resource Manager to Configure Your Tenancy for Data Science](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/orm-configure-tenancy.htm)