Oracle Data Science service sample notebook.

Copyright (c) 2021-2022 Oracle, Inc.<br>
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***

# <font> Using Data Science Jobs to Automate Model Building and Training</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

***

## Overview

Notebook sessions are not ideal for long-running operations. Generally, notebooks sessions use relatively small compute shapes and you are running one at a time. Further, they are designed to be interactive and this may not always be practical. The Data Science Jobs Service is designed to execute arbitrary scripts in a headless manner. This means they run without a display. A common use case for data scientists is to train a model using a job. When a job is executed, the underlying resources are provisioned and then the compute instance is prepared with the conda environment that it needs along with a script. The script is then run and the resources are shut down when the script ends. Therefore, you only pay for the compute that you use. It also allows you to select the compute instance size based on the performance that is needed.

This notebook demonstrates how to create a script, configure logs so that the output can be monitored, and create a job and an associated job run.

***

**<font color='red'>NOTE: This notebook was run in the PySpark 3.0 and Data Flow (slug: `pyspark30_p37_cpu_v5`) conda environment.</font>**

***

Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials 
under your agreement with Oracle.

You can access the `orcl_attrition` dataset license [here](https://oss.oracle.com/licenses/upl).

In [None]:
import ads
import os
import random
import string

from ads.common.oci_logging import OCILogGroup, OCILog
from ads.jobs import Job, DataScienceJob, ScriptRuntime

# Use resource principal to authenticate with the Data Science Jobs API: 
ads.set_auth(auth="resource_principal")

# Create a Script

This notebook demonstrates how to create a Job and Job Run but using an example where a model is trained. The normal use case for using a Job to train a model is when the model takes a significant amount of time to train. In this notebook, the model only takes a few seconds to train but the goal is to demonstrate the steps, not train a production-grade model.

The first step is to create the script that is executed as part of the job. This script will be stored the training script in a job artifact folder (`./job-artifact`) and performs the following actions:

* Pulls the data from Object storage. You must be in the Ashburn region.
* Uses ADS to perform automatic data transformation.
* Creates an sklearn pipeline object.
* Trains a random forest classifier.
* Saves the sklearn pipeline object (joblib) to disk in the model artifact folder.
* Uses the model artifact to create a model artifact object by reading the files in the model artifact folder.
* Saves the model to the Model Catalog.

In [None]:
# Path to artifact directory for my sklearn model: 
job_artifact_location = os.path.expanduser('./job-artifact/')
os.makedirs(job_artifact_location, exist_ok=True)
attrition_path = os.path.join(job_artifact_location, "attrition-job.py")

In [None]:
%%writefile {attrition_path}

import ads
import io
import joblib
import logging
import os
import pandas as pd
import pip
import warnings

from ads.common.model import ADSModel
from ads.common.model_artifact import ModelArtifact
from ads.common.model_export_util import prepare_generic_model
from ads.dataset.factory import DatasetFactory
from ads.dataset.label_encoder import DataFrameLabelEncoder
from ads.evaluations.evaluator import ADSEvaluator
from collections import defaultdict
from os import path
from os.path import expanduser, join
from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import get_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder

ads.set_auth("resource_principal")

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

# downloading the data from object storage: 
bucket_name = "hosted-ds-datasets"
namespace = "bigdatadatasciencelarge"
data_path = "oci://{}@{}/synthetic/orcl_attrition.csv".format(bucket_name, namespace)
print(f"Loading data from {data_path}.")
ds = DatasetFactory.open(data_path, target="Attrition",  
                         storage_options={'config':{}, 'region': 'us-ashburn-1', 'tenancy':os.environ['TENANCY_OCID']}).\
                    set_positive_class('Yes')
print("Data loaded.")

# Transforming the data: 
print("Starting data auto-transformation.")
transformed_ds = ds.auto_transform(fix_imbalance=False)
print("Done data auto-transformation.")

print("Starting model training.")
train, test = transformed_ds.train_test_split()
le = DataFrameLabelEncoder()
X = le.fit_transform(train.X.copy())

# Training the Random Forest Classifier: 
sk_clf = RandomForestClassifier(random_state=42)
sk_clf.fit(X, train.y.copy())
sk_model = make_pipeline(le, sk_clf)
print("Completed model training.")

# Path to artifact directory for my sklearn model: 
decompressed_artifact_path = os.path.join(os.path.expanduser("~"), 'decompressed_artifact')
job_artifact_path = os.path.join(decompressed_artifact_path, 'job-artifact')
model_artifact_path = os.path.join(job_artifact_path, 'model-artifact')

print(f"Current path:  {os.path.abspath('.')}")
print(f"cwd contents: {os.listdir('.')}")

print(f"decompressed_artifact path exists: {os.path.exists(decompressed_artifact_path)}")
if os.path.exists(decompressed_artifact_path):
    print(f"decompressed_artifact contents: {os.listdir(decompressed_artifact_path)}")
    
print(f"job_artifact path exists: {os.path.exists(job_artifact_path)}")
if os.path.exists(job_artifact_path):
    print(f"job_artifact contents: {os.listdir(job_artifact_path)}")
    
if os.path.exists(decompressed_artifact_path):
    print(f"decompressed_artifact contents: {os.listdir(decompressed_artifact_path)}")
print(f"model_artifact path exists: {os.path.exists(model_artifact_path)}")
if os.path.exists(model_artifact_path):
    print(f"model_artifact contents: {os.listdir(model_artifact_path)}")
else:
    print(f"Creating model_artifact directory: {model_artifact_path}")
    os.makedirs(model_artifact_path)

# Creating a joblib pickle object of the random forest model: 
model_path = os.path.join(model_artifact_path, "model.joblib")
print(f"Serializing sklearn object to {model_path}.")
joblib.dump(sk_model, model_path)
print(f"model_artifact contents: {os.listdir(model_artifact_path)}")

print(f"Preparing model artifact from {model_artifact_path}.")
sk_artifact = ModelArtifact(model_artifact_path)
sk_artifact.populate_schema(X_sample=train.X, y_sample=train.y)
print("Done populating the input and output schema.")
print("Model artifact created.")

print("Saving model artifact to the Model Catalog.")
# Save the model to the catalog: 
mc_model = sk_artifact.save(project_id=os.environ['PROJECT_OCID'],
                            compartment_id=os.environ['JOB_RUN_COMPARTMENT_OCID'],
                            training_id=os.environ['JOB_RUN_OCID'],
                            display_name="Employee-attrition-from-job",
                            description="Sklearn model to predict employee attrition", 
                            ignore_pending_changes=True)

print("Model artifact has been saved to the Model Catalog.")
print(f"Model OCID: {mc_model.id}")

# Create a Job

This section creates a [Data Science Job and a Job Run](https://docs.oracle.com/en-us/iaas/data-science/using/jobs-about.htm) using the ADS library.

Using jobs, you can:

* Run machine learning (ML) or data science tasks outside of your notebook sessions in JupyterLab.
* Operationalize discrete data science and machine learning tasks as reusable runnable operations.
* Automate your typical MLOps or CI/CD pipeline.
* Execute batches or workloads triggered by events or actions.
* Batch, mini-batch, or distributed batch job inference.

Jobs are run in compute instances in the OCI Data Science service tenancy. The compute instance will run for the duration of your job and will automatically shut itself down at the completion of the job script.

Output from the job can be captured using the OCI Logging service. While logging is optional, it is highly recommended. Without logging enabled, it is very difficult to troubleshoot job runs. The following cell will create a Log Group and Custom Log for you. If you run this cell more than once you will have to change the value of `job_name`, as it is used as the name of the Log Group and Log and they must have unique names.

In [None]:
job_name = 'Attrition-model-training-job'
log_group = OCILogGroup(display_name=job_name).create()
log = log_group.create_log(job_name)

Use the `Job` class to create a job. The `.with_infrastructure()` method is used to define the default infrastructure that will be used. When a Job Run is created, many of the options can be changed. The Job Run will need to know what conda environment needs to be installed so that the script will execute. Generally, this will be the same conda environment that was used to develop and test the script. The Job Run needs to know the path of the script that is to be executed and the function to call.

In [None]:
job = Job(job_name).with_infrastructure(
    DataScienceJob().\
    with_shape_name("VM.Standard2.1").\
    with_log_id(log.id).\
    with_log_group_id(log_group.id)).\
    with_runtime(ScriptRuntime().\
        with_source("job-artifact", entrypoint="job-artifact/attrition-job.py").\
        with_service_conda("pyspark30_p37_cpu_v5"))

Printing the job object provides details about the job such as what conda environment it will use, logging information, what script will be run, and much more.

In [None]:
job

Use the `.create()` method to create the job. This will not trigger the execution of the job script. A job is a resource that contains the configuration and definition of the task to be executed while job runs are actual executions of a job.

In [None]:
dsjob = job.create()

# Create a Job Run

A Job allows for the definition of a template of a Job Run. A Job Run is the actual instance of the job being run. A Job can have many Job Runs. Further, the Job can be parameterized such that environment variables and command line arguments can be passed to the Job Run at run time. This allows for a single Job to define a family of Job Runs where each Job Run performs a slightly different action based on the environment variables and command line arguments. The Job Run used in this notebook is not parameterized as the goal is to demonstrate the basics of setting up a Job and creating a Job Run.

The `.run()` method can be used to create a Job Run and execute the script. The `.watch()` method is used to watch the progress of the job. It displays information about the job run and the output of the job script. There is a slight difference between what is displayed in the `.watch()` method and what is in the logs. The `.watch()` method displays information about the setup and teardown of the Job Run. It also displays the output from the script itself. The log only captures the information from the execution of the script.

In [None]:
dsjob.run().watch()