
</font>

***

# <font> Using Data Science Jobs to Automate Model Building and Training</font>
<p style="margin-left:10%; margin-right:10%;"> <font color="teal"></font></p>

***

## Overview

Notebook sessions are not ideal for long-running operations. Generally, notebooks sessions use relatively small compute shapes and you are running one at a time. Further, they are designed to be interactive and this may not always be practical. The Data Science Jobs Service is designed to execute arbitrary scripts in a headless manner. This means they run without a display. A common use case for data scientists is to train a model using a job. When a job is executed, the underlying resources are provisioned and then the compute instance is prepared with the conda environment that it needs along with a script. The script is then run and the resources are shut down when the script ends. Therefore, you only pay for the compute that you use. It also allows you to select the compute instance size based on the performance that is needed.

This notebook demonstrates how to create a script, configure logs so that the output can be monitored, and create a job and an associated job run.

Please select the  published conda envionment data-science-gmlv1_0_v1 before proceeding further. 

In [1]:
import ads
import os
import random
import string

from ads.common.oci_logging import OCILogGroup, OCILog
from ads.jobs import Job, DataScienceJob, ScriptRuntime

# Use resource principal to authenticate with the Data Science Jobs API: 
ads.set_auth(auth="resource_principal")

In [2]:
ads.hello()



  O  o-o   o-o
 / \ |  \ |
o---o|   O o-o
|   ||  /     |
o   oo-o  o--o

ads v2.8.11
oci v2.114.0
ocifs v1.1.3




# Create a Script

This notebook demonstrates how to create a Job and Job Run but using an example where a model is trained. The normal use case for using a Job to train a model is when the model takes a significant amount of time to train. In this notebook, the model only takes a few seconds to train but the goal is to demonstrate the steps, not train a production-grade model.

The first step is to create the script that is executed as part of the job. This script will be stored the training script in a job artifact folder (`./job-artifact`) and performs the following actions:

* Pulls the data from Object storage. You must be in the Ashburn region.
* Uses ADS to perform automatic data transformation.
* Creates an sklearn pipeline object.
* Trains a random forest classifier.
* Uses the SklearnModel to prepare and save the model to the Model Catalog.

In [3]:
# Path to artifact directory for my sklearn model: 
job_artifact_location = os.path.expanduser('./job-artifact/')
os.makedirs(job_artifact_location, exist_ok=True)
attrition_path = os.path.join(job_artifact_location, "attrition-job1.py")

We will use the published conda environment as inference and training conda environment. The published conda environment in this case resides in the bucket LAB_Conda and the path is given by  "oci://LAB_Conda@ocuocictrng22/data-science-gmlv1_0"

In [4]:
%%writefile {attrition_path}

import ads
import io
import joblib
import logging
import os
import pandas as pd
import pip
import warnings
import tempfile

from ads.common.model import ADSModel
from ads.common.model_artifact import ModelArtifact
from ads.common.model_export_util import prepare_generic_model
from ads.dataset.factory import DatasetFactory
from ads.dataset.label_encoder import DataFrameLabelEncoder
from ads.evaluations.evaluator import ADSEvaluator
from collections import defaultdict
from os import path
from os.path import expanduser, join
from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import get_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from ads.model.framework.sklearn_model import SklearnModel
from ads.common.model_metadata import UseCaseType
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

ads.set_auth("resource_principal")

X, y = make_classification(n_samples=10000, n_features=15, n_classes=2, flip_y=0.05)
trainx, testx, trainy, testy = train_test_split(X, y, test_size=30, random_state=42)


sk_clf = RandomForestClassifier(random_state=42)
sk_clf.fit(trainx, trainy)

# Instantiate ads.model.framework.sklearn_model.SklearnModel
sklearn_model = SklearnModel(
    estimator=sk_clf, artifact_dir=tempfile.mkdtemp()
)

# Autogenerate score.py, serialized model, runtime.yaml, input_schema.json and output_schema.json
artifact_dir = tempfile.mkdtemp()
sklearn_model = SklearnModel(estimator=sk_clf, artifact_dir=artifact_dir)

sklearn_model.prepare(
    inference_conda_env="oci://LAB_Conda@ocuocictrng22/data-science-gmlv1_0",
    training_conda_env="oci://LAB_Conda@ocuocictrng22/data-science-gmlv1_0" ,
    use_case_type=UseCaseType.BINARY_CLASSIFICATION,
    X_sample=trainx,
    y_sample=trainy,
    force_overwrite=True,
    ignore_introspection=True
)

# Register scikit-learn model
sklearn_model.save(display_name="Job-Sklearn-Model-Sept-25-2", inference_conda_env="oci://conda-env-bucket@intoraclerohit/conda_environments/cpu/General Machine Learning for CPUs on Python 3.8/1.0/generalml_p38_cpu_v1",ignore_introspection=True)



Overwriting ./job-artifact/attrition-job1.py


# Create a Job

This section creates a [Data Science Job and a Job Run](https://docs.oracle.com/en-us/iaas/data-science/using/jobs-about.htm) using the ADS library.

Using jobs, you can:

* Run machine learning (ML) or data science tasks outside of your notebook sessions in JupyterLab.
* Operationalize discrete data science and machine learning tasks as reusable runnable operations.
* Automate your typical MLOps or CI/CD pipeline.
* Execute batches or workloads triggered by events or actions.
* Batch, mini-batch, or distributed batch job inference.

Jobs are run in compute instances in the OCI Data Science service tenancy. The compute instance will run for the duration of your job and will automatically shut itself down at the completion of the job script.

Output from the job can be captured using the OCI Logging service. While logging is optional, it is highly recommended. Without logging enabled, it is very difficult to troubleshoot job runs. The following cell will create a Log Group and Custom Log for you. 

If you run this cell more than once you will have to change the value of `job_name`, as it is used as the name of the Log Group and Log and they must have unique names.

In [5]:
job_name = 'Training-job-NOv09-23'
log_group = OCILogGroup(display_name=job_name).create()
log = log_group.create_log(job_name)

Use the `Job` class to create a job. The `.with_infrastructure()` method is used to define the default infrastructure that will be used. When a Job Run is created, many of the options can be changed. The Job Run will need to know what conda environment needs to be installed so that the script will execute. Generally, this will be the same conda environment that was used to develop and test the script. The Job Run needs to know the path of the script that is to be executed and the function to call.

In [6]:
job = Job(job_name).with_infrastructure(
    DataScienceJob().\
    with_shape_name("VM.Standard.E4.Flex").\
    with_log_id(log.id).\
    with_log_group_id(log_group.id)).\
    with_runtime(ScriptRuntime().\
        with_source("job-artifact", entrypoint=attrition_path).\
        with_custom_conda("oci://LAB_Conda@ocuocictrng22/data-science-gmlv1_0"))

Printing the job object provides details about the job such as what conda environment it will use, logging information, what script will be run, and much more.

In [7]:
job


kind: job
spec:
  infrastructure:
    kind: infrastructure
    spec:
      jobType: DEFAULT
      logGroupId: ocid1.loggroup.oc1.phx.amaaaaaas5adu2ia4ahx3m3ptxyi2zui2pmbygpnaukbt4x5tmhr76wu545a
      logId: ocid1.log.oc1.phx.amaaaaaas5adu2iap6exjxeu6dakqfx6kyxdvi6apsaqp5vtlgvk3r2ymk6a
      shapeName: VM.Standard.E4.Flex
    type: dataScienceJob
  name: Training-job-NOv09-23
  runtime:
    kind: runtime
    spec:
      conda:
        type: published
        uri: oci://LAB_Conda@ocuocictrng22/data-science-gmlv1_0
      entrypoint: ./job-artifact/attrition-job1.py
      scriptPathURI: job-artifact
    type: script

Use the `.create()` method to create the job. This will not trigger the execution of the job script. A job is a resource that contains the configuration and definition of the task to be executed while job runs are actual executions of a job.

In [8]:
dsjob = job.create()

# Create a Job Run

A Job allows for the definition of a template of a Job Run. A Job Run is the actual instance of the job being run. A Job can have many Job Runs. Further, the Job can be parameterized such that environment variables and command line arguments can be passed to the Job Run at run time. This allows for a single Job to define a family of Job Runs where each Job Run performs a slightly different action based on the environment variables and command line arguments. The Job Run used in this notebook is not parameterized as the goal is to demonstrate the basics of setting up a Job and creating a Job Run.

The `.run()` method can be used to create a Job Run and execute the script. The `.watch()` method is used to watch the progress of the job. It displays information about the job run and the output of the job script. There is a slight difference between what is displayed in the `.watch()` method and what is in the logs. The `.watch()` method displays information about the setup and teardown of the Job Run. It also displays the output from the script itself. The log only captures the information from the execution of the script.

In [9]:
dsjob.run()

kind: jobRun
spec:
  id: ocid1.datasciencejobrun.oc1.phx.amaaaaaas5adu2iagtqn6r6hdcku2cg262be5qb6pyewzolz4hdoig6ohdca
  infrastructure:
    kind: infrastructure
    spec:
      blockStorageSize: 50
      compartmentId: ocid1.compartment.oc1..aaaaaaaabu6pgqbwe4are4ke7uzkq44rbvbnxwhybhmplialatq54kdvq4jq
      displayName: Training-job-NOv09-23-run-2023-11-09-11:07.17
      jobInfrastructureType: ME_STANDALONE
      jobType: DEFAULT
      logGroupId: ocid1.loggroup.oc1.phx.amaaaaaas5adu2ia4ahx3m3ptxyi2zui2pmbygpnaukbt4x5tmhr76wu545a
      logId: ocid1.log.oc1.phx.amaaaaaas5adu2iap6exjxeu6dakqfx6kyxdvi6apsaqp5vtlgvk3r2ymk6a
      projectId: ocid1.datascienceproject.oc1.phx.amaaaaaas5adu2iaa45m3uo4rgrpu4fzgpyltcsf4bn5grgpzlc675dmiprq
      shapeConfigDetails:
        memoryInGBs: 32.0
        ocpus: 2.0
      shapeName: VM.Standard.E4.Flex
    type: dataScienceJob
  name: Training-job-NOv09-23-run-2023-11-09-11:07.17
  runtime:
    kind: runtime
    spec:
      conda:
        region: us-pho

<b>The status of the job can be checked from the console</b>