<font color=gray>Oracle Cloud Infrastructure Data Science Demo Notebook

Copyright (c) 2021 Oracle, Inc.<br>
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***
# <font> Using Data Science Jobs to Automate Model Building and Training</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> OCI Data Science PM Team </font></p>

***

**<font color='red'>NOTE: This notebook was run in the TensorFlow 2.7 for CPU (slug: `tensorflow27_p37_cpu_v1`) conda environment with ADS version 2.5.10. Upgrade your version of ADS (see cell below) and restart your kernel.</font>**

In [None]:
#!pip install oracle-ads==2.5.10

In [None]:
import ads
print(ads.__version__)

# Create a Simple Model training Script 

First step will be to create a simple model training script that is identical to the training steps we wrote in notebook `1-model-training.ipynb`. We store the training script inside of a job artifact folder (`./job-artifact`)

This is the script that will be executed as a Data Science Job. 

The script: 
* pulls the data from object storage;
* does data transformation on the data; 
* creates an sklearn pipeline object; 
* trains a random forest classifier; 
* saves the sklearn pipeline object (joblib) to disk in the model artifact folder; 
* uses the model artifact files in the model-artifact folder to create a model artifact object from a local folder (ModelArtifact(path))
* saves the model to the model catalog

In [None]:
%%writefile job-artifact/attrition-job.py 

import io
import warnings
import logging
import os
from os import path 
from os.path import expanduser
from os.path import join

from category_encoders.ordinal import OrdinalEncoder
from collections import defaultdict

from ads.common.model import ADSModel
from ads.dataset.factory import DatasetFactory
from ads.evaluations.evaluator import ADSEvaluator

import pandas as pd

from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import get_scorer

from dataframelabelencoder import DataFrameLabelEncoder

from ads.common.model_artifact import ModelArtifact
from ads.common.model_export_util import prepare_generic_model
import joblib 

import ads 
ads.set_auth("resource_principal")

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

# downloading the data from object storage: 
bucket_name = "hosted-ds-datasets"
namespace = "bigdatadatasciencelarge"
ds = DatasetFactory.open(
        "oci://{}@{}/synthetic/orcl_attrition.csv".format(bucket_name, namespace), 
    target="Attrition",  storage_options={'config':{}, 'region': 'us-ashburn-1', 'tenancy':os.environ['TENANCY_OCID']}).set_positive_class('Yes')

print("done downloading data")

# Transforming the data: 
transformed_ds = ds.auto_transform(fix_imbalance=False)
train, test = transformed_ds.train_test_split()

print("done auto-transform data")

X = train.X.copy()
y = train.y.copy()

le = DataFrameLabelEncoder()
X = le.fit_transform(X)

# Training the Random Forest Classifier: 
sk_clf = RandomForestClassifier(random_state=42)
sk_clf.fit(X, y)

sk_model = make_pipeline(le, sk_clf)

print("completed model training")


# Path to artifact directory for my sklearn model: 
path = "model-artifact/"


print("serializing sklearn object")
print(f"current path:  {os.path.abspath('.')}")
print(f"list content of cwd: {os.listdir('.')}")
print(f"model-artifact exists:  {os.path.exists('./model-artifact/')}")
print(f"full path exists: {os.path.exists(os.path.join(os.path.abspath('.'), path))}")

# Creating a joblib pickle object of my random forest model: 
#joblib.dump(sk_model, os.path.join(os.path.abspath("."), path, "model.joblib"))
joblib.dump(sk_model, "/home/datascience/decompressed_artifact/job-artifact/model-artifact/model.joblib")

print("preparing model artifact")

#sk_artifact = ModelArtifact(os.path.join(os.path.abspath("."), path))
sk_artifact = ModelArtifact("/home/datascience/decompressed_artifact/job-artifact/model-artifact/")
print("done creating sk_artifact")


sk_artifact.populate_schema(X_sample=train.X, y_sample=train.y)

print("done populating schema")

print("done preparing model artifact")

print("saving model artifact to catalog")

# Save the model to the catalog: 
mc_model = sk_artifact.save(project_id=os.environ['PROJECT_OCID'],
                            compartment_id=os.environ['JOB_RUN_COMPARTMENT_OCID'],
                            training_id=os.environ['JOB_RUN_OCID'],
                            display_name="sklearn-employee-attrition-from-job",
                            description="simple sklearn model to predict employee attrition", 
                            ignore_pending_changes=True)

print("done saving model artifact to catalog")

print(mc_model.id)

# Creating a Data Science Job and Job Run

Here are we creating a [Data Science Job and a Job Run](https://docs.oracle.com/en-us/iaas/data-science/using/jobs-about.htm) using the ADS library. 

Data Science jobs enable custom tasks because you can apply any use case you have, such as data preparation, model training, hyperparameter tuning, batch inference, and so on.

Using jobs, you can:

* Run machine learning (ML) or data science tasks outside of your notebook sessions in JupyterLab.
* Operationalize discrete data science and machine learning tasks as reusable runnable operations.
* Automate your typical MLOps or CI/CD pipeline.
* Execute batches or workloads triggered by events or actions.
* Batch, mini batch, or distributed batch job inference.

Jobs are run in virtual machines (VMs) in the OCI Data Science service tenancy. The VM will be running for the duration of your job and will shut itself down at the completion of your job script. 

In [None]:
from ads.common.oci_logging import OCILogGroup, OCILog
import ads
import os
import random
import string

# here we are using resource principal to authenticate with the Data Science Jobs API: 
ads.set_auth(auth="resource_principal")

Specify the [log group OCID and log OCID](https://docs.oracle.com/en-us/iaas/Content/Logging/Task/managinglogs.htm) that you want to attach to your Job. You need to create the log group and log through the OCI Logging service first and grant Job Run access to the Logging service on your behalf. This is done through [resource principals for Job Runs and an IAM policy](https://docs.oracle.com/en-us/iaas/data-science/using/policies.htm#policy-examples). 

Specifying logs is optional but highly recommended. Without logging enabled, it is very difficult to troubleshoot job runs. 


In [None]:
log_group_id = ""
log_id = ""

In [None]:
from ads.jobs import Job
from ads.jobs import DataScienceJob, ScriptRuntime

job_name = 'attrition-model-training-job'

In [None]:
job = Job(job_name).with_infrastructure(DataScienceJob().\
                                with_shape_name("VM.Standard2.1").\
                                with_log_id(log_id).\
                                with_log_group_id(log_group_id)).\
            with_runtime(ScriptRuntime().\
                         with_source("job-artifact", entrypoint="job-artifact/attrition-job.py").\
                         with_service_conda("tensorflow27_p37_cpu_v1"))

In [None]:
job

Let's now create a job. The job itself will not trigger the execution of the job script. Think of the job as the resource that contains the configuration and definition of the task to be executed. 

In [None]:
dsjob = job.create()

Lastly, we are executing the job by creating a job run. The `watch()` method will only be available if you have enabled logs for your job or job run. 

In [None]:
dsjob.run().watch()