Oracle Data Science service sample notebook.

Copyright (c) 2021-2022 Oracle, Inc.<br>
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***
# <font> Predicting Employee Attrition with ADS</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

***

## Overview:

This notebook uses an employee attrition dataset. It is a synthetic dataset that contains information about employees and if they have left the company or not. To understand the data, you start by doing an exploratory data analysis (EDA). This is followed by creating a model using `scikit-learn`. The model is used to make predictions and evaluate the model's performance on new data. Then the model is prepared and saved to the Model Catalog using Oracle's Accelerated Data Science, (`ADS`) library.

***



***

<font color=gray>Datasets are provided as a convenience.  Datasets are considered Third Party Content and are not considered Materials under Your agreement with Oracle applicable to the Services.  You can access the `orcl_attrition` dataset license [here](oracle_data/UPL.txt). Dataset `orcl_attrition` is distributed under UPL license. 
</font>

Please select the  pubhished conda envionment data-science-gmlv1_0_v1 before proceeding further. The version of ADS installed with this conda environment is 2.8.11. 

In [1]:
import ads
import io
import joblib 
import logging
import numpy as np  
import os
import pandas as pd
import sys 
import warnings
import tempfile

from ads.common.model import ADSModel
from ads.common.model_artifact import ModelArtifact
from ads.common.model_export_util import prepare_generic_model
from ads.dataset.factory import DatasetFactory
from ads.dataset.label_encoder import DataFrameLabelEncoder
from ads.evaluations.evaluator import ADSEvaluator
from collections import defaultdict
from os import path 
from os.path import expanduser
from os.path import join
from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import get_scorer
from ads.common.model_artifact import ModelArtifact
from ads.config import NB_SESSION_OCID
from ads.config import PROJECT_OCID
from ads.config import NB_SESSION_COMPARTMENT_OCID

from ads.model.framework.sklearn_model import SklearnModel

ads.set_auth(auth='resource_principal') 

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

In [2]:
ads.hello()



  O  o-o   o-o
 / \ |  \ |
o---o|   O o-o
|   ||  /     |
o   oo-o  o--o

ads v2.8.11
oci v2.114.0
ocifs v1.1.3




<a id='binaryclassifition'></a>
# Binary Classification

Binary classification is a technique of classifying observations into one of two categories. In this notebook, the two groups are those employees that will leave the organization and those that will not.

Given the features in the data, the model will determine the optimal criteria for classifying an observation as leaving or not leaving. This optimization is based on the training data. However, some data will be excluded from the training data so that the model's performance can be evaluated. Models can over-fit on the training data, that is learn the noise in a dataset. Models can also under-fit the data, meaning that it does not learn the important characteristics of the relationships between the predictors and the target variable. Further, the model learns from the training data but its predictive power on the training data is not a good measure of the model's performance. Therefore, a test set of data is withheld from the full data set so that model's performance on an unseen set of data can be evaluated.

The evaluation will be done using classic measures for fit for binary classification. These would be metrics such as specificity, sensitivity, accuracy, area under the ROC curve, lift, gain, and several others.

# Open and Visualize the Dataset using `ADS`

<a id='data'></a>
## Dataset

This is a synthetic data set which contains 1,470 observations. There are 36 features, where 22 are ordinal, 11 are categorical, and 3 are constant values. The data contains demographic information, compensation level, job characteristics, job satisfaction, and employee performance. The data is imbalanced as fewer employees leave than stay.

The first step is to load the dataset. To do this the `DatasetFactory` singleton object will be used. It is part of the `ADS` SDK. It is a powerful class to work with datasets from different sources as you can store metadata such as what column is the target and what type of modeling problem is trying to be solved. In this case, it is a binary classification problem.

In [3]:
employees = DatasetFactory.open("/opt/notebooks/ads-examples/oracle_data/orcl_attrition.csv", 
                                target="Attrition").set_positive_class('Yes')

loop1:   0%|          | 0/4 [00:00<?, ?it/s]

<a id='trans'></a>
# Transformation Recommendations

`ADS` can help with feature engineering by automatically transforming datasets. For example, it can fix class imbalance by up or downsampling. This is just one example of the many transforms that `ADS` can apply. You can have `ADS` perform an analysis of the data and automatically apply the appropriate transformations to improve a model's performance. This is done with the `.auto_transform()` method. The `.suggest_recommendations()` method allows you to explore the suggested transforms using the notebook's UI and select the transformations that you would like it to make.

All ADS datasets are immutable; any transforms that are applied result in a new dataset. In this example, the notebook will perform automatic transformations on the data, and it will also fix the class imbalance.

In [4]:
transformed_ds = employees.auto_transform(fix_imbalance=True)

loop1:   0%|          | 0/7 [00:00<?, ?it/s]

The data should be split into training and test sets. This can be done using the `.train_test_split()`. The following cell uses the parameter `test_size` to indicate that it wants 80% of the data to be allocated to the training set and the remaining 20% will go to the test set.

In [5]:
train, test = transformed_ds.train_test_split(test_size=0.2)

# Training a Random Forest Model 

The next cell trains a Randow Forest model. It use the sklearn `Pipeline()` object to assemble the data transformations and model estimators into a single object. 

In [6]:
X = train.X.copy()
y = train.y.copy()

Xtest = test.X.copy()
ytest = test.y.copy()

le = DataFrameLabelEncoder()
X = le.fit_transform(X)
Xtest = le.fit_transform(Xtest)

sk_clf = RandomForestClassifier(random_state=42)
sk_clf.fit(X, y)

sk_model = make_pipeline(le, sk_clf)

my_model = ADSModel.from_estimator(sk_model, name=sk_clf.__class__.__name__)

<a id='eval'></a>
# Evaluate the Model

One of the key advantages of `ADS` is the ability to quickly evaluate any regression or classification model. While `ADS` has many built-in evaluation techniques, it supports the ability to provide your own evaluation function. You would provide a series of the true dependent variable value and another series of the predicted value. Then any esoteric calculation can be performed. This notebook uses the built-in performance metrics as these are sufficient for binary classification model evaluation.

The next cell creates the plots that are commonly used to evaluate model performance. These include the precision, recall, ROC, lift, and gain plots. Each model under study is plotted together, allowing for easy comparison. In addition, the normalized confusion matrices are provided.

In [7]:
evaluator = ADSEvaluator(test, models=[my_model], 
                         training_data=train)




These metrics suggest that the model is not much better than chance. There is obviously a lot of work that would need to be done to improve the model's overall performance.

There are a number of common metrics that are used to assess the quality of a model. `ADS` provides a convenient method to compare the models and highlights the model with the highest score in each metric. The following cell computes the metrics using the test and training datasets. It demonstrates that the model is a perfect classifier of the training data. It also demonstrates that on the test data, the model has a lot of room for improvement. The significant difference between the training and test metrics is a strong indication that the model is overfitting.

The goal of this exercise is to create a model, not an ideal model. Therefore, the next step is to prepare the model for productionalization.

In [8]:
evaluator.metrics

Unnamed: 0,RandomForestClassifier
Accuracy,0.9636
Hamming distance,0.0364
Precision,0.9669
Recall,0.959
F1,0.963
ROC AUC,0.9912


Unnamed: 0,RandomForestClassifier
Accuracy,0.999
Hamming distance,0.001
Precision,1.0
Recall,0.998
F1,0.999
ROC AUC,1.0




A binary classification model can have one of four outcomes for each prediction. A true-negative is an outcome where the model correctly predicts the negative case, and a false-negative is an outcome where when the model incorrectly predicts the negative case. A false-positive is when the model incorrectly predicts the positive case, and a true-positive is when the model correctly predicts the positive case. However, not all false-positive and false-negatives have the same importance. For example, a cancer test has a higher cost when it incorrectly says that a patient does not have cancer when they do. The `.calculate_cost()` method allows the cost to be computed for each model based on the cost of each class of prediction.

In [9]:
evaluator.calculate_cost(tn_weight=1, fp_weight=3, fn_weight=2, tp_weight=2)

Unnamed: 0,model,cost
0,RandomForestClassifier,754


You could register your model with OCI Data Science service through ADS. Alternatively, the Oracle Cloud Infrastructure (OCI) Console can be used by going to the Data Science projects page, selecting a project, then click Models. The models page shows the model artifacts that are in the model catalog for a given project.

After a model and its artifacts are registered, they become available for other data scientists if they have the correct permissions.
The ADS SDK automatically captures some of the metadata for you. It captures provenance, taxonomy, and some custom metadata.
ADS has a set of framework specific classes that take your model and push it to production with a few quick steps.

In [10]:
# Instantiate ads.model.framework.sklearn_model.SklearnModel
sklearn_model = SklearnModel(
    estimator=sk_clf, artifact_dir=tempfile.mkdtemp()
)




The first step is to create a model serialization object. This object wraps your model and has a number of methods to assist in deploying it. There are different model classes for different model classes.
After creating the model serialization object, the next step is to use the .prepare() method to create the model artifacts. The score.py file is created and it is customized to your model class. You may still need to modify it for your specific use case but this is generally not required. The .prepare() method also can be used to store metadata about the model, code used to create the model, input and output schema, and much more.

We will used the published conda environment as inference and training conda environment. The published conda environment in this case resides in the bucket LAB_Conda and the path is given by  "oci://LAB_Conda@ocuocictrng22/data-science-gmlv1_0"

In [11]:
# Autogenerate score.py, serialized model, runtime.yaml, input_schema.json and output_schema.json
sklearn_model.prepare(
    training_conda_env="oci://LAB_Conda@ocuocictrng22/data-science-gmlv1_0" ,
    inference_conda_env="oci://LAB_Conda@ocuocictrng22/data-science-gmlv1_0" ,
    X_sample=X,
    y_sample=y,
    force_overwrite=True,
    ignore_introspection=True
    
)



algorithm: RandomForestClassifier
artifact_dir:
  /tmp/tmpi59m3lcc:
  - - model.joblib
    - .model-ignore
    - score.py
    - runtime.yaml
    - output_schema.json
    - input_schema.json
framework: scikit-learn
model_deployment_id: null
model_id: null

If you make changes to the score.py file, call the .verify() method to confirm that the load_model() and predict() functions in this file are working. This speeds up your debugging as you do not need to deploy a model to test it.


In [12]:
# The verify method invokes the ``predict`` function defined inside ``score.py`` in the artifact_dir

sklearn_model.verify(Xtest[1:10])["prediction"]

Start loading model.joblib from model directory /tmp/tmpi59m3lcc ...
Model is successfully loaded.


[0, 0, 1, 1, 1, 1, 1, 0, 0]

The .save() method is then used to store the model in the model catalog. Use a unique display name to identify the saved model. 

In [13]:
# Register scikit-learn model
model1 = sklearn_model.save(display_name="Attrition-Nov09-23",inference_conda_env="oci://LAB_Conda@ocuocictrng22/data-science-gmlv1_0",ignore_introspection=True)

Start loading model.joblib from model directory /tmp/tmpi59m3lcc ...
Model is successfully loaded.


loop1:   0%|          | 0/4 [00:00<?, ?it/s]

A call to the .deploy() method creates a load balancer and the instances needed to have an HTTPS access point to perform inference on the model. Using the .predict() method, you can send data to the model deployment endpoint and it will return the predictions. Kindly ensure to give a valid display name. 

In [14]:
# Deploy and create an endpoint for the RandomForest model
sklearn_model.deploy(display_name="Attrition-Nov09-23")

loop1:   0%|          | 0/6 [00:00<?, ?it/s]


kind: deployment
spec:
  createdBy: ocid1.datasciencenotebooksession.oc1.phx.amaaaaaas5adu2iacdwunpxq54h2xr2yjjfcyynvbzdxwmjzpv2qtbbm5vca
  definedTags:
    Oracle-Tags:
      CreatedOn: '2023-11-09T09:23:43.232Z'
  displayName: Attrition-Nov09-23
  id: ocid1.datasciencemodeldeployment.oc1.phx.amaaaaaas5adu2iauzp5w6fjusvwmh6ylyp6qb43dm5oudrqztfjqhjhlnzq
  infrastructure:
    kind: infrastructure
    spec:
      bandwidthMbps: 10
      compartmentId: ocid1.compartment.oc1..aaaaaaaabu6pgqbwe4are4ke7uzkq44rbvbnxwhybhmplialatq54kdvq4jq
      deploymentType: SINGLE_MODEL
      policyType: FIXED_SIZE
      projectId: ocid1.datascienceproject.oc1.phx.amaaaaaas5adu2iaa45m3uo4rgrpu4fzgpyltcsf4bn5grgpzlc675dmiprq
      replica: 1
      shapeConfigDetails:
        memoryInGBs: 16.0
        ocpus: 2.0
      shapeName: VM.Standard.E4.Flex
      webConcurrency: '10'
    type: datascienceModelDeployment
  lifecycleDetails: Creating compute resources.
  lifecycleState: CREATING
  modelDeploymentUrl: 

In [None]:
print(f"Endpoint: {sklearn_model.model_deployment.url}")

sklearn_model.predict(Xtest)["prediction"]