<font color=gray>Oracle Cloud Infrastructure Data Science Demo Notebook

Copyright (c) 2021 Oracle, Inc.<br>
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***
# <font> Predicting Employee Attrition with ADS</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> OCI Data Science PM Team </font></p>

***

## Overview:

In this notebook, we will be using an employee attrition dataset. We will start by doing an exploratory data analysis (EDA) to understand the data. Then a model will be trained using `scikit-learn`. The model will be used to make predictions and evaluate the model to determine how well it generalizes to new data. You will prepare and save the resulting model to the model catalog using Oracle's Accelerated Data Science, (`ADS`) library.

Let's do all of the imports necessary to get this notebook working up here.

**<font color='red'>NOTE: This notebook was run in the TensorFlow 2.7 for CPU (slug: `tensorflow27_p37_cpu_v1`) conda environment with ADS version 2.5.10. Upgrade your version of ADS (see cell below) and restart your kernel.</font>**

In [None]:
!pip install oracle-ads==2.5.10
!pip install onnxconverter-common --upgrade

In [None]:
import io
import warnings
import logging
import os
from os import path 
from os.path import expanduser
from os.path import join

from category_encoders.ordinal import OrdinalEncoder
from collections import defaultdict

from ads.common.model import ADSModel
from ads.dataset.factory import DatasetFactory
from ads.evaluations.evaluator import ADSEvaluator

import pandas as pd

from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import get_scorer
import numpy as np 

import ads 
ads.set_auth(auth='resource_principal') 

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

## Open and Visualize the Attrition Dataset using `ADS`

<a id='binaryclassifition'></a>
### Binary Classification

Binary classification is a technique of classifying observations into one of two groups. In this notebook, the two groups are those employees that will leave the organization and those that will not. 

Given the features in the data, the model will determine the optimal criteria for classifying an observation as leaving or not leaving. This optimization is based on the training data. However, we will holdout some of the data to test the model's preformance. Models can over-fit on the training data, that is learn the noise in a dataset and then it will not do a good job at predicting the results on new data (test data). Since we already know the truth for the data in the training dataset, we are really interested in how well it performs on the test data.

<a id='data'></a>
### The Dataset

This is a fictional data set which contains 1,470 rows. There are 36 features. 22 features are ordinal, 11 are categorical, and 3 are constant values. The features include basic demographic information, compensation level, job characteristics, job satisfaction and employee performance metrics. The data is not balanced as fewer employees leave than stay.

The first step is to load in the dataset. To do this the `DatasetFactory` singleton object will be used. It is part of the `ADS` library. It is a powerful class to work with datasets from different sources.

<font color=gray>Datasets are provided as a convenience.  Datasets are considered Third Party Content and are not considered Materials under Your agreement with Oracle applicable to the Services.  You can access the `orcl_attrition` dataset license [here](oracle_data/UPL.txt). Dataset `orcl_attrition` is distributed under UPL license. 
</font>

In [None]:
# ADS version used in this notebook: 
print(ads.__version__)

The code cell below will work only if your notebook session is running in the **Ashburn** region. If not use these instructions to download it to your local computer, then upload it to the notebook session, then use it to create a dataset. 

1. Download the file from this public url and save it on your local computer: 
https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/hosted-ds-datasets/o/synthetic%2Forcl_attrition.csv

2. Use the Upload Files button (or drag and drop) to upload the csv file to the same folder as 1-model-training.ipynb

3. Run the following instruction instead of the code cell below.
```
employees = DatasetFactory.open("synthetic_orcl_attrition.csv", format='csv', delimiter=",", target="Attrition").set_positive_class('Yes')
```

In [None]:
bucket_name = "hosted-ds-datasets"
namespace = "bigdatadatasciencelarge"
employees = DatasetFactory.open(
        "oci://{}@{}/synthetic/orcl_attrition.csv".format(bucket_name, namespace), 
    target="Attrition", storage_options={'config':{},'region':'us-ashburn-1'}).set_positive_class('Yes')

<a id='viz'></a>
### Visualize the Dataset Object

The `show_in_notebook` method can be applied to the dataset itself. When this is done the following is produced:

  - Summary, this shows a brief description of the dataset, shape, and a breakdown by feature type
  - Feature summary, a visualization created on a dataset sample to give an idea of distribution for each feature.
  - Correlations, a map which shows how every feature (numeric and categorical) are correlated
  - Data preview, the first five rows of the data


In [None]:
#employees.show_in_notebook()

In [None]:
#employees.show_corr()

<a id='trans'></a>
### Get and Apply Transformation Recommendations

`ADS` can help with feature engineering by transforming datasets. For example, it can fix class imbalance by up or downsampling. This is just one example of the many transforms that `ADS` can apply. You can have `ADS` perform an analysis of the data and automatically perform the transformations that it thinks would improve the model. This is done with the `auto_transform()` method. The `suggest_recommendations()` method allows you to explore the suggested transforms using the notebook's UI and select the transformations that you would like it to make.

All ADS datasets are immutable; any transforms that are applied result in a new dataset. In this example, the notebook will perform automatic transformations on the data, and it will also fix the class imbalance.

In [None]:
transformed_ds = employees.auto_transform(fix_imbalance=False)

Let's split the dataset train/test. If you call `train_test_split()` the split will be 90/10, train/test. Change the parameter `test_size` to change the size of the test dataset.  

In [None]:
train, test = transformed_ds.train_test_split()

### Training a `scikit-learn` Random Forest Model 

Below we create our own label encoder for some of the categorical features that are found in our dataset. We use `category_encoders` to achieve this and we apply to all columns of type `object` or `category`. That's a preprocessing step we go through before training the model.

The class object will be written locally as a Python module (`dataframelabelencoder.py`). We will capture that file as part of the model artifact.

In [None]:
%%writefile dataframelabelencoder.py 

from category_encoders.ordinal import OrdinalEncoder
from collections import defaultdict

from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder

class DataFrameLabelEncoder(TransformerMixin):
    def __init__(self):
        self.label_encoders = defaultdict(LabelEncoder)
        
    def fit(self, X):
        for column in X.columns:
            if X[column].dtype.name  in ["object", "category"]:
                self.label_encoders[column] = OrdinalEncoder()
                self.label_encoders[column].fit(X[column])
        return self
    
    def transform(self, X):
        for column, label_encoder in self.label_encoders.items():
            X[column] = label_encoder.transform(X[column])
        return X

Here we train the model. We are using the sklearn `Pipeline()` object to assemble the data transformation and model estimators into a single object. 

In [None]:
from dataframelabelencoder import DataFrameLabelEncoder

X = train.X.copy()
y = train.y.copy()

le = DataFrameLabelEncoder()
X = le.fit_transform(X)

sk_clf = RandomForestClassifier(random_state=42)
sk_clf.fit(X, y)

sk_model = make_pipeline(le, sk_clf)

# Build an ads model from the SVM classifier
my_model = ADSModel.from_estimator(sk_model, 
                                   name=sk_clf.__class__.__name__)

<a id='eval'></a>
## Evaluate The Model using `ADSEvaluator`

One of the key advantages of `ADS` is the ability to quickly evaluate any models. ADS supports evaluating:

- regression
- binary classification
- multiclass classification

`ADS` supports the ability for you to provide your own evaluation function (given `y_true` and `y_pred` series) for any esoteric calculation that you would like to run.

Below, we examine the plots that are commonly used to evaluate model performance. These include the precision-recall, ROC, lift, and gain plots. Each model under study is plotted together, allowing for easy comparison. In addition, the normalized confusion matrices are provided.

In [None]:
evaluator = ADSEvaluator(test, models=[my_model], 
                         training_data=train)
evaluator.show_in_notebook()

There are a number of common metrics that are used to assess the quality of a model. `ADS` provides a convenient method to compare the models and highlights the model with the highest score in each metric.

In [None]:
evaluator.metrics

A binary classification model can have one of four outcomes for each prediction. A true-negative is an outcome where the model correctly predicts the negative case, and a false-negative is an outcome where when the model incorrectly predicts the negative case. A false-positive is when the model incorrectly predicts the positive case, and a true-positive is when the model correctly predicts the positive case. However, not all false-positive and false-negatives have the same importance. For example, a cancer test has a higher cost when it incorrectly says that a patient does not have cancer when they do. The `calculate_cost` method allows the cost to be computed for each model based on the cost of each class of prediction.

In [None]:
evaluator.calculate_cost(tn_weight=1, fp_weight=3, fn_weight=2, tp_weight=2)

<a id='save'></a>
# Saving the model to the model catalog 

Now, we can save the simple random forest model in the model catalog, using the very flexible `prepare_generic_model()` function to save my model. That function creates an editable template artifact. The function `prepare_generic_model()` can support **any** model and **should always be the preferred way to save models from open source libraries**. 

`prepare_generic_model()` gives you complete control on the structure of the artifact and the definition fo the functions in `score.py`.

Note in the cell below that we specify an `inference_conda_env` value. This parameter corresponds to the conda environment we want to use for model deployment. A reference of that environment is written to `runtime.yaml` when you run `prepare_generic_model()`. The path represents where the conda environment is stored in object storage. You can find that information in the Environment Explorer. 

In [None]:
from ads.common.model_artifact import ModelArtifact
from ads.common.model_export_util import prepare_generic_model
import joblib 
import os

# Path to artifact directory for my sklearn model: 
model_artifact_location = os.path.expanduser('./model-artifact/')
os.makedirs(model_artifact_location, exist_ok=True)

# Creating a joblib pickle object of my random forest model: 
joblib.dump(sk_model, os.path.join(model_artifact_location, "model.joblib"))

# Creating the artifact template files in the directory: 
sklearn_artifact = prepare_generic_model(model_artifact_location, 
                                         inference_conda_env="oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/TensorFlow 2.7 for CPU on Python 3.7/1.0/tensorflow27_p37_cpu_v1",
                                         force_overwrite=True,
                                         model='model.joblib',
                                         use_case_type='BINARY_CLASSIFICATION',
                                         X_sample=train.X,
                                         y_sample=train.y)

Next, we copy the `dataframelabelencoder.py` module in the model artifact directory. The serialized pipeline object will require the module to be defined and available when you de-serialize and load the pipeline object to memory.  

We also tweak the `score.py` template that `prepare_generic_model()` created, ensuring that `load_model()` reads in the `model.joblib` file.

In [None]:
#setting paths for artifact files that need to be modified: 

encoder_path = os.path.join(model_artifact_location, "dataframelabelencoder.py")
score_path = os.path.join(model_artifact_location, "score.py")
!cp dataframelabelencoder.py {encoder_path}

In [None]:
%%writefile {score_path}

"""
   Inference script. This script is used for prediction by scoring server when schema is known.
"""

import json
import os
from joblib import load
import io 
import pandas as pd
import logging 

# logging configuration - OPTIONAL 
logging.basicConfig(format='%(name)s - %(levelname)s - %(message)s', level=logging.INFO)
logger_pred = logging.getLogger('model-prediction')
logger_pred.setLevel(logging.INFO)
logger_feat = logging.getLogger('input-features')
logger_feat.setLevel(logging.INFO)

from dataframelabelencoder import DataFrameLabelEncoder

def load_model():
    """
    Loads model from the serialized format

    Returns
    -------
    model:  a model instance on which predict API can be invoked
    """
    model_dir = os.path.dirname(os.path.realpath(__file__))
    contents = os.listdir(model_dir)
    model_file_name = "model.joblib"
    # TODO: Load the model from the model_dir using the appropriate loader
    # Below is a sample code to load a model file using `cloudpickle` which was serialized using `cloudpickle`
    # from cloudpickle import cloudpickle
    if model_file_name in contents:
        with open(os.path.join(os.path.dirname(os.path.realpath(__file__)), model_file_name), "rb") as file:
            model = load(file) # Use the loader corresponding to your model file.
    else:
        raise Exception('{0} is not found in model directory {1}'.format(model_file_name, model_dir))
    
    return model


def predict(data, model=load_model()) -> dict:
    """
    Returns prediction given the model and data to predict

    Parameters
    ----------
    model: Model instance returned by load_model API
    data: Data format as expected by the predict API of the core estimator. For eg. in case of sckit models it could be numpy array/List of list/Panda DataFrame

    Returns
    -------
    predictions: Output from scoring server
        Format: { 'prediction': output from `model.predict` method }

    """
    assert model is not None, "Model is not loaded"
    X = pd.read_json(io.StringIO(data)) if isinstance(data, str) else pd.DataFrame.from_dict(data)
    preds = model.predict(X).tolist()
#    logger_pred.info(preds)
#    logger_feat.info(X)    
    return { 'prediction': preds }

### Testing the artifact before saving to the catalog

It is always a good idea to test your model artifact before saving it to the catalog. Here we load the `score.py` module along with `load_model` and `predict`. We test predict by passing the training dataframe, doing the same for the predict() method of the sklearn model object. Next, we compare the two prediction arrays. These two should be identical.

In [None]:
input_data = train.X[:5]

In [None]:
import sys 

# add the path of score.py: 
sys.path.insert(0, model_artifact_location)

from score import load_model, predict

# Load the model to memory 
_ = load_model()
# make predictions on the first five rows of the training dataset: 
predictions = predict(input_data.to_json()) 

# The two lists should match:
print(f"* * * score.predict() and the pipeline predict() return the same predictions \
on the same data: {sk_model.predict(input_data).tolist() == predictions['prediction']}")

## Saving the Model to the Model Catalog

In [None]:
mc_model = sklearn_artifact.save(project_id=os.environ['PROJECT_OCID'], 
                               compartment_id=os.environ['NB_SESSION_COMPARTMENT_OCID'], 
                               training_id=os.environ['NB_SESSION_OCID'],
                               display_name="attrition-model",
                               ignore_introspection=False,
                               description="simple sklearn model to predict employee attrition", 
                               training_script_path="1-model-training.ipynb", 
                               ignore_pending_changes=True)

This this the model OCID of the newly created model: 

In [None]:
print(mc_model.id)