<font color=gray>Oracle Cloud Infrastructure Data Science Demo Notebook

Copyright (c) 2021 Oracle, Inc.<br>
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***
# <font> Predicting Employee Attrition with ADS</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> OCI Data Science PM Team </font></p>

***

## Overview:

In this notebook, we will be using an employee attrition dataset. We will start by doing an exploratory data analysis (EDA) to understand the data. Then a model will be trained using `AutoML`. The model will be used to make predictions and evaluate the model to determine how well it generalizes to new data. Then we will do use machine learning explainability (`MLX`) to understand the global and local model behavior. This will all be done using Oracle's Accelerated Data Science, (`ADS`) library.

## Other Public Technical Resources 

- [Deploying a Machine Learning Model with Oracle Functions](https://blogs.oracle.com/datascience/deploying-a-machine-learning-model-with-oracle-functions) End to end demo with instructions on how to build, train, and deploy a machine learning model on OCI using Data Science, Cloudshell and Oracle Functions. 
- [A simple Guide to Leveraging Parallelization for Machine Learning Tasks](https://blogs.oracle.com/datascience/parallelization-machine-learning) This blog post covers a few options that are available to a data scientist who wants to parallelize a workload done on a data frame. It covers approaches that offer multi-threading and multiprocessing execution. Each method provides benchmarks in terms of speed of execution that you can run in your notebook session.
- [Running Python Processes/jobs in the Notebook Session Environment](https://blogs.oracle.com/datascience/execute-a-python-process-in-the-oracle-cloud-infrastructure-data-science-notebook-session-environment)
- [Using Resource Principals in the Data Science service](https://blogs.oracle.com/datascience/resource-principals-data-science-service) 
- [Build and Deploy a Model in 9 Minutes using ONNX on OCI](https://blogs.oracle.com/datascience/deploy-machine-learning-models-with-onnx) 


## Business Use:

Organizations can face significant costs resulting from employee turnover. Some costs are tangible such as training expenses and the time it takes from when an employee starts to when they become a productive team member. Generally, the most important costs are intangible. Consider what is lost when a productive employee quits: corporate knowledge, new product ideas, great project management, and customer relationships. With advances in machine learning and data science, it's possible to not only predict employee attrition but to understand the key variables that influence turnover.

---

## Objectives:
By the end of this tutorial, you will know how to:
 - <a href='#setup'>0. Setup</a> the required packages:
 - <a href='#data'>1. Open and Visualize Datasets using `ADS`</a>
      - <a href='#binaryclassifition'>1.1. Binary Classification</a>
      - <a href='#data'>1.2. The Dataset</a>
      - <a href='#eda'>1.3. Exploratory Data Analysis</a> 
      - <a href='#viz'>1.4. Visualize the Dataset Object</a>
      - <a href='#sss'>1.5. **(Optional)** Using Oracle Cloud Infrastructure Data Flow (serverless Spark service) to perform data transformations at scale</a>
      - <a href='#trans'>1.6. Get and Apply Transformation Recommendations</a> 
 - <a href='#model'>2. Building and Visualizing Models</a>
      - <a href='#automl'>2.1. Oracle AutoML</a>
      - <a href='#other_sources'>2.2. `ADS` Supports Models from other Sources</a> 
 - <a href='#eval'>3. Evaluate Models using `ADSEvaluator`</a>
 - <a href='#explain'>4. Explain How the Models Work using `ADSExplainer`</a>
      - <a href='#featurepermutation'>4.1. Feature Permutation Importance</a>
           - <a href='#fpalgo'>4.1.1. Description of the algorithm</a>
           - <a href='#fpinterpret'>4.1.2. Interpreting the output</a>
      - <a href='#pdpice'>4.2. Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) Explanations</a>
           - <a href='#pdpicealgo'>4.2.1. Description of the algorithm</a>
           - <a href='#pdpiceinterpret'>4.2.2. Interpreting the output</a>
      - <a href='#localexplanations'>4.3. Local Explanations</a>
           - <a href='#localexplanationsalgo'>4.3.1. Description of the algorithm</a>
           - <a href='#localexplanationsinterpret'>4.3.2. Interpreting the output</a>
 - <a href='#save'>5. Saving the model to the model catalog</a>
 - <a href='#appendix'>**(Optional)** Appendix: Deploy your Model to Oracle Functions</a>
      - <a href ='#invoke'>Invoke your Deployed Model</a>
 - <a href='#conclusion'>6. Conclusion</a>
 
***

Let's do all of the imports necessary to get this notebook working up here.

In [None]:
import io
import warnings
import logging
import os
from os import path 
from os.path import expanduser
from os.path import join
warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)
from category_encoders.ordinal import OrdinalEncoder
from collections import defaultdict

from ads.automl.driver import AutoML
from ads.automl.provider import OracleAutoMLProvider
from ads.catalog.model import ModelCatalog
from ads.catalog.project import ProjectCatalog
from ads.common.model_artifact import ModelArtifact
from ads.common.data import MLData
from ads.common.model import ADSModel
from ads.dataflow.dataflow import DataFlow
from ads.dataset.factory import DatasetFactory
from ads.evaluations.evaluator import ADSEvaluator
from ads.explanations.explainer import ADSExplainer
from ads.common.model_export_util import prepare_generic_model
from ads.explanations.mlx_whatif_explainer import MLXWhatIfExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer

import pandas as pd

from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import get_scorer

import ads 
ads.set_auth(auth='resource_principal') 


## Open and Visualize the Attrition Dataset using `ADS`

<a id='binaryclassifition'></a>
### Binary Classification

Binary classification is a technique of classifying observations into one of two groups. In this notebook, the two groups are those employees that will leave the organization and those that will not. 

Given the features in the data, the model will determine the optimal criteria for classifying an observation as leaving or not leaving. This optimization is based on the training data. However, we will holdout some of the data to test the model's preformance. Models can over-fit on the training data, that is learn the noise in a dataset and then it will not do a good job at predicting the results on new data (test data). Since we already know the truth for the data in the training dataset, we are really interested in how well it performs on the test data.

<a id='data'></a>
### The Dataset

This is a fictional data set which contains 1,470 rows. There are 36 features. 22 features are ordinal, 11 are categorical, and 3 are constant values. The features include basic demographic information, compensation level, job characteristics, job satisfaction and employee performance metrics. The data is not balanced as fewer employees leave than stay.

The first step is to load in the dataset. To do this the `DatasetFactory` singleton object will be used. It is part of the `ADS` library. It is a powerful class to work with datasets from different sources.

<font color=gray>Datasets are provided as a convenience.  Datasets are considered Third Party Content and are not considered Materials under Your agreement with Oracle applicable to the Services.  You can access the `orcl_attrition` dataset license [here](oracle_data/UPL.txt). Dataset `orcl_attrition` is distributed under UPL license. 
</font>

In [None]:
attrition_path = "https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/hosted-ds-datasets/o/synthetic%2Forcl_attrition.csv"

employees = DatasetFactory.open(attrition_path,
      target="Attrition").set_positive_class('Yes')

<a id='viz'></a>
### Visualize the Dataset Object

The `show_in_notebook` method can be applied to the dataset itself. When this is done the following is produced:

  - Summary, this shows a brief description of the dataset, shape, and a breakdown by feature type
  - Feature summary, a visualization created on a dataset sample to give an idea of distribution for each feature.
  - Correlations, a map which shows how every feature (numeric and categorical) are correlated
  - Data preview, the first five rows of the data


In [None]:
employees.show_in_notebook()

In [None]:
employees.show_corr()

In [None]:
employees.plot("MonthlyIncome", y="JobRole", plot_type='infer')

In [None]:
employees.get_recommendations()

In [None]:
ds = employees

<a id='eda'></a>
### Exploratory Data Analysis

The `show_in_notebook` method is used in many classes within `ADS`. It makes a best effort to display information that is meaningful to a data scientist about the object. Below, it is applied to the target variable, and it will show the relative frequency of the classes in the data. Since the target is `Attrition` here, False means people who did not leave, and True are people that do.

It shows that the data is imbalanced.

<a id='sss'></a>
### (Optional) Using Oracle Cloud Infrastructure Data Flow to perform data transformations at scale

You can skip this section and go directly to **Get and Apply Transformation Recommendations**. 

In this particular example, we have access to additional employee-level datasets that are stored on Oracle Object Storage. These datasets contain timestamp records of every time an employee either entered or exited the office using their keycard. We think that the number of events per day could be predictive of an employee attrition. Since the dataset is quite large, we are going to process this dataset using Data Flow, the serverless spark service on the Oracle Cloud Infrastructure. `ADS` is deeply integrated with Data Flow. Creating Data Flow applications and runs is very easy with `ADS`.

Beyond the notebook environment you can use the Data Flow service to perform the data transformation using pyspark. We have a notebook example (dataflow.ipynb) which walks you through the details of creating Data Flow appliations and runs with ADS.

Let's just create a simple SparkSQL script to perform the aggregation. All the data files are stored on object storage. **Change the name of the file before running the cell:**

In [None]:
pyspark_file_path = path.join(path.expanduser("~"), "dataflow", "employee-keycard-v25.py")

In [None]:
%%writefile $pyspark_file_path

from pyspark.sql import SparkSession

def main():
    
    # create a spark session
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .getOrCreate()
    
    # load csv file from dataflow public storage
    df = spark \
        .read \
        .format("csv") \
        .option("header", "true") \
        .option("multiLine", "true") \
        .load("oci://data-science-data-flow-data@bigdatadatasciencelarge/keycard_log.csv")
    
    # create a temp view and do some sql operations
    df.createOrReplaceTempView("keycard")
    query_result_df = spark.sql("""
        SELECT 
            EmployeeNumber, 
            COUNT(*)
        FROM keycard
        GROUP BY EmployeeNumber
    """)
    
    # Convert the filtered Spark DataFrame into json format
    # Note: we are writing to the spark stdout log so that we can retrieve the log later.
    print('\n'.join(query_result_df.toJSON().collect()))
    
if __name__ == '__main__':
    main()

In [None]:
# Creating a DataFlow() instance: 

use_dataflow = False 

if use_dataflow: 

    data_flow = DataFlow()

    # User would need to update the value assigned to display_name
    display_name = "<insert-your-application-name>"
    # User would need to update the value assigned to script_bucket
    script_bucket = "<insert-the-object-storage-bucket-name-for-your-script>"
    # User would need to update the value assigned to logs_bucket
    logs_bucket = "<insert-the-object-storage-bucket-name-for-the-logs"  
    
    # The next step in the process is to prepare and create the application: 
    
    app_config = data_flow.prepare_app(display_name,
                                   script_bucket,
                                   pyspark_file_path,
                                   logs_bucket=logs_bucket,
                                   driver_shape='VM.Standard2.4',
                                   executor_shape='VM.Standard2.4',
                                   num_executors=2)

    app = data_flow.create_app(app_config)
    
    run_display_name = "keycard_count"               # User would need to update the value assigned to run_display_name

    run_config = app.prepare_run(run_display_name, logs_bucket=logs_bucket)
   
    # run the application: 
    
    run = app.run(run_config, save_log_to_local=False)
    
    # Fetch the logs: 
    
    run.fetch_log("stdout").save()
    run.fetch_log("stderr").save()
    
    # Let's now explore the data and merge the output of this sparksql script with the original `ADSDataset`: 
    
    # the PySpark script wrote to the log as jsonL, and we read the log back as `ADS` dataset

    keycards = DatasetFactory.open(pd.read_json((str(run.log_stdout)), lines=True))
    keycards = keycards.rename(columns={"count(1)":"keycard_counts"})
    keycards.show_in_notebook()
    
    # Let's merge the keycard data with the employee profile data
    
    ds = employees.merge(keycards, how='left',on="EmployeeNumber")

<a id='trans'></a>
### Get and Apply Transformation Recommendations

`ADS` can help with feature engineering by transforming datasets. For example, it can fix class imbalance by up or downsampling. This is just one example of the many transforms that `ADS` can apply. You can have `ADS` perform an analysis of the data and automatically perform the transformations that it thinks would improve the model. This is done with the `auto_transform()` method. The `suggest_recommendations()` method allows you to explore the suggested transforms using the notebook's UI and select the transformations that you would like it to make.

All ADS datasets are immutable; any transforms that are applied result in a new dataset. In this example, the notebook will perform automatic transformations on the data, and it will also fix the class imbalance.

In [None]:
transformed_ds = ds.auto_transform(fix_imbalance=False)
#ds.visualize_transforms()

Alternatively, you can use a GUI-based approach with `suggest_recommendations()` as shown below: 

In [None]:
ds.suggest_recommendations()

<a id='model'></a>
## 2. Building and Visualizing Models
<a id='automl'></a>
### Oracle AutoML

The Oracle `AutoML` package automatically tunes a model class to produce the best models. It works with any supervised prediction task (e.g., classification or regression). It supports binary and multi-class classifications as well as regression problems. `AutoML` automates three major stages of the ML pipeline: feature selection, algorithm selection, and hyperparameter tuning. These pieces are combined into a pipeline which automatically optimizes the whole process with minimal user interaction.

The Oracle Labs `AutoML` uses the `OracleAutoMLProvider` object to delegates the model training to `AutoML`.

`AutoML` has a pipeline-level Python API that quickly jumpstarts the data science process with a quality tuned model. It selects the appropriate features and model class for a given prediction task.

Below, the `ADS` `AutoML` driver is used to invoke the Oracle `AutoML` pipeline to tackle the ML problem. It will build and tune models with the `driver.AutoML`. In this example, it will create five models. The four trained models will be Logistic Regression, Light Gradient Boosting Machine, XGBoost, and a Random Forest. In addition, it will create a baseline model for comparison. The baseline model is a naive model that is used to confirm that the trained model learned something meaningful from the data. The baseline, for both classification and regression problems, is called the Zero Rule algorithm, which is also known as ZeroR or the null model. 
- For a regression modeling problem, the Zero Rule algorithm predicts the mean of the training dataset.
- For a classification modeling problem, the Zero Rule algorithm predicts the class with the most observations in the training dataset.

A machine learning model must demonstrate that is statistically significantly better at prediction than the baseline model. If not, the model has not learned anything meaningful from the data. The value here delivered by `ADS` is:

- A baseline model is always computed with no extra effort by the data scientist.
- A baseline model is always available for comparison with the `AutoML` models.
- Multiple models can be trained and tuned with limited input from the data scientist. It will select:
- ideal feature set
- minimal sampling size
- best model class to use
- the best set of model-specific hyperparameters

`ADS` also provides the ability to split a dataset into training and testing datasets using the `train_test_split` method, and `train` will train a set of models.

In [None]:
train, test = transformed_ds.train_test_split()
automl = AutoML(train, provider=OracleAutoMLProvider())
model, baseline = automl.train(model_list=[
    'LogisticRegression',
    'LGBMClassifier',
    'XGBClassifier',
    'RandomForestClassifier'], min_features=['OverTime', 'JobLevel'], score_metric = "roc_auc", time_budget=160)

Let's look at our accuracy on the testing data

In [None]:
accuracy_scorer = get_scorer("accuracy") # works with any sklearn scoring function

print("Oracle AutoML accuracy on test data:", model.score(test.X, test.y, score_fn = accuracy_scorer))

The `ADSModel` object wraps the model produced by the `AutoML` provider. It will delegate any attributes to the actual model, so we are able to look at any of the following: 

(✓ indicates it can be visualized)

  - ranked_models_
  - num_fs_evals_
  - selected_features_names_
  - selected_model_params_
  - tuning_trials_ ✓
  - adaptive_sampling_trials_ ✓
  - feature_selection_trials_ ✓
  - model_selection_trials_ ✓

In [None]:
#model.selected_model_params_

There were four learned models (Logistic Regression, Light Gradient Boosting Machine, XGBoost, and a Random Forest). The `ranked_models_` attribute returns the one that performed the best based on the selected model scoring criteria.

In [None]:
#model.ranked_models_

Visualize the mean model score for each of the algorithms.

In [None]:
#automl.visualize_algorithm_selection_trials()

Visualize the relationship between the model's score and the number of features in the model.

In [None]:
#automl.visualize_feature_selection_trials()

Visualize the hyperparameter tuning by plotting the model score for the selected model versus the number of iterations. Generally, the model score will tend to increase as the number of iterations increases.

In [None]:
#automl.visualize_tuning_trials()

<a id='other_sources'></a>
### ADS Supports Models from other Sources

Above, `AutoML` built a number of models including a Random Forest model. However, `ADS` is agnostic to the source of the model as it takes advantage of duck typing: something that looks like a model and walks like a model, is a model to ADS. Below, is an example of how to build a `sklearn` model and then use that with `ADS`.

In [None]:
%%writefile dataframelabelencoder.py 

from category_encoders.ordinal import OrdinalEncoder
from collections import defaultdict

from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder

class DataFrameLabelEncoder(TransformerMixin):
    def __init__(self):
        self.label_encoders = defaultdict(LabelEncoder)
        
    def fit(self, X):
        for column in X.columns:
            if X[column].dtype.name  in ["object", "category"]:
                self.label_encoders[column] = OrdinalEncoder()
                self.label_encoders[column].fit(X[column])
        return self
    
    def transform(self, X):
        for column, label_encoder in self.label_encoders.items():
            X[column] = label_encoder.transform(X[column])
        return X

In [None]:
from dataframelabelencoder import DataFrameLabelEncoder

X = train.X.copy()
y = train.y.copy()

le = DataFrameLabelEncoder()
X = le.fit_transform(X)

sk_clf = RandomForestClassifier(random_state=42)
sk_clf.fit(X, y)

sk_model = make_pipeline(le, sk_clf)

# Build an ads model from the SVM classifier
my_model = ADSModel.from_estimator(sk_model, 
                                   name=sk_clf.__class__.__name__)

In [None]:
print("Sklearn accuracy on test data:", my_model.score(test.X, test.y, score_fn = accuracy_scorer))

<a id='eval'></a>
## Evaluate Models using `ADSEvaluator`

One of the key advantages of `ADS` is the ability to quickly evaluate any models. ADS supports evaluating:

- regression
- binary classification
- multiclass classification

`ADS` supports the ability for you to provide your own evaluation function (given `y_true` and `y_pred` series) for any esoteric calculation that you would like to run.

Below, we examine the plots that are commonly used to evaluate model performance. These include the precision-recall, ROC, lift, and gain plots. Each model under study is plotted together, allowing for easy comparison. In addition, the normalized confusion matrices are provided.

In [None]:
evaluator = ADSEvaluator(test, models=[model, my_model, baseline], 
                         training_data=train)
evaluator.show_in_notebook()

There are a number of common metrics that are used to assess the quality of a model. `ADS` provides a convenient method to compare the models and highlights the model with the highest score in each metric.

Note: `AutoML` does its optimization on the validation data, meaning that the `sklearn` `random-forest` model will perform better on the training data. Performance on training data doesn't tell us how the model performs on unseen data. You should look to performance on the `test` dataset to get an idea of which model is better

In [None]:
evaluator.metrics

A binary classification model can have one of four outcomes for each prediction. A true-negative is an outcome where the model correctly predicts the negative case, and a false-negative is an outcome where when the model incorrectly predicts the negative case. A false-positive is when the model incorrectly predicts the positive case, and a true-positive is when the model correctly predicts the positive case. However, not all false-positive and false-negatives have the same importance. For example, a cancer test has a higher cost when it incorrectly says that a patient does not have cancer when they do. The `calculate_cost` method allows the cost to be computed for each model based on the cost of each class of prediction.

In [None]:
evaluator.calculate_cost(tn_weight=1, fp_weight=3, fn_weight=2, tp_weight=2)

<a id='explanations'></a>
# Model Explanations

The remaining part of this tutorial demonstrates how we can use the `ADS` explanation module to help better understand the behavior of our trained model. We will first create the required `ADS` explainer objects and then begin generating global and local explanations. 

Some useful terms for machine learning explainability (MLX):
  - **Explainability**: The ability to explain the reasons behind an machine learning model’s prediction.
  - **Interpretability**: The level at which a human can understand the explanation.
  - **Global Explanations**: Understand the general behavior of a machine learning model as a whole.
  - **What-If**: Understand how the change in feature values for either one sample or an entire dataset affects model outcome. 
  - **Local Explanations**: Understand why the machine learning model made a specific prediction.
  - **Model-Agnostic Explanations**: Explanations treat the machine learning model (and feature pre-processing) as a black-box, rather than using properties from the model to guide the explanation.

The `ADS` explanation module provides interpretable, model-agnostic, local/global explanations.

---

<a id='adsexplainer'></a>
## ADSExplainer 

`ADS` provides a general explainer object, `ADSExplainer`, which is used to generate both global and local explanations for machine learning models. `ADSExplainer` takes as input the datasets used to train and evaluate the model (e.g., train and test) and the model itself. Any type of model containing a `predict_proba()` or `predict()` function can be used. 

In [None]:
# our model explainer class
explainer = ADSExplainer(test, model)

# let's created a global explainer
global_explainer = explainer.global_explanation(provider=MLXGlobalExplainer())

## What-if Scenarios 

Using the `ADSExplainer` object, we can create a "WhatIf" explanation object to generate model explanations. Oracle Labs WhatIf `MLX` is selected as the provider using the `MLXWhatIfExplainer` object. WhatIf explanation supports both explore sample and explore predictions.

In [None]:
#whatif_explainer = explainer.whatif_explanation(
#                     provider=MLXWhatIfExplainer())

### Sample Explorer 

The sample explorer API allows you to explore how a change applied to the feature values of a selected sample impacts the model prediction. You can modify a subset of the available features or all the features. The optional argument `features` lets you specify a list of features to explore while the optional parameter `max_features` lets you select the maximum number of features you want to evaluate. By default, all features are selected. 

In [None]:
#whatif_explainer.explore_sample(row_idx=0)

### Predictions Explorer

The predictions Explorer tool allows you to explore model predictions across either the marginal distribution (1-feature) or the joint distribution (2-feature) of the features in your train/validation/test dataset. The method `explore_predictions()` has several optional parameters including: 

* `feature`: the name of the feature to visualize 
* `label`: either the target label index or name to visualize 
* `plot_type`: `scatter`, `bar`, `box`.
* `discretization`: method to discretize continuous features (Options are no discretization, quartile, decile, or percentile)

In [None]:
#whatif_explainer.explore_predictions('Age')

In [None]:
#whatif_explainer.explore_predictions('WorkLifeBalance', plot_type='box', discretization='decile')

<a id='global'></a>
## Global Explanations

We will start with generating global explanations for our model. 

Using the `ADSExplainer` object, we can create a global explanation object to generate global model explanations. Oracle Labs global `MLX` is selected as the provider using the `MLXGlobalExplainer` object. Global explanation supports both feature importance explanations and feature dependence explanations, such as Partial Dependence Plots (PDP) and Individual Conditional Expectations (ICE). 

Generate the global feature importance explanation and visualize the top 6 features as a bar chart

In [None]:
importances = global_explainer.compute_feature_importance()

In [None]:
importances.show_in_notebook(n_features=10)

Visualize a detailed scatter plot. This shows the distribution of the importance measure and provides a sense of the variation in the data. Five features will be plotted.

In [None]:
#importances.show_in_notebook(n_features=10, mode='detailed')

The detailed information used to generate the above plot is available with the `get_global_explanations` method. It returns an array of JSON structures. We will display the results for only one of the features.

In [None]:
#importances.get_global_explanation()[0]

## Show what the model has learned

It is great to have an expert with knowledge of what a model should do. However, this is often not available. Plus, models sometimes learn different things than what an expert would speculate should be learned. Generally, models learn important relationships between the data. For many machine learning models, it is difficult or almost impossible to understand what the model has learned. The `ADSExplainer` provides a powerful set of tools that provide the data scientist insight into what a complex model is doing. It does this by building other models and performing simulations on the model's predictions. From this, the `ADSExplainer` learns what has been learned. The `ADSExplainer` provides a powerful set of tools that will provide the data scientist insight into what a complex model is doing. It does this by building other models and performing simulations on the model's predictions. From this, the `ADSExplainer` learns what has been learned.

When an explanation does not make sense, it does not mean the explanation is wrong. It is possible that the model has learned new relationships in the data. This allows `MLX` to be used to understand and debug the modeling process.

### Feature Dependence Explanations (PDP & ICE)

Next, we will generate global explanations to visualize how different values for the important features interact with the target variable. This is done through Partial Dependence Plots (PDP) and Individual Conditional Expectations (ICE) explanations. 

The following cell shows how to learn more about the PDP and ICE techniques used in the `MLXGlobalExplainer`. This provides a description of the algorithm and how to interpret the output. 

In [None]:
#global_explainer.partial_dependence_summary()

Create the PDP plot for the `OverTime` feature. From this, it can be seen that if an employee does `Overtime`, they are more likely to leave the company.

In [None]:
#explanation = explainer.global_explanation().compute_partial_dependence("OverTime")
#explanation.show_in_notebook(mode="pdp", labels=[True])

Using ICE, the results for each sample can be seen for both the True and False cases (employee leaves or stays). This approach allows the data scientist to see the distribution of importance values. In this example, the output has been centered/pinned on its first prediction. Also, a median line is plotted to show the general trend.

In [None]:
#explanation.show_in_notebook(mode="ice", centered=True, 
#                               show_distribution=True, 
#                               show_correlation_warning=True, 
#                               show_median=True)

As mentioned above, PDP is able to consider the interaction between multiple variables. In theory, any level of interaction can be used but practically only two-way interaction can be plotted without a significant amount of compute; therefore, it is generally limited to two variables. In this example, the feature importance will be determined between the `OverTime` and `JobLevel` variables.

In [None]:
#otjl_explanation = explainer.global_explanation().compute_partial_dependence(
#    ['OverTime', 'JobLevel'])


#otjl_explanation.show_in_notebook(
#    show_distribution=True, show_correlation_warning=False, line_gap=1)

Access to the raw data can be obtained by converting it to a data frame.

In [None]:
#otjl_explanation.as_dataframe()

<a id='localexplanations'></a>
## Local Explanations

Global explanations inform the data scientist about the general trends in a model. They do not describe what is happening with a specific prediction. That is the role of local explanations. They are model-agnostic and provide insights into why a model made a specific prediction.  



In [None]:
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
local_explainer = explainer.local_explanation(provider=MLXLocalExplainer())

In [None]:
#sample = 4
#(X, y) = test.X.iloc[sample:sample+1], test.y.iloc[sample:sample+1]

#local_explainer.explain(X, y).show_in_notebook(labels=True)

Select and display a sample on which to perform a local explanation.

This concludes our exploration of `ADS`, `AutoML`, and `MLX` features in the context of a binary classification problem.

<a id='save'></a>
# Saving the model to the model catalog 

Now, we can save the simple random forest model in the model catalog, using the very flexible `prepare_generic_model()` function to save my model. That function creates an editable template artifact. The function `prepare_generic_model()` can support **any** model and **should always be the preferred way to save models from open source libraries**. 

`prepare_generic_model()` gives you complete control on the structure of the artifact and the definition fo the functions in `score.py`. 

In [None]:
from ads.common.model_artifact import ModelArtifact
from ads.common.model_export_util import prepare_generic_model
import joblib 

# Path to artifact directory for my sklearn model: 
sklearn_path = "./model-artifact/"

# Creating the artifact template files in the directory: 
sklearn_artifact = prepare_generic_model(sklearn_path, 
                                         function_artifacts=False, 
                                         data_science_env=True,
                                         force_overwrite=True)

# Creating a joblib pickle object of my random forest model: 
joblib.dump(sk_model, os.path.join(sklearn_path, "model.joblib"))

Now, we can make some changes to `score.py`, ensuring that `load_model()` reads in the `model.joblib` file.

In [None]:
#setting paths for artifact files that need to be modified: 

encoder_path = os.path.join(sklearn_path, "dataframelabelencoder.py")
score_path = os.path.join(sklearn_path, "score.py")
!cp dataframelabelencoder.py {encoder_path}

In [None]:
%%writefile {score_path}

"""
   Inference script. This script is used for prediction by scoring server when schema is known.
"""

import json
import os
from joblib import load
import io 
import pandas as pd
import logging 

# logging configuration - OPTIONAL 
logging.basicConfig(format='%(name)s - %(levelname)s - %(message)s', level=logging.INFO)
logger_pred = logging.getLogger('model-prediction')
logger_pred.setLevel(logging.INFO)
logger_feat = logging.getLogger('input-features')
logger_feat.setLevel(logging.INFO)

from dataframelabelencoder import DataFrameLabelEncoder

def load_model():
    """
    Loads model from the serialized format

    Returns
    -------
    model:  a model instance on which predict API can be invoked
    """
    model_dir = os.path.dirname(os.path.realpath(__file__))
    contents = os.listdir(model_dir)
    model_file_name = "model.joblib"
    # TODO: Load the model from the model_dir using the appropriate loader
    # Below is a sample code to load a model file using `cloudpickle` which was serialized using `cloudpickle`
    # from cloudpickle import cloudpickle
    if model_file_name in contents:
        with open(os.path.join(os.path.dirname(os.path.realpath(__file__)), model_file_name), "rb") as file:
            model = load(file) # Use the loader corresponding to your model file.
    else:
        raise Exception('{0} is not found in model directory {1}'.format(model_file_name, model_dir))
    
    return model


def predict(data, model=load_model()) -> dict:
    """
    Returns prediction given the model and data to predict

    Parameters
    ----------
    model: Model instance returned by load_model API
    data: Data format as expected by the predict API of the core estimator. For eg. in case of sckit models it could be numpy array/List of list/Panda DataFrame

    Returns
    -------
    predictions: Output from scoring server
        Format: { 'prediction': output from `model.predict` method }

    """
    assert model is not None, "Model is not loaded"
    X = pd.read_json(io.StringIO(data)) if isinstance(data, str) else pd.DataFrame.from_dict(data)
    preds = model.predict(X).tolist()
    logger_pred.info(preds)
    logger_feat.info(X)    
    return { 'prediction': preds }

### Testing the artifact before saving to the catalog

It is always a good idea to test your model artifact before saving it to the catalog. Here we load the `score.py` module along with `load_model` and `predict`. We test predict by passing the training dataframe, doing the same for the predict() method of the sklearn model object. Next, we compare the two prediction arrays. These two should be identical.

In [None]:
input_data = train.X[:5].to_json()

In [None]:
import sys 

# add the path of score.py: 
sys.path.insert(0, sklearn_path)

from score import load_model, predict

# Load the model to memory 
_ = load_model()
# make predictions on the training dataset: 
predictions_test = np.asarray(predict(train.X[:5], _)['prediction'])
predictions = predict(input_data)


# comparing the predictions from the sklearn RandomForest predict() to the predictions 
# array generated by calling predict() in score.py. Both should arrays should be equal.  
#print("The two arrays are equal: {}".format(np.array_equal(predictions_test, predictions['prediction'])))

### Calling save()

In [None]:
mc_model = sklearn_artifact.save(project_id=os.environ['PROJECT_OCID'], 
                               compartment_id=os.environ['NB_SESSION_COMPARTMENT_OCID'], 
                               display_name="sklearn-employee-attrition",
                               description="simple sklearn model to predict employee attrition", 
                               training_script_path="employee-attrition.ipynb", 
                               ignore_pending_changes=True)

# Deploy your model through Model Deployment 

Let's do that in the console! 

---
---

# Invoke the Model HTTP Endpoint 

In [None]:
import requests
import oci
from oci.signer import Signer

In [None]:
# Using resource principals. You can alternatively use the config+key flow. 
using_rps = True
# Replace with the uri of your model deployment: 
uri = ''

# payload: 
input_data = train.X[:5].to_json()
body = input_data

if using_rps: # using resource principal:     
    auth = oci.auth.signers.get_resource_principals_signer()
else: # using config + key: 
    config = oci.config.from_file("~/.oci/config") # replace with the location of your oci config file
    auth = Signer(
        tenancy=config['tenancy'],
        user=config['user'],
        fingerprint=config['fingerprint'],
        private_key_file_location=config['key_file'],
        pass_phrase=config['pass_phrase'])

In [None]:
%%time 

# submit request to model endpoint: 
requests.post(uri, json=input_data, auth=auth).json()