Oracle Data Science service sample notebook.

Copyright (c) 2020, 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---

# <font color="red">Binary Classification for Predicting Employee Attrition with the Accelerated Data Science (ADS) SDK</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---

# Overview:

In this notebook, an employee attrition dataset is used. You start by doing an exploratory data analysis (EDA) to understand the data. Then a model is trained using `sklearn`. The model is used to make predictions and evaluate the model to determine how well it generalizes to new data. Then you use machine learning explainability (MLX) to understand the global and local model behavior. You do all of this using the Oracle Accelerated Data Science (`ADS`) library.

---

## Contents:

- <a href='#setup'>Setting Up</a>
- <a href='#data'>Opening and Visualizing Datasets using `ADS`</a>
   - <a href='#binaryclassifition'>Binary Classification</a>
   - <a href='#data'>The Dataset</a>
   - <a href='#viz'>Visualizing the Dataset</a>
   - <a href='#eda'>Exploratory Data Analysis</a> 
   - <a href='#trans'>Getting and Applying Transformation Recommendations</a> 
- <a href='#model'>Building and Visualizing Models</a>
- <a href='#eval'>Evaluating Models Using `ADSEvaluator`</a>
- <a href='#explainations'>Explaining How Models Work Using `ADSExplainer`</a>
   - <a href='#adsexplainer'>Using the `ADSExplainer` Class</a>
   - <a href='#global'>Generating Global Explanations</a>
   - <a href='#show'>Showing What the Model Has Learned</a>
        - <a href='#show'>Using `ADSExplainer` for a Custom Model</a>
        - <a href='#pdp'>Feature Dependence Explanations</a>   
   - <a href='#localexplanations'>Generating Local Explanations</a>
- <a href='#ref'>References</a>          

---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials 
under your agreement with Oracle.

You can access the `orcl_attrition` dataset license [here](https://oss.oracle.com/licenses/upl).

---
The notebook is compatible with the following [Data Science conda environments](https://docs.oracle.com/en-us/iaas/data-science/using/conda_environ_list.htm):

* [General Machine Learning](https://docs.oracle.com/en-us/iaas/data-science/using/conda-gml-fam.htm) for CPU on Python 3.7 (version 1.0)
* [General Machine Learning](https://docs.oracle.com/en-us/iaas/data-science/using/conda-gml-fam.htm) for GPU on Python 3.7 (version 1.0)


<a id='setup'></a>
# Setting Up

Import everything necessary to this notebook:

In [10]:
import logging
import os
import pandas as pd
import warnings

from ads.catalog.model import ModelCatalog
from ads.catalog.project import ProjectCatalog
from ads.common.model_artifact import ModelArtifact
from ads.common.data import ADSData
from ads.common.model import ADSModel
from ads.dataset.factory import DatasetFactory
from ads.evaluations.evaluator import ADSEvaluator
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from category_encoders.ordinal import OrdinalEncoder
from collections import defaultdict
from os.path import expanduser
from os.path import join
from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import get_scorer

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

ModuleNotFoundError: No module named 'ads.catalog'


# Opening and Visualizing Datasets using `ADS`

<a id='binaryclassifition'></a>
## Binary Classification

Binary classification is a technique of classifying observations into one of two groups. In this notebook, the two groups are those employees that will leave the organization and those that will not. 

Given the features in the data, the model determines the optimal criteria for classifying an observation as leaving or not leaving. This optimization is based on the training data. However, some of the data to test the model's preformance is reserved. Models can overfit on the training data, that is learn the noise in a dataset, and then it won't do a good job at predicting the results on new data (test data). Since you already know the truth for the data in the training dataset, you are really interested in how well it performs on the test data.

<a id='data'></a>
## The Dataset

This is a fictional dataset and contains 1,470 rows. There are 36 features with 22 ordinal features, 11 categorical features, and 3 constant values. The features include basic demographic information, compensation level, job characteristics, job satisfaction, and employee performance metrics. The data is not balanced as fewer employees leave than stay.

The first step is to load in the dataset. To do this the `DatasetFactory` singleton object is used. It is part of the `ADS` library and is a powerful class to work with datasets from different sources.


In [None]:
attrition_path = join("/", "opt", "notebooks", "ads-examples", "oracle_data", "orcl_attrition.csv")

ds = DatasetFactory.open(
      attrition_path,
      target="Attrition").set_positive_class('Yes')

<a id='viz'></a>
## Visualizing the Dataset

The `.feature_plot()` method can be applied to create the visualization on a dataset sample to give an understanding of the nature of the data in feature.

In [None]:
ds['Attrition'].ads.feature_plot()

In [None]:
ds['Age'].ads.feature_plot()

In [None]:
ds['EducationalLevel'].ads.feature_plot()

Use the `show_corr()` method to visualize correlations between features, even when the features are different types.

In [None]:
ds.show_corr()

<a id='eda'></a>
## Exploratory Data Analysis

The `DatasetFactory` does more than just open data from various sources. It profiles the data, determines their data types, and uses sampling to prepare visualizations. The `type_of_target` method provides information about the target, which is the value that is going to be predicted.

In [None]:
ds.type_of_target()

The `show_in_notebook()` method is used in many classes within `ADS`. It makes a best effort to display information that is meaningful to a data scientist about the object. In the next cell, it is applied to the target variable and it shows the relative frequency of the classes in the data. Since the target is `Attrition`, False means people who did not leave and True are people that do. It shows that the data is imbalanced.

In [None]:
ds.target.show_in_notebook()

<a id='trans'></a>
## Getting and Applying Transformation Recommendations

`ADS` can help with feature engineering by transform datasets. For example, it can fix class imbalance by up or downsampling. There are many transforms that `ADS` can also apply. You can have `ADS` perform an analysis of the data and automatically perform the transformations that it thinks would improve the model. This is done using the `auto_transform()` method. The `suggest_recommendations()` method allows you to explore the suggested transforms using the notebook's UI and select the transformations that you want it to make.

All `ADS` datasets are immutable, any transforms that are applied result in a new dataset. In this example, the notebook performs automatic transformations on the data and it also fixes the class imbalance.

In [None]:
transformed_ds = ds.auto_transform(fix_imbalance=False)

<a id='model'></a>
# Building and Visualizing Models


`ADS` also provides the ability to split a dataset into training and testing datasets using the `train_test_split` method and `train` to train a set of models.

In [None]:
train, test = transformed_ds.train_test_split()

`ADS` is agnostic to the source of the model as it takes advantage of duck typing, something that looks like a model and walks like a model, is a model to `ADS`. 

Next, you build a `sklearn` random forest model, and then use it with `ADS`.

In [None]:
class DataFrameLabelEncoder(TransformerMixin):
    def __init__(self):
        self.label_encoders = defaultdict(LabelEncoder)
        
    def fit(self, X):
        for column in X.columns:
            if X[column].dtype.name  in ["object", "category"]:
                self.label_encoders[column] = OrdinalEncoder()
                self.label_encoders[column].fit(X[column])
        return self
    
    def transform(self, X):
        for column, label_encoder in self.label_encoders.items():
            X[column] = label_encoder.transform(X[column])
        return X

le = DataFrameLabelEncoder()
X = le.fit_transform(train.X.copy())
y = train.y.copy()

sk_clf = RandomForestClassifier(random_state=42)
sk_clf.fit(X, y)

# Build an ADS model.
my_model = ADSModel.from_estimator(make_pipeline(le, sk_clf), 
                                   name=sk_clf.__class__.__name__)

In [None]:
print("Random Forest accuracy on test data:", my_model.score(test.X, test.y))

Let's generate another model to compare the random forest to, in this case, we can use the more simple `DecisionTreeClassifier` as a baseline

In [None]:
dt_clf = DecisionTreeClassifier(random_state = 42)
dt_clf.fit(X, y)

my_other_model = ADSModel.from_estimator(make_pipeline(le, dt_clf), 
                                   name=dt_clf.__class__.__name__)

In [None]:
print("Decision tree accuracy on test data:", my_other_model.score(test.X, test.y))

<a id='eval'></a>
# Evaluating Models using `ADSEvaluator`

One of the key advantages of `ADS` is the ability to quickly evaluate any models. `ADS` supports evaluating:

- Regression
- Binary classification
- Multiclass classification

`ADS` allows you to provide your own evaluation function (given `y_true` and `y_pred` series) for any esoteric calculation that you want to run.

Next, you examine the plots that are commonly used to evaluate model performance. These include the precision-recall, ROC, lift, and gain plots. Each model under study is plotted together. This allows for easy comparison. In addition, the normalized confusion matrices are provided.

In [None]:
evaluator = ADSEvaluator(test, models=[my_model, my_other_model], 
                         training_data=train,
                         show_full_name=True,
                         positive_class=True)
evaluator.show_in_notebook()

There are a number of common metrics that are used to assess the quality of a model. `ADS` provides a convenient method to compare the models and highlights the model with the highest score in each metric.

Performance on training data doesn't tell you how the model performs on unseen data. You should look to performance on the `test` dataset to get an idea of which model is better

In [None]:
evaluator.metrics

A binary classification model can have one of four outcomes for each prediction. A true negative is an outcome where the model correctly predicts the negative case. For this example, that would be the case when the employee is predicted to leave. A false positive is when the model incorrectly predicts that an employee would stay and they do not. However, not all predictions may have the same importance. For example, a cancer test has a higher cost when it incorrectly says that a patient does not have cancer when they do. The `calculate_cost` method allows the cost to be computed for each model based on the cost of each class of prediction.

In [None]:
evaluator.calculate_cost(tn_weight=1, fp_weight=3, fn_weight=2, tp_weight=2)

<a id='explanations'></a>
# Explaining How Models Work Using `ADSExplainer`

The remainder of this notebook demonstrates how you can use the `ADS` explanation module to help better understand the behavior of your trained model. First you create the required `ADS` explainer objects to then begin generating global and local explanations. 

Some useful terms for machine learning explainability (MLX):
  - **Explainability**: The ability to explain the reasons behind an machine learning model’s prediction.
  - **Interpretability**: The level at which a human can understand the explanation.
  - **Global Explanations**: Understand the general behavior of a machine learning model as a whole.
  - **Local Explanations**: Understand why the machine learning model made a specific prediction.
  - **Model-Agnostic Explanations**: Explanations treat the machine learning model and feature pre-processing as a black-box, instead of using properties from the model to guide the explanation.

The `ADS` explanation module provides interpretable, model-agnostic, and local and global explanations.

---

<a id='adsexplainer'></a>
## Using the `ADSExplainer` Class

`ADS` provides a general explainer class, `ADSExplainer`, which is used to generate both global and local explanations for machine learning models. `ADSExplainer` takes as input the datasets used to train and evaluate the model (such as, train and test) and the model itself. Any type of model containing a `.predict_proba()` or `.predict()` method can be used. 

In [None]:
# our model explainer class
explainer = ADSExplainer(test, my_model)

# let's created a global explainer
global_explainer = explainer.global_explanation(provider=MLXGlobalExplainer())

<a id='global'></a>
## Generating Global Explanations

Start with generating global explanations for the model. Using the `ADSExplainer` object, you can create a global explanation object to generate global model explanations. Oracle Labs global MLX is selected as the provider using the `MLXGlobalExplainer` object. Global explanation supports both feature importance explanations and feature dependence explanations, such as Partial Dependence Plots (PDP) and Individual Conditional Expectations (ICE). 

In [None]:
global_explainer.feature_importance_summary()

Generate the global feature importance explanation and visualize the top six features as a bar chart:

In [None]:
importances = global_explainer.compute_feature_importance()
importances.show_in_notebook(n_features=6)

Visualize a detailed scatter plot to show the distribution of the importance measure and provides a sense of the variation in the data. Five features are plotted.

In [None]:
importances.show_in_notebook(n_features=5, mode='detailed')

The detailed information used to generate the above plot is available with the `get_global_explanations` method. It returns an array of JSON structures. This displays the results for only one of the features.

In [None]:
importances.get_global_explanation()[0]

<a id='show'></a>
## Showing What the Model Has Learned

It is great to have an expert with knowledge of what a model should do. However, this is often not available. Models sometimes learn different things than what an expert would speculate should be learned. Generally, models learn important relationships between the data. For many machine learning models, it is difficult or almost impossible to understand what the model has learned. The `ADSExplainer` provides a powerful set of tools that provides the data scientist insight into what a complex model is doing. It does this by building other models and performing simulations on the model's predictions. From this, the `ADSExplainer` learns what has been learned.

When an explanation does not make sense, it does not mean the explanation is wrong. It is possible that the model has learned new relationships in the data. This allows MLX to be used to understand and debug the modeling process.

<a id='custom'></a>
### Using `ADSExplainer` on the Decision Tree Model

In [None]:
explainer_custom_model = ADSExplainer(test, my_other_model)
fi = explainer_custom_model.global_explanation(
    provider=MLXGlobalExplainer()).compute_feature_importance()
fi.show_in_notebook(n_features=6)

<a id='pdp'></a>
### Feature Dependence Explanations

Next, you generate global explanations to visualize how different values for the important features interact with the target variable. This is done using PDP and ICE explanations. 

The next cell shows you how to learn more about the PDP and ICE techniques used in the `MLXGlobalExplainer`. This provides a description of the algorithm and how to interpret the output. 

In [None]:
global_explainer.partial_dependence_summary()

Create the PDP plot for the `OverTime` feature. From this you can see that if an employee does `Overtime`, they are more likely to leave the company.

In [None]:
explanation = explainer.global_explanation().compute_partial_dependence("OverTime")
explanation.show_in_notebook(mode="pdp", labels=[True])

Using ICE, the results for each sample can be seen for both the True and False cases (employee leaves or stays). This approach allows the data scientist to see the distribution of importance values. In this example, the output has been centered and pinned on its first prediction. Also, a median line is plotted to show the general trend.

In [None]:
explanation.show_in_notebook(mode="ice", centered=True, 
                               show_distribution=True, 
                               show_correlation_warning=True, 
                               show_median=True)

PDP is able to consider the interaction between multiple variables. In theory, any level of interaction can be used though practically only two-way interaction can be plotted without a significant amount of compute so it is generally limited to two variables. In this example, the feature importance is determined between the `Age` and `JobRole` variables.

In [None]:
adjr_explanation = explainer.global_explanation().compute_partial_dependence(
    ['Age', 'JobRole'])


adjr_explanation.show_in_notebook(
    show_distribution=True, show_correlation_warning=False, line_gap=1)

Access to the raw data can be obtained by converting it to a data frame.

In [None]:
adjr_explanation.as_dataframe()

<a id='localexplanations'></a>
## Generating Local Explanations

Global explanations inform the data scientist about the general trends in a model. They do not describe what is happening with a specific prediction. That is the role of local explanations. They are model-agnostic and provide insights into why a model made a specific prediction.  



In [None]:
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
local_explainer = explainer.local_explanation(provider=MLXLocalExplainer())

Print a detailed summary about how local explanations work:

In [None]:
local_explainer.summary()

Select and display a sample to perform a local explanation on:

In [None]:
(X, y) = test.X.iloc[0:1], test.y.iloc[0:1]
X

In [None]:
explainer.local_explanation(
    provider=MLXLocalExplainer()).explain(X, y).show_in_notebook(labels=[True])

<a id='ref'></a>
# References

- [ADS Library Documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [Interpretable Machine Learning ICE](https://christophm.github.io/interpretable-ml-book/ice.html)
- [Interpretable Machine Learning PDP](https://christophm.github.io/interpretable-ml-book/pdp.html)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Scikit-learn](https://scikit-learn.org/stable/)