In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.

!pip install -U oracle-ads

Oracle Data Science service sample notebook.

Copyright (c) 2021 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---

# <font color="red">Text Classification and Model Explanations using LIME</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---

# Overview:

This notebook demonstrates an example of peforming model explanations on an NLP classifier using a surrogate model technique called "Locally Interpretable Model Explanations" (LIME). 

Specifically, it focuses on training a Decision Tree multiclass classification model using the 20 Newsgroups dataset. It illustrates how to do this using two different implementations of LIME within the NLP conda pack.
 
The 20 Newsgroups dataset is used in this notebook. The original dataset has 20 different categories with 11,314 news documents in the training dataset, and 7,532 in the testing dataset. 

Compatible conda pack: [Natural Language Processing](https://docs.oracle.com/en-us/iaas/data-science/using/conda-nlp-fam.htm) for CPU on Python 3.7 (version 2.0)

---

## Contents:

- <a href='#loaddataset'>Load the Dataset</a>
- <a href='#xgboost'>Train the Decision Tree Model</a>
- <a href='#eli5'>Model Explanation Using an Eli5 package</a>
    - <a href='#eli5-global'>Global Explanation</a>
    - <a href='#eli5-local'>Local Explanation</a>
    - <a href='#eli5-Lime'>Inpect Black-Box Models Using a Lime Algorithm</a>
- <a href='#lime'>Model Explanation Using a Lime Package</a>
- <a href='#ref'>References</a>

---


Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials 
under your agreement with Oracle.
    
You can access the `20 Newsgroups` dataset license [here](https://github.com/scikit-learn/scikit-learn/blob/master/COPYING).

---


In [None]:
import eli5
import nltk
import numpy as np
import os
import pandas as pd
import string


from nltk import word_tokenize
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

pd.set_option("display.max_colwidth", None)

Need to get the punkt word tokenizer

In [None]:
nltk.download("punkt")

<a id='loaddataset'></a>
## Load the Dataset

First, create a `TwentyNewsDataset` class to load news data and do simple preprocessing to prepare data for model training.

In [None]:
class TwentyNewsDataset:
    def __init__(self, data="train", categories=None):
        self.data = fetch_20newsgroups(
            subset=data,
            categories=categories,
            shuffle=True,
            random_state=42,
            remove=("headers", "footers"),
        )

    def load_data(self):
        labels = [self.data["target_names"][x] for x in self.data["target"]]

        # preprocessing
        processed_data = []
        for s in self.data["data"]:
            new_s = s.translate(str.maketrans("", "", string.punctuation))
            processed_data.append(new_s)
        return processed_data, labels

Initialize the class and load the data for the train and test dataset. Only four of the twenty categories are included to simplify this example.

In [None]:
categories = ["sci.med", "rec.autos", "misc.forsale"]
train_data, train_label = TwentyNewsDataset(categories=categories).load_data()
test_data, test_label = TwentyNewsDataset(
    data="test", categories=categories
).load_data()

<a id='Decision Tree'></a>
## Train the Decision Tree Model

Next, you use sklearns `TfidfVectorizer` to transform the text data into feature vectors, and then train a Decision Tree model using scikit-learns `DecisionTreeClassifier`. The bigrams and trigrams are included as features alongside unigrams (words).

In [None]:
tf_vectorizer = TfidfVectorizer(
    stop_words="english", analyzer="word", ngram_range=(1, 3)
)

trainx = tf_vectorizer.fit_transform(train_data)
testx = tf_vectorizer.transform(test_data)


dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(trainx, train_label)

Next, create a `Pipeline` object to make your model code more modular, which makes it easier to try different kinds of preprocessing or vectorization:

In [None]:
predict_pipeline = Pipeline([("vectorizer", tf_vectorizer), ("model", dt_classifier)])

Take a look at the performace of your trained model:

In [None]:
predict_pipeline.score(test_data, test_label)

You can use the `predict` and `predict_proba` methods of the `Pipeline` object to transform and predict the probablity of different classes on any test data.

In [None]:
print(train_data[0])

print(predict_pipeline.predict_proba([train_data[0]]))
print(predict_pipeline.predict([train_data[0]]))

Finally, try this model on an unseen example as a sanity check:

In [None]:
predict_pipeline.predict(["Cars are very exciting!"])

<a id='eli5'></a>

## Model Explanation Using an Eli5 Package

When you evaluate a machine learning model, you want to know why the model makes incorrect predictions. You can leverage model explanation libraries to understand why a model makes the prediction that it does. Using model explanations (even surrogate ones) aids in interpretating the model. Otherwise the model may be a black-box and so is unusable in many domains.

Model explanations can be classified into two different categories, global or local explanations. Global explainability techniques seeks to explain the entire model behavior by inspecting model features. While local explanations checks an individual prediction of a model and shows why the model makes that particular decision. Local iexplanation methods tend to be easier to apply to any kind of model than global explanations are. 

The [Eli5 package](https://eli5.readthedocs.io/en/latest/overview.html#features) is helpful to demonstrate model explanations. It supports common machine learning libraries and frameworks and implements algorithms for the inpection of black-box models.

<a id='eli5-global'></a>
### Global Explanation

Feature importance is a good way to demonstrate global explanation. Eli5 implements permutation importance method to calculate feature importance for any black-box estimator. In this example, this is not as useful because Decision Tree models supports directly calculating feature importance. This example can be used with a model that doesn't support these direct calculations too. 

To show global explanation, you can simply call the `show_weights` method from the eli5 package, pass in the trained classifer, and optionally configure a few other parameters. The output shows the features (words, bigrams, and trigrams) ranked by its weight trained in your model.

In [None]:
eli5.show_weights(
    estimator=dt_classifier, vec=tf_vectorizer, top=10, target_names=categories
)

<a id='eli5-local'></a>
### Local Explanation

To inspect a single prediction of the model on directly white-box models like Decision Trees, use the `show_prediction()` method and pass in any test data you want the model to predict. The results show the prediction score for each class and the feature contribution to the predicted class ranked by its weight.

In [None]:
eli5.show_prediction(
    estimator=dt_classifier,
    doc=train_data[0],
    top=10,
    vec=tf_vectorizer,
    target_names=categories,
)

What if your model is not a white-box? 

<a id='eli5-Lime'></a>
### Inpect Black-Box Models Using a LIME Algorithm

Eli5 implements the LIME algorithm to interpret a [black-box classfier](https://eli5.readthedocs.io/en/latest/tutorials/black-box-text-classifiers.html) using a locally-fit white-box classifier for text data:

- Create a fake text dataset using perturbed versions of the text.
- Use the black-box classifier to predict on the generated fake dataset.
- Train a white-box classifier on the faked data with the black-box classifiers predicted labels.
- Interpret the original model through weights of this white-box estimator instead.

The next cell is an example of using the LIME algorithm and a surrogate linear model to interpret our Decision Tree model. The results show the prediction score for each class and the highlighted text based on the feature contribution of each word to the predicted class.

In [None]:
from eli5.lime import TextExplainer

te = TextExplainer(random_state=42)

te.fit(train_data[3], predict_pipeline.predict_proba)
te.show_prediction(target_names=categories)

You can investigate the quality on a held out dataset from the generated data by calling the `metrics_` attribute to see if the explanation using LIME algorithm is accurate and can be trusted:

In [None]:
te.metrics_

You can also check the generated fake dataset by calling the `samples_` attribute:

In [None]:
print(te.samples_[0])

Besides the implementation of LIME found with the `Eli5` package, there is also an implementation of LIME from the authors of its paper.

<a id='lime'></a>
## Model Explanation Using a LIME Package

The LIME package focuses on local model interpretable and model-agnostic part only.

Similar to the Lime algorithm in Eli5 package, the [algorithm](https://github.com/marcotcr/lime)also perturbs the original dataset, but it uses to a sparse linear model to explain the black-box model.

This is a simple example showing how to use LIME API for local explanation:

In [None]:
from lime.lime_text import LimeTextExplainer

explainer = LimeTextExplainer(class_names=categories)

exp = explainer.explain_instance(
    train_data[0], predict_pipeline.predict_proba, num_features=10, top_labels=5
)

exp.show_in_notebook()

<a id="ref"></a>
# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)