In [2]:
pip install transformers-interpret


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers-interpret
  Downloading transformers-interpret-0.6.0.tar.gz (35 kB)
Collecting transformers>=3.0.0
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 50.5 MB/s 
[?25hCollecting captum>=0.3.1
  Downloading captum-0.5.0-py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 36.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 48.3 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.w

### Sequence Classification Explainer
Let's start by initializing a transformers' model and tokenizer, and running it through the SequenceClassificationExplainer.

For this example we are using distilbert-base-uncased-finetuned-sst-2-english, a distilbert model finetuned on a sentiment analysis task.

In [3]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# With both the model and tokenizer initialized we are now able to get explanations on an example text.

from transformers_interpret import SequenceClassificationExplainer
cls_explainer = SequenceClassificationExplainer(
    model,
    tokenizer)
word_attributions = cls_explainer("I love you, I like you")

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [4]:
word_attributions

[('[CLS]', 0.0),
 ('i', 0.2778542106073973),
 ('love', 0.7792373079344496),
 ('you', 0.38560053391419574),
 (',', -0.017697477692672603),
 ('i', 0.12071900163288994),
 ('like', 0.1909112905335153),
 ('you', 0.3399486503735562),
 ('[SEP]', 0.0)]

Positive attribution numbers indicate a word contributes positively towards the predicted class, while negative numbers indicate a word contributes negatively towards the predicted class. Here we can see that I love you gets the most attention.

You can use predicted_class_index in case you'd want to know what the predicted class actually is:

In [5]:
cls_explainer.predicted_class_index

array(1)

In [6]:
cls_explainer.predicted_class_name

'POSITIVE'

In [7]:
cls_explainer.visualize("distilbert_viz.html")


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
1.0,POSITIVE (1.00),POSITIVE,2.08,"[CLS] i love you , i like you [SEP]"
,,,,


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
1.0,POSITIVE (1.00),POSITIVE,2.08,"[CLS] i love you , i like you [SEP]"
,,,,




### Explaining Attributions for Non Predicted Class
Attribution explanations are not limited to the predicted class. Let's test a more complex sentence that contains mixed sentiments.

In the example below we pass class_name="NEGATIVE" as an argument indicating we would like the attributions to be explained for the NEGATIVE class regardless of what the actual prediction is. Effectively because this is a binary classifier we are getting the inverse attributions.

In [9]:
cls_explainer = SequenceClassificationExplainer(model, tokenizer)
attributions = cls_explainer("I love you, I like you, I also kinda dislike you", class_name="NEGATIVE")

In [10]:
cls_explainer.predicted_class_name


'POSITIVE'

In [11]:
cls_explainer.visualize("distilbert_negative_attr.html")


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
0.0,POSITIVE (0.00),NEGATIVE,-1.63,"[CLS] i love you , i like you , i also kinda dislike you [SEP]"
,,,,


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
0.0,POSITIVE (0.00),NEGATIVE,-1.63,"[CLS] i love you , i like you , i also kinda dislike you [SEP]"
,,,,


### MultiLabel Classification Explainer
This explainer is an extension of the SequenceClassificationExplainer and is thus compatible with all sequence classification models from the Transformers package. The key change in this explainer is that it caclulates attributions for each label in the model's config and returns a dictionary of word attributions w.r.t to each label. The visualize() method also displays a table of attributions with attributions calculated per label.

In [12]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers_interpret import MultiLabelClassificationExplainer

model_name = "j-hartmann/emotion-english-distilroberta-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


cls_explainer = MultiLabelClassificationExplainer(model, tokenizer)


word_attributions = cls_explainer("There were many aspects of the film I liked, but it was frightening and gross in parts. My parents hated it.")

Downloading:   0%|          | 0.00/0.98k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/313M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/294 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [13]:
word_attributions

{'anger': [('<s>', 0.0),
  ('There', 0.09002251057477229),
  ('were', -0.025129775900540254),
  ('many', -0.028852744645040344),
  ('aspects', -0.06341975738850242),
  ('of', -0.03587629494821015),
  ('the', -0.014813134114657442),
  ('film', -0.14087604614914045),
  ('I', 0.00736794405344098),
  ('liked', -0.09816598361060064),
  (',', -0.014259522976316006),
  ('but', -0.08087146615875056),
  ('it', -0.10185211768031022),
  ('was', -0.07132252076276328),
  ('frightening', -0.4125354406380191),
  ('and', -0.02176165366961444),
  ('gross', -0.10423769861677833),
  ('in', -0.02383647745157393),
  ('parts', -0.027137700776358572),
  ('.', -0.029604221314589833),
  ('My', 0.056427806220330175),
  ('parents', 0.11146651099428335),
  ('hated', 0.8497977561249986),
  ('it', 0.053581210946560986),
  ('.', -0.013566299008078124),
  ('', 0.09293260940936088),
  ('</s>', 0.0)],
 'disgust': [('<s>', 0.0),
  ('There', -0.035296362919335564),
  ('were', -0.01022489750743456),
  ('many', -0.03747574

In [14]:
cls_explainer.visualize("multilabel_viz.html")


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
,(0.01),anger,-0.05,"#s There were many aspects of the film I liked , but it was frightening and gross in parts . My parents hated it . #/s"
,,,,
,(0.02),disgust,0.01,"#s There were many aspects of the film I liked , but it was frightening and gross in parts . My parents hated it . #/s"
,,,,
,(0.97),fear,0.75,"#s There were many aspects of the film I liked , but it was frightening and gross in parts . My parents hated it . #/s"
,,,,
,(0.00),joy,-0.87,"#s There were many aspects of the film I liked , but it was frightening and gross in parts . My parents hated it . #/s"
,,,,
,(0.00),neutral,-1.26,"#s There were many aspects of the film I liked , but it was frightening and gross in parts . My parents hated it . #/s"
,,,,


n/a,Prediction Score,Attribution Label,Attribution Score,Word Importance
,(0.01),anger,-0.05,"#s There were many aspects of the film I liked , but it was frightening and gross in parts . My parents hated it . #/s"
,,,,
,(0.02),disgust,0.01,"#s There were many aspects of the film I liked , but it was frightening and gross in parts . My parents hated it . #/s"
,,,,
,(0.97),fear,0.75,"#s There were many aspects of the film I liked , but it was frightening and gross in parts . My parents hated it . #/s"
,,,,
,(0.00),joy,-0.87,"#s There were many aspects of the film I liked , but it was frightening and gross in parts . My parents hated it . #/s"
,,,,
,(0.00),neutral,-1.26,"#s There were many aspects of the film I liked , but it was frightening and gross in parts . My parents hated it . #/s"
,,,,


### Zero Shot Classification Explainer
Models using this explainer must be previously trained on NLI classification downstream tasks and have a label in the model's config called either "entailment" or "ENTAILMENT".

This explainer allows for attributions to be calculated for zero shot classification like models. In order to achieve this we use the same methodology employed by Hugging face. For those not familiar method employed by Hugging Face to achieve zero shot classification the way this works is by exploiting the "entailment" label of NLI models. Here is a link to a paper explaining more about it. A list of NLI models guaranteed to be compatible with this explainer can be found on the model hub.

Let's start by initializing a transformers' sequence classification model and tokenizer trained specifically on a NLI task, and passing it to the ZeroShotClassificationExplainer.

For this example we are using facebook/bart-large-mnli which is a checkpoint for a bart-large model trained on the MNLI dataset. This model typically predicts whether a sentence pair are an entailment, neutral, or a contradiction, however for zero-shot we only look the entailment label.

Notice that we pass our own custom labels ["finance", "technology", "sports"] to the class instance. Any number of labels can be passed including as little as one. Whichever label scores highest for entailment can be accessed via predicted_label, however the attributions themselves are calculated for every label. If you want to see the attributions for a particular label it is recommended just to pass in that one label and then the attributions will be guaranteed to be calculated w.r.t. that label.

In [15]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers_interpret import ZeroShotClassificationExplainer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")

model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli")


zero_shot_explainer = ZeroShotClassificationExplainer(model, tokenizer)


word_attributions = zero_shot_explainer(
    "Today apple released the new Macbook showing off a range of new features found in the proprietary silicon chip computer. ",
    labels = ["finance", "technology", "sports"],
)


Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

In [16]:
word_attributions

{'finance': [('<s>', 0.0),
  ('Today', 0.0),
  ('apple', -0.01613677276882101),
  ('released', 0.334810929012864),
  ('the', -0.8933469144551915),
  ('new', 0.1420993977604435),
  ('Mac', 0.01624774896439183),
  ('book', -0.06960714547229564),
  ('showing', -0.12651672581052884),
  ('off', -0.114708075566471),
  ('a', -0.03300858485124436),
  ('range', -0.0025667033279626683),
  ('of', -0.02253172157313357),
  ('new', -0.018566417425873494),
  ('features', -0.02073655965351906),
  ('found', -0.007759053868642555),
  ('in', 0.005041683967560297),
  ('the', 0.04696521503073988),
  ('proprietary', 0.04621103557939502),
  ('silicon', -0.003347806824963864),
  ('chip', -0.010360479592646282),
  ('computer', -0.11502367823950617),
  ('.', 0.12227989573607549)],
 'sports': [('<s>', 0.0),
  ('Today', 0.0),
  ('apple', 0.17779969705352994),
  ('released', 0.10039436387671084),
  ('the', 0.48205759741358567),
  ('new', -0.01857899109482789),
  ('Mac', 0.01619916563128396),
  ('book', 0.393230103

In [17]:
zero_shot_explainer.predicted_label

'technology'

**Visualize Zero Shot Classification attributions** <br/>
For the ZeroShotClassificationExplainer the visualize() method returns a table similar to the SequenceClassificationExplainer but with attributions for every label.

In [18]:
zero_shot_explainer.visualize("zero_shot.html")


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
finance,finance (0.08),finance,-0.74,#s Today apple released the new Mac book showing off a range of new features found in the proprietary silicon chip computer .
,,,,
technology,technology (0.84),technology,1.32,#s Today apple released the new Mac book showing off a range of new features found in the proprietary silicon chip computer .
,,,,
sports,sports (0.08),sports,1.61,#s Today apple released the new Mac book showing off a range of new features found in the proprietary silicon chip computer .
,,,,


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
finance,finance (0.08),finance,-0.74,#s Today apple released the new Mac book showing off a range of new features found in the proprietary silicon chip computer .
,,,,
technology,technology (0.84),technology,1.32,#s Today apple released the new Mac book showing off a range of new features found in the proprietary silicon chip computer .
,,,,
sports,sports (0.08),sports,1.61,#s Today apple released the new Mac book showing off a range of new features found in the proprietary silicon chip computer .
,,,,
