[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/lcl23-xnlm-lab/blob/main/notebooks/2_Attribution_Contrastive.ipynb)

In [None]:
%%capture
# Run in Colab to install local packages
%pip install ferret-xai
%pip install git+https://github.com/inseq-team/inseq.git
%pip install numpy==1.23.1

# Feature Attribution for NLP

*See [Madsen et al. 2022](https://dl.acm.org/doi/10.1145/3546577) for a survey on feature attribution methods for NLP applications.*

**Feature attribution methods** leverage the internal information (e.g. gradients, attention) or predictions of a model to quantify the relationship between its inputs and its predictions.

Attribution methods produce **importance scores** (or *saliency scores*) for every element of the input, reflecting the importance of every input element in driving the model prediction. These scores are often presented using <span style="background:#A85E9E">highlights</span> (or *attribution maps*) to facilitate comprehension, although recent research showed the risks of textual highlights misinterpretation caused by human cognitive biases ([Jacovi et al. 2022](https://dl.acm.org/doi/abs/10.1145/3531146.3533127), [Jacovi et al. 2023](https://arxiv.org/abs/2305.02679))

![Example of attribution map for sentiment analysis](https://dl.acm.org/cms/asset/b4fe400e-9eb7-4f10-ab49-ac6a31a8a5cf/csur-2021-0674-f03.jpg)

*Hypothetical example of attribution map for sentiment analysis from [Madsen et al. 2022](https://dl.acm.org/doi/10.1145/3546577). c represents the explained (predicted) class, while y is the correct label for the example.*

We can categorize feature attribution approaches in three major families:

- **Gradient-based methods** such as [Integrated gradients]() use gradients as a natural source of information to motivate model predictions. Gradients for model parameters computed in relation to a loss function are commonly used during training to update model parameters, since they represent *the magnitude of change needed for a parameter such that the prediction matches the target label*. In the case of feature attribution, gradients computed with respect to a model prediction logit or probability are instead taken as *how much the parameter is contributing towards the prediction*.

<img alt="How gradient attribution is computed" src="https://jalammar.github.io/images/explaining/111.PNG" height=300/>

*Overview of feature attribution using gradient information. From Jay Alammar blog post "[Interfaces for Explaining Transformer Language Models](https://jalammar.github.io/explaining-transformers/)"*

- **Perturbation-based methods** such as [Occlusion](https://captum.ai/api/occlusion.html) estimate importance of inputs by introducing noise in the prediction process, usually by masking or removing either input features or network components, and verifying the downstream effect on model predictions. Intuitively, these can also be used to determine the importance of layers in the neural network.

- **Internals-based methods** use quantities computed naturally by the network during their predictions to motivate its internal computations. For Transformers-based model, [attention weights](https://aclanthology.org/D19-1002/) are commonly used, by themselves or multiplied with other quantities, as indications of feature importance.

<img alt="Aligning words across source and target sentences in translation with attention" src="https://jalammar.github.io/images/attention_sentence.png" height=350/>

*Example of attention weights aligning source and target words in a translation task. From Bahdanau et al. 2015 "[Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)"*


<details>
    <summary><b>Additional details about feature attribution methods</b></summary>
    <ol>
        <li> We distinguish gradient and internals methods since gradient methods require a <i>prediction target</i> (e.g. the predicted class label, the next generated word) to compute importance scores, while internals-based one generally do not rely on the predicted output to determine importance.</li>
        </br>
        <li> Feature importance is generally computed for input tokens, but most methods can also be used to compute importance of intermediate representations (or <i>activations</i>) computed by the model (e.g. stopping at an intermediate layer when backpropagating gradients).</li>
        </br>
        <li> Methods using layer-specific quantities like attention weights usually aggregate those across layers to obtain a proxy importance for input features, either averaging naively or with more reasonable methods such as <a href="https://aclanthology.org/2020.acl-main.385/">rollout and flow</a>.</li>
        </br>
        <li>The granularity of importance scores depend on the attribution method that is being used. For example, the attention mechanism operates at a token level, so attention weights used as importance scores will be one per token in the input sequence. Gradients, on the contrary, are computed for all the dimensions of each token embedding, so an aggregation strategy should be used to obtain a single score per token.</li>
    </ol>
</details>

In this lab we will learn how to use modern 🤗-based tools to compute feature attribution for NLP models and evaluate their results.

## Attributing Classification Models with ferret

*More info: [ferret Docs](https://ferret.readthedocs.io/en/latest/index.html)*

The [ferret](https://github.com/g8a9/ferret) library [(Attanasio et al., 2023)](https://aclanthology.org/2023.eacl-demo.29/) can be used to conveniently attribute the prediction of classification models from the 🤗 Transformers framework. In the following example, a multilingual model finetuned for the sentiment analysis task is loaded, and ferret is used to attribute its predictions using various methods:

In [5]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

name = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(name)
tokenizer = AutoTokenizer.from_pretrained(name)
print(model.config.id2label)
nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
nlp("Despite your sad haircut, you look truly stunning!")

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


{0: 'negative', 1: 'neutral', 2: 'positive'}


[{'label': 'positive', 'score': 0.8361954092979431}]

In [19]:
from ferret import SHAPExplainer, Benchmark

# Attributing the predicted class (positive)
bench = Benchmark(model, tokenizer)
shap_exp = SHAPExplainer(model, tokenizer)
explanations = shap_exp("Despite your sad haircut, you look truly stunning!", target=2)
bench.show_table([explanations])

Token,▁De,spite,▁your,▁sad,▁hair,cut,",",▁you,▁look,▁truly,▁stunning,!
Partition SHAP,-0.0,-0.01,0.07,-0.16,0.02,0.03,0.04,0.07,0.09,0.07,0.28,0.05


In [22]:
# Now attributing the negative class -> positive scores should correspond to negative word
explanations = shap_exp("Despite your sad haircut, you look truly stunning!", target=0)
bench.show_table([explanations])

Token,▁De,spite,▁your,▁sad,▁hair,cut,",",▁you,▁look,▁truly,▁stunning,!
Partition SHAP,-0.0,-0.01,-0.06,0.19,-0.01,-0.01,-0.04,-0.03,-0.06,-0.01,-0.17,-0.03


In [30]:
# Using bench direcly attributes using all available methods
all_explanations = bench.explain("Despite your sad haircut, you look truly stunning!", target=0)
bench.show_table(all_explanations)

                                                        

Token,▁De,spite,▁your,▁sad,▁hair,cut,",",▁you,▁look,▁truly,▁stunning,!
Partition SHAP,-0.01,-0.01,-0.1,0.31,-0.02,-0.01,-0.07,-0.05,-0.09,-0.01,-0.27,-0.05
LIME,-0.09,-0.09,-0.05,0.22,-0.03,0.03,-0.01,-0.02,-0.08,-0.03,-0.28,-0.08
Gradient,0.03,0.09,0.05,0.21,0.08,0.09,0.02,0.03,0.04,0.08,0.09,0.03
Gradient (x Input),0.02,-0.07,0.06,-0.2,-0.04,0.0,-0.03,-0.0,0.05,-0.13,-0.14,-0.1
Integrated Gradient,0.02,0.14,0.03,0.19,-0.03,-0.0,0.08,0.04,0.02,-0.01,0.12,-0.04
Integrated Gradient (x Input),0.01,0.12,-0.01,0.26,-0.1,0.02,-0.06,-0.02,0.01,-0.13,-0.22,-0.04


### Evaluating Feature Attribution Methods

From the results in the tables above we can see that some methods seem to produce more intuitive results than others. But how can we evaluate the quality of these attributions? Generally the evaluation of feature attribution method is centered around two aspects:

- **Faithfulness** to model processing, i.e. if the method correctly identifies feature having a causal influence in model predictions.

- **Plausibility** of the attribution, i.e. if the method produces results that match human intuition.

There is no guarantee that a method will be faithful and plausible at the same time, and the choice of the evaluation metric will depend on the application. We can estimate some faithfulness and evaluation metrics using the `ferret` library:

In [31]:
explanation_evaluations = bench.evaluate_explanations(
    all_explanations,
    target=0,
    # Negative words = "sad"
    human_rationale=[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
    top_k_rationale = 1
)
bench.show_evaluation_table(explanation_evaluations)

                                                               

Unnamed: 0,aopc_compr,aopc_suff,taucorr_loo,auprc_plau,token_f1_plau,token_iou_plau
Partition SHAP,0.05,-0.47,0.12,1.0,1.0,1.0
LIME,0.05,-0.42,0.55,1.0,1.0,1.0
Gradient,-0.05,-0.26,0.24,1.0,1.0,1.0
Gradient (x Input),-0.01,-0.18,0.27,0.04,0.0,0.0
Integrated Gradient,0.01,-0.35,0.0,1.0,1.0,1.0
Integrated Gradient (x Input),0.05,-0.54,0.27,1.0,1.0,1.0


[➡️ Faithfulness and plausibility metrics explained](https://ferret.readthedocs.io/en/latest/user_guide/notions.benchmarking.html)

## Attributing Generative Language Models with Inseq

**Plan**:

- Intro
- Basic example + step scores
- Subword + head/layer aggregation example
- Minimal pair difference example
- Contrastive feature attribution example
- Localizing factual knowledge with layer-wise gradient attribution 