# Visualization with BertViz
##  Head View
As with most Deep Learning (DL) architectures, both the success of the Transformer models and how they learn have been not fully understood, but we know that the Transformers—remarkably—learn many linguistic features of the language. A significant amount of learned linguistic knowledge is distributed both in the hidden state and in the self-attention heads of the pre-trained model. There have been substantial recent studies published and many tools developed to understand and to better explain the phenomena.

Thanks to some Natural Language Processing (NLP) community tools, we are able to interpret the information learned by the self-attention heads in a Transformer model. The heads can be interpreted naturally, thanks to the weights between tokens. We will soon see that in further experiments in this section, certain heads correspond to a certain aspect of syntax or semantics. We can also observe surface-level patterns and many other linguistic features.

In this section, we will conduct some experiments using community tools to observe these patterns and features in the attention heads. Recent studies have already revealed many of the features of self-attention. Let's highlight some of them before we get into the experiments. For example, most of the heads attend to delimiter tokens such as Separator (SEP) and Classification (CLS), since these tokens are never masked out and bear segment-level information in particular. Another observation is that most heads pay little attention to the current token, but some heads specialize in only attending the next or previous tokens, especially in earlier layers. Here is a list of other patterns found in recent studies that we can easily observe in our experiments:  

- Attention heads in the same layer show similar behavior.
- Particular heads correspond to specific aspects of syntax or semantic relations.
- Some heads encode so that the direct objects tend to attend to their verbs, such as \<lesson, take> or \<car, drive>.
- In some heads, the noun modifiers attend to their noun (for example, the hot water; the next layer), or the possessive pronoun attends to the head (for example, her car).
- Some heads encode so that passive auxiliary verbs attend to a related verb, such as Been damaged, was taken.
- In some heads, coreferent mentions attend to themselves, such as talks-negotiation, she-her, President-Biden.
- The lower layers usually have information about word positions.
- Syntactic features are observed earlier in the transformer, while high-level semantic information appears in the upper layers.
- The final layers are the most task-specific and are therefore very effective for downstream tasks.


To observe these patterns, we can use two important tools, exBERT and BertViz, here. These tools have almost the same functionality. We will start with exBERT.

### exBert
- the [link to the tool](https://huggingface.co/exbert)
- read the paper : *exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models, Benjamin Hoover, Hendrik Strobelt, Sebastian Gehrmann, 2019*  
- see [this video](https://youtu.be/e31oyfo_thY) for more explanation



### BertViz
we will write some code to visualize heads with BertViz, which is a tool to visualize attention in the Transformer model, as is exBERT. It was developed by Jesse Vig in 2019 (A Multiscale Visualization of Attention in the Transformer Model, Jesse Vig, 2019). It is the extension of the work of the Tensor2Tensor visualization tool (Jones, 2017).

We can monitor the inner parts of a model with multiscale qualitative analysis. The advantage of BertViz is that we can work with most Hugging Face-hosted models (such as Bidirectional Encoder Representations from Transformers (BERT), Generated Pre-trained Transformer (GPT), and Cross-lingual Language Model (XLM)) through the Python Application Programming Interface (API). Therefore, we will be able to work with non-English models as well, or any pre-trained model. We will examine such examples together shortly. You can access BertViz resources and other information from the following [GitHub link](https://github.com/jessevig/bertviz)  


As with exBERT, BertViz visualizes attention heads in a single interface. Additionally, it supports a bird's eye view and a low-level neuron view, where we observe how individual neurons interact to build attention weights. A useful demonstration video can be found at the following [link](https://vimeo.com/340841955)

In [None]:
# !pip install bertviz
# !pip install transformers
# !pip install ipywidgets

In [1]:
from bertviz import head_view
from transformers import BertTokenizer, BertModel

BertViz supports three views: a head view, a model view, and a neuron view. Let's examine these views one by one. First of all, though, it is important to point out that we started from 1 to index layers and heads in exBERT. But in BertViz, we start from 0 for indexing, as in Python programming. If I say a <9,9> head in exBERT, its BertViz counterpart is <8,8>.

Let's start with the head view.

In [2]:
# We define getBertAttentions() function to retrieve attentions and tokens from a given model 

def get_bert_attentions(model_path, sentence_a, sentence_b):
    model = BertModel.from_pretrained(model_path, output_attentions=True)
    tokenizer = BertTokenizer.from_pretrained(model_path)
    inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True) #, add_special_tokens=True)
    token_type_ids = inputs['token_type_ids']
    input_ids = inputs['input_ids']
    attention = model(input_ids, token_type_ids=token_type_ids)[-1]
    input_id_list = input_ids[0].tolist()
    tokens = tokenizer.convert_ids_to_tokens(input_id_list)
    return attention, tokens


# Head View
The head view visualizes attention in one or more heads for the selected layer.
 

In [3]:
model_path = 'bert-base-cased'
sentence_a = "The cat is very sad."
sentence_b = "Because it could not find food to eat."
attention, tokens=get_bert_attentions(model_path, sentence_a, sentence_b)
head_view(attention, tokens)

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

<IPython.core.display.Javascript object>

#Working the language models other than English

##  A Turkish Model
Let's look for a coreference pattern in a Turkish language model. The following code loads a Turkish bert-base model and takes a sentence pair. We observe here that the <8,8> head has the same semantic feature in Turkish as in the English model,

In [4]:
model_path = 'dbmdz/bert-base-turkish-cased'

sentence_a = "Kedi çok üzgün."
sentence_b = "Çünkü o her zamanki gibi çok fazla yemek yedi."

attention, tokens=get_bert_attentions(model_path, sentence_a, sentence_b)
head_view(attention, tokens)
# <Layer-8, Head-8>

Downloading config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-base-turkish-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading vocab.txt:   0%|          | 0.00/245k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

<IPython.core.display.Javascript object>

In [5]:
model_path = 'bert-base-german-cased'
sentence_a = "Die Katze ist sehr traurig."
sentence_b = "Weil sie zu viel gegessen hat"
attention, tokens=get_bert_attentions(model_path, sentence_a, sentence_b)
head_view(attention, tokens)

# <Layer-8, Head-11>

Downloading config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/419M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading vocab.txt:   0%|          | 0.00/249k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

<IPython.core.display.Javascript object>

## Model View
Model view allows us to have a bird's-eye view of attentions across all heads and layers. Self-attention heads are shown in tabular form, with rows and columns corresponding to layers and heads, respectively. Each head is visualized in the form of a clickable thumbnail that includes the broad shape of the attention model  


The view can tell us how BERT works and makes it easier to interpret. Many recent studies, such as *A Primer in BERTology: What We Know About How BERT Works, Anna Rogers, Olga Kovaleva, Anna Rumshisky, 2021* , found some clues about the behavior of the layers and came to a consensus. We already listed some of them in the Interpreting attention heads section. You can test these facts yourself using BertViz's model view.  

we will use a `show_model_view()` wrapper function developed by Jesse Vig.

* https://github.com/jessevig/bertviz/blob/master/notebooks/model_view_bert.ipynb

In [None]:
from bertviz import model_view
from transformers import BertTokenizer, BertModel

In [None]:
def show_model_view(model, tokenizer, sentence_a, sentence_b=None, hide_delimiter_attn=False, display_mode="light"):
    inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
    input_ids = inputs['input_ids']
    if sentence_b:
        token_type_ids = inputs['token_type_ids']
        attention = model(input_ids, token_type_ids=token_type_ids)[-1]
        sentence_b_start = token_type_ids[0].tolist().index(1)
    else:
        attention = model(input_ids)[-1]
        sentence_b_start = None
    input_id_list = input_ids[0].tolist() # Batch index 0
    tokens = tokenizer.convert_ids_to_tokens(input_id_list)  
    if hide_delimiter_attn:
        for i, t in enumerate(tokens):
            if t in ("[SEP]", "[CLS]"):
                for layer_attn in attention:
                    layer_attn[0, :, i, :] = 0
                    layer_attn[0, :, :, i] = 0
    model_view(attention, tokens, sentence_b_start, display_mode=display_mode)

In [None]:
model_path='bert-base-german-cased'
model = BertModel.from_pretrained(model_path, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_path)

In [None]:
show_model_view(model, tokenizer, sentence_a, sentence_b, hide_delimiter_attn=False, display_mode="light")

pronoun-antecedent relation (Coreference) patterns are mostly encoded in the heads <8,1> <8,11>, <10,1> <10,7> (< LAYER, HEAD >)

< Layer-8, Head-11> is the strongest head that encodes the corerefence relation in German Model

# Neuron View
The attention-head view visualizes attention, as well as query and key values, in a particuler attention head.

The official Usage Notes:
* Hover over any of the tokens on the left side of the visualization to filter attention from that token.
* Then click on the plus icon that is revealed when hovering. This shows the query vectors, key vectors, and intermediate computations for the attention weights (blue=positive, orange=negative).
* Once in the expanded view, hover over any other token on the left to see the associated attention computations.
* Click on the Layer or Head drop-downs to change the model layer or head (zero-indexed).

In [None]:
from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from bertviz.neuron_view import show
model_path='bert-base-german-cased'
sentence_a = "Die Katze ist sehr traurig."
sentence_b = "Weil sie zu viel gegessen hat"
model = BertModel.from_pretrained(model_path, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_path)
model_type = 'bert'
show(model, model_type, tokenizer, sentence_a, sentence_b, layer=8, head=11, display_mode="light")

#let us check  <8,11>  that is for pronoun-antecedent relation,  <2,6> is for nect token pattern