# BERT Attention Visualization

This notebook demonstrates how to visualize attention mechanisms in BERT using the `transformers` and `bertviz` libraries.


## Understanding the Visualization

The interactive visualization above shows:
- **Layers**: 12 different layers (each building on the previous)
- **Heads**: 12 attention heads per layer (each focusing on different relationships)
- **Attention patterns**: Lines connecting tokens, with thickness indicating attention strength
- **Interactive exploration**: Click on different layers/heads to see different attention patterns

### Key Concepts:
- **Attention**: How much each word "pays attention" to other words
- **Multi-head attention**: Different heads focus on different types of relationships (syntax, semantics, etc.)
- **Layer depth**: Deeper layers capture more complex, abstract relationships

### Requirements:
To run this notebook, you'll need to install the required packages:
```bash
pip install transformers torch bertviz
```


In [2]:
from transformers import BertModel, BertTokenizer


In [3]:
# Load pre-trained BERT model and tokenizer
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [6]:
# Define and tokenize a sample sentence
sentence = "Transformers are amazing!"
inputs = tokenizer(sentence, return_tensors="pt", return_token_type_ids=True)
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

print(f"Original sentence: {sentence}")
print(f"Tokens: {tokens}")


Original sentence: Transformers are amazing!
Tokens: ['[CLS]', 'transformers', 'are', 'amazing', '!', '[SEP]']


In [7]:
# Run the model to get outputs and attention weights
with torch.no_grad():
    outputs = model(**inputs)

attentions = outputs.attentions
last_hidden_state = outputs.last_hidden_state
hidden_states = outputs.hidden_states


NameError: name 'torch' is not defined

In [None]:
# Print shapes of various outputs
print(f"Input IDs shape: {inputs['input_ids'].shape}")
print(f"Last hidden state shape: {last_hidden_state.shape}")
print(f"Number of hidden layers: {len(hidden_states)}")
print(f"Hidden state[0] shape: {hidden_states[0].shape}")
print(f"Attention[0] shape: {attentions[0].shape}")


In [None]:
# Visualize attention patterns
head_view(attentions, tokens)
