# Tranformers

In this notebook, you'll learn how to use the Hugging Face Transformers library for the translation task.

### ⚙️ Setup Workspace

We start with setting up the workspace by installing the `transformers` library and ignoring the warnings.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install transformers
# required by the Helsinki-NLP/opus-mt-en-it model
!pip install sentencepiece
!pip install sacremoses

In [None]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

## 🔄  Loading the Model

Let's first load the [`Helsinki-NLP/opus-mt-en-it` translation model](https://huggingface.co/Helsinki-NLP/opus-mt-en-it)  and its tokenizer.

We use the pretrained version available on Hugging Face.

The HuggingFace library provides the `AutoModel` class for the loading.
> The AutoModel class is a convenient way to load an architecture without needing to know the exact model class name because there are many models available. It automatically selects the correct model class based on the configuration file. You only need to know the task and checkpoint you want to use.  *Source:* [[1]](#r1) .

> ⚠️ However, the `AutoModel` class doesn't include the specific heads for the tasks that `Helsinki-NLP/opus-mt-en-it` was trained on (translation) [[2]](#r2).

<p style="background-color:#fff1d7; padding:15px;">
  📖 The <code>AutoModel</code> class is a flexible base class for loading pretrained models.
  The model <code>Helsinki-NLP/opus-mt-en-it</code> is trained for <strong>sequence-to-sequence task</strong> (translation),
  so using <code>AutoModel</code> would load only the core architecture,
  <strong>without the generation head</strong>, and it wouldn't support <code>.generate()</code>.
  Instead, you should use the appropriate <strong>task-specific auto class</strong>.<br><br>
  👉 For our case study, we will use the <code>AutoModelForSeq2SeqLM</code> class.
</p>


This class is part of a broader family of task-specific `AutoModel` classes:

- `AutoModelForCausalLM` – for causal language modeling (e.g., GPT-2). They are decoder-only models.
- `AutoModelForMaskedLM` – for masked language modeling (e.g., BERT). They are encoder-only models.
- `AutoModelForTokenClassification` – for classification tasks (e.g., NER), where we classify each token in the input text.
- `AutoModelForSeq2SeqLM` – for sequence-to-sequence tasks (e.g., Marian, BART, mBART)
- `...`

You can read the entire list in the [documentation](https://huggingface.co/docs/transformers/en/model_doc/auto).

In [None]:
# import libraries
from transformers import AutoModelForSeq2SeqLM

# load the model
model_name = 'Helsinki-NLP/opus-mt-en-it'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# print the type of the model
print(type(model))

<class 'transformers.models.marian.modeling_marian.MarianMTModel'>


<p style="background-color:#fff1d7; padding:15px;">
 🔍 By examining the model type, we can see that <code>'Helsinki-NLP/opus-mt-en-it'</code> is based on the <strong>MarianMTModel</strong> architecture. While <code>AutoModelForSeq2SeqLM</code> is a <b> generic class </b> that automatically selects the appropriate model architecture, we can use the more specific class <code>MarianMTModel</code>.
</p>


In [None]:
from transformers import MarianMTModel
model_name = 'Helsinki-NLP/opus-mt-en-it'
model = MarianMTModel.from_pretrained(model_name)

## 👀 Inspect the Model's Configuration

Let’s explore the model’s **configuration** to better understand its structure and hyperparameters.  
You can access it directly via the `config` attribute:
```python
model.config
```
If you prefer, you can also view the full configuration file on the official Hugging Face model page:
https://huggingface.co/Helsinki-NLP/opus-mt-en-it/blob/main/config.json

In [None]:
model.config

MarianConfig {
  "_attn_implementation_autoset": true,
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      80034
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 80034,
  "decoder_vocab_size": 80035,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 0,
  "forced_eos_token_id": 0,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_length": 512,
  "ma

**General Structure**

```python
"model_type": "marian"
"architectures": ["MarianMTModel"]
"is_encoder_decoder": true
```
**Model Structure**

```python
"encoder_layers": 6,
"decoder_layers": 6,
"encoder_attention_heads": 8,
"decoder_attention_heads": 8,
"d_model": 512,
"encoder_ffn_dim": 2048,
"decoder_ffn_dim": 2048
```


## 🔍 Exploring the Model's Architecture

You can print the model to take a look at its architecture.

In [None]:
model

MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(80035, 512, padding_idx=80034)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(80035, 512, padding_idx=80034)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0-5): 6 x MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): SiLU()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05

From the architecture, you can see that the model has:
- a `shared` Embedding layer;
- an `encoder`;
- a `decoder`;
- a final `lm_head`.

### Embedding Layer

if you want to look at the embedding layer in the model, you can access it using

In [None]:
model.model.shared

Embedding(80035, 512, padding_idx=80034)

| Parameter           | Meaning                                                                                          |
|---------------------|--------------------------------------------------------------------------------------------------|
| `80035`             | The **vocabulary size**: The number of unique token IDs the model knows. It can embed 80,035 different tokens. |
| `512`               | The **embedding dimension**: Each token is mapped to a vector of 512 floating-point values.      |
| `padding_idx=80034` | The token ID used for **padding** (`<pad>`). Its embedding won't be updated during training and is treated specially. |


The model uses a shared embedding layer for both encoder and decoder.

To check if the `shared` embedding layer is really used by both the encoder and the decoder,  

we can compare the ID of `model.model.shared` with the IDs of the embedding layers used in the encoder (`model.model.encoder.embed_tokens`)  

and the decoder (`model.model.decoder.embed_tokens`).  

If the IDs are the same, it means they all point to the same embedding layer — confirming that it’s shared.


In [None]:
id(model.model.shared) == id(model.model.encoder.embed_tokens) and id(model.model.shared) == id(model.model.decoder.embed_tokens)

True

### Encoder Layers

The MarianEncoder includes six encoder layers.

In [None]:
model.model.encoder

MarianEncoder(
  (embed_tokens): Embedding(80035, 512, padding_idx=80034)
  (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
  (layers): ModuleList(
    (0-5): 6 x MarianEncoderLayer(
      (self_attn): MarianAttention(
        (k_proj): Linear(in_features=512, out_features=512, bias=True)
        (v_proj): Linear(in_features=512, out_features=512, bias=True)
        (q_proj): Linear(in_features=512, out_features=512, bias=True)
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (activation_fn): SiLU()
      (fc1): Linear(in_features=512, out_features=2048, bias=True)
      (fc2): Linear(in_features=2048, out_features=512, bias=True)
      (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
  )
)

The code on the left matches the diagram on the right.

<img src="https://raw.githubusercontent.com/mallociFrancesca/XAIKGRLGM/a77f9ea5633475efe43038ef2a11e1341342e0ef/hands-on-session/marian-encoder.png" alt="Marian Encoder" width="1000">


### Decoder

MarianDecoder includes six decoedr layers.

In [None]:
model.model.decoder

MarianDecoder(
  (embed_tokens): Embedding(80035, 512, padding_idx=80034)
  (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
  (layers): ModuleList(
    (0-5): 6 x MarianDecoderLayer(
      (self_attn): MarianAttention(
        (k_proj): Linear(in_features=512, out_features=512, bias=True)
        (v_proj): Linear(in_features=512, out_features=512, bias=True)
        (q_proj): Linear(in_features=512, out_features=512, bias=True)
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (activation_fn): SiLU()
      (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (encoder_attn): MarianAttention(
        (k_proj): Linear(in_features=512, out_features=512, bias=True)
        (v_proj): Linear(in_features=512, out_features=512, bias=True)
        (q_proj): Linear(in_features=512, out_features=512, bias=True)
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (encoder_attn_lay

The code on the left matches the diagram on the right.

<img src="https://raw.githubusercontent.com/mallociFrancesca/XAIKGRLGM/a77f9ea5633475efe43038ef2a11e1341342e0ef/hands-on-session/marian-decoder.png" alt="Marian Decoder" width="1000">

### Language Modeling Head

The decoder’s output vector is passed to this final  `lm_head`

In [None]:
model.lm_head

Linear(in_features=512, out_features=80035, bias=False)

<img src="https://raw.githubusercontent.com/mallociFrancesca/XAIKGRLGM/a77f9ea5633475efe43038ef2a11e1341342e0ef/hands-on-session/lm-head.png" alt="LM Head" width="400">

- The decoder’s output vector is passed to this final lm_head.
- Then a softmax function turns these scores into probabilities.
- The model selects the most likely word (e.g., “Gatto”) based on these probabilities.
- It transforms the model’s internal output (vector of size 512) into a vector of size 80,035.
- That large number (80,035) is the vocabulary size – every possible word or subword the model can generate.


## 🔠  Machine Translation English-to-Italian

This section, shows how you can translate a sentence using the APIs provided by Hugging Face.

If you want to translate an English sentence into Italian using the <code>Helsinki-NLP/opus-mt-en-it</code>  model, you can follow these six steps:

In [None]:
from transformers import MarianTokenizer, MarianMTModel

# 1. Specify the name of the pre-trained English-to-Italian translation model
model_name = 'Helsinki-NLP/opus-mt-en-it'

# 2. Load the tokenizer and model associated with the specified translation model
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)


# 3. Define the input sentence(s) to be translated
src_text = ["I love natural language processing."]

# 4. Tokenize the input text and returns it as input tensors for the model
input_ids = tokenizer(src_text, return_tensors="pt", padding=True)

# 5. Generate the translated output tokens using the model
#translated = model.generate(**encoded)
# it gives us tokens (numbers), not words yet
translated = model.generate(input_ids=input_ids.input_ids, attention_mask=input_ids.attention_mask)


# 6. Decode the generated tokens into human-readable text, skipping any special tokens
translated_text = tokenizer.batch_decode(translated, skip_special_tokens=True)

print("Traduzione:", translated_text[0])


Traduzione: Adoro l'elaborazione del linguaggio naturale.


👉 This code showed how a sentence can be translated using the API provided by Hugging Face.

Now, let’s take a closer look at how tokens are generated by the  `generate() ` method.

## 🧩 Deep dive into Token Generation `model.generate()`

In this section, we simulate how the models generates one token at a time through the `generate()` method.

The input for the encoder is the tokenized input text.

The decoder input will be the sequence generated up to that point.

For the first iteration, there is nothing already generated, so a special token must be specified to indicate the beginning of the sequence (BOS). For `Helsinki-NLP/opus-mt-en-it`, the token will be `<pad_token>`.

In [None]:
import torch

### --- We define the encoder input --- ###

# Define the English sentence we want to translate.
input_sentence = "I love natural language processing."

# We use the tokenizer to convert the sentence into tokens (numbers).
# 'return_tensors="pt"' means we want the output as PyTorch tensors.
tokens = tokenizer(input_sentence, return_tensors="pt")

### --- We manually create the first decoder input. --- ###

# It starts with the padding token as a placeholder (in real generation, this might be a start token).
# The model will use this to begin generating the first word.
decoder_input_ids = torch.tensor([[ tokenizer.pad_token_id ]])

In [None]:
print("Encoder's input")
print(tokens["input_ids"])

print()
print("Decoder's input")
print(decoder_input_ids)

Encoder's input
tensor([[  22,  722, 1552, 2413, 3795,    2,    0]])

Decoder's input
tensor([[80034]])


To generate the first output token (after `<pad_token>`), we call the `forward()` method with the input text and the decoder's input.

<p style="background-color:#fff1d7; padding:15px;">
  <strong>📖</strong> The <code>forward</code> method is the core function that takes the input sentence and decoder input, runs them through the model, and returns the raw predictions for the next token.
</p>

In [None]:
output = model.forward(
    input_ids=tokens["input_ids"],
    attention_mask=tokens["attention_mask"],
    decoder_input_ids=decoder_input_ids
)
print(output.keys())

odict_keys(['logits', 'past_key_values', 'encoder_last_hidden_state'])


The `forward` method returns a dictionary that contains:

- `logits`: These are raw (non-normalized) score generated by the model before applying any activation function. The logits are the output of the linear layer (`lm_head`) that maps the decoder's output to the vocabulary size. The logits are used to compute the probabilities of each token.

- `past_key_values`: This helps the model remember what it has already generated. It's used to make the next steps faster during text generation. Since the decoder is autoregressive—it generates one token at a time and only looks at previous tokens—the predictions for earlier tokens stay the same. That means we don’t need to recalculate them. Instead, we can save (`cache`) those values as past_key_values and reuse them the next time we call the model. This makes generation more efficient.

- `encoder_last_hidden_state`: The hidden states of the last layer of the encoder. This is not used for generating the next token, but it can be useful for other tasks (e.g., summarization) or for debbuging.

In [None]:
print(output.logits.shape)

torch.Size([1, 1, 80035])


Shape: (batch_size, sequence_length, vocab_size)

- `batch_size` = 1 (one sentence),

- `sequence_length` = 1 (we're predicting the first token only),

- `vocab_size` = number of possible tokens the model knows.

Now that we have the logits, how do we find the most likely next word?
To do this, we need to turn the logits into probabilities for each token. One way to select the next word is by using the **greedy decoding technique**.

<p style="background-color:#fff1d7; padding:15px;">
  📖 The <b>Greedy decoding </b> is a simple text generation method where, at each step, the model chooses the token with the
  highest probability — the one it thinks is most likely — without considering any other possibilities.
</p>

In [None]:
# The first 0 refers to the first sentence in the batch.
# The second 0 refers to the first token in the output sequence.
# The output returns a vector of size vocab_size: one score for every possible wrod in the vocabulary
output.logits[0,0]

tensor([-1.2617, -4.9812, -0.3660,  ..., -4.9893, -5.0309,  0.0000],
       grad_fn=<SelectBackward0>)

In [None]:

# Find the index of the highest logit value: takes the highest scoring token for the first sentence (greedy decoding at step 1)
max_proba_token = output.logits[0,0].argmax()
# Extract the logit value corresponding to the selected token

logit_value = output.logits[0, 0, max_proba_token]
print(f"Max probability token ID: {max_proba_token.item()} | Logit value: {logit_value.item()}")
print("Corresponding token (Word):", tokenizer.decode(max_proba_token))

Max probability token ID: 1235 | Logit value: 8.184107780456543
Corresponding token (Word): Ad


We add the predicted token to our `decoder_input_ids` so that we can generate the next token in the sequence.

In [None]:
# .view() acts like .reshape()
decoder_input_ids = torch.hstack([decoder_input_ids, max_proba_token.view(1, 1)])
print(decoder_input_ids)

tensor([[80034,  1235]])


We call the `forward` method to predict the next token.

In [None]:
output = model.forward(**tokens, decoder_input_ids=decoder_input_ids)

print("Shape: (batch_size, sequence_length, vocab_size)")
print(output.logits.shape)

Shape: (batch_size, sequence_length, vocab_size)
torch.Size([1, 2, 80035])


Now, the `sequence_length` is 2,  because we’ve generated two tokens.

We can confirm this by decoding both token predictions and seeing the output. Let’s extract the most probable token at each position, map the token IDs back to their string representations, and decode the full sequence.

In [None]:

# Get the full vocabulary from the tokenizer: a dictionary mapping token strings to their IDs
vocabulary = tokenizer.get_vocab()
# Create a reverse mapping from token IDs to token strings (for easier lookup and display)
reverse_vocab = { v: k for k, v in vocabulary.items() }

# 0:  first batch
max_proba_tokens = output.logits[0].argmax(axis=1)
print("Token ids: ", max_proba_tokens)
print("Mapped tokens: ", list(map(reverse_vocab.get, max_proba_tokens.tolist())))
print("Decoded string (word): ", tokenizer.decode(max_proba_tokens))

Token ids:  tensor([1235, 3351])
Mapped tokens:  ['▁Ad', 'oro']
Decoded string (word):  Adoro


Instead of manually generating one token at a time, we let the model generate the whole output in one go.
This uses greedy decoding with randomness (do_sample=True), which picks the most likely token at each step, but adds some variation.

In [None]:
# greedy deoding
tokenizer.batch_decode(model.generate(**tokens, do_sample=True))

["<pad> Adoro l'elaborazione del linguaggio naturale.</s>"]

Finally, let's manually simulate what `generate()` does behind the scenes using greedy decoding. Using greedy decoding, we’ll generate a sequence one token at a time by repeatedly feeding the model’s output back into itself. At each step, we’ll pick the most likely next token and continue building the sequence until we reach the end-of-sequence token or a maximum length.

In [None]:

decoder_input_ids = torch.tensor([[ tokenizer.pad_token_id ]])

max_length = 15
i = 0

while i < max_length and decoder_input_ids[0,-1] != tokenizer.eos_token_id:
    output = model(**tokens, decoder_input_ids=decoder_input_ids)
    max_proba_tokens = output.logits[0].argmax(axis=1)
    print(f"Step {i+1}: {tokenizer.decode(max_proba_tokens)}")
    decoder_input_ids = torch.hstack([decoder_input_ids, max_proba_tokens[-1].view(1, 1)])
    i += 1

Step 1: Ad
Step 2: Adoro
Step 3: Adoro l
Step 4: Adoro l'
Step 5: Adoro l'elaborazione
Step 6: Adoro l'elaborazione del
Step 7: Adoro l'elaborazione del linguaggio
Step 8: Adoro l'elaborazione del linguaggio naturale
Step 9: Adoro l'elaborazione del linguaggio naturale.
Step 10: Adoro l'elaborazione del linguaggio naturale.</s>


In [None]:
out_tokens = model.generate(**tokens, max_length=max_length)
tokenizer.batch_decode(out_tokens)

["<pad> Adoro l'elaborazione del linguaggio naturale.</s>"]

## 🛠️ Customizing Model Architecture with Configurations

Hugging Face Transformers library provides powerful ways to customize model behavior through their configuration classes.

You can modify the model's configuration class to change how a model is built.  
The configuration defines the architecture of the model, including attributes like the number of hidden layers, attention heads, hidden size, dropout probabilities, and more.

There are two main ways to use configurations:

1. **Start from scratch** by creating a custom configuration.  
   In this case, the model is initialized with random weights and must be trained before use it.

2. **Modify a pre-trained model's configuration**, which allows you to tweak how the model behaves (e.g., enable attention outputs or change generation settings) without losing the pretrained weights.


#### 1. Starting From Scratch with a Custom Configuration

If you want full control over the model's architecture — for example, to experiment with different layer sizes or number of heads
— you can create a configuration from scratch.

This approach initializes the model with **random weights**, meaning you must **train it from scratch** before it's usable.


In [None]:
from transformers import MarianConfig, MarianMTModel

# Define a new configuration (same structure as MarianMT)
custom_config = MarianConfig(
    vocab_size=59590,
    d_model=512,
    encoder_layers=6,
    decoder_layers=6,
    encoder_attention_heads=8,
    decoder_attention_heads=8
)

# Create the model with random weights
# He does not have pre-trained weights, so he cannot translate yet!
model_from_scratch = MarianMTModel(config=custom_config)

print("Model initialized from scratch with custom config:")
print(model_from_scratch.config)


Model initialized from scratch with custom config:
MarianConfig {
  "_attn_implementation_autoset": true,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "attention_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 58100,
  "decoder_vocab_size": 59590,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 0,
  "forced_eos_token_id": 0,
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "max_position_embeddings": 1024,
  "model_type": "marian",
  "num_hidden_layers": 6,
  "pad_token_id": 58100,
  "scale_embedding": false,
  "share_encoder_decoder_embeddings": true,
  "transformers_version": "4.51.1",
  "use_cache": true,
  "vocab_size": 59590
}



#### 2. Modifying a pre-trained model's configuration

In this example, we modify the model's configuration to:

- Get the **attention weights**, which show what parts of the input the model focuses on at each step.
- Get the **hidden states**, which are the internal representations of each token at every layer of the model.

This is useful when you want to **analyze or better understand** how the model works internally.

We use the pretrained MarianMT model (`Helsinki-NLP/opus-mt-en-it`) and translate a simple sentence from English to Italian.  
The model weights stay the same — we’re only changing the **output behavior**, not the training.

Then, we use:
- `model.generate()` to get the translated sentence.
- `model()` (the forward pass) to explore the attention maps and hidden states.

In [None]:
from transformers import MarianMTModel, MarianTokenizer

# 1. Define the name of the pretrained model
model_name = "Helsinki-NLP/opus-mt-en-it"

# 2. Load the tokenizer and the pretrained model
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# 3. Modify the configuration to enable extra outputs
# These changes do not affect the pretrained weights
model.config.output_attentions = True
model.config.output_hidden_states = True
model.config.return_dict = True  # Needed to access outputs as a dictionary

# 4. Define the input sentence to translate
sentence = "The cat eats a mouse"
print("Original sentence:", sentence)
inputs = tokenizer(sentence, return_tensors="pt")  # Convert text to tensors

# 5. Use the forward() method to manually run the model and get full outputs
output = model(**inputs, decoder_input_ids=inputs["input_ids"])

# 6. Use generate() to get the translated sentence
translated_ids = model.generate(**inputs)
translated_text = tokenizer.decode(translated_ids[0], skip_special_tokens=True)

# 7. Print the final translated sentence
print("Translated sentence:", translated_text)

# 8. Optional: print extra info from the model outputs
print("Keys in output:", output.keys())  # Show what’s inside the output
print("Number of encoder hidden states:", len(output.encoder_hidden_states))  # How many layers
print("Shape of attention from encoder layer 0:", output.encoder_attentions[0].shape)  # Attention matrix size

Original sentence: The cat eats a mouse
Translated sentence: Il gatto mangia un topo
Keys in output: odict_keys(['logits', 'past_key_values', 'decoder_hidden_states', 'decoder_attentions', 'cross_attentions', 'encoder_last_hidden_state', 'encoder_hidden_states', 'encoder_attentions'])
Number of encoder hidden states: 7
Shape of attention from encoder layer 0: torch.Size([1, 8, 7, 7])


# References

<a name="r1">[1]</a> https://huggingface.co/docs/transformers/en/models

<a name="r2">[2]</a> https://huggingface.co/docs/transformers/en/model_doc/marian

<a name="r2">[3]</a> https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/quicktour.ipynb#scrollTo=4D89wY_Z8Cg7