# <h1><center><b>Ismerkedés a Transformer-alapú modellekkel</b></h1></center>




## **Mi a Transfromer architektúra és mire jó?**

* Enkóder-dekóder architektúra
* Alrétegei: Attention, előrecsatolt alréteg
* Az alrétegek reziduális blokkok
* Hatékony szekvenciafeldolgozásra tervezték konvolúciós és rekurrens rétegek nélkül

<center><img src="https://drive.google.com/uc?export=view&id=1aysNBj04B4hvTw0ft8VZAAsmJB1poUK3"/><p>Forrás: Viswani et al. (2017)</P></center>


### A Transformer-rétegek előtt: tokenbeágyazások

Abszolút **pozíciós kódolás**:

$$
PE_{(pos, 2i)} = sin \left ( \frac{pos}{10000^{\frac{2i}{\text{d_model}}}} \right )
$$

$$
PE_{(pos, 2i+1)} = cos \left ( \frac{pos}{10000^{\frac{2i}{\text{d_model}}}} \right )
$$

$pos$: pozíció, $i$: a beágyazás $i$-edik dimenziója, *d_model*: beágyazásméret

<br/>

<center><img src="https://drive.google.com/uc?export=view&id=1TLhPgksmaKDZLK-fBjRBSCAbCDJ_imAs"/><p>Forrás: Kernes (2021)</p></center>

<br/>

* A maximálisan kódolható pozíciók számát előre rögzíteni kell
* A tanítóanyagból hiányzó szekvenciahosszokat is jól modellezi
* Relatív pozíciós kódolás: bármilyen nemnegatív (a maximális szekvenciahossznál nem nagyobb) $k$-ra $PE_{pos+k}$ felírható $PE_{pos}$ lineáris függvényeként.

A pozíciók relatív reprezentációjára vonatkozó állítás azt jelenti, hogy bármely $\omega_i$ súlyra és $k$-ra létezik olyan $M_{i,k}$ mátrix, amelyre

$$
M_{i,k} \left (
\begin{matrix}
sin(\omega_ipos) \\
cos(\omega_ipos)
\end{matrix}
\right ) = \left (
\begin{matrix}
sin(\omega_i(pos + k)) \\
cos(\omega_i(pos + k))
\end{matrix}
\right )
$$

$M_{i,k}$ ki is számítható:

$$
M_{i,k} = \left (
\begin{matrix}
cos(\omega_ik)) && sin(\omega_ik) \\
-sin(\omega_ik)) && cos(\omega_ik)
\end{matrix}
\right )
$$

<br/>

Részletesebben:  Kazemnejad (2019)

Implementáció: Kernes (2021)

## Az Attention mechanizmus

**Bemenet**: az beágyazó réteg vagy az előző Transformer réteg kiemenete, egy `(batch_méret, szekvencia_hossz, beágyazás_méret)` alakú tenzor

**Kiemenet**: a bemettével azonos alakú tenzor módosított beágyazásokkal




In [1]:
!pip install transformers datasets sentencepiece



In [2]:
from typing import List

import torch
from torch import nn
from torch.utils.data import DataLoader
from datasets import Dataset
from transformers import (
    MBartForConditionalGeneration,
    MBart50TokenizerFast,
    pipeline,
    set_seed
)

A bemenetet megszorozzuk három súlymátrixszal, ezek a _query_ ($W^Q$), _key_ ($W^K$) és a _value_ ($W^V$). A szorzatokat jelöljük rendre a következőképpen: $Q, K, V$

<br/>

<center><img src="https://drive.google.com/uc?export=view&id=1jKfz_TCnLvAwGXLC0t5MSKlHFPXSVQUK"/><p>Forrás: Alammar (2018)</p></center>

<br/>

Legyen $d_k$ a K mátrix oszlopainak száma (beágyazásméret), $Z$ pedig at Attention kimenete. Ekkor az Attention kiszámításának képlete: 

$$
Z = softmax \left(\frac{QK^T}{\sqrt{d_K}} \right)V
$$

<br/>

<center><img src="https://drive.google.com/uc?export=view&id=15CxXpzRRzs_wSNhaPt3Dhtn22W0CmEjE"/><p>Forrás: Alammar (2018)</p></center>

<br/>

$Q, K$ és $V$ sorainak száma változó lehet.

Az Attention tehát a **vektorreprezentációk skalárszorzatával** kísérli meg megragadni a tokenek közötti kapcsolatokat.

Az osztás $\sqrt{d_k}$-val akkor különösen fontos, amikor $d_k$ nagy: ekkor a ekkor a skalárszorzatok is nagyok lehetnek, a szoftmax gradiensei pedig kicsik.





In [3]:
# define a vector
x = torch.tensor([5.67, 3.97, 7.79, 4.52], requires_grad=False)
# softmax without dividing by `sqrt(d_k)`
s1 = nn.functional.softmax(x, dim=-1)
# softmax after dividing by `sqrt(d_k)`
s2 = nn.functional.softmax(x / 2.0, dim=-1)

print(s1, s2, sep='\n')

tensor([0.1017, 0.0186, 0.8475, 0.0322])
tensor([0.2051, 0.0876, 0.5919, 0.1154])


A várt kimenetet ($Z$-t) a $V$-vel való szorzás után kapjuk meg. Az Attention bemenetét $X$-szel jelölve azt látjuk, hogy mind $X$, mind $Z$ sorai tokenreprezentációk. $Z$ reprezentációi azonban már **kontextuális információt** tartalmaznak. 

<br/>

<center><img src="https://drive.google.com/uc?export=view&id=1-o200diL3-DtVyt0rCzr8k7lRvinHxwA"/><p>Forrás: Alammar (2018)</p></center>

<br/>

A **Multi-Head Attention** az Attention kiszámítását több, kisebb rejtett méretű Attention-fejben végzi. Ez lehetővé teszi, hogy a modell a különböző fejekben más-más jellemzőket tanuljon.

<br/>

<center><img src="https://drive.google.com/uc?export=view&id=1PlCXkTrhECnSmUxRHHA1Yx8AFySoFuaP"/><p>Forrás: Vaswani et al. (2017)</p></center>

<br/>

<center><img src="https://drive.google.com/uc?export=view&id=1V2FmIfzmeOuYEH4hLL0HDSb3FEzVWyTO"/><p>Forrás: Alammar (2018)</p></center>

In [4]:
class DotProductAttention(nn.Module):
  """A dot product self-attention implementation"""

  def __init__(self, d_x: int, d_k: int, d_v: int) -> None:
    """Layer initialization

    Args:
      d_x: Input embedding size
      d_k: Hidden size of the query and key matrices
      d_v: Hidden size of the value matrix
    """
    super().__init__()
    self.w_query = nn.Linear(d_x, d_k)
    self.w_key = nn.Linear(d_x, d_k)
    self.w_value = nn.Linear(d_x, d_v)
    self.softmax = nn.Softmax(dim=-1)

  def _calculate_attention(
      self,
      q_tensor: torch.Tensor,
      k_tensor: torch.Tensor, 
      v_tensor: torch.Tensor,
      mask: torch.Tensor
  )-> torch.Tensor:
    """Calculate Attenetion output if the Q, K, V tensors are given

    Args:
      q_tensor: The query tensor of shape `(batch_size, sequence_length,
        key_hidden_size)`
      k_tensor: The key tensor of shape `(batch_size, sequence_length,
        key_hidden_size)`
      v_tensor: The value tensor of shape `(batch_size, sequence_length,
        value_hidden_size)`
      mask: A binary tensor of shape `(batch_size, sequence_length)`, where
        zeros indicate padding tokens
    
    Returns:
      The attention output tensor of shape `(batch_size, sequence_length,
        value_hidden_size)`
    """
    if mask.dim() == 2:
      mask = mask[:, None, :]
      mask = (1.0 - mask) * -10000.0
    else:
      raise ValueError(f"Invalid mask shape: {mask.shape}")
    dots = torch.matmul(q_tensor, torch.transpose(k_tensor, -2, -1))
    dots = dots + mask
    attn_strengths = self.softmax(dots)
    outputs = torch.matmul(attn_strengths, v_tensor)
    return outputs

  def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
    """Forward pass implementation
    
    Args:
      x: An input tensor of shape `(batch_size, sequence_length,
        embedding_size)` or `(batch_size, sequence_length, num_attn_heads,
        attn_head_size)`
      mask: A binary tensor of shape `(batch_size, sequence_length)`.
        Zeros indicate padding tokens
      
    Returns: 
      A tensor of shape `(batch_size, sequence_length,
        value_hidden_size)`
    """
    query = self.w_query(x)
    key = self.w_key(x)
    value = self.w_value(x)
    return self._calculate_attention(query, key, value, mask)


In [5]:
class MHDotProductAttention(DotProductAttention):
  """Multi-head dot product attention implementation"""

  def __init__(self, d_x: int, d_k: int, d_v: int, num_attn_heads: int) -> None:
    """Layer initialization

    Args:
      d_x: Input embedding size
      d_k: Hidden size of the query and key matrices
      d_v: Hidden size of the value matrix
      num_attn_heads: The number of attention heads
    """
    if d_k % num_attn_heads != 0 or d_v % num_attn_heads != 0:
      raise ValueError("The hidden sizes `d_k` and `d_v` should be multiples "
                       "of `num_attn_head`")
    super().__init__(d_x, d_k, d_v)
    self._num_attn_heads = num_attn_heads
    self._per_head_d_k = d_k // num_attn_heads
    self._per_head_d_v = d_v // num_attn_heads
  
  def _calculate_attention(
      self,
      q_tensor: torch.Tensor,
      k_tensor: torch.Tensor, 
      v_tensor: torch.Tensor,
      mask: torch.Tensor
  )-> torch.Tensor:
    """Calculate multi-head self-attention with the query, key, value
    tensors given
    """
    if mask.dim() == 2:
      mask = mask[:, None, None, :]
      mask = (1.0 - mask) * -10000.0
    else:
      raise ValueError(f"Invalid mask shape: {mask.shape}")

    batch_seq_heads = k_tensor.shape[:-1] + (self._num_attn_heads,)
    q_tensor = torch.reshape(
        q_tensor, batch_seq_heads + (self._per_head_d_k,)).transpose(1, 2)
    k_tensor = torch.reshape(
        k_tensor, batch_seq_heads + (self._per_head_d_k,)).transpose(1, 2)
    v_tensor = torch.reshape(
        v_tensor, batch_seq_heads + (self._per_head_d_v,)).transpose(1, 2)
    
    dots = torch.matmul(q_tensor, torch.transpose(k_tensor, -2, -1))
    dots = dots + mask
    attn_strengths = self.softmax(dots)
    head_outputs = torch.matmul(attn_strengths, v_tensor).transpose(1, 2)
    return torch.reshape(head_outputs, batch_seq_heads[:2] + (-1,))


In [6]:
x = torch.randn(2, 6, 8, dtype=torch.float32)
print(f"The original tensor:\n{x}", end="\n\n")
mask = torch.tensor([[1, 1, 1, 1, 0, 0], [1]*6], dtype=torch.float32)
attn_model = MHDotProductAttention(d_x=8, d_k=8, d_v=8, num_attn_heads=2)
print(f"The modified tensor after Attention:\n{attn_model(x, mask)}")

The original tensor:
tensor([[[ 5.0283e-02, -2.8718e-01,  5.9291e-01, -2.1073e+00,  8.8631e-01,
           2.5183e+00,  1.1344e+00,  3.2931e-01],
         [-2.8812e-01,  5.9181e-01,  2.8666e+00,  1.0814e+00, -4.6995e-01,
           1.6620e+00,  8.6864e-01, -1.0366e+00],
         [ 2.8497e-03,  5.1901e-01, -9.9473e-02, -3.5825e-01, -5.2426e-01,
          -8.6292e-03, -8.0892e-01,  1.2498e-01],
         [ 7.6107e-01,  1.4409e-01, -5.9433e-01,  1.5065e-01,  3.2622e-02,
          -9.8909e-01, -5.5912e-02, -4.9544e-01],
         [-5.1751e-01, -1.3121e+00,  6.5042e-01,  5.0176e-01, -1.3434e+00,
          -8.7413e-01,  1.2172e+00,  2.8059e-01],
         [-1.1149e+00,  2.4540e-02, -8.8860e-01,  1.3537e+00,  9.7080e-02,
          -5.7155e-01,  1.2084e+00,  1.8728e+00]],

        [[ 4.4819e-01,  8.4787e-01,  5.8240e-01, -1.1143e+00, -1.0977e+00,
          -1.5650e+00,  4.0007e-01, -2.0731e+00],
         [ 1.1057e+00, -8.0421e-01,  2.3043e+00, -7.4986e-01, -5.1368e-01,
           7.0227e-01, -1.3

## **Attention súlyok kinyerése egy előtanított modellből**

In [None]:
seq2seq_model = MBartForConditionalGeneration.from_pretrained(
    "facebook/mbart-large-50-many-to-many-mmt")

for name, weight in seq2seq_model.named_parameters():
  print(name)

del seq2seq_model

In [8]:
for name, _ in attn_model.named_parameters():
  print(name)

w_query.weight
w_query.bias
w_key.weight
w_key.bias
w_value.weight
w_value.bias


## **Betanított modell használata gépi fordításra**

Az alábbi példa a következőket fogja bemutatni:

* A bemenet előkészítése gépi fordításhoz
* Az mBART finomhangolt fordítómodell betöltése
* A modell meghívása a fordítás generálásához

Ehhez a Hugging Face `transformers` könyvtárát fogjuk használni.

In [9]:
def get_input_sents() -> Dataset:
  """Create a dataset with the input sentences"""
  sents = {
      "id": list(range(5)),
      "sent": [
           "Nearly ten years had passed since the Dursleys had woken up to "
             "find their nephew on the front step, but Privet Drive had hardly "
             "changed at all.",
           "The sun rose on the same tidy front gardens and lit up the brass "
             "number four on the Dursleys' front door; it crept into their "
             "living room, which was almost exactly the same as it had been on "
             "the night when Mr. Dursley had seen that fateful news report "
             "about the owls.",
           "Only the photographs on the mantelpiece really showed how much "
             "time had passed.",
           "Ten years ago, there had been lots of pictures of what looked like "
             "a large pink beach ball wearing different-colored bonnets - but "
             "Dudley Dursley was no longer a baby, and now the photographs "
             "showed a large blond boy riding his first bicycle, on a carousel "
             "at the fair, playing a computer game with his father, being "
             "hugged and kissed by his mother.",
           "The room held no sign at all that another boy lived in the house, "
             "too."     
      ]
  }
  # Text from Harry Potter and the Philosopher's Stone by J. K. Rowling
  dataset = Dataset.from_dict(sents)
  print(f"An example from the dataset:\n{dataset[0]['sent']}")
  return dataset


def tokenize_dataset(
    sent_dataset: Dataset,
    tokenizer: MBart50TokenizerFast,
    batch_size: int
) -> DataLoader:
  """Tokenize a dataset
  
  Args:
    sent_dataset: A dataset that contains a field named `sent`
    tokenizer: A pre-trained tokenizer model
    batch_size: The batch size to use
  
  Returns:
    The tokenized dataset as a PyTorch `DataLoader` 
  """
  sent_field = "sent"
  old_fields = sent_dataset.features.keys()
  if sent_field not in old_fields:
    raise KeyError(f"The dataset should contain a field named `{sent_field}`")
  if not isinstance(batch_size, int) or batch_size <= 0:
    raise ValueError("`batch_size` should be a positive integer")

  def tokenize_example(example):
     return tokenizer(example[sent_field], padding=True)
  
  sent_dataset = sent_dataset.map(
      tokenize_example,
      batched=True,
      batch_size=batch_size,
      remove_columns=list(old_fields)
  )
  sent_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
  dataloader = DataLoader(sent_dataset, batch_size=batch_size)
  return dataloader


def translate_sents(
    model: MBartForConditionalGeneration,
    tokenizer: MBart50TokenizerFast,
    dataloader: DataLoader,
    target_lang: str
) -> List[str]:
  """Translate sentences
  
  Args:
    model: The MT model
    tokenizer: A pre-trained tokenizer model
    dataloader: A PyTorch `DataLoader` with the fields
      (`input_ids`, `attnetion_mask`)
    target_lang: Target lanugage identifier string
  
  Returns:
    A list of strings, the translations
  """
  outputs = []
  for input_dict in dataloader:
    gen_tokens = model.generate(
      **input_dict,
      forced_bos_token_id=tokenizer.lang_code_to_id[target_lang]
    )
    res = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
    outputs.append(res)
  return outputs

In [10]:
def do_translation(
    dataset: Dataset,
    model_name: str,
    batch_size: int,
    target_lang: str
) -> List[List[str]]:
  """Full data flow: translate sentences"""
  tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
  tokenizer.src_lang = "en_XX"
  dataloader = tokenize_dataset(dataset, tokenizer, batch_size)

  model = MBartForConditionalGeneration.from_pretrained(model_name)
  translations = translate_sents(model, tokenizer, dataloader, target_lang)
  return translations

In [11]:
dataset = get_input_sents()
res = do_translation(
    dataset=dataset,
    model_name="facebook/mbart-large-50-many-to-many-mmt",
    batch_size=2,
    target_lang="de_DE"
)

# Print the results
for src_sent, trg_sent in zip(
    (example["sent"] for example in dataset),
    (decoded_sent for batch in res for decoded_sent in batch)):
  print(f"{src_sent}\n{trg_sent}", end="\n\n")

An example from the dataset:
Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all.


  0%|          | 0/3 [00:00<?, ?ba/s]

Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all.
Fast zehn Jahre waren vergangen, seit die Dursleys aufgewacht hatten, um ihren Neffen auf dem Vordersteg zu finden, aber Privet Drive hatte sich kaum verändert.

The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys' front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen that fateful news report about the owls.
Die Sonne stieg auf den gleichen ordentlichen Vorgarten und beleuchtete die Messing Nummer vier an der Dursleys Vordertür; sie kletterte in ihr Wohnzimmer, das war fast genau dasselbe, wie es war in der Nacht gewesen, als Herr Dursley den Schicksalsbericht über die Eulen gesehen hatte.

Only the photographs on the mantelpiece really showed how much time had passed.
Nur die Fotos auf demmantelstück zeigten wirklich, wie vie

In [12]:
del dataset
del res

## **A Hugging Face** `transformers` **könyvtára**

Vessünk egy pillantást a `PyTorch`-ban implementált `GPT-2` [dokumentációjára](https://huggingface.co/docs/transformers/model_doc/gpt2)

In [13]:
# A `pipeline example`
# See also the original example and tutorial:
# https://huggingface.co/tasks/text-generation

generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Studying language models is great, because",
          max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Studying language models is great, because you can actually understand it. In fact, this is one of the reasons I use it. All these different'},
 {'generated_text': "Studying language models is great, because there's a lot of flexibility and you can take it from there. I have read books on language, but"},
 {'generated_text': 'Studying language models is great, because it has to be a bit of extra work.\n\nIn a way, I prefer to work on a'},
 {'generated_text': "Studying language models is great, because you will see these approaches come in handy. It's also where you can make a conscious effort to learn how"},
 {'generated_text': "Studying language models is great, because it helps you find the right language! Even if you're not fluent in either one, you should keep that"}]

## Cikkek, olvasnivalók

Transformer:
* Vaswani et al. (2017): _Attention Is All You Need_ [Link](https://arxiv.org/abs/1706.03762)
* Jay Alammar (2018): _The Illustrated Transformer_ (blog) [Link](https://jalammar.github.io/illustrated-transformer/)

Transformers könyvtár:
* Wolf et al. (2020): _Transformers: State-of-the-Art Natural Language Processing_ [Link](https://aclanthology.org/2020.emnlp-demos.6/)

Az Attention kezdetei:
* Bahdanau et al. (2014): _Neural Machine Translation by Jointly Learning to Align and Translate_ [Link](https://arxiv.org/abs/1508.04025)
* Sutskever et al. (2014): _Sequence to Sequence Learning with Neural Networks_ [Link](https://arxiv.org/abs/1409.3215)
* Loung et al. (2015): _Effective Approaches to Attention-based Neural Machine Translation_ [Link](https://arxiv.org/abs/1508.04025)

Pozicókódolás:
* Shaw et al. (2018): _Self-Attention with Relative Position Representations_ [Link](https://arxiv.org/abs/1803.02155)
* Huang et al. (2018): _Music Transformer: Generating Music with Long-Term Structure_ [Link](https://arxiv.org/abs/1809.04281)
* Huang et al. (2020): _Improve Transformer Models with Better Relative Position Embeddings_ [Link](https://arxiv.org/abs/2009.13658)
* Wang et al. (2020): _What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding_ [Link](https://arxiv.org/abs/2010.04903)
* Amirhossein Kazemnejad (2019): _Transformer Architecture: The Positional Encoding_ (blog) [Link](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)
* Jonathan Kernes (2021): _Master Positional Encoding: Part I_ (Towards Data Science) [Link](https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3)

A skalárszorzat-hasonlóság szemléltetése:
* Tivadar Danka (2021): _How the Dot Product Measures Similarity_ (Towards Data Science) [Link](https://towardsdatascience.com/how-the-dot-product-measures-similarity-b3e16e22beda)