<table>
<tr>    
<td style="text-align: center">
<h1>Sieci transformatorowe i mechanizmy atencji</h1>
<h2><a href="http://home.agh.edu.pl/~horzyk/index.php">Adrian Horzyk</a></h2>
</td> 
<td>
<img src="http://home.agh.edu.pl/~horzyk/im/AdrianHorzyk49BT140h.png" alt="Adrian Horzyk, Professor" title="Adrian Horzyk, Professor" />        
</td> 
</tr>
</table>
<h3><i>Zapraszam do interaktywnego notebooka, w którym możesz dowiedzieć się, jak działają sieci neuronowe, doświadczysz i sprawdzisz ich działanie na wybranych zbiorach danych i przeprowadzisz własne eksperymenty!</i></h3>

# Sieci transformatorowe

Transformer to bardzo wydajna sieć wykorzystująca mechanizm Attention.
Mechanizm uwagi stosowany w transformatorach:
użyj nieskończonego okna referencyjnego, aby kontekst mógł zostać zaczerpnięty z całego tekstu, a nie tylko z krótkiego okna referencyjnego, na jakie pozwala RNN, lub długiego okna referencyjnego, na jakie pozwalają GRU lub LSTM.
umożliwia modelowi transformatora skupienie się na wszystkich wcześniej wygenerowanych tokenach, dzięki czemu nie cierpi na pamięć krótkotrwałą.

![Transformer](http://home.agh.edu.pl/~horzyk/lectures/figures/TransformerNN.png)

### Transformator zaimplementowany do sekwencyjnego przetwarzania danych, np. problem z tłumaczeniem tekstu.

Spróbuj obserwować, jak działają transformatory, wykonaj dokładne testy i sprawdź, jak dobrze ta architektura radzi sobie z zadaniem tłumaczenia maszynowego.

Na każdym poziomie zaimplementowano oryginalny kod transformatora z 6 warstwami kodera i dekodera oraz 8 wielogłowicowymi uwagami.

Ostatecznym wynikiem testu prezentowanego transformatora był <a href="https://en.wikipedia.org/wiki/BLEU">BLEU (dwujęzyczny dubler oceny)</a> na poziomie 0,6, co jest wynikiem podobnym do wyniku w innych stanach najnowocześniejsze modele transformatorów, które można znaleźć w Internecie.

DEF: BLEU (bilingual Evaluation dubler) to algorytm służący do oceny jakości tekstu przetłumaczonego maszynowo z jednego języka naturalnego na inny.

BLEU był jednym z pierwszych wskaźników, który wykazał wysoką korelację z ludzką oceną jakości i pozostaje jednym z najpopularniejszych zautomatyzowanych i niedrogich wskaźników.

Jakość uważa się za zgodność między wydajnością maszyny a wydajnością człowieka: „im tłumaczenie maszynowe jest bliższe profesjonalnemu tłumaczeniu wykonywanemu przez człowieka, tym jest ono lepsze” [<i>Papineni, Kishore; Roukos, Salim; Warda, Todda; Zhu, Wei-Jing (2001). „BLEU”. Materiały z 40. dorocznego spotkania Stowarzyszenia Lingwistyki Obliczeniowej - ACL '02. Morristown, New Jersey, USA: Association for Computational Linguistics: 311. doi:10.3115/1073083.1073135</i>].

Wyniki są obliczane dla poszczególnych przetłumaczonych segmentów – zazwyczaj zdań – poprzez porównanie ich z zestawem dobrej jakości tłumaczeń referencyjnych. Wyniki te są następnie uśredniane dla całego korpusu, aby oszacować ogólną jakość tłumaczenia. Nie jest brana pod uwagę zrozumiałość i poprawność gramatyczna.

Wynikiem BLEU jest zawsze liczba z zakresu od 0 do 1. Wartość ta wskazuje, jak podobny tekst kandydujący jest do tekstów referencyjnych, przy czym wartości bliższe 1 oznaczają więcej podobnych tekstów. Niewiele tłumaczeń wykonanych przez człowieka uzyska wynik 1, ponieważ oznaczałoby to, że kandydat jest identyczny z jednym z tłumaczeń referencyjnych. Z tego powodu nie jest konieczne uzyskanie wyniku 1. Ponieważ istnieje więcej możliwości dopasowania, dodanie dodatkowych tłumaczeń referencyjnych zwiększy wynik BLEU.

In [1]:
!nvidia-smi

Wed Jan 24 14:07:27 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 516.94       Driver Version: 516.94       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   43C    P8    12W /  N/A |   1068MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [6]:
import os
import torch
import torch.nn as nn
import math
import copy
import time
import pandas as pd
import altair as alt
import torchtext.datasets as datasets
import spacy
import GPUtil
import warnings
import torch.distributed as dist
from os.path import exists
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset
from torchtext.vocab import build_vocab_from_iterator
from torch.optim.lr_scheduler import LambdaLR
from torch.nn.functional import log_softmax, pad
from torch.nn.parallel import DistributedDataParallel as DDP

warnings.filterwarnings("ignore")

# Architektura modelu

In [7]:
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

In [8]:
class Generator(nn.Module):
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return log_softmax(self.proj(x), dim=-1)

## Stosy koderów i dekoderów

In [9]:
def clones(module, N):

    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

In [10]:
class Encoder(nn.Module):

    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

In [11]:
class LayerNorm(nn.Module):
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

In [12]:
class SublayerConnection(nn.Module):
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

In [13]:
class EncoderLayer(nn.Module):
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size
    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

### Dekoder

In [14]:
class Decoder(nn.Module):
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

In [15]:
class DecoderLayer(nn.Module):
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

In [16]:
def subsequent_mask(size):
    attn_shape = (1, size, size)
    subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(torch.uint8)
    return subsequent_mask == 0

In [17]:
def attention(query, key, value, mask=None, dropout=None):
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

In [18]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
    def forward(self, query, key, value, mask=None):
        if mask is not None:
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))]
        x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)
        x = ( x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k))
        del query
        del key
        del value
        return self.linears[-1](x)

In [19]:
class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

## Osadzania i Softmax

In [20]:
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model
    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

In [21]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)
    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

## Pełny model

In [22]:
def make_model(
    src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1):
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab),)
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

# Uczenie modelu

In [23]:
class Batch:
    def __init__(self, src, tgt=None, pad=2):  # 2 = <blank>
        self.src = src
        self.src_mask = (src != pad).unsqueeze(-2)
        if tgt is not None:
            self.tgt = tgt[:, :-1]
            self.tgt_y = tgt[:, 1:]
            self.tgt_mask = self.make_std_mask(self.tgt, pad)
            self.ntokens = (self.tgt_y != pad).data.sum()
    @staticmethod
    def make_std_mask(tgt, pad):
        tgt_mask = (tgt != pad).unsqueeze(-2)
        tgt_mask = tgt_mask & subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data)
        return tgt_mask

In [24]:
class TrainState:
    step: int = 0  # Steps in the current epoch
    accum_step: int = 0  # Number of gradient accumulation steps
    samples: int = 0  # total # of examples used
    tokens: int = 0  # total # of tokens processed

In [25]:
def run_epoch(
    data_iter,
    model,
    loss_compute,
    optimizer,
    scheduler,
    mode="train",
    accum_iter=1,
    train_state=TrainState(),
):
    start = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    n_accum = 0
    for i, batch in enumerate(data_iter):
        out = model.forward(batch.src, batch.tgt, batch.src_mask, batch.tgt_mask)
        loss, loss_node = loss_compute(out, batch.tgt_y, batch.ntokens)
        if mode == "train" or mode == "train+log":
            loss_node.backward()
            train_state.step += 1
            train_state.samples += batch.src.shape[0]
            train_state.tokens += batch.ntokens
            if i % accum_iter == 0:
                optimizer.step()
                optimizer.zero_grad(set_to_none=True)
                n_accum += 1
                train_state.accum_step += 1
            scheduler.step()

        total_loss += loss
        total_tokens += batch.ntokens
        tokens += batch.ntokens
        if i % 40 == 1 and (mode == "train" or mode == "train+log"):
            lr = optimizer.param_groups[0]["lr"]
            elapsed = time.time() - start
            start = time.time()
            print(start, i)
            tokens = 0
        del loss
        del loss_node
    return total_loss / total_tokens, train_state

In [26]:
def rate(step, model_size, factor, warmup):
    if step == 0:
        step = 1
    return factor * (model_size ** (-0.5) * min(step ** (-0.5), step * warmup ** (-1.5)))

## Regularyzacja modelu

In [27]:
class LabelSmoothing(nn.Module):
    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(reduction="sum")
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None
    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, true_dist.clone().detach())

# Pierwszy przykład

## Dane syntetyczne

In [28]:
def data_gen(V, batch_size, nbatches):
    for i in range(nbatches):
        data = torch.randint(1, V, size=(batch_size, 10))
        data[:, 0] = 1
        src = data.requires_grad_(False).clone().detach()
        tgt = data.requires_grad_(False).clone().detach()
        yield Batch(src, tgt, 0)

## Obliczanie strat

In [29]:
class SimpleLossCompute:
    def __init__(self, generator, criterion):
        self.generator = generator
        self.criterion = criterion
    def __call__(self, x, y, norm):
        x = self.generator(x)
        sloss = self.criterion(x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1))/ norm
        return sloss.data * norm, sloss

## Zachłanne dekodowanie

In [30]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    memory = model.encode(src, src_mask)
    ys = torch.zeros(1, 1).fill_(start_symbol).type_as(src.data)
    for i in range(max_len - 1):
        out = model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.zeros(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )
    return ys

# Prawdziwy przykład

## Wczytywanie danych

In [31]:
def load_tokenizers():
    try:
        spacy_de = spacy.load("de_core_news_sm")
    except IOError:
        os.system("python -m spacy download de_core_news_sm")
        spacy_de = spacy.load("de_core_news_sm")
    try:
        spacy_en = spacy.load("en_core_web_sm")
    except IOError:
        os.system("python -m spacy download en_core_web_sm")
        spacy_en = spacy.load("en_core_web_sm")
    return spacy_de, spacy_en

In [32]:
def tokenize(text, tokenizer):
    return [tok.text for tok in tokenizer.tokenizer(text)]

def yield_tokens(data_iter, tokenizer, index):
    for from_to_tuple in data_iter:
        yield tokenizer(from_to_tuple[index])

In [33]:
def build_vocabulary(spacy_de, spacy_en):
    def tokenize_de(text):
        return tokenize(text, spacy_de)
    def tokenize_en(text):
        return tokenize(text, spacy_en)

    train, val, test = datasets.Multi30k(language_pair=("de", "en"))
    vocab_src = build_vocab_from_iterator(
        yield_tokens(train + val + test, tokenize_de, index=0),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],)
    train, val, test = datasets.Multi30k(language_pair=("de", "en"))
    vocab_tgt = build_vocab_from_iterator(
        yield_tokens(train + val + test, tokenize_en, index=1),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],)
    vocab_src.set_default_index(vocab_src["<unk>"])
    vocab_tgt.set_default_index(vocab_tgt["<unk>"])
    return vocab_src, vocab_tgt

def load_vocab(spacy_de, spacy_en):
    if not exists("vocab.pt"):
        vocab_src, vocab_tgt = build_vocabulary(spacy_de, spacy_en)
        torch.save((vocab_src, vocab_tgt), "vocab.pt")
    else:
        vocab_src, vocab_tgt = torch.load("vocab.pt")
    print("Finished.\nVocabulary sizes:")
    print(len(vocab_src))
    print(len(vocab_tgt))
    return vocab_src, vocab_tgt

spacy_de, spacy_en = load_tokenizers()
vocab_src, vocab_tgt = load_vocab(spacy_de, spacy_en)

Finished.
Vocabulary sizes:
8315
6384


## Iteratory

In [34]:
def collate_batch(
    batch,
    src_pipeline,
    tgt_pipeline,
    src_vocab,
    tgt_vocab,
    device,
    max_padding=128,
    pad_id=2):
    
    bs_id = torch.tensor([0], device=device)
    eos_id = torch.tensor([1], device=device)
    src_list, tgt_list = [], []
    for (_src, _tgt) in batch:
        processed_src = torch.cat([bs_id, torch.tensor(
            src_vocab(src_pipeline(_src)),dtype=torch.int64,device=device,),eos_id,],0,)
        processed_tgt = torch.cat([bs_id, torch.tensor(
            tgt_vocab(tgt_pipeline(_tgt)),dtype=torch.int64,device=device,),eos_id,],0,)
        src_list.append(pad(processed_src,(0,max_padding - len(processed_src),),value=pad_id,))
        tgt_list.append(pad(processed_tgt,(0, max_padding - len(processed_tgt)),value=pad_id,))

    src = torch.stack(src_list)
    tgt = torch.stack(tgt_list)
    return (src, tgt)

In [35]:
def create_dataloaders(
    device,
    vocab_src,
    vocab_tgt,
    spacy_de,
    spacy_en,
    batch_size=8,
    max_padding=128,
    is_distributed=True,):
    
    def tokenize_de(text):
        return tokenize(text, spacy_de)
    def tokenize_en(text):
        return tokenize(text, spacy_en)
    def collate_fn(batch):
        return collate_batch(
            batch,
            tokenize_de,
            tokenize_en,
            vocab_src,
            vocab_tgt,
            device,
            max_padding=max_padding,
            pad_id=vocab_src.get_stoi()["<blank>"],
        )

    train_iter, valid_iter, test_iter = datasets.Multi30k(language_pair=("de", "en") )

    train_iter_map = to_map_style_dataset(train_iter)
    train_sampler = (DistributedSampler(train_iter_map) if is_distributed else None)
    valid_iter_map = to_map_style_dataset(valid_iter)
    valid_sampler = (DistributedSampler(valid_iter_map) if is_distributed else None)

    train_dataloader = DataLoader(
        train_iter_map,
        batch_size=batch_size,
        shuffle=(train_sampler is None),
        sampler=train_sampler,
        collate_fn=collate_fn,
    )
    valid_dataloader = DataLoader(
        valid_iter_map,
        batch_size=batch_size,
        shuffle=(valid_sampler is None),
        sampler=valid_sampler,
        collate_fn=collate_fn,
    )
    return train_dataloader, valid_dataloader

## Trening systemu

In [36]:
class DummyOptimizer(torch.optim.Optimizer):
    def __init__(self):
        self.param_groups = [{"lr": 0}]
        None

    def step(self):
        None

    def zero_grad(self, set_to_none=False):
        None


class DummyScheduler:
    def step(self):
        None

In [37]:
def train_worker(
    gpu,
    ngpus_per_node,
    vocab_src,
    vocab_tgt,
    spacy_de,
    spacy_en,
    config,
    is_distributed=False,
):
    print(f"Train worker process using GPU: {gpu} for training", flush=True)
    torch.cuda.set_device(gpu)

    pad_idx = vocab_tgt["<blank>"]
    d_model = 512
    model = make_model(len(vocab_src), len(vocab_tgt), N=6)
    model.cuda(gpu)
    module = model
    is_main_process = True

    criterion = LabelSmoothing(size=len(vocab_tgt), padding_idx=pad_idx, smoothing=0.1)
    criterion.cuda(gpu)
    
    train_dataloader, valid_dataloader = create_dataloaders(
        gpu,
        vocab_src,
        vocab_tgt,
        spacy_de,
        spacy_en,
        batch_size=config["batch_size"] // ngpus_per_node,
        max_padding=config["max_padding"],
        is_distributed=is_distributed,
    )

    optimizer = torch.optim.Adam(
        model.parameters(), lr=config["base_lr"], betas=(0.9, 0.98), eps=1e-9
    )
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, d_model, factor=1, warmup=config["warmup"]
        ),
    )
    train_state = TrainState()

    for epoch in range(config["num_epochs"]):
        if is_distributed:
            train_dataloader.sampler.set_epoch(epoch)
            valid_dataloader.sampler.set_epoch(epoch)

        model.train()
        print(f"[GPU{gpu}] Epoch {epoch} Training ====", flush=True)
        _, train_state = run_epoch(
            (Batch(b[0], b[1], pad_idx) for b in train_dataloader),
            model,
            SimpleLossCompute(module.generator, criterion),
            optimizer,
            lr_scheduler,
            mode="train+log",
            accum_iter=config["accum_iter"],
            train_state=train_state,
        )

        GPUtil.showUtilization()
        if is_main_process:
            file_path = "%s%.2d.pt" % (config["file_prefix"], epoch)
            torch.save(module.state_dict(), file_path)
        torch.cuda.empty_cache()

        print(f"[GPU{gpu}] Epoch {epoch} Validation ====", flush=True)
        model.eval()
        sloss = run_epoch(
            (Batch(b[0], b[1], pad_idx) for b in valid_dataloader),
            model,
            SimpleLossCompute(module.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            mode="eval",
        )
        print(sloss)
        torch.cuda.empty_cache()

    if is_main_process:
        file_path = "%sfinal.pt" % config["file_prefix"]
        torch.save(module.state_dict(), file_path)

In [38]:
def train_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config):
    train_worker(0, 1, vocab_src, vocab_tgt, spacy_de, spacy_en, config, False)

def load_trained_model():
    config = {
        "batch_size": 16,
        "num_epochs": 8,
        "accum_iter": 10,
        "base_lr": 1.0,
        "max_padding": 72,
        "warmup": 3000,
        "file_prefix": "multi30k_model_",
    }
    torch.cuda.empty_cache()
    model_path = "multi30k_model_final.pt"
    if not exists(model_path):
        train_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config)

    model = make_model(len(vocab_src), len(vocab_tgt), N=6)
    model.load_state_dict(torch.load("multi30k_model_final.pt"))
    return model

model = load_trained_model()

Train worker process using GPU: 0 for training
[GPU0] Epoch 0 Training ====
1656348359.6432955 1
1656348371.8963242 41
1656348384.0974813 81
1656348395.941855 121
1656348407.7968757 161
1656348419.551931 201
1656348431.346184 241
1656348443.079429 281
1656348454.8391542 321
1656348466.639822 361
1656348478.5179312 401
1656348490.7576487 441
1656348503.0799427 481
1656348515.119138 521
1656348527.3255155 561
1656348539.5226562 601
1656348551.758334 641
1656348563.9339225 681
1656348576.1895196 721
1656348588.4087205 761
1656348600.535571 801
1656348612.424817 841
1656348624.308712 881
1656348636.5302527 921
1656348648.5630412 961
1656348660.5376902 1001
1656348672.5523202 1041
1656348684.5202622 1081
1656348696.5631359 1121
1656348708.6089852 1161
1656348720.5839841 1201
1656348732.2233639 1241
1656348743.825259 1281
1656348755.2890234 1321
1656348766.8800044 1361
1656348778.5548358 1401
1656348790.3119807 1441
1656348802.0239236 1481
1656348813.782589 1521
1656348825.505587 1561
165634

# Dodatkowe komponenty: BPE, wyszukiwanie, uśrednianie

In [39]:
if False:
    model.src_embed[0].lut.weight = model.tgt_embeddings[0].lut.weight
    model.generator.lut.weight = model.tgt_embed[0].lut.weight

In [40]:
def average(model, models):
    "Average models into model"
    for ps in zip(*[m.params() for m in [model] + models]):
        ps[0].copy_(torch.sum(*ps[1:]) / len(ps[1:]))

# Wyniki

![](images/results.png)

In [41]:
def check_outputs(
    valid_dataloader,
    model,
    vocab_src,
    vocab_tgt,
    n_examples=15,
    pad_idx=2,
    eos_string="</s>",
):
    results = [()] * n_examples
    for idx in range(n_examples):
        b = next(iter(valid_dataloader))
        rb = Batch(b[0], b[1], pad_idx)
        greedy_decode(model, rb.src, rb.src_mask, 64, 0)[0]

        src_tokens = [vocab_src.get_itos()[x] for x in rb.src[0] if x != pad_idx]
        tgt_tokens = [vocab_tgt.get_itos()[x] for x in rb.tgt[0] if x != pad_idx]

        print("Source Text  : "+ " ".join(src_tokens).replace("\n", ""))
        print("Target Text  : "+ " ".join(tgt_tokens).replace("\n", ""))
        model_out = greedy_decode(model, rb.src, rb.src_mask, 72, 0)[0]
        model_txt = (" ".join([vocab_tgt.get_itos()[x] for x in model_out if x != pad_idx])
                     .split(eos_string, 1)[0]+ eos_string)
        
        print("Model Output               : " + model_txt.replace("\n", ""))
        results[idx] = (rb, src_tokens, tgt_tokens, model_out, model_txt)
    return results


def run_model_example(n_examples=5):
    global vocab_src, vocab_tgt, spacy_de, spacy_en
    print("Preparing Data ...")
    _, valid_dataloader = create_dataloaders(
        torch.device("cpu"),
        vocab_src,
        vocab_tgt,
        spacy_de,
        spacy_en,
        batch_size=1,
        is_distributed=False,
    )
    print("Loading Trained Model ...")

    model = make_model(len(vocab_src), len(vocab_tgt), N=6)
    model.load_state_dict(torch.load("multi30k_model_final.pt", map_location=torch.device("cpu")))

    print("Checking Model Outputs:")
    example_data = check_outputs(valid_dataloader, model, vocab_src, vocab_tgt, n_examples=n_examples)
    return model, example_data

In [42]:
run_model_example()

Preparing Data ...
Loading Trained Model ...
Checking Model Outputs:
Source Text  : <s> Drei Menschen knien oder stehen in der Nähe des Wassers am Strand . </s>
Target Text  : <s> Three people are on the beach kneeling or standing near the water . </s>
Model Output               : <s> Three people are kneeling on or standing near the water on the beach . </s>
Source Text  : <s> Eine Frau mit einem gelben T-Shirt und Sonnenbrille geht einen Gehweg entlang . </s>
Target Text  : <s> A woman in a yellow t - shirt and sunglasses walks down a sidewalk . </s>
Model Output               : <s> A woman with a yellow shirt and sunglasses is walking down a sidewalk . </s>
Source Text  : <s> Ein schwarzer Mann sieht einem anderen schwarzen Mann beim Baden im Wasserfall zu . </s>
Target Text  : <s> A black man is watching another black man playing in the waterfall . </s>
Model Output               : <s> A black man watches another black man in a waterfall in a waterfall . </s>
Source Text  : <s> Zwe

(EncoderDecoder(
   (encoder): Encoder(
     (layers): ModuleList(
       (0): EncoderLayer(
         (self_attn): MultiHeadedAttention(
           (linears): ModuleList(
             (0): Linear(in_features=512, out_features=512, bias=True)
             (1): Linear(in_features=512, out_features=512, bias=True)
             (2): Linear(in_features=512, out_features=512, bias=True)
             (3): Linear(in_features=512, out_features=512, bias=True)
           )
           (dropout): Dropout(p=0.1, inplace=False)
         )
         (feed_forward): PositionwiseFeedForward(
           (w_1): Linear(in_features=512, out_features=2048, bias=True)
           (w_2): Linear(in_features=2048, out_features=512, bias=True)
           (dropout): Dropout(p=0.1, inplace=False)
         )
         (sublayer): ModuleList(
           (0): SublayerConnection(
             (norm): LayerNorm()
             (dropout): Dropout(p=0.1, inplace=False)
           )
           (1): SublayerConnection(
       

In [43]:
def mtx2df(m, max_row, max_col, row_tokens, col_tokens):
    "convert a dense matrix to a data frame with row and column indices"
    return pd.DataFrame(
        [
            (
                r,
                c,
                float(m[r, c]),
                "%.3d %s"
                % (r, row_tokens[r] if len(row_tokens) > r else "<blank>"),
                "%.3d %s"
                % (c, col_tokens[c] if len(col_tokens) > c else "<blank>"),
            )
            for r in range(m.shape[0])
            for c in range(m.shape[1])
            if r < max_row and c < max_col
        ],
        # if float(m[r,c]) != 0 and r < max_row and c < max_col],
        columns=["row", "column", "value", "row_token", "col_token"],
    )


def attn_map(attn, layer, head, row_tokens, col_tokens, max_dim=30):
    df = mtx2df(
        attn[0, head].data,
        max_dim,
        max_dim,
        row_tokens,
        col_tokens,
    )
    return (
        alt.Chart(data=df)
        .mark_rect()
        .encode(
            x=alt.X("col_token", axis=alt.Axis(title="")),
            y=alt.Y("row_token", axis=alt.Axis(title="")),
            color="value",
            tooltip=["row", "column", "value", "row_token", "col_token"],
        )
        .properties(height=400, width=400)
        .interactive()
    )

In [44]:
def get_encoder(model, layer):
    return model.encoder.layers[layer].self_attn.attn


def get_decoder_self(model, layer):
    return model.decoder.layers[layer].self_attn.attn


def get_decoder_src(model, layer):
    return model.decoder.layers[layer].src_attn.attn


def visualize_layer(model, layer, getter_fn, ntokens, row_tokens, col_tokens):
    # ntokens = last_example[0].ntokens
    attn = getter_fn(model, layer)
    n_heads = attn.shape[1]
    charts = [
        attn_map(
            attn,
            0,
            h,
            row_tokens=row_tokens,
            col_tokens=col_tokens,
            max_dim=ntokens,
        )
        for h in range(n_heads)
    ]
    assert n_heads == 8
    return alt.vconcat(
        charts[0]
        | charts[1]
        | charts[2]
        | charts[3]
        | charts[4]
        | charts[5]
        | charts[6]
        | charts[7]
    ).properties(title="Layer %d" % (layer + 1))

In [45]:
def viz_encoder_self():
    model, example_data = run_model_example(n_examples=1)
    example = example_data[
        len(example_data) - 1
    ] 

    layer_viz = [
        visualize_layer(
            model, layer, get_encoder, len(example[1]), example[1], example[1]
        )
        for layer in range(6)
    ]
    return alt.hconcat(
        layer_viz[0]
        & layer_viz[1]
        & layer_viz[2]
        & layer_viz[3]
        & layer_viz[4]
        & layer_viz[5]
    )


viz_encoder_self()

Preparing Data ...
Loading Trained Model ...
Checking Model Outputs:
Source Text  : <s> Ein Baby in einer Wippe und ein stehender Junge , die von Spielzeug umgeben sind . </s>
Target Text  : <s> A baby in a bouncy seat and a standing boy surrounded by toys . </s>
Model Output               : <s> A baby in a seesaw and boy are standing around toys . </s>


In [46]:
def viz_decoder_self():
    model, example_data = run_model_example(n_examples=1)
    example = example_data[len(example_data) - 1]

    layer_viz = [
        visualize_layer(
            model,
            layer,
            get_decoder_self,
            len(example[1]),
            example[1],
            example[1],
        )
        for layer in range(6)
    ]
    return alt.hconcat(
        layer_viz[0]
        & layer_viz[1]
        & layer_viz[2]
        & layer_viz[3]
        & layer_viz[4]
        & layer_viz[5]
    )


viz_decoder_self()

Preparing Data ...
Loading Trained Model ...
Checking Model Outputs:
Source Text  : <s> Der <unk> hört dem Arbeiter mit dem grünen Schutzhelm zu , während die zu <unk> Ausrüstung <unk> wird . </s>
Target Text  : <s> The FedEx driver listens to the workman in the green hard hat while the equipment to ship is being loaded . </s>
Model Output               : <s> The <unk> is listening to the worker with the green hard hat while on his <unk> . </s>


In [47]:
def viz_decoder_src():
    model, example_data = run_model_example(n_examples=1)
    example = example_data[len(example_data) - 1]

    layer_viz = [
        visualize_layer(
            model,
            layer,
            get_decoder_src,
            max(len(example[1]), len(example[2])),
            example[1],
            example[2],
        )
        for layer in range(6)
    ]
    return alt.hconcat(
        layer_viz[0]
        & layer_viz[1]
        & layer_viz[2]
        & layer_viz[3]
        & layer_viz[4]
        & layer_viz[5]
    )


viz_decoder_src()

Preparing Data ...
Loading Trained Model ...
Checking Model Outputs:
Source Text  : <s> Ein Mann mit <unk> und einem Helm reitet auf einem weißen Pferd und das Pferd springt über ein Hindernis . </s>
Target Text  : <s> A man wearing riding boots and a helmet is riding a white horse , and the horse is jumping a hurdle . </s>
Model Output               : <s> A man with a <unk> and a helmet is riding a white horse and is jumping over an obstacle . </s>


Ponieważ szkolenie nawet podstawowej sieci GAN wymaga dużej mocy obliczeniowej, wszystkie obrazy zostały przeskalowane do rozmiaru 64x64.
W tym zbiorze danych autorzy podają informacje o obwiedniach ptaków na obrazach, więc ich przycięcie pozwoli nam uchwycić tylko najważniejsze cechy obrazów.
Zmienna **LATENT_DIM** odnosi się do rozmiaru wektora szumu, który będzie przekazywany jako sygnał wejściowy do sieci generatora (bez losowego wejścia generator generowałby za każdym razem ten sam sygnał wyjściowy). Zwykle będzie to realizowane w formie partii o kształcie [BATCH_SIZE, LATENT_DIM].

In [None]:
class Transformer(tf.keras.Model):
    def __init__(
            self, *, num_layers, d_model, num_heads,
            dff, input_vocab_size, target_vocab_size, dropout_rate=0.1
    ):
        super(Transformer, self).__init__()

        self.encoder = Encoder(num_layers=num_layers, d_model=d_model,
                               num_heads=num_heads, dff=dff,
                               vocab_size=input_vocab_size,
                               dropout_rate=dropout_rate)

    def call(self, inputs):
        logits = self.encoder(inputs)

        try:
            del logits._keras_mask
        except AttributeError:
            pass

        return logits

Tutaj zmiana tej samej wartości na różnych wejściach prowadzi do podobnych rezultatów – stopniowej transformacji koloru. Jednocześnie wpływa to na inne części obrazu, przez co pojedyncze wartości z wektora szumu nie zawsze są dla nas łatwe do wyjaśnienia.

## Zadania (4 punkty)

* Zastosuj wiedzę z wykładów i wykorzystaj sieć transformatorową do dowolnych danych sekwencyjnych.
* Porównaj wyniki z rekurencyjnymi sieciami neuronowymi, takimi jak GRU lub LSTM.

Odeślij mi notatnik z rozwiązaniem, w którym zaprezentujesz eksperymenty i wyniki w osobnym notatniku.