# Assigment 5

**Submission deadlines**:

* last lab before 27.06.2022 

**Points:** Aim to get 12 out of 15+ possible points

All needed data files are on Drive: <https://drive.google.com/drive/folders/1uufpGn46Mwv4oBwajIeOj4rvAK96iaS-?usp=sharing> (or will be soon :) )

## Task 2 (6 points)

This task is about text generation. You have to:


**C**. write text generation procedure. The procedure should fulfill the following requirements:

1. it should use the RNN language model (trained on sub-word tokens)
2. generated tokens should be presented as a text containing words (without extra spaces, or other extra characters, as begin-of-word introduced during tokenization)
3. all words in a generated text should belond to the corpora (note that this is not guaranteed by LSTM)
4. in generation Top-P sampling should be used (see NN-NLP.6, slide X) 
5. in generated texts every token 3-gram should be uniq
6. *(optionally, +1 point)* all token bigrams in generated texts occur in the corpora

In [1]:
import pickle
from collections import Counter
from pathlib import Path

import torch
from nltk.tokenize import word_tokenize
from torch import nn, optim
from torch.utils.data import DataLoader
from tqdm import tqdm

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

**A**. Create text corpora containing texts with similar vocabulary (for instance books from the same genre, or written by the same author). This corpora should have approximately 1M words. You can consider using the following sources: Project Gutenberg (https://www.gutenberg.org/), Wolne Lektury (https://wolnelektury.pl/), parts of BookCorpus, https://github.com/soskek/bookcorpus, but generally feel free. Texts could be in English, Polish or any other language you know.

In [3]:
CORPORA_FILEPATH = Path("data/prus.txt")

if not CORPORA_FILEPATH.exists():
    with CORPORA_FILEPATH.open("wt") as f:
        for filepath in Path("data/prus").glob("*.txt"):
            f.write(filepath.open("rt").read())
            f.write("\n")

In [4]:
corpora = CORPORA_FILEPATH.open("rt").read()
len(corpora.split())

1061380

**B**. choose the tokenization procedure. It should have two stages:

1. word tokenization (you can use nltk.tokenize.word_tokenize, tokenizer from spaCy, pytorch, keras, ...). Test your tokenizer on your corpora, and look at a set of tokens containing both letters and special characters. If some of them should be in your opinion treated as a sequence of tokens, then modify the tokenization procedure

In [5]:
BASE_TOKENS_FILEPATH = Path("data/tokens.pickle")

if BASE_TOKENS_FILEPATH.exists():
    with BASE_TOKENS_FILEPATH.open("rb") as f:
        tokens = pickle.load(f)
else:
    tokens = word_tokenize(corpora, "polish")
    with BASE_TOKENS_FILEPATH.open("wb") as f:
        pickle.dump(tokens, f)

In [6]:
new_tokens = []
for token in tokens:
    new_tokens.extend(token.replace("…", "$$…$$").split("$$"))
tokens = new_tokens

2. sub-word tokenization (you can either use the existing procedure, like wordpiece or sentencepiece, or create something by yourself). Here is a simple idea: take 8K most popular words (W), 1K most popular suffixes (S), and 1K most popular prefixes (P). Words in W are its own tokens. Word x outside W should be tokenized as 'p_ _s' where p is the longest prefix of x in P, and s is the longest prefix of W

In [7]:
word_counts = Counter(tokens)

In [8]:
most_common_words = set(next(zip(*word_counts.most_common(8000))))

In [9]:
FIXES_FILEPATH = Path("data/fixes.pickle")

if FIXES_FILEPATH.exists():
    with FIXES_FILEPATH.open("rb") as f:
        suffix_counts, prefix_counts = pickle.load(f)
else:
    suffix_counts = Counter()
    prefix_counts = Counter()
    for token in tokens:
        for idx in range(len(token) + 1):
            prefix_counts.update([token[:idx]])
            suffix_counts.update([token[idx:]])
    with FIXES_FILEPATH.open("wb") as f:
        pickle.dump((suffix_counts, prefix_counts), f)

In [10]:
most_common_prefixes = sorted(next(zip(*prefix_counts.most_common(1000))), key=lambda s: -len(s))
most_common_suffixes = sorted(next(zip(*suffix_counts.most_common(1000))), key=lambda s: -len(s))

In [11]:
def take(iterable, n):
    for _, el in zip(range(n), iterable):
        yield el

In [12]:
FINAL_TOKENS_FILEPATH = Path("data/tokens_final.pickle")

if FINAL_TOKENS_FILEPATH.exists():
    with FINAL_TOKENS_FILEPATH.open("rb") as f:
        final_tokens = pickle.load(f)
else:
    final_tokens = []
    for token in tokens:
        if token in most_common_words:
            final_tokens.append(token)
        else:
            longest_prefix = ""
            longest_suffix = ""
            for prefix in most_common_prefixes:
                if token.startswith(prefix):
                    longest_prefix = prefix
                    break
            for suffix in most_common_suffixes:
                if token.endswith(suffix):
                    longest_suffix = suffix
                    break
            final_tokens.append(f"{longest_prefix}__{longest_suffix}")
    with FINAL_TOKENS_FILEPATH.open("wb") as f:
        pickle.dump(final_tokens, f)

In [13]:
SEQUENCE_LENGTH = 10

class PrusDataset(torch.utils.data.Dataset):
    def __init__(self, tokens, sequence_length=SEQUENCE_LENGTH, device=device):
        self.tokens = tokens
        self.vocab = set(tokens)
        self.final_idx_to_word = dict(enumerate(self.vocab))
        self.final_word_to_idx = {w: idx for idx, w in self.final_idx_to_word.items()}
        self.sequence = torch.tensor([self.final_word_to_idx[token] for token in self.tokens], device=device)
        self.sequence_length = sequence_length
        self.device = device

    def __len__(self):
        return len(self.sequence) - self.sequence_length

    def __getitem__(self, index):
        return (
            self.sequence[index:(index + self.sequence_length)],
            self.sequence[(index + 1):(index + self.sequence_length + 1)]
        )

In [14]:
dataset = PrusDataset(final_tokens)

In [15]:
class PrusModel(nn.Module):
    def __init__(self, dataset, device):
        super().__init__()
        self.n_vocab = len(dataset.vocab)
        self.embedding_dim = 100
        self.lstm_size = 512
        self.num_layers = 3
        self.device = device

        self.embedding = nn.Embedding(
            num_embeddings=self.n_vocab,
            embedding_dim=self.embedding_dim,
        )
        self.lstm = nn.LSTM(
            input_size=self.embedding_dim,
            hidden_size=self.lstm_size,
            num_layers=self.num_layers,
            dropout=0.2,
        )
        self.fc = nn.Linear(self.lstm_size, self.n_vocab)
        self.to(device)

    def forward(self, x, state):
        embed = self.embedding(x)
        out, new_state = self.lstm(embed, state)
        logits = self.fc(out)
        return logits, new_state

    def get_init_state(self, sequence_length):
        return (
            torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device),
            torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device),
        )

In [16]:
model = PrusModel(dataset, device)

In [17]:
batch_size = 256
max_epochs = 30

def train(dataset, model):
    model.train()

    dataloader = DataLoader(dataset, batch_size=batch_size)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(max_epochs):
        state_h, state_c = model.get_init_state(SEQUENCE_LENGTH)
        
        for x, y in tqdm(dataloader):

            optimizer.zero_grad()

            y_pred, (state_h, state_c) = model(x, (state_h, state_c))
            loss = criterion(y_pred.transpose(1, 2), y)

            state_h = state_h.detach()
            state_c = state_c.detach()            

            loss.backward()
            optimizer.step()

        print({ 'epoch': epoch, 'loss': loss.item() })
        torch.save(model.state_dict(), f"prus_model_3x512_{epoch:02d}ep.model")

In [18]:
train(dataset, model)

21it [00:08,  2.48it/s]


KeyboardInterrupt: 