# Assigment 5

**Submission deadlines**:

* last lab before 27.06.2022 

**Points:** Aim to get 12 out of 15+ possible points

All needed data files are on Drive: <https://drive.google.com/drive/folders/1uufpGn46Mwv4oBwajIeOj4rvAK96iaS-?usp=sharing> (or will be soon :) )

## Task 2 (6 points)

In [3]:
import pickle
from collections import Counter
from pathlib import Path

import torch
from nltk.tokenize import word_tokenize
from torch import nn, optim
from torch.utils.data import DataLoader
from torchtext.vocab import vocab
from tqdm import tqdm
import pytorch_lightning as pl

from utils import PrusModel, PrusDataModule

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

**A**. Create text corpora containing texts with similar vocabulary (for instance books from the same genre, or written by the same author). This corpora should have approximately 1M words. You can consider using the following sources: Project Gutenberg (https://www.gutenberg.org/), Wolne Lektury (https://wolnelektury.pl/), parts of BookCorpus, https://github.com/soskek/bookcorpus, but generally feel free. Texts could be in English, Polish or any other language you know.

In [3]:
CORPORA_FILEPATH = Path("data/prus.txt")

if not CORPORA_FILEPATH.exists():
    with CORPORA_FILEPATH.open("wt") as f:
        for filepath in Path("data/prus").glob("*.txt"):
            f.write(filepath.open("rt").read())
            f.write("\n")

In [4]:
corpora = CORPORA_FILEPATH.open("rt").read()
len(corpora.split())

1061380

**B**. choose the tokenization procedure. It should have two stages:

1. word tokenization (you can use nltk.tokenize.word_tokenize, tokenizer from spaCy, pytorch, keras, ...). Test your tokenizer on your corpora, and look at a set of tokens containing both letters and special characters. If some of them should be in your opinion treated as a sequence of tokens, then modify the tokenization procedure

In [5]:
BASE_TOKENS_FILEPATH = Path("data/tokens.pickle")

if BASE_TOKENS_FILEPATH.exists():
    with BASE_TOKENS_FILEPATH.open("rb") as f:
        tokens = pickle.load(f)
else:
    tokens = word_tokenize(corpora.lower(), "polish")
    new_tokens = []
    for token in tokens:
        new_tokens.extend(token.replace("…", "$$…$$").split("$$"))
    tokens = new_tokens
    with BASE_TOKENS_FILEPATH.open("wb") as f:
        pickle.dump(tokens, f)

2. sub-word tokenization (you can either use the existing procedure, like wordpiece or sentencepiece, or create something by yourself). Here is a simple idea: take 8K most popular words (W), 1K most popular suffixes (S), and 1K most popular prefixes (P). Words in W are its own tokens. Word x outside W should be tokenized as 'p_ _s' where p is the longest prefix of x in P, and s is the longest prefix of W

In [6]:
word_counts = Counter(tokens)

In [7]:
most_common_words = set(next(zip(*word_counts.most_common(8000))))

In [8]:
FIXES_FILEPATH = Path("data/fixes.pickle")

if FIXES_FILEPATH.exists():
    with FIXES_FILEPATH.open("rb") as f:
        suffix_counts, prefix_counts = pickle.load(f)
else:
    suffix_counts = Counter()
    prefix_counts = Counter()
    for token in tokens:
        for idx in range(len(token) + 1):
            prefix_counts.update([token[:idx]])
            suffix_counts.update([token[idx:]])
    with FIXES_FILEPATH.open("wb") as f:
        pickle.dump((suffix_counts, prefix_counts), f)

In [9]:
most_common_prefixes = sorted(next(zip(*prefix_counts.most_common(1000))), key=lambda s: -len(s))
most_common_suffixes = sorted(next(zip(*suffix_counts.most_common(1000))), key=lambda s: -len(s))

In [10]:
def take(iterable, n):
    for _, el in zip(range(n), iterable):
        yield el

In [11]:
FINAL_TOKENS_FILEPATH = Path("data/tokens_final.pickle")

if FINAL_TOKENS_FILEPATH.exists():
    with FINAL_TOKENS_FILEPATH.open("rb") as f:
        final_tokens = pickle.load(f)
    v = torch.load("vocab.pth")
else:
    final_tokens = []
    for token in tokens:
        if token in most_common_words:
            final_tokens.append(token)
        else:
            rest = token
            subtokens = []
            longest_prefix = ""
            longest_suffix = ""
            for prefix in most_common_prefixes:
                if token.startswith(prefix):
                    longest_prefix = prefix
                    break
            if longest_prefix:
                rest = rest[len(longest_prefix):]
                subtokens.append(longest_prefix)

            for suffix in most_common_suffixes:
                if rest.endswith(suffix):
                    longest_suffix = suffix
                    break

            if longest_suffix:
                rest = rest[:-len(longest_suffix)]
            if rest:
                subtokens.append(rest)
            if longest_suffix:
                subtokens.append(longest_suffix)

            if len(subtokens) == 3:
                subtokens[0] = subtokens[0] + "$"
                subtokens[1] = "$" + subtokens[1] + "$"
                subtokens[2] = "$" + subtokens[2]

            elif len(subtokens) == 2:
                subtokens[0] = subtokens[0] + "$"
                subtokens[1] = "$" + subtokens[1]
            final_tokens.extend(subtokens)
    with FINAL_TOKENS_FILEPATH.open("wb") as f:
        pickle.dump(final_tokens, f)
    v = vocab(Counter(final_tokens))
    v.append_token("<unknown>")
    v.set_default_index(-1)
    torch.save(v, "vocab.pth")