## BPE Exploration
This is a notebook to explore the byte-pair encoding tools in the huggingface transformers package. The exploration is performed on the WMT english to german dataset.

In [1]:
import torch
import numpy as np
from tokenizers import CharBPETokenizer
import os

In [2]:
data_path = "/data2/pdplab/grantsrb/wmt_data/en-de_2014/"
eng_file = os.path.join(data_path, "train.en")
ger_file = os.path.join(data_path, "train.de")

In [3]:
with open(eng_file,'r') as f:
    eng_lines = []
    for i,l in enumerate(f.readlines()):
        print(l)
        eng_lines.append(l)
        if i > 5: break

iron cement is a ready for use paste which is laid as a fillet by putty knife or finger in the mould edges ( corners ) of the steel ingot mould .

iron cement protects the ingot against the hot , abrasive steel casting process .

a fire restant repair cement for fire places , ovens , open fireplaces etc .

Construction and repair of highways and ...

An announcement must be commercial character .

Goods and services advancement through the P.O.Box system is NOT ALLOWED .

Deliveries ( spam ) and other improper information deleted .



In [4]:
with open(ger_file,'r') as f:
    for i,l in enumerate(f.readlines()):
        print(l)
        if i > 5: break

iron cement ist eine gebrauchs ##AT##-##AT## fertige Paste , die mit einem Spachtel oder den Fingern als Hohlkehle in die Formecken ( Winkel ) der Stahlguss -Kokille aufgetragen wird .

Nach der Aushärtung schützt iron cement die Kokille gegen den heissen , abrasiven Stahlguss .

feuerfester Reparaturkitt für Feuerungsanlagen , Öfen , offene Feuerstellen etc.

Der Bau und die Reparatur der Autostraßen ...

die Mitteilungen sollen den geschäftlichen kommerziellen Charakter tragen .

der Vertrieb Ihrer Waren und Dienstleistungen durch das Postfach ##AT##-##AT## System WIRD NICHT ZUGELASSEN .

die Werbeversande ( Spam ) und andere unkorrekte Informationen werden gelöscht .



In [5]:
tok = CharBPETokenizer()

In [6]:
tok

Tokenizer(vocabulary_size=0, model=BPE, unk_token=<unk>, suffix=</w>, dropout=None, lowercase=False, unicode_normalizer=None, bert_normalizer=True, split_on_whitespace_only=False)

In [7]:
tok.train([eng_file], vocab_size=50000)

1

In [18]:
tok.add_special_tokens(["<MASK>"])
tok.add_tokens(["<START>"])

1

In [19]:
print(tok.token_to_id("<MASK>"))
print(tok.token_to_id("<START>"))

50000
50001


In [11]:
tok

Tokenizer(vocabulary_size=50001, model=BPE, unk_token=<unk>, suffix=</w>, dropout=None, lowercase=False, unicode_normalizer=None, bert_normalizer=True, split_on_whitespace_only=False)

In [20]:
tok.save_model("./")

['./vocab.json', './merges.txt']

In [21]:
tok = tok.from_file("vocab.json", "merges.txt")

In [22]:
print(tok.token_to_id("<MASK>"))
print(tok.token_to_id("<START>"))

None
None


In [24]:
tok.add_special_tokens(["<MASK>"])
tok.add_special_tokens(["<START>"])

1

In [8]:
print("encoding:", eng_lines[0])
output = tok.encode(eng_lines[0])

encoding: iron cement is a ready for use paste which is laid as a fillet by putty knife or finger in the mould edges ( corners ) of the steel ingot mould .



In [9]:
print("Ids:")
print(output.ids)
print("Tokens:")
print(output.tokens)

Ids:
[10474, 6633, 1930, 1004, 2665, 1964, 2138, 19231, 2049, 1930, 6761, 1951, 1004, 3024, 3616, 2026, 2204, 8657, 22671, 1941, 15404, 1931, 1912, 22741, 16685, 1902, 16550, 1278, 1922, 1912, 7173, 2135, 2090, 22741, 1558]
Tokens:
['iron</w>', 'cement</w>', 'is</w>', 'a</w>', 'ready</w>', 'for</w>', 'use</w>', 'paste</w>', 'which</w>', 'is</w>', 'laid</w>', 'as</w>', 'a</w>', 'fil', 'let</w>', 'by</w>', 'pu', 'tty</w>', 'knife</w>', 'or</w>', 'finger</w>', 'in</w>', 'the</w>', 'mould</w>', 'edges</w>', '(</w>', 'corners</w>', ')</w>', 'of</w>', 'the</w>', 'steel</w>', 'ing', 'ot</w>', 'mould</w>', '.</w>']


In [57]:
tok.decode(output.ids)

'iron cement is a ready for use paste which is laid as a fillet by putty knife or finger in the mould edges ( corners ) of the steel ingot mould .'

In [69]:
print(dir(tok))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_parameters', '_tokenizer', 'add_special_tokens', 'add_tokens', 'decode', 'decode_batch', 'enable_padding', 'enable_truncation', 'encode', 'encode_batch', 'from_file', 'get_vocab', 'get_vocab_size', 'id_to_token', 'no_padding', 'no_truncation', 'normalize', 'num_special_tokens_to_add', 'padding', 'post_process', 'save', 'save_model', 'to_str', 'token_to_id', 'train', 'truncation']


In [25]:
tok.get_vocab_size()

50002

In [61]:
tok.add_special_tokens(["<MASK>"])

1

In [62]:
tok.token_to_id("<MASK>")

50000

In [65]:
eng_lines[0] += "<MASK> <MASK> <MASK>"

In [67]:
output = tok.encode(eng_lines[0])

In [70]:
print(output.ids)

[10471, 6630, 1927, 1001, 2662, 1961, 2135, 19228, 2046, 1927, 6758, 1948, 1001, 3021, 3613, 2023, 2201, 8654, 22668, 1938, 15401, 1928, 1909, 22738, 16682, 1694, 16547, 1833, 1919, 1909, 7170, 2132, 2087, 22738, 1428, 50000, 50000, 50000]


In [74]:
ids = [50000, 50000, 50000]
tok.decode(ids)

''

In [None]:
tok.

## Vocab Exploration

In [33]:
eng_vocab = os.path.join(data_path, "vocab.50K.en")

In [35]:
with open(eng_vocab,'r') as f:
    for i,l in enumerate(f.readlines()):
        print(l)
        if i > 10: break

<unk>

<s>

</s>

the

,

.

of

and

to

in

a

is

