First, we check which datasets are available via HuggingFace's `datasets` library like so:

In [None]:
!pip install datasets

In [1]:
import datasets

all_ds = datasets.list_datasets()
print(len(all_ds))

1306


In [2]:
all_ds[:10]

['acronym_identification',
 'ade_corpus_v2',
 'adversarial_qa',
 'aeslc',
 'afrikaans_ner_corpus',
 'ag_news',
 'ai2_arc',
 'air_dialogue',
 'ajgt_twitter_ar',
 'allegro_reviews']

Let's go ahead and download the OSCAR Italian dataset (total of 95GB - download can take some time).

In [3]:
dataset = datasets.load_dataset(
    'oscar',
    'unshuffled_deduplicated_it',
    split='train[:2000000]')

Reusing dataset oscar (/Users/jamesbriggs/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_it/1.0.0/84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2)


In [4]:
dataset

Dataset({
    features: ['id', 'text'],
    num_rows: 2000000
})

Now we reformat the data into plaintext files. We will store them in a local `oscar_it` directory.

In [5]:
import os

os.mkdir('./oscar_it')

In [6]:
from tqdm.auto import tqdm  # for our loading bar

text_data = []
file_count = 0

for sample in tqdm(dataset):
    # remove newline characters from each sample as we need to use exclusively as seperators
    sample = sample['text'].replace('\n', '\s')
    text_data.append(sample)
    if len(text_data) == 5_000:
        # once we hit the 5K mark, save to file
        with open(f'./oscar_it/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1
# after saving in 5K chunks, we may have leftover samples, we save those now too
with open(f'./oscar_it/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))

100%|██████████| 2000000/2000000 [01:33<00:00, 21496.16it/s]


Next, we make a big list of all of the plaintext files we just saved - using `pathlib`.

In [7]:
from pathlib import Path
paths = [str(x) for x in Path('./oscar_it').glob('**/*.txt')]
paths[:5]

['oscar_it/text_38.txt',
 'oscar_it/text_10.txt',
 'oscar_it/text_264.txt',
 'oscar_it/text_270.txt',
 'oscar_it/text_258.txt']

In [8]:
len(paths)

401

And now we're ready to begin training our tokenizer! We initialize the WordPiece tokenizer, assign BERT style special tokens, and provide our list of files.

In [9]:
# !pip install tokenizers
from tokenizers import BertWordPieceTokenizer

# initialize
tokenizer = BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=False,
    strip_accents=False,
    lowercase=False
)
# and train
tokenizer.train(files=paths, vocab_size=30_000, min_frequency=2,
                limit_alphabet=1000, wordpieces_prefix='##',
                show_progress=True, special_tokens=[
                    '[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'])






Then we save the tokenizer inside the `./bert-it` directory.

In [10]:
os.mkdir('./bert-it')

tokenizer.save_model('./bert-it')

['./bert-it/vocab.txt']

---

# Using the Tokenizer

We can now load and use the tokenizer as we usually would in transformers.

In [11]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('./bert-it')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [12]:
tokenizer('ciao! come va?')  # hi! how are you?

{'input_ids': [2, 13884, 5, 2095, 2281, 35, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

We can see our `[CLS]` token in `input_ids` represented by `2`, and `[SEP]` token represented by `3`. If we read in the file created by the `save_model` method we will find that the input IDs align to the token rows in that file.

In [13]:
with open('./bert-it/vocab.txt', 'r') as fp:
    vocab = fp.read().split('\n')

In [14]:
vocab[2], vocab[13884], vocab[5], \
    vocab[2095], vocab[2281], vocab[35], \
        vocab[3]

('[CLS]', 'ciao', '!', 'come', 'va', '?', '[SEP]')

Let's try another - this one is good to remember if you ever need to speak Italian:

In [15]:
tokenizer('ho capito niente')  # I understood nothing

{'input_ids': [2, 2318, 5945, 4576, 3], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

In [16]:
vocab[2], vocab[2318], vocab[5945], \
    vocab[4576], vocab[3]

('[CLS]', 'ho', 'capito', 'niente', '[SEP]')

And with that, we've build our WordPiece tokenizer! Let's have a look at some word *pieces* too:

In [17]:
tokenizer('responsbilità')  # responsibility

{'input_ids': [2, 24140, 1016, 16948, 3], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

In [18]:
vocab[2], vocab[24140], vocab[1016], \
    vocab[16948], vocab[3]

('[CLS]', 'respon', '##s', '##bilita', '[SEP]')