First install HuggingFace `datasets`:

In [None]:
!pip install datasets

Or use *conda*:

In [None]:
!conda install -c huggingface -c conda-forge datasets

If this doesn't work, install from source:

In [None]:
!git clone https://github.com/huggingface/datasets.git
!cd datsets
!pip install -e .

---

In [1]:
from datasets import load_dataset

We then load the **Italian** part of the [**OSCAR**](https://huggingface.co/datasets/oscar) dataset. This is a *huge* dataset so download can take a long time:

In [2]:
dataset = load_dataset('oscar', 'unshuffled_deduplicated_it')

Reusing dataset oscar (C:\Users\James\.cache\huggingface\datasets\oscar\unshuffled_deduplicated_it\1.0.0\e4f06cecc7ae02f7adf85640b4019bf476d44453f251a1d84aebae28b0f8d51d)


In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text'],
        num_rows: 28522082
    })
})

In [4]:
len(dataset['train'])

28522082

In [5]:
dataset['train']

Dataset({
    features: ['id', 'text'],
    num_rows: 28522082
})

In [6]:
dataset['train'].features

{'id': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None)}

In [7]:
dataset['train'][0]

{'id': 0,
 'text': "La estrazione numero 48 del 10 e LOTTO ogni 5 minuti e' avvenuta sabato 15 settembre 2018 alle ore 04:00 a Roma, nel Centro Elaborazione Dati della Lottomatica Italia (ora GTech SpA), con la supervisione della Amministrazione Autonoma dei Monopoli di Stato (AAMS), incaricata di vigilare sulla regolarità delle operazioni di sorteggio.\nIl Montepremi della 48ª estrazione viene ripartito tra i vincitori delle singole categorie di premio.\nRicorda di controllare il Numero ORO 53. E, se lo hai giocato, anche il DOPPIO ORO 53 e 66. Se indovini puoi vincere premi più ricchi.\nIl nostro sito web impiega cookies per migliorare la navigazione del visitatore. L’utente è consapevole che, continuando a visitare il nostro sito web, accetta l’utilizzo dei cookies Accetto Informazioni\n(C) Copyright 2013-2017 10elotto.biz | Il presente sito è da considerarsi un sito indipendente, NON collegato alla rete ufficiale Gtech SpA."}

Now we save this data to file as several *plaintext* files.

In [8]:
from tqdm.auto import tqdm

text_data = []
file_count = 0

for sample in tqdm(dataset['train']):
    sample = sample['text'].replace('\n', '')
    text_data.append(sample)
    if len(text_data) == 10_000:
        # once we git the 10K mark, save to file
        with open(f'../../data/text/oscar_it/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1
# after saving in 10K chunks, we will have ~2082 leftover samples, we save those now too
with open(f'../../data/text/oscar_it/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))

100%|██████████| 28522082/28522082 [33:32<00:00, 14173.48it/s]


Now we get a list of paths to each file in our *oscar_it* directory.

In [1]:
from pathlib import Path

paths = [str(x) for x in Path('../../data/text/oscar_it').glob('**/*.txt')]

paths[-5:]

['..\\..\\data\\text\\oscar_it\\text_995.txt',
 '..\\..\\data\\text\\oscar_it\\text_996.txt',
 '..\\..\\data\\text\\oscar_it\\text_997.txt',
 '..\\..\\data\\text\\oscar_it\\text_998.txt',
 '..\\..\\data\\text\\oscar_it\\text_999.txt']

Now we move onto training the tokenizer. We use a byte-level Byte-pair encoding (BPE) tokenizer. This allows us to build the vocabulary from an alphabet of single bytes, meaning all words will be decomposable into tokens.

In [2]:
from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()

In [3]:
tokenizer.train(files=paths[:5], vocab_size=30_522, min_frequency=2,
                special_tokens=['[PAD]', '[CLS]', '[SEP]', '[MASK]', '[UNK]'])

We can now save our tokenizer to file, we'll be giving our model a traditional Italian name - filiBERTo:

In [4]:
import os

os.mkdir('./filiberto')

tokenizer.save_model('./filiberto', 'filiberto')

['./filiberto\\filiberto-vocab.json', './filiberto\\filiberto-merges.txt']

Now we have two files that outline our new filiBERTo tokenizer:

* the *vocab.json* - a mapping file between tokens to token IDs

* and *merges.txt* - which describes which characters/set of characters can be decomposed/composed smaller/larger tokens

To begin using our tokenizer like we would usually use a `from_pretrained` tokenizer we do this:

In [5]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

# initialize the tokenizer using the tokenizer we initialized and saved to file
tokenizer = ByteLevelBPETokenizer(
    './filiberto/filiberto-vocab.json',
    './filiberto/filiberto-merges.txt'
)

# set [CLS] and [SEP] to be added to start-end of sequences
tokenizer._tokenizer.post_processor = BertProcessing(
    ('[SEP]', tokenizer.token_to_id('[SEP]')),
    ('[CLS]', tokenizer.token_to_id('[CLS]'))
)

# truncate anything more than 512 characters in length
tokenizer.enable_truncation(max_length=512)
# and enable padding to 512 too
tokenizer.enable_padding(length=512, pad_token='[PAD]')

# test our tokenizer on a simple sentence
tokens = tokenizer.encode('ciao, come va?')

In [6]:
print(tokens)

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


In [7]:
tokens.tokens[:10]

['[CLS]', 'ciao', ',', 'Ġcome', 'Ġva', '?', '[SEP]', '[PAD]', '[PAD]', '[PAD]']

In [8]:
tokens.ids[:10]

[1, 16834, 16, 488, 611, 35, 2, 0, 0, 0]

We can see here that our **CLS** token is now placed at the beginning of our sequences using token ID *1*. At the end of the sequence we see the **SEP** token represented by *2*. Following this we have our **PAD** tokens which pad each sequence upto a length of *512*.