In [None]:
!pip install datasets

In [1]:
import datasets

In [4]:
all_ds = datasets.list_datasets()
print(len(all_ds))

964


There are a few too many datasets to print them all out here, so let's view the first *20* only.

In [5]:
all_ds[:20]

['acronym_identification',
 'ade_corpus_v2',
 'adversarial_qa',
 'aeslc',
 'afrikaans_ner_corpus',
 'ag_news',
 'ai2_arc',
 'air_dialogue',
 'ajgt_twitter_ar',
 'allegro_reviews',
 'allocine',
 'alt',
 'amazon_polarity',
 'amazon_reviews_multi',
 'amazon_us_reviews',
 'ambig_qa',
 'amttl',
 'anli',
 'app_reviews',
 'aqua_rat']

Now, the dataset that we want to be using is called `'oscar'`, which we can see here. Now, if we'd like more information about a specific dataset we can find a page for every single dataset on the HF [datasets viewer](https://huggingface.co/datasets/viewer/).

We'll download the Latin subset of OSCAR like so:

In [8]:
dataset = datasets.load_dataset('oscar', 'unshuffled_deduplicated_la')

Downloading and preparing dataset oscar/unshuffled_deduplicated_la (download: 3.26 MiB, generated: 8.46 MiB, post-processed: Unknown size, total: 11.72 MiB) to C:\Users\James\.cache\huggingface\datasets\oscar\unshuffled_deduplicated_la\1.0.0\e4f06cecc7ae02f7adf85640b4019bf476d44453f251a1d84aebae28b0f8d51d...
Downloading: 100%|██████████| 81.0/81.0 [00:00<00:00, 81.6kB/s]
Downloading: 100%|██████████| 3.42M/3.42M [00:00<00:00, 4.19MB/s]
                                           Dataset oscar downloaded and prepared to C:\Users\James\.cache\huggingface\datasets\oscar\unshuffled_deduplicated_la\1.0.0\e4f06cecc7ae02f7adf85640b4019bf476d44453f251a1d84aebae28b0f8d51d. Subsequent calls will reuse this data.


In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text'],
        num_rows: 18808
    })
})

In [10]:
dataset['train'][0]

{'id': 0,
 'text': 'Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\nEcce ego adducam aquas diluvii super terram, ut interficiam omnem carnem, in qua spiritus vitæ est subter cælum: universa quæ in terra sunt, consumentur.\nTolles igitur tecum ex omnibus escis, quæ mandi possunt, et comportabis apud te: et erunt tam tibi, quam illis in cibum.'}

Now we save this data to file as several *txt* files.

In [11]:
import os

os.mkdir('../../data/text/oscar_la')

In [12]:
from tqdm.auto import tqdm

text_data = []
file_count = 0

for sample in tqdm(dataset['train']):
    sample = sample['text'].replace('\n', '')
    text_data.append(sample)
    if len(text_data) == 5_000:
        # once we git the 5K mark, save to file
        with open(f'../../data/text/oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1
# after saving in 5K chunks, we will have ~3808 leftover samples, we save those now too
with open(f'../../data/text/oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))

100%|██████████| 18808/18808 [00:00<00:00, 27943.48it/s]


---

## Training Tokenizer

Now we can mvoe onto training our tokenizer. We'll use a byte-level byte-pair encoding (BPE) tokenizer. Allowing us to build the vocab from an alphabet of single bytes - meaning all words will be decomposable into tokens and we will not need to use unknown tokens.

In [13]:
# first get list of paths to each txt file we just made
from pathlib import Path
paths = [str(x) for x in Path('../../data/text/oscar_la').glob('**/*.txt')]
paths

['..\\..\\data\\text\\oscar_la\\text_0.txt',
 '..\\..\\data\\text\\oscar_la\\text_1.txt',
 '..\\..\\data\\text\\oscar_la\\text_2.txt',
 '..\\..\\data\\text\\oscar_la\\text_3.txt']

Now initialize and train the tokenizer, we will use roBERTa special tokens.

In [14]:
from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()

# and train
tokenizer.train(files=paths, vocab_size=30_522, min_frequency=2,
                special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

We can now save the tokenizer to file, giving BERT a Latin-spin.

In [15]:
os.mkdir('bertius')

tokenizer.save_model('bertius')

['bertius\\vocab.json', 'bertius\\merges.txt']

---

## Loading the Tokenizer

We've now built and saved our tokenizer, to load it as we usually would (eg `from_pretrained`) all we must do is:

In [16]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained('bertius')

Let's try encoding everyones favorite Latin.

In [22]:
lorem_ipsum = (
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor "
    "incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud "
    "exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute "
    "irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla "
    "pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia "
    "deserunt mollit anim id est laborum."
)

In [23]:
# we'll include the typical padding/truncation
tokenizer(lorem_ipsum, max_length=512, padding='max_length', truncation=True)

{'input_ids': [0, 3587, 653, 1601, 461, 1788, 16, 2618, 3714, 3088, 16, 398, 3702, 13754, 3727, 16099, 330, 2219, 290, 1914, 1547, 1650, 18, 1376, 412, 320, 10178, 1931, 16, 632, 13322, 23666, 2438, 6332, 9089, 691, 330, 20864, 350, 507, 7542, 10803, 18, 15644, 380, 73, 2920, 2650, 1601, 285, 2068, 285, 1604, 1256, 361, 17171, 1914, 2514, 2074, 1089, 2524, 18, 15442, 28801, 909, 24536, 30305, 312, 20856, 16, 338, 285, 2527, 366, 2573, 3045, 17797, 581, 462, 297, 3562, 18, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Here we can see our two tensors, `input_ids` and `attention_mask`. In `input_ids` we see our start of sequence token **<s\>** represented with `0`, end of sequence token **<s\\>** represented with `2`, and padding tokens **<pad\>** represented by `1`.