### Tokenizer

Now let's re-create our tokenizer in transformers

In [2]:
from pathlib import Path

# From Hugging Face Hub
tokenizer_dir = "phunc20/esperoberta-cased"

In [3]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained(
    tokenizer_dir,
)
tokenizer.model_max_length

512

## Let's Build Our Dataset
We shall follow this tuto: <https://huggingface.co/course/chapter7/3?fw=pt#fine-tuning-a-masked-language-model>

### Dataset from texts

In [4]:
dataset_dir = Path.cwd()/"../../../data/"

text_paths = [
    dataset_dir/"oscar.eo.txt",
]
assert all(p.exists() for p in text_paths), "some path doesn't exist! Check again, please!"

In [5]:
from datasets import load_dataset

In [6]:
data_files = [str(p) for p in text_paths]
dataset = load_dataset(
    "text",
    data_files=data_files
)
dataset

Downloading and preparing dataset text/default to /home/phunc20/.cache/huggingface/datasets/text/default-80c45ae5768c770c/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to /home/phunc20/.cache/huggingface/datasets/text/default-80c45ae5768c770c/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 974372
    })
})

In [7]:
random_samples = dataset["train"].shuffle(seed=42).select(range(3))
for sample in random_samples:
    print(f'>>> {sample["text"][:50]}')

>>> La teksto disponeblas laŭ la permesilo Krea Komuna
>>> 625.1 Unuformaj kutimoj kaj praktiko de ICC por do
>>> Iuj predikatoj montras agon, kiu celas influi la a


We will concatenate and regroup the text into chunks, so **no need to**

- restrict maximum length
- truncate
- return overflowing tokens

### Dataset in token ids

**(?)** Do we need `word_ids`?  
**(R)** It depends on whether you'd like to do **whole-word masking** or not.

In [14]:
def tokenize(examples):
    tokenized_examples =  tokenizer(
        examples["text"],
        max_length=tokenizer.model_max_length,
        truncation=True,
        #return_overflowing_tokens=True,
    )
    if tokenizer.is_fast:
        tokenized_examples["word_ids"] = [tokenized_examples.word_ids(i) for i in range(len(tokenized_examples["input_ids"]))]
    return tokenized_examples

In [15]:
tokenized_random_samples = tokenize(random_samples)
type(tokenized_random_samples)

transformers.tokenization_utils_base.BatchEncoding

In [16]:
tokenized_random_samples.keys()

dict_keys(['input_ids', 'attention_mask', 'word_ids'])

In [17]:
for k in tokenized_random_samples:
    print(f'{k = :>15s}: {len(tokenized_random_samples[k]) = }')

k =       input_ids: len(tokenized_random_samples[k]) = 3
k =  attention_mask: len(tokenized_random_samples[k]) = 3
k =        word_ids: len(tokenized_random_samples[k]) = 3


In [18]:
for i in range(3):
    input_ids = tokenized_random_samples["input_ids"][i]
    print(f'{len(input_ids) = }')
    print(tokenizer.decode(
        input_ids,
        skip_special_tokens=True,
    ))
    #print(tokenized_random_samples["overflow_to_sample_mapping"][i])
    print()

len(input_ids) = 29
La teksto disponeblas laŭ la permesilo Krea Komunaĵo Atribuite-Samkondiĉe 3.0 Neadaptita; eble aldonaj kondiĉoj aplikeblas. Vidu la uzkondiĉojn por detaloj.

len(input_ids) = 37
625.1 Unuformaj kutimoj kaj praktiko de ICC por dokumentaj kreditoj : Mondvaste agnoskita aro da reguloj reganta la uzon de la dokumenta kredito en internacia komerco

len(input_ids) = 128
Iuj predikatoj montras agon, kiu celas influi la agadon de alia persono (aŭ afero). Tiaj verboj estas ekz. (mal)permesi, ordoni, doni, destini, peti, instrui, instrukcii, devigi, lasi, inviti, voki, sendi, (mal)konsili, komandi, konvinki, persvadi, memorigi kaj (mal)rekomendi. La influata persono (aŭ afero) aperas ĉe tiaj verboj kiel al-adjekto aŭ N-adjekto. Kiam infinitivo aperas kune kun tia verbo, la senca subjekto de la infinitivo estas la influata persono aŭ afero:



In [19]:
tokenized_dataset = dataset.map(
    tokenize,
    batched=True,
    #remove_columns=["text"],
)
tokenized_dataset

Map:   0%|          | 0/974372 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 974372
    })
})

In [20]:
concatenated_samples = {
    k: sum(v, start=[])
    for k, v in tokenized_random_samples.items() if k not in (
        "overflow_to_sample_mapping",
        "text",
    )}
print(f'{type(concatenated_samples)}')

<class 'dict'>


In [21]:
concatenated_samples.keys()

dict_keys(['input_ids', 'attention_mask', 'word_ids'])

In [22]:
for k in concatenated_samples.keys():
    print(f'{k = }')
    print(f'{len(concatenated_samples[k])}')
    print()

k = 'input_ids'
194

k = 'attention_mask'
194

k = 'word_ids'
194



In [23]:
type(concatenated_samples["attention_mask"])

list

Let's check if `attention_mask` are simply all `1`'s.

In [24]:
all(m == 1 for m in concatenated_samples["attention_mask"])

True

In [25]:
concatenated_samples["attention_mask"] == [1]*len(concatenated_samples["attention_mask"])

True

Let's verify whether we concatenate them correctly:

In [26]:
concatenated_samples["input_ids"] == \
    sum(tokenized_random_samples["input_ids"], [])

True

In [27]:
concatenated_text = tokenizer.decode(
    concatenated_samples["input_ids"],
    skip_special_tokens=True,
)
print(f'{len(concatenated_text) = }')
print(f'{concatenated_text[:100] = }')

len(concatenated_text) = 796
concatenated_text[:100] = 'La teksto disponeblas laŭ la permesilo Krea Komunaĵo Atribuite-Samkondiĉe 3.0 Neadaptita; eble aldon'


In [28]:
"".join(random_samples["text"]) == concatenated_text

True

In [29]:
" ".join(random_samples["text"]) == concatenated_text

False

In [30]:
"</s><s>".join(random_samples["text"]) == concatenated_text

False

In [31]:
for special_token in tokenizer.special_tokens_map.values():
    if special_token in concatenated_text:
        print(f'{special_token} in it')

In [32]:
len("".join(random_samples["text"])), len(concatenated_text)

(796, 796)

In [33]:
for input_ids in tokenized_random_samples["input_ids"]:
    print(len(input_ids))

29
37
128


In [34]:
tokenized_random_samples["input_ids"][0][-1]

2

In [35]:
sum(len(input_ids) for input_ids in tokenized_random_samples["input_ids"])

194

In [36]:
len(concatenated_samples["input_ids"])

194

`"".join(random_samples["text"])` and `concatenated_text` are different
- not because they're joined by different jointers, e.g. `" "`, `"</s><s>"`
- but because one contains diacritics, the other not

In [37]:
"".join(random_samples["text"])[:50]

'La teksto disponeblas laŭ la permesilo Krea Komuna'

In [38]:
concatenated_text[:50]

'La teksto disponeblas laŭ la permesilo Krea Komuna'

In [39]:
concatenated_with_accent = "".join(random_samples["text"])
concatenated_without_accent = tokenizer.backend_tokenizer.normalizer.normalize_str(
    concatenated_with_accent
)
len(concatenated_without_accent)

796

In [40]:
concatenated_without_accent == concatenated_text

True

Great!

### Let's Chunk and Roll!

In [41]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 974372
    })
})

In [42]:
tokenized_dataset = tokenized_dataset.remove_columns(["text"])
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 974372
    })
})

In [43]:
def group(
    examples,
    *,
    chunk_size: int = 128,
):
    """
    Args:
        examples
            think of it as a batch (of, by default, 1000 instances)
    """
    concatenated_examples = {
        k: sum(v, start=[])
        for k, v in examples.items() if k not in (
            "overflow_to_sample_mapping",
            "text",
        )
    }
    length = len(concatenated_examples["input_ids"])
    # Drop the last chunk if its size < chunk_size
    length = length - (length % chunk_size)
    chunked_examples = {
        k: [v[i: i+chunk_size] for i in range(0, length, chunk_size)]
        for k, v in concatenated_examples.items()
    }
    chunked_examples["labels"] = chunked_examples["input_ids"].copy()
    return chunked_examples

In [44]:
lm_dataset = tokenized_dataset.map(
    group,
    batched=True,
    #remove_columns=["text"],
)
lm_dataset

Map:   0%|          | 0/974372 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 590752
    })
})

In [46]:
for s in lm_dataset["train"].shuffle(42).select(range(10)):
    print(len(s["input_ids"]))

Loading cached shuffled indices for dataset at /home/phunc20/.cache/huggingface/datasets/text/default-80c45ae5768c770c/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2/cache-2b6a379481d2331f.arrow


128
128
128
128
128
128
128
128
128
128


## Split into train/val

Cf. <https://huggingface.co/docs/datasets/process>

**(?)** Why `1/3`?  
- <https://discuss.huggingface.co/t/working-with-large-datasets/1876/4>
- <https://www.digitalocean.com/community/tutorials/understanding-database-sharding>

## Upload to Hugging Face Hub

In [47]:
hub_repo = f'phunc20/oscar_esperoberta-cased_dataset'

In [48]:
lm_dataset["train"].push_to_hub(hub_repo)

Pushing dataset shards to the dataset hub:   0%|          | 0/4 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/148 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/148 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/148 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]


KeyboardInterrupt



## Reload to Double-Check

In [49]:
from datasets import load_dataset

In [50]:
reloaded_dataset = load_dataset(hub_repo)
reloaded_dataset

Downloading and preparing dataset parquet/phunc20--oscar_esperoberta-cased_dataset to /home/phunc20/.cache/huggingface/datasets/phunc20___parquet/phunc20--oscar_esperoberta-cased_dataset-19abbf324e90b9e8/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/93.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/97.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset parquet downloaded and prepared to /home/phunc20/.cache/huggingface/datasets/phunc20___parquet/phunc20--oscar_esperoberta-cased_dataset-19abbf324e90b9e8/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 295376
    })
})

In [52]:
for k in train_samples:
    print(f'{k}')
    if isinstance(train_samples[k], list):
        for L in train_samples[k]:
            if isinstance(L, list):
                print(f'{len(L) = }')
            else:
                print(f'{L = }')
    print()

input_ids
len(L) = 128
len(L) = 128
len(L) = 128

attention_mask
len(L) = 128
len(L) = 128
len(L) = 128

word_ids
len(L) = 128
len(L) = 128
len(L) = 128

labels
len(L) = 128
len(L) = 128
len(L) = 128

