In [24]:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer

vocab_size = 32000

# size of the input data
input_sentence_size = None

cache_dir="/workdir/.cache/huggingface/datasets"

# using a tiny bit of the dataset to train a tokenizer
ds = datasets.load_dataset("oscar",  
    name="unshuffled_deduplicated_no", 
    cache_dir=cache_dir, 
    split="train[:100]"
    )

len(ds)

Reusing dataset oscar (/workdir/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_no/1.0.0/84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2)


100

In [47]:
import os
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")

def batch_iterator(ds, input_sentence_size=None):
    if input_sentence_size==None:
        input_sentence_size = len(ds)
    batch_size = 100
    for i in range(0, input_sentence_size, batch_size):
        yield ds[i: i+batch_size]["text"]

tokenizer.train_from_iterator(
    iterator=batch_iterator(ds),
    vocab_size=vocab_size,
    show_progress=True
)

t5_config_dir = "/workdir/norwegian-t5-base/"
if not os.path.exists(t5_config_dir):
    os.makedirs(t5_config_dir)

tokenizer.save(os.path.join(t5_config_dir, "tokenizer.json"))





In [48]:
tokenizer.get_vocab_size()

6152

In [33]:

from transformers import T5Config

config = T5Config.from_pretrained("google/t5-v1_1-base", vocab_size=tokenizer.get_vocab_size())
config.save_pretrained(t5_config_dir)

In [5]:
from transformers import AutoTokenizer
from transformers import T5Config
# use rust based tokenizer
cache_dir="/workdir/norwegian-t5-base"
tokenizer = AutoTokenizer.from_pretrained(
    cache_dir, 
    cache_dir=cache_dir,
    use_fast=True,
    use_auth_token=None
)




Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
from transformers import T5Config
config = T5Config.from_pretrained(
    "/workdir/norwegian-t5-base/",
    cache_dir=cache_dir,
    vocab_size = len(tokenizer),
)

T5 experiments replicating
I should answer the following
questions when looking at the implementation:

- how span-masking is implemented <
- exactly how loss is computed.
    - implementation details of decoder
    - [do the most simple example](Dummy Examples)
- input and labeled of span-masked pre-training data
- modeling details / optimizer / learning rate scheduling
- how they implemented metrics logging. 


# Span-masked language modeling

Spans of the input sequence are masked by so-called sentinel tokens (unique mask tokens) and the output sequence
is formed as a concatenation of the same sentinel tokens and the real token that has been masked out. For example, if the input sequence is 

`The bad dog ruined my sleep`

We can mask out `bad dog` and ask the model to predict it. The sequence we will feed to the encoder is 

`The <extra_id_0> ruined my sleep`

The label we use to compute the loss is 

`<extra_id_0> bad dog <extra_id_1>`

T5-like span masked language models fuse the consecutively masked tokens to a single sentinel token.


In [11]:
tokenizer("<extra_id_0>", add_special_tokens=False)

{'input_ids': [6152], 'attention_mask': [1]}

In [14]:
print("pad token id: ", config.pad_token_id)
print("decoder start token id: ", config.decoder_start_token_id)


pad token id:  0
decoder start token id:  0


The following parameters are needed to create span-masked 

- tokenizer: PreTrainedTokenizerBase
A pretrained tokenizer with all the extra id stuff

- noise_density: float = 0.15 (data_args.mlm_probability,)
The probablity of mask out a token

- mean_noise_span_length: float (data_args.mean_noise_span_length,)
Average size of the masked span

- input_length: int
Maximum input sequence length. Defined as 
```
max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
```

- target_length: int
Target sequence length. This quantity depends on the `max_seq_length` and is computed
by `def compute_input_and_target_lengths`


- pad_token_id: int (model.config.pad_token_id,)
id for the pad token. 0 for T5

- decoder_start_token_id: int (model.config.decoder_start_token_id)
start token for sequence feed into decoder. 0 for T5. 


### Input type to `FlaxDataCollatorForT5MLM`
Span-mask is implemeted in `FlaxDataCollatorForT5MLM`, the input to the `__call__` method is 
a `List[Dict[str, np.ndarray]]`, i.e. it is a batch of input data, each of them has the signiture:
```
{
    "input_ids": [..., token ids,...],
    "masks": np.array,
}
```
The instance of `FlaxDataCollatorForT5MLM` is refered as
`data_collator` in the code. I can check what kind of input is fed into `data_collator` at line 895. 
The `tokenized_dataset` object from which we generate the batch of data is a standard interface in HuggingFace

### Hugging Face Dataset
The `tokenized_dataset` is defined as follwing. This pattern is the same for many language training usages
in Hugging Face
```python
# 557
datasets = load_dataset(
    data_args.dataset_name,
    data_args.dataset_config_name,
    cache_dir=model_args.cache_dir,
    use_auth_token=True if model_args.use_auth_token else None,
)

# line 667
tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    num_proc=data_args.preprocessing_num_workers,
    remove_columns=column_names,
    load_from_cache_file=not data_args.overwrite_cache,
)

# line 706
tokenized_datasets = tokenized_datasets.map(
    group_texts, # concatenate all texts from our dataset and generate chunks of expanded_inputs_length.
    batched=True,
    num_proc=data_args.preprocessing_num_workers,
    load_from_cache_file=not data_args.overwrite_cache,
)
```
Now, we know how the input of the `__call__` method of `FlaxDataCollatorForT5MLM` look like, let's dive into
its implementation details:

```python
# list of dicts to dict of batched tensors of the same key, 
# BatchEncoding: https://huggingface.co/docs/transformers/v4.21.2/en/main_classes/tokenizer#transformers.BatchEncoding
# a wrapper of input data
batch = BatchEncoding(
    {k: np.array([examples[i][k] for i in range(len(examples))]) for k, v in examples[0].items()}
)

input_ids = batch["input_ids"]
batch_size, expandend_input_length = input_ids.shape
```
Some interesting stuff happens below: given the length of a input sequence, we are deciding the indices
of the tokens to be masked. Note that for span-mask language modeling, the masked tokens need to be 
locally-connected (locally-contiguous)

```python
mask_indices = np.asarray([self.random_spans_noise_mask(expandend_input_length) for i in range(batch_size)])
labels_mask = ~mask_indices
```

Let's look at how `def random_spans_noise_mask` is being implemented. Note that the implementation is a clone
from [google's origal implementation](https://github.com/google-research/text-to-text-transfer-transformer/blob/84f8bcc14b5f2c03de51bd3587609ba8f6bbd1cd/t5/data/preprocessors.py#L2682).




In [None]:
class FlaxDataCollatorForT5MLM:
    def __init__