Reserved special tokens #77

mgerstgrasser · 2024-04-19T16:26:58Z

Apologies in case this is documented somewhere and I missed it:

I notice that there are 250 "reserved special tokens" defined in the tokenizer. Is there any information available on what these are meant for, and what users are supposed to (not) do with them? For instance, could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary), or would that be problematic?

Thanks so much!

ruanslv · 2024-04-19T23:33:40Z

could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary)

Yes you can, this is why they were added -- to support more use-cases without requiring vocab resize.

As long as you don't conflict with the ones currently being used, you can pick any of them for your use-case: https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L61-L74

AlienKevin · 2024-04-23T02:10:32Z

@ruanslv Can you please say more about which reserved special tokens are already used? Based on the tokenizer code you linked, it seems that <|reserved_special_token_0|> to <|reserved_special_token_4|> are separated from the rest of the special tokens. However, I can't find any mention of their current usage or significance in the doc.

AlienKevin · 2024-04-23T03:05:05Z

I used some reserved special tokens with index higher than 10 in my fine-tuning corpus as language tags. The training was done with QLoRA and the embedding layer was also fine-tuned. However, the model never converged and the validation loss stayed constant. Interesting, after I switched to adding new special tokens, the loss immediately started to decrease. Does this have anything to do with the initial value of the reserved token embedding?

AlienKevin · 2024-04-25T23:08:18Z

https://twitter.com/danielhanchen/status/1781395882925343058

Seems like some Llama 3's weights are untrained (set to 0 or very close to 0):
<|reserved_special_token_{0->250}|>
<|eot_id|>
<|start_header_id|>
<|end_header_id|>

Unsloth added a fix_untrained_tokens helper to set the untrained tokens to the mean of the trained tokens:

def fix_untrained_tokens(model, eps = 1e-16):
    """
    Llama-3 for eg has untrained vectors in the base model.
    These include <|eot_id|>, <|start_header_id|>, <|end_header_id|>
    We reset them to the mean of the rest of the tokens
    """
    embedding_matrix = model.get_input_embeddings ().weight.data
    lm_head_matrix   = model.get_output_embeddings().weight.data

    # Get untrained tokens
    indicator_untrained = torch.amax(embedding_matrix, axis = 1) <= eps
    where_untrained = torch.where(indicator_untrained)[0]
    n_untrained = where_untrained.shape[0]
    n_trained = embedding_matrix.shape[0] - n_untrained
    if n_untrained != 0:
        print(
            f"Unsloth: Not an error, but your model has {n_untrained} untrained tokens.\n"\
            "We shall set them to the mean of the other trained tokens."
        )
    pass

    # First set untrained to all 0s - sometimes it's not! 1e-23 for bfloat16
    embedding_matrix[where_untrained] = 0
    lm_head_matrix  [where_untrained] = 0

    # Find sum
    sum_embedding  = torch.sum(embedding_matrix, dtype = torch.float32, axis = 0)
    sum_lm_head    = torch.sum(lm_head_matrix,   dtype = torch.float32, axis = 0)

    # Find correct average by dividing by sum of trained tokens
    mean_embedding = (sum_embedding / n_trained).to(embedding_matrix.dtype)
    mean_lm_head   = (sum_lm_head   / n_trained).to(lm_head_matrix  .dtype)

    # Set them to the mean
    embedding_matrix[where_untrained] = mean_embedding
    lm_head_matrix  [where_untrained] = mean_lm_head

    return mean_embedding, mean_lm_head

disperaller · 2024-05-16T13:18:14Z

if this is how LLaMa3 was pretrained, then in the sft process, should we include these special tokens (<|eot_id|>, <|start_header_id|>, etc...), which means to unmask them in the attention_mask?

NivinaNull · 2024-05-17T10:06:58Z

Could you please explain which special token works as a sep token or which special character works as a sep

Ben-Pfirsich · 2024-05-27T12:38:38Z

could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary)

Yes you can, this is why they were added -- to support more use-cases without requiring vocab resize.

As long as you don't conflict with the ones currently being used, you can pick any of them for your use-case: https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L61-L74

Is this the prefered solution over just adding new tokens and extending the vocabulary? I would also like to have some kind of seperator token. Is there any reason to use an exsiting special token over a new one?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reserved special tokens #77

Reserved special tokens #77

mgerstgrasser commented Apr 19, 2024

ruanslv commented Apr 19, 2024

AlienKevin commented Apr 23, 2024

AlienKevin commented Apr 23, 2024

AlienKevin commented Apr 25, 2024

disperaller commented May 16, 2024

NivinaNull commented May 17, 2024

Ben-Pfirsich commented May 27, 2024

Reserved special tokens #77

Reserved special tokens #77

Comments

mgerstgrasser commented Apr 19, 2024

ruanslv commented Apr 19, 2024

AlienKevin commented Apr 23, 2024

AlienKevin commented Apr 23, 2024

AlienKevin commented Apr 25, 2024

disperaller commented May 16, 2024

NivinaNull commented May 17, 2024

Ben-Pfirsich commented May 27, 2024