Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reserved special tokens #77

Open
mgerstgrasser opened this issue Apr 19, 2024 · 7 comments
Open

Reserved special tokens #77

mgerstgrasser opened this issue Apr 19, 2024 · 7 comments

Comments

@mgerstgrasser
Copy link

Apologies in case this is documented somewhere and I missed it:

I notice that there are 250 "reserved special tokens" defined in the tokenizer. Is there any information available on what these are meant for, and what users are supposed to (not) do with them? For instance, could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary), or would that be problematic?

Thanks so much!

@ruanslv
Copy link
Contributor

ruanslv commented Apr 19, 2024

could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary)

Yes you can, this is why they were added -- to support more use-cases without requiring vocab resize.

As long as you don't conflict with the ones currently being used, you can pick any of them for your use-case: https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L61-L74

@AlienKevin
Copy link

@ruanslv Can you please say more about which reserved special tokens are already used? Based on the tokenizer code you linked, it seems that <|reserved_special_token_0|> to <|reserved_special_token_4|> are separated from the rest of the special tokens. However, I can't find any mention of their current usage or significance in the doc.

@AlienKevin
Copy link

I used some reserved special tokens with index higher than 10 in my fine-tuning corpus as language tags. The training was done with QLoRA and the embedding layer was also fine-tuned. However, the model never converged and the validation loss stayed constant. Interesting, after I switched to adding new special tokens, the loss immediately started to decrease. Does this have anything to do with the initial value of the reserved token embedding?

@AlienKevin
Copy link

https://twitter.com/danielhanchen/status/1781395882925343058

Seems like some Llama 3's weights are untrained (set to 0 or very close to 0):
<|reserved_special_token_{0->250}|>
<|eot_id|>
<|start_header_id|>
<|end_header_id|>

Unsloth added a fix_untrained_tokens helper to set the untrained tokens to the mean of the trained tokens:

def fix_untrained_tokens(model, eps = 1e-16):
    """
    Llama-3 for eg has untrained vectors in the base model.
    These include <|eot_id|>, <|start_header_id|>, <|end_header_id|>
    We reset them to the mean of the rest of the tokens
    """
    embedding_matrix = model.get_input_embeddings ().weight.data
    lm_head_matrix   = model.get_output_embeddings().weight.data

    # Get untrained tokens
    indicator_untrained = torch.amax(embedding_matrix, axis = 1) <= eps
    where_untrained = torch.where(indicator_untrained)[0]
    n_untrained = where_untrained.shape[0]
    n_trained = embedding_matrix.shape[0] - n_untrained
    if n_untrained != 0:
        print(
            f"Unsloth: Not an error, but your model has {n_untrained} untrained tokens.\n"\
            "We shall set them to the mean of the other trained tokens."
        )
    pass

    # First set untrained to all 0s - sometimes it's not! 1e-23 for bfloat16
    embedding_matrix[where_untrained] = 0
    lm_head_matrix  [where_untrained] = 0

    # Find sum
    sum_embedding  = torch.sum(embedding_matrix, dtype = torch.float32, axis = 0)
    sum_lm_head    = torch.sum(lm_head_matrix,   dtype = torch.float32, axis = 0)

    # Find correct average by dividing by sum of trained tokens
    mean_embedding = (sum_embedding / n_trained).to(embedding_matrix.dtype)
    mean_lm_head   = (sum_lm_head   / n_trained).to(lm_head_matrix  .dtype)

    # Set them to the mean
    embedding_matrix[where_untrained] = mean_embedding
    lm_head_matrix  [where_untrained] = mean_lm_head

    return mean_embedding, mean_lm_head

@disperaller
Copy link

if this is how LLaMa3 was pretrained, then in the sft process, should we include these special tokens (<|eot_id|>, <|start_header_id|>, etc...), which means to unmask them in the attention_mask?

@NivinaNull
Copy link

Could you please explain which special token works as a sep token or which special character works as a sep

@Ben-Pfirsich
Copy link

could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary)

Yes you can, this is why they were added -- to support more use-cases without requiring vocab resize.

As long as you don't conflict with the ones currently being used, you can pick any of them for your use-case: https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L61-L74

Is this the prefered solution over just adding new tokens and extending the vocabulary? I would also like to have some kind of seperator token. Is there any reason to use an exsiting special token over a new one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants