-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reserved special tokens #77
Comments
Yes you can, this is why they were added -- to support more use-cases without requiring vocab resize. As long as you don't conflict with the ones currently being used, you can pick any of them for your use-case: https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L61-L74 |
@ruanslv Can you please say more about which reserved special tokens are already used? Based on the tokenizer code you linked, it seems that |
I used some reserved special tokens with index higher than 10 in my fine-tuning corpus as language tags. The training was done with QLoRA and the embedding layer was also fine-tuned. However, the model never converged and the validation loss stayed constant. Interesting, after I switched to adding new special tokens, the loss immediately started to decrease. Does this have anything to do with the initial value of the reserved token embedding? |
https://twitter.com/danielhanchen/status/1781395882925343058 Seems like some Llama 3's weights are untrained (set to 0 or very close to 0): Unsloth added a def fix_untrained_tokens(model, eps = 1e-16):
"""
Llama-3 for eg has untrained vectors in the base model.
These include <|eot_id|>, <|start_header_id|>, <|end_header_id|>
We reset them to the mean of the rest of the tokens
"""
embedding_matrix = model.get_input_embeddings ().weight.data
lm_head_matrix = model.get_output_embeddings().weight.data
# Get untrained tokens
indicator_untrained = torch.amax(embedding_matrix, axis = 1) <= eps
where_untrained = torch.where(indicator_untrained)[0]
n_untrained = where_untrained.shape[0]
n_trained = embedding_matrix.shape[0] - n_untrained
if n_untrained != 0:
print(
f"Unsloth: Not an error, but your model has {n_untrained} untrained tokens.\n"\
"We shall set them to the mean of the other trained tokens."
)
pass
# First set untrained to all 0s - sometimes it's not! 1e-23 for bfloat16
embedding_matrix[where_untrained] = 0
lm_head_matrix [where_untrained] = 0
# Find sum
sum_embedding = torch.sum(embedding_matrix, dtype = torch.float32, axis = 0)
sum_lm_head = torch.sum(lm_head_matrix, dtype = torch.float32, axis = 0)
# Find correct average by dividing by sum of trained tokens
mean_embedding = (sum_embedding / n_trained).to(embedding_matrix.dtype)
mean_lm_head = (sum_lm_head / n_trained).to(lm_head_matrix .dtype)
# Set them to the mean
embedding_matrix[where_untrained] = mean_embedding
lm_head_matrix [where_untrained] = mean_lm_head
return mean_embedding, mean_lm_head |
if this is how LLaMa3 was pretrained, then in the sft process, should we include these special tokens (<|eot_id|>, <|start_header_id|>, etc...), which means to unmask them in the attention_mask? |
Could you please explain which special token works as a sep token or which special character works as a sep |
Is this the prefered solution over just adding new tokens and extending the vocabulary? I would also like to have some kind of seperator token. Is there any reason to use an exsiting special token over a new one? |
Apologies in case this is documented somewhere and I missed it:
I notice that there are 250 "reserved special tokens" defined in the tokenizer. Is there any information available on what these are meant for, and what users are supposed to (not) do with them? For instance, could one use some of these tokens in finetunes (instead of adding additional tokens and resizing the vocabulary), or would that be problematic?
Thanks so much!
The text was updated successfully, but these errors were encountered: