-
Notifications
You must be signed in to change notification settings - Fork 9
PCC: questions regarding additional token initialization (PCC Decoder) #33
Description
Hi there!
Thanks for sharing this interesting work. I have a few questions regarding the patch model. I implemented the creation script as follows:
NEW_TOKENS = [
'<mem>', # <mem> is the start of a memory block.
'<ae>', # <ae> is the hint for the reconstruction task
'</mem>', # </mem> is the end of a memory block.
]
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
model = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.bfloat16)
num_added = tokenizer.add_special_tokens({
"additional_special_tokens": NEW_TOKENS
})
assert num_added == len(NEW_TOKENS), "Error adding tokens (some tokens already exist!!)."
print(f"Added {num_added} tokens.")
model.resize_token_embeddings(len(tokenizer))
token_ids = tokenizer.convert_tokens_to_ids(NEW_TOKENS)
print("Token IDs:", token_ids)
with torch.no_grad():
input_emb_matrix = model.get_input_embeddings().weight.detach()
lm_head_matrix = model.get_output_embeddings().weight.detach()
embedding_vectors = input_emb_matrix[token_ids].clone()
lm_head_vectors = lm_head_matrix[token_ids].clone()
patch = {
"tokens": NEW_TOKENS,
"embedding": embedding_vectors,
"lm_head": lm_head_vectors
}Based on this snippet, I have three questions:
- In
injection & model.py, why do you explicitly seteos_token_idto 128001 andpad_token_idto 128002? By default, these values should be 128009 and empty, respectively:
self.tokenizer.eos_token_id = 128001
self.tokenizer.pad_token_id = 128002 # <|reserved_special_token_0|>
patch = torch.load("model/patch/llama3_8b_special_token_patch.pt")-
For pretraining and finetuning, the decoder loads the same patch pt. According to your previous response, this patch file should be created as in the code above (which initializes embeddings randomly, though aligned with the model’s distribution). Could you confirm that this snippet is correct?
-
I noticed that the decoder sets
_set_grad_modetoFalseall the time. Does this mean that, even during the warmup stage, the added token embeddings will not be updated? Referring to your README, the embeddings are supposed to be optimized during warmup for both pretraining and finetuning. Am I misunderstanding something here?
Thanks in advance for your clarification!