Skip to content

PCC: questions regarding additional token initialization (PCC Decoder) #33

@BierOne

Description

@BierOne

Hi there!

Thanks for sharing this interesting work. I have a few questions regarding the patch model. I implemented the creation script as follows:

NEW_TOKENS = [
    '<mem>', # <mem> is the start of a memory block.
    '<ae>', # <ae> is the hint for the reconstruction task
    '</mem>', # </mem> is the end of a memory block.
]
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
model = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.bfloat16)
num_added = tokenizer.add_special_tokens({
    "additional_special_tokens": NEW_TOKENS
})
assert num_added == len(NEW_TOKENS), "Error adding tokens (some tokens already exist!!)."

print(f"Added {num_added} tokens.")
model.resize_token_embeddings(len(tokenizer))

token_ids = tokenizer.convert_tokens_to_ids(NEW_TOKENS)
print("Token IDs:", token_ids)
with torch.no_grad():
    input_emb_matrix = model.get_input_embeddings().weight.detach()
    lm_head_matrix = model.get_output_embeddings().weight.detach()
    embedding_vectors = input_emb_matrix[token_ids].clone()
    lm_head_vectors = lm_head_matrix[token_ids].clone()

patch = {
    "tokens": NEW_TOKENS,
    "embedding": embedding_vectors,
    "lm_head": lm_head_vectors
}

Based on this snippet, I have three questions:

  1. In injection & model.py, why do you explicitly set eos_token_id to 128001 and pad_token_id to 128002? By default, these values should be 128009 and empty, respectively:
self.tokenizer.eos_token_id = 128001
self.tokenizer.pad_token_id = 128002  # <|reserved_special_token_0|>
patch = torch.load("model/patch/llama3_8b_special_token_patch.pt")
  1. For pretraining and finetuning, the decoder loads the same patch pt. According to your previous response, this patch file should be created as in the code above (which initializes embeddings randomly, though aligned with the model’s distribution). Could you confirm that this snippet is correct?

  2. I noticed that the decoder sets _set_grad_mode to False all the time. Does this mean that, even during the warmup stage, the added token embeddings will not be updated? Referring to your README, the embeddings are supposed to be optimized during warmup for both pretraining and finetuning. Am I misunderstanding something here?

Thanks in advance for your clarification!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions