phi-2 issue #681

ilmarinen · 2024-03-07T19:58:37Z

The bug
List index out of range error when attempting to get a generation from phi-2

Traceback:

File "evaluate_success.py", line 238, in evaluate_phi2_constrained_gsm8k
response = model + prompt + gen("reason", max_tokens=220) + "\n" + "Answer: " + gen("answer", regex='\d+')
File "/home/xfernandes/code/envs/venv/lib/python3.8/site-packages/guidance/models/_model.py", line 302, in add
out = lm._run_stateless(value)
File "/home/xfernandes/code/envs/venv/lib/python3.8/site-packages/guidance/models/_model.py", line 465, in _run_stateless
for new_bytes, is_generated, new_bytes_prob, capture_groups, capture_group_log_probs, new_token_count in gen_obj:
File "/home/xfernandes/code/envs/venv/lib/python3.8/site-packages/guidance/models/_model.py", line 819, in call
sampled_token = self.tokens[sampled_token_ind]
IndexError: list index out of range

To Reproduce

from guidance import models


model = models.Transformers(
    "microsoft/phi-2",
    echo=False,
    device="cuda:0",
    trust_remote_code=True)

prompt = "\n1. Please answer the following question:\nQuestion: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\n\n2. Break up your answer to the question in (1) into a series of simple steps.\n3. Write each of the steps down on a separate line.\n4. Write out the final answer to the question, which should be a single number.\n\nWrite out your response below"

response = model + prompt + gen("reason", max_tokens=220) + "\n" + "Answer: " + gen("answer", regex='\d+')

System info (please complete the following information):

OS: Ubuntu
Guidance Version: 0.1.10

riedgar-ms · 2024-03-07T21:16:41Z

Just trying now with the infrastructure in #674 .... I do note that "Janet's" uses a different apostrophe, which looks rather like the one in test_unicode2().... which was giving me trouble in that PR

riedgar-ms · 2024-03-07T23:44:24Z

Can confirm, it's the apostrophe in "Janet's" - changing to a regular apostrophe allows the code to run (although putting a max_tokens on the second gen() also helped)

Harsha-Nori · 2024-03-08T18:59:06Z

Interesting! Can we check how the phi-2 tokenizer handles the special apostrophe character? Am mid-flight at the moment with trickle internet or I'd try it myself, but my HF model download is failing.

riedgar-ms · 2024-03-08T19:10:09Z

Note, this is likely the same underlying issue as #682

riedgar-ms · 2024-03-08T23:23:04Z

Experimenting with calling Phi-2 directly, the following works (until the final assert False):

def test_unicode_temp():
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch

    model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype=torch.float32, trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
    
    prompt = "Janet’s ducks lay 16 eggs per day"

    inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
    outputs = model.generate(**inputs, max_length=200)
    text = tokenizer.batch_decode(outputs)[0]
    print(text)
    assert False, text

slundberg · 2024-03-09T00:22:50Z

A part of the problem is that the transformer model returns 51200 logits when you call it but len(tokenizer.get_vocab()) is only 50295...

Why that is I don't know...seems like a problem with Phi-2 in transformers?

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype=torch.float32, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

prompt = "bla"

inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
outs = model(**inputs)
print(len(tokenizer.get_vocab())) # 50295
print(outs.logits.shape) # torch.Size([1, 138, 51200])

slundberg · 2024-03-09T01:14:16Z

Okay after a lot of digging the bigger problem here is that the Phi2 tokenizer does not have the byte_decoder attribute, this means we don't get tokens that are not valid strings correctly into the vocab...(they are just the � string)

For example:
tokenizer.convert_tokens_to_string([tokenizer.convert_ids_to_tokens(447)]) just gives '�' but 447 is part of what the apostrophe is encoded as so it must be a prefix of the apostrophe's unicode bytes.

I am not sure how to fix this, there must be some way to extract the actual byte-string tokens from the tokenizer but I don't know how. @riedgar-ms if you have any ideas let me know :)

riedgar-ms · 2024-03-09T18:15:41Z

@slundberg I have opened a discussion on Hugging Face about this:

https://huggingface.co/microsoft/phi-2/discussions/116

riedgar-ms · 2024-03-12T21:39:45Z

Side note: I have just spotted that we were getting a warning
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

…ance-ai#681, etc)" This reverts commit cd250cd.

…rs (guidance-ai#681, etc)"" This reverts commit 354b0b7.

slundberg added a commit that referenced this issue Mar 18, 2024

Fix issues with unicode and huggingface fast tokenizers (#681, etc)

cd250cd

slundberg closed this as completed in e2f517a Mar 18, 2024

MikoAL added a commit to MikoAL/guidance that referenced this issue Mar 31, 2024

Revert "Fix issues with unicode and huggingface fast tokenizers (guid…

354b0b7

…ance-ai#681, etc)" This reverts commit cd250cd.

MikoAL added a commit to MikoAL/guidance that referenced this issue Mar 31, 2024

Revert "Revert "Fix issues with unicode and huggingface fast tokenize…

18c0704

…rs (guidance-ai#681, etc)"" This reverts commit 354b0b7.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phi-2 issue #681

phi-2 issue #681

ilmarinen commented Mar 7, 2024

riedgar-ms commented Mar 7, 2024

riedgar-ms commented Mar 7, 2024 •

edited

Harsha-Nori commented Mar 8, 2024

riedgar-ms commented Mar 8, 2024

riedgar-ms commented Mar 8, 2024

slundberg commented Mar 9, 2024 •

edited

slundberg commented Mar 9, 2024 •

edited

riedgar-ms commented Mar 9, 2024

riedgar-ms commented Mar 12, 2024

phi-2 issue #681

phi-2 issue #681

Comments

ilmarinen commented Mar 7, 2024

riedgar-ms commented Mar 7, 2024

riedgar-ms commented Mar 7, 2024 • edited

Harsha-Nori commented Mar 8, 2024

riedgar-ms commented Mar 8, 2024

riedgar-ms commented Mar 8, 2024

slundberg commented Mar 9, 2024 • edited

slundberg commented Mar 9, 2024 • edited

riedgar-ms commented Mar 9, 2024

riedgar-ms commented Mar 12, 2024

riedgar-ms commented Mar 7, 2024 •

edited

slundberg commented Mar 9, 2024 •

edited

slundberg commented Mar 9, 2024 •

edited