Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phi-2 issue #681

Closed
ilmarinen opened this issue Mar 7, 2024 · 9 comments
Closed

phi-2 issue #681

ilmarinen opened this issue Mar 7, 2024 · 9 comments

Comments

@ilmarinen
Copy link

The bug
List index out of range error when attempting to get a generation from phi-2

Traceback:

File "evaluate_success.py", line 238, in evaluate_phi2_constrained_gsm8k
response = model + prompt + gen("reason", max_tokens=220) + "\n" + "Answer: " + gen("answer", regex='\d+')
File "/home/xfernandes/code/envs/venv/lib/python3.8/site-packages/guidance/models/_model.py", line 302, in add
out = lm._run_stateless(value)
File "/home/xfernandes/code/envs/venv/lib/python3.8/site-packages/guidance/models/_model.py", line 465, in _run_stateless
for new_bytes, is_generated, new_bytes_prob, capture_groups, capture_group_log_probs, new_token_count in gen_obj:
File "/home/xfernandes/code/envs/venv/lib/python3.8/site-packages/guidance/models/_model.py", line 819, in call
sampled_token = self.tokens[sampled_token_ind]
IndexError: list index out of range

To Reproduce

from guidance import models


model = models.Transformers(
    "microsoft/phi-2",
    echo=False,
    device="cuda:0",
    trust_remote_code=True)

prompt = "\n1. Please answer the following question:\nQuestion: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\n\n2. Break up your answer to the question in (1) into a series of simple steps.\n3. Write each of the steps down on a separate line.\n4. Write out the final answer to the question, which should be a single number.\n\nWrite out your response below"

response = model + prompt + gen("reason", max_tokens=220) + "\n" + "Answer: " + gen("answer", regex='\d+')

System info (please complete the following information):

  • OS: Ubuntu
  • Guidance Version: 0.1.10
@riedgar-ms
Copy link
Collaborator

Just trying now with the infrastructure in #674 .... I do note that "Janet's" uses a different apostrophe, which looks rather like the one in test_unicode2().... which was giving me trouble in that PR

@riedgar-ms
Copy link
Collaborator

riedgar-ms commented Mar 7, 2024

Can confirm, it's the apostrophe in "Janet's" - changing to a regular apostrophe allows the code to run (although putting a max_tokens on the second gen() also helped)

@Harsha-Nori
Copy link
Collaborator

Interesting! Can we check how the phi-2 tokenizer handles the special apostrophe character? Am mid-flight at the moment with trickle internet or I'd try it myself, but my HF model download is failing.

@riedgar-ms
Copy link
Collaborator

Note, this is likely the same underlying issue as #682

@riedgar-ms
Copy link
Collaborator

Experimenting with calling Phi-2 directly, the following works (until the final assert False):

def test_unicode_temp():
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch

    model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype=torch.float32, trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
    
    prompt = "Janet’s ducks lay 16 eggs per day"

    inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
    outputs = model.generate(**inputs, max_length=200)
    text = tokenizer.batch_decode(outputs)[0]
    print(text)
    assert False, text

@slundberg
Copy link
Contributor

slundberg commented Mar 9, 2024

A part of the problem is that the transformer model returns 51200 logits when you call it but len(tokenizer.get_vocab()) is only 50295...

Why that is I don't know...seems like a problem with Phi-2 in transformers?

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype=torch.float32, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

prompt = "bla"

inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
outs = model(**inputs)
print(len(tokenizer.get_vocab())) # 50295
print(outs.logits.shape) # torch.Size([1, 138, 51200])

@slundberg
Copy link
Contributor

slundberg commented Mar 9, 2024

Okay after a lot of digging the bigger problem here is that the Phi2 tokenizer does not have the byte_decoder attribute, this means we don't get tokens that are not valid strings correctly into the vocab...(they are just the � string)

For example:
tokenizer.convert_tokens_to_string([tokenizer.convert_ids_to_tokens(447)]) just gives '�' but 447 is part of what the apostrophe is encoded as so it must be a prefix of the apostrophe's unicode bytes.

I am not sure how to fix this, there must be some way to extract the actual byte-string tokens from the tokenizer but I don't know how. @riedgar-ms if you have any ideas let me know :)

@riedgar-ms
Copy link
Collaborator

@slundberg I have opened a discussion on Hugging Face about this:

https://huggingface.co/microsoft/phi-2/discussions/116

@riedgar-ms
Copy link
Collaborator

Side note: I have just spotted that we were getting a warning
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

MikoAL added a commit to MikoAL/guidance that referenced this issue Mar 31, 2024
MikoAL added a commit to MikoAL/guidance that referenced this issue Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants