Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix models like phi-3 which have a mismatched tokenizer definition and model output tensor size #792

Merged
merged 3 commits into from
May 4, 2024

Conversation

Harsha-Nori
Copy link
Collaborator

@Harsha-Nori Harsha-Nori commented May 3, 2024

The Phi-3 Mini models (e.g. https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) have a total tokenizer vocab size of 32011 (base 32000 plus 11 special tokens), but the model outputs logit vectors of size 32064 (for CUDA performance reasons maybe?).

I think models that use this trick will always have the "padded" model output at the end, which makes this PR a very clean and simple change to guarantee tokenizer <> model output consistency. This fix will NOT work if the padded outputs are either in the beginning or interspersed throughout the token vector, but I haven't seen a model do that yet. If models do start doing that, we'll need to make significantly heavier changes across the guidance codebase, as we currently heavily rely on the len(tokenizer) to be usable for iteration.

@Harsha-Nori
Copy link
Collaborator Author

Build failure seems to be unrelated to this change @riedgar-ms

@riedgar-ms
Copy link
Collaborator

Doesn't this need to be combined with the Phi3-enabling PR?

@Harsha-Nori
Copy link
Collaborator Author

Harsha-Nori commented May 3, 2024

Doesn't this need to be combined with the Phi3-enabling PR?

Yes, oops, meant to merge that in (assuming you mean the one that adds phi-3 to the tests). Just did.

Copy link
Collaborator

@riedgar-ms riedgar-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gates looking good so far....

@Harsha-Nori
Copy link
Collaborator Author

Build failures on Mac seem to be due to build machine size issues, not this change. I can run all tests on my Mac.

@Harsha-Nori Harsha-Nori merged commit 7229656 into main May 4, 2024
111 of 116 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants