Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update create_states_mapping function to include tokenizer parameter #873

Closed
wants to merge 6 commits into from

Conversation

br3no
Copy link
Contributor

@br3no br3no commented May 7, 2024

Fixes #872

Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how well the tokenizer instance will hash/serialize. We need to at least make sure that there's a test confirming that we get cache hits when appropriate.

@br3no
Copy link
Contributor Author

br3no commented May 7, 2024

The class inherits Hashable
https://github.com/outlines-dev/outlines/blob/4f8433d8d6633b0780c3a6c27981f9adffbe49f5/outlines/models/tokenizer.py#L7

But yeah, a test would be good. Is there somewhere a documentation for contributors that could help me setup the dev env?

@brandonwillard
Copy link
Member

The class inherits Hashable

https://github.com/outlines-dev/outlines/blob/4f8433d8d6633b0780c3a6c27981f9adffbe49f5/outlines/models/tokenizer.py#L7

But yeah, a test would be good.

Yeah, we have some tests for in-session hash consistency (e.g. here), but I recall some issues with cross-session/serialization consistency. That might not be the case anymore, though.

Is there somewhere a documentation for contributors that could help me setup the dev env?

https://outlines-dev.github.io/outlines/community/contribute/

@ekagra-ranjan
Copy link

ekagra-ranjan commented May 7, 2024

Passing the tokenizer directly to cache key will not work since each instance of the same tokenizer object will be hashed to different values so there will be cache miss. I didnt see this PR earlier so I raised a PR with my local fix in #876 which caches based on tokenizer name or path string.

@br3no
Copy link
Contributor Author

br3no commented May 10, 2024

I've changed the PR to explicitly set the cache key computation using the key_function parameter of the @cache() decorator. This is in line with the discussion in #876.

@brandonwillard what do you think?

@brandonwillard
Copy link
Member

I've changed the PR to explicitly set the cache key computation using the key_function parameter of the @cache() decorator. This is in line with the discussion in #876.

@brandonwillard what do you think?

We need tests that confirm that the hashing function works as intended.

@br3no
Copy link
Contributor Author

br3no commented May 10, 2024

@brandonwillard I have added unit tests for the cache decorator when used in conjunction with the key_function. Let me know if this is sufficient for you.

Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandonwillard I have added unit tests for the cache decorator when used in conjunction with the key_function. Let me know if this is sufficient for you.

I like where you're going with those tests! I was originally referring to tests for the choice of key/hash function for the tokenizers, though.

We need to confirm that this choice of key will work between sessions and across different tokenizer types (at the very least the transformer types). Since you're using all the members/fields of the generic Tokenizer interface, it's probably fine; if not, we need to update the interface and/or force subclasses to implement this key function themselves.

Speaking of which, we should first consider/try updating the __hash__ and __eq__ methods of Tokenizerto follow this key. That would make things a lot more consistent; however, for the few times we need to compute a hash, it will be a little expensive. Since the interface already implies some form of immutability via Hashable, and the tokenizers should be immutable in this scenario conceptually, we could reasonably cache the hash value and mostly remove that concern.

Anyway, if we fix Tokenizer, then we only need to add that extra argument to create_states_mapping and a direct test guaranteeing that the disk cache is hit for—say—an example using GPT2's tokenizer (at the very least).

@br3no
Copy link
Contributor Author

br3no commented May 11, 2024

We need to confirm that this choice of key will work between sessions and across different tokenizer types (at the very least the transformer types). Since you're using all the members/fields of the generic Tokenizer interface, it's probably fine;

The key_function will return the fields that, together, "define" a particular tokenizer (these fields are all used in later code, so we can safely assume that it's okay to use them for this purpose). Assuming the hash is computed correctly, I believe we can also safely assume that the caching will work since we verified in a unit test, that the key_function is used the way it should.

if not, we need to update the interface and/or force subclasses to implement this key function themselves.

I agree this would be the best solution, but this would require a large refactoring. The different integrations work differently with the tokenizers. The only implementation of the Tokenizer protocol class is https://github.com/outlines-dev/outlines/blob/97ec37d9038750101152582e5df3d7315b2759b5/outlines/models/transformers.py#L57 For vLLM the tokenizer is patched with the right methods and members. In this case, we would need to also patch the hashing function into the tokenizer. I think this is not something we should do.

Speaking of which, we should first consider/try updating the __hash__ and __eq__ methods of Tokenizerto follow this key.

The situation described above is also the reason why this is, unfortunately, no solution at the moment.

I haven't looked into the matter in depth, but I believe there are some refactoring opportunities to consolidate the usage of tokenizers, aligned with what is done in the transformers integration. If all integrations would provide implementations of the Tokenizer protocol, then we could make the hash computation much simpler and also easier to test.


I opened the original issue because I realized that the cache uses only the regex as key, leading to errors because different tokenizers share the same state machine. Since the cache is persistent, I believe many users will face this issue. The symptoms of this problem are hard to read; the LLMs will generate gibberish.

I think the right thing to do now would be to merge this PR to make sure people don't get into this problem. This is not a new feature, it's a bug. And then open a new issue to address the refactoring needs described above and maybe clean up the change introduced in this PR.

Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a PR (i.e. #911) that synthesizes the things we've talked about and outlines the kind of test we need before merging anything that closes #872.

The outlines.caching.cache tests here are useful, so we can update this PR so that it only introduces those changes (in one separate commit), or we can merge the changes from #911 into this PR if you want to finish that work here.

@br3no
Copy link
Contributor Author

br3no commented May 23, 2024

@brandonwillard great work on #911. If I read it correctly, there is no longer a key_function argument in the cache decorator, right? So the tests here don't make sense if #911 is merged.

I don't mind closing this PR; I'd just like to have the behavior fixed. #911 seems like a better increment than this PR here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

State mapping cache ignores the tokenizer used to build the state machine
3 participants