-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update create_states_mapping function to include tokenizer parameter #873
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how well the tokenizer
instance will hash/serialize. We need to at least make sure that there's a test confirming that we get cache hits when appropriate.
The class inherits But yeah, a test would be good. Is there somewhere a documentation for contributors that could help me setup the dev env? |
Yeah, we have some tests for in-session hash consistency (e.g. here), but I recall some issues with cross-session/serialization consistency. That might not be the case anymore, though.
https://outlines-dev.github.io/outlines/community/contribute/ |
Passing the tokenizer directly to cache key will not work since each instance of the same tokenizer object will be hashed to different values so there will be cache miss. I didnt see this PR earlier so I raised a PR with my local fix in #876 which caches based on tokenizer name or path string. |
I've changed the PR to explicitly set the cache key computation using the @brandonwillard what do you think? |
We need tests that confirm that the hashing function works as intended. |
@brandonwillard I have added unit tests for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandonwillard I have added unit tests for the
cache
decorator when used in conjunction with thekey_function
. Let me know if this is sufficient for you.
I like where you're going with those tests! I was originally referring to tests for the choice of key/hash function for the tokenizers, though.
We need to confirm that this choice of key will work between sessions and across different tokenizer types (at the very least the transformer
types). Since you're using all the members/fields of the generic Tokenizer
interface, it's probably fine; if not, we need to update the interface and/or force subclasses to implement this key function themselves.
Speaking of which, we should first consider/try updating the __hash__
and __eq__
methods of Tokenizer
to follow this key. That would make things a lot more consistent; however, for the few times we need to compute a hash, it will be a little expensive. Since the interface already implies some form of immutability via Hashable
, and the tokenizers should be immutable in this scenario conceptually, we could reasonably cache the hash value and mostly remove that concern.
Anyway, if we fix Tokenizer
, then we only need to add that extra argument to create_states_mapping
and a direct test guaranteeing that the disk cache is hit for—say—an example using GPT2's tokenizer (at the very least).
The
I agree this would be the best solution, but this would require a large refactoring. The different integrations work differently with the tokenizers. The only implementation of the
The situation described above is also the reason why this is, unfortunately, no solution at the moment. I haven't looked into the matter in depth, but I believe there are some refactoring opportunities to consolidate the usage of tokenizers, aligned with what is done in the I opened the original issue because I realized that the cache uses only the regex as key, leading to errors because different tokenizers share the same state machine. Since the cache is persistent, I believe many users will face this issue. The symptoms of this problem are hard to read; the LLMs will generate gibberish. I think the right thing to do now would be to merge this PR to make sure people don't get into this problem. This is not a new feature, it's a bug. And then open a new issue to address the refactoring needs described above and maybe clean up the change introduced in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created a PR (i.e. #911) that synthesizes the things we've talked about and outlines the kind of test we need before merging anything that closes #872.
The outlines.caching.cache
tests here are useful, so we can update this PR so that it only introduces those changes (in one separate commit), or we can merge the changes from #911 into this PR if you want to finish that work here.
@brandonwillard great work on #911. If I read it correctly, there is no longer a I don't mind closing this PR; I'd just like to have the behavior fixed. #911 seems like a better increment than this PR here. |
Fixes #872