-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move get_logits and EngineCallResponse out of the Engine.__call__ function so that the remaining parts can be lowered to Rust or C++ in the future and for simplification of LLM servers that operate in batches #647
Conversation
…tion of separating the grammar processing that could be lowered to Rust or C++ into its own separate function
…at the contents of the next(...) function can be lowered to Rust or C++ in the future and for simplification of LLM servers that operate in batches
…essing loop of the Engine class so that the sample ordering can be done in python before lowering into C++ or Rust
Thanks @paulbkoch ! I am just starting to dig through this, but one high level question first. Since many Model objects share the same engine does this change prevent those model objects from being async friendly or thread safe (not sure if they are thread safe anyway)? just noting that now the engine has more state than just a cache. (perhaps this means we need to create a cheap sub-object that gets created at each call) |
# self._cache_state["new_token_ids"].append(sampled_token_ind) | ||
logits = None | ||
while True: | ||
is_done, logits_state, response_state = self.next(logits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently we return each token that is forced as a separate chunk, this in theory allows us to report to the client the probability of each token. It looks like this might make each forced region of bytes into a single chunk is that right? (not the huge issue, but one to note).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I introduced a bug, it should return the same number of token responses as the original.
Do you have an example function showing how batching might work? I was trying to imagine that but figured it would be faster to see what you had in mind :) |
One other thought here, we should consider how this integrates with a speculative decoder while we are refactoring... |
It’s definitely not thread-safe as written, but the existing trie isn’t either. The way to do this would probably be to have a separate engine object per model object and deepcopy them when copies are made of the model object. It does work currently as a shared object held by multiple Model objects since the state is only valid between the call to start and the last call to next. |
Here's an example of how it would work:
|
I'm not clear on what you have in mind here, but happy to discuss it further if you think it should impact the design. |
Thanks again Paul! I looked through everything again and it all looks good for now. Merging :) |
This PR has two purposes: