Move get_logits and EngineCallResponse out of the Engine.call function so that the remaining parts can be lowered to Rust or C++ in the future and for simplification of LLM servers that operate in batches #647

paulbkoch · 2024-02-21T08:58:47Z

This PR has two purposes:

We'd like to lower the contents of the Engine.__call__() function to either Rust or C++, however the calls to get_logits and the yields of EngineCallResponse require python. This change moves them outside of the Engine.__call__(...) function leaving the rest of that function lowerable.
The current Engine class works well for servers that respond to a single request at a time, however batched servers need to maintain state for multiple connections at a time, and benefit from synchronizing the calls to get_logits into batched GPU calls. The current architecture halts on the call to get_logits inside the stack of the Engine.__call__ function. This change moves the call to get_logits outside of that function to a location where a batched server can batch the calls together.

…tion of separating the grammar processing that could be lowered to Rust or C++ into its own separate function

…at the contents of the next(...) function can be lowered to Rust or C++ in the future and for simplification of LLM servers that operate in batches

…essing loop of the Engine class so that the sample ordering can be done in python before lowering into C++ or Rust

slundberg · 2024-02-22T19:10:08Z

Thanks @paulbkoch ! I am just starting to dig through this, but one high level question first. Since many Model objects share the same engine does this change prevent those model objects from being async friendly or thread safe (not sure if they are thread safe anyway)? just noting that now the engine has more state than just a cache. (perhaps this means we need to create a cheap sub-object that gets created at each call)

slundberg · 2024-02-22T19:22:42Z

guidance/models/_model.py

-                #     self._cache_state["new_token_ids"].append(sampled_token_ind)
+        logits = None
+        while True:
+            is_done, logits_state, response_state = self.next(logits)


I think currently we return each token that is forced as a separate chunk, this in theory allows us to report to the client the probability of each token. It looks like this might make each forced region of bytes into a single chunk is that right? (not the huge issue, but one to note).

Unless I introduced a bug, it should return the same number of token responses as the original.

slundberg · 2024-02-22T19:26:45Z

Do you have an example function showing how batching might work? I was trying to imagine that but figured it would be faster to see what you had in mind :)

slundberg · 2024-02-22T20:27:55Z

One other thought here, we should consider how this integrates with a speculative decoder while we are refactoring...

paulbkoch · 2024-02-22T20:34:55Z

Thanks @paulbkoch ! I am just starting to dig through this, but one high level question first. Since many Model objects share the same engine does this change prevent those model objects from being async friendly or thread safe (not sure if they are thread safe anyway)? just noting that now the engine has more state than just a cache. (perhaps this means we need to create a cheap sub-object that gets created at each call)

It’s definitely not thread-safe as written, but the existing trie isn’t either. The way to do this would probably be to have a separate engine object per model object and deepcopy them when copies are made of the model object. It does work currently as a shared object held by multiple Model objects since the state is only valid between the call to start and the last call to next.

paulbkoch · 2024-02-22T20:41:55Z

Do you have an example function showing how batching might work? I was trying to imagine that but figured it would be faster to see what you had in mind :)

Here's an example of how it would work:

engines = […] # imagine this contains 10 engine objects and each has its own prompt
for i in range(len(engines)):
    # each engine has its own parser and grammar
    engines[i].start(parsers[i], grammars[i])

while True:
    for i in range(len(engines)):
        # For better performance use joblib on the next function
        done[i], logits_state[i], response_state[i] = engines[i].next(batched_logits[i])

    # GPU computes all 10 arrays of logits in a single batch
    batched_logits = GPU.get_batched_logits(engines)
    
    # Do some complicated state management to handle completed grammars
    # by swapping in new grammars waiting on queues
    # and issuing streaming responses through queues.

paulbkoch · 2024-02-23T06:13:39Z

One other thought here, we should consider how this integrates with a speculative decoder while we are refactoring...

I'm not clear on what you have in mind here, but happy to discuss it further if you think it should impact the design.

slundberg · 2024-03-08T00:12:42Z

Thanks again Paul! I looked through everything again and it all looks good for now. Merging :)

paulbkoch added 5 commits February 21, 2024 00:29

move get_logits and EngineCallResponse into an outer loop in anticipa…

ee83a84

…tion of separating the grammar processing that could be lowered to Rust or C++ into its own separate function

move get_logits and EngineCallResponse into a separate function so th…

51a1e9d

…at the contents of the next(...) function can be lowered to Rust or C++ in the future and for simplification of LLM servers that operate in batches

move the handling of any new logits received outside of the main proc…

661e0f8

…essing loop of the Engine class so that the sample ordering can be done in python before lowering into C++ or Rust

recombine response stream processing inside the Engine loop

9b13ac5

Simplify the Engine interface to get the logit and response states

9bb8485

slundberg reviewed Feb 22, 2024

View reviewed changes

Merge branch 'main' into main

330a292

slundberg merged commit 9fcd78b into guidance-ai:main Mar 8, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move get_logits and EngineCallResponse out of the Engine.call function so that the remaining parts can be lowered to Rust or C++ in the future and for simplification of LLM servers that operate in batches #647

Move get_logits and EngineCallResponse out of the Engine.call function so that the remaining parts can be lowered to Rust or C++ in the future and for simplification of LLM servers that operate in batches #647

paulbkoch commented Feb 21, 2024 •

edited

slundberg commented Feb 22, 2024 •

edited

slundberg Feb 22, 2024

paulbkoch Feb 22, 2024 •

edited

slundberg commented Feb 22, 2024

slundberg commented Feb 22, 2024

paulbkoch commented Feb 22, 2024 •

edited

paulbkoch commented Feb 22, 2024 •

edited

paulbkoch commented Feb 23, 2024

slundberg commented Mar 8, 2024

Move get_logits and EngineCallResponse out of the Engine.__call__ function so that the remaining parts can be lowered to Rust or C++ in the future and for simplification of LLM servers that operate in batches #647

Move get_logits and EngineCallResponse out of the Engine.__call__ function so that the remaining parts can be lowered to Rust or C++ in the future and for simplification of LLM servers that operate in batches #647

Conversation

paulbkoch commented Feb 21, 2024 • edited

slundberg commented Feb 22, 2024 • edited

slundberg Feb 22, 2024

Choose a reason for hiding this comment

paulbkoch Feb 22, 2024 • edited

Choose a reason for hiding this comment

slundberg commented Feb 22, 2024

slundberg commented Feb 22, 2024

paulbkoch commented Feb 22, 2024 • edited

paulbkoch commented Feb 22, 2024 • edited

paulbkoch commented Feb 23, 2024

slundberg commented Mar 8, 2024

Move get_logits and EngineCallResponse out of the Engine.call function so that the remaining parts can be lowered to Rust or C++ in the future and for simplification of LLM servers that operate in batches #647

Move get_logits and EngineCallResponse out of the Engine.call function so that the remaining parts can be lowered to Rust or C++ in the future and for simplification of LLM servers that operate in batches #647

paulbkoch commented Feb 21, 2024 •

edited

slundberg commented Feb 22, 2024 •

edited

paulbkoch Feb 22, 2024 •

edited

paulbkoch commented Feb 22, 2024 •

edited

paulbkoch commented Feb 22, 2024 •

edited