Batch llama prompt #2111

tbogdala · 2024-04-22T21:05:54Z

This PR is discussed in #2108 and handles mask creation for the Llama model that allows for processing a user supplied prompt in token batches instead of all at once. The key change was to Cache::mask(), adding a second usize and then creating the appropriately sized vector to turn into a Tensor there.

The code in candle-examples/examples/llama/main.rs in this PR may need smoothing, but other than that, I've tested the example with and without the new --prompt-batch-size CLI parameter and at a variety of sizes.

LaurentMazare · 2024-04-23T09:53:38Z

Yeah the change in the example part indeed seems a bit complex. Maybe we should just have the model change in this PR so that users of the candle-transformers crate can benefit from it and we don't need to adapt the example for now.

tbogdala added 5 commits April 22, 2024 11:12

added batch processing to the llama model.

de62fbd

changed kv cache mask generation to work in 2d

d3f97ff

cleaned up the boolean expression for mask creation

3e7b69a

resolved merge with main

112c7a5

undo the change to model_ids that was an accidental commit.

f5d8544

tbogdala closed this Apr 23, 2024

lucasavila00 mentioned this pull request Apr 26, 2024

Batched & chunked prefill EricLBuehler/mistral.rs#216

Open

EricLBuehler mentioned this pull request May 3, 2024

broadcast_as error when processing multiple tokens at once in quantized example #2153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch llama prompt #2111

Batch llama prompt #2111

tbogdala commented Apr 22, 2024

LaurentMazare commented Apr 23, 2024

Batch llama prompt #2111

Batch llama prompt #2111

Conversation

tbogdala commented Apr 22, 2024

LaurentMazare commented Apr 23, 2024