Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is discussed in #2108 and handles mask creation for the Llama model that allows for processing a user supplied prompt in token batches instead of all at once. The key change was to
Cache::mask()
, adding a secondusize
and then creating the appropriately sized vector to turn into aTensor
there.The code in candle-examples/examples/llama/main.rs in this PR may need smoothing, but other than that, I've tested the example with and without the new
--prompt-batch-size
CLI parameter and at a variety of sizes.