`broadcast_as` error when processing multiple tokens at once in quantized example #2153

EricLBuehler · 2024-05-02T20:57:38Z

Hello all,

Thanks for your great work here. We are implementing speculative decoding at mistral.rs, and were in the final stages of testing when we discovered some incredibly strange behavior. Specifically, the following error results when sending multiple tokens at once during the completions steps:

Error: cannot broadcast [3, 3] to [1, 32, 3, 5]

Reproducing this error is simple:

In the quantized/main.rs:578:

-           let input = Tensor::new(&[next_token], &device)?.unsqueeze(0)?;
+           let input = Tensor::new(&[next_token, next_token, next_token], &device)?.unsqueeze(0)?;

Is this a bug?

The text was updated successfully, but these errors were encountered:

EricLBuehler · 2024-05-03T09:05:45Z

@LaurentMazare, is this a mistake on my part?

LaurentMazare · 2024-05-03T09:28:14Z

Not sure to understand, this model has been designed to be passed a prompt the one token at a time, so it fails if after the prompt you pass it multiple tokens at once which is somewhat expected. Do you mean that the error message should be more explicit about why this is failing?

EricLBuehler · 2024-05-03T09:33:28Z

For speculative decoding, we need to run the target model with multiple tokens at once, once per step. If we need to run the target model with a full prompt, that would be a big performance hit, which is why I tried to do this. Is there some workaround, like disabling the attention mask?

LaurentMazare · 2024-05-03T10:11:47Z

I think disabling the attention mask would be incorrect, you want the tokens in the batch you're processing to be causal between them and to be able to attend to all tokens in the kv cache. So you would want a mask that is rectangular rather than square based on how many tokens are in the kv-caches at the moment, and it should look like the following for a batch of 4 tokens and a kv cache that already has 5 tokens processed.

EricLBuehler · 2024-05-03T10:25:16Z

Ok. Would this be similar to #2111?

LaurentMazare · 2024-05-03T10:32:29Z

Indeed looks like the mask part at the bottom. Would be great if you can make a fresh PR with that change for the model that you care about.

EricLBuehler · 2024-05-03T12:09:20Z

Ok, so just to confirm: it is this part?

https://github.com/huggingface/candle/pull/2111/files#diff-ed262e4bc9a4a093e64842a2f61a85e1713c4efde0618ac7b31ad58dc5d171e3R137-R149

I can add a PR for this to some of the models if you think it is a good idea.

LaurentMazare · 2024-05-03T12:24:13Z

Yep exactly this part, probably good to support for at least llama and quantized-llama (and others too but they might need a bit more work as the mask generation is different).

EricLBuehler · 2024-05-07T00:25:52Z

I was able to make a general causal masker implementation here:

https://github.com/EricLBuehler/mistral.rs/blob/cc2f60a0bc4acfde636464ac408722335e0be732/mistralrs-core/src/layers.rs#L253

It works for all models with a causal/causal+sliding window mask. Should I submit this as a PR?

EricLBuehler mentioned this issue May 2, 2024

Major pipeline refactor EricLBuehler/mistral.rs#261

Merged

EricLBuehler changed the title ~~[Possible Bug] Error: cannot broadcast [3, 3] to [1, 32, 3, 5] in quantized example~~ [Possible Bug] broadcast_as error when processing multiple tokens at once in quantized example May 2, 2024

EricLBuehler changed the title ~~[Possible Bug] broadcast_as error when processing multiple tokens at once in quantized example~~ broadcast_as error when processing multiple tokens at once in quantized example May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`broadcast_as` error when processing multiple tokens at once in quantized example #2153

`broadcast_as` error when processing multiple tokens at once in quantized example #2153

EricLBuehler commented May 2, 2024 •

edited

Loading

EricLBuehler commented May 3, 2024

LaurentMazare commented May 3, 2024

EricLBuehler commented May 3, 2024

LaurentMazare commented May 3, 2024

EricLBuehler commented May 3, 2024

LaurentMazare commented May 3, 2024

EricLBuehler commented May 3, 2024

LaurentMazare commented May 3, 2024

EricLBuehler commented May 7, 2024

broadcast_as error when processing multiple tokens at once in quantized example #2153

broadcast_as error when processing multiple tokens at once in quantized example #2153

Comments

EricLBuehler commented May 2, 2024 • edited Loading

EricLBuehler commented May 3, 2024

LaurentMazare commented May 3, 2024

EricLBuehler commented May 3, 2024

LaurentMazare commented May 3, 2024

EricLBuehler commented May 3, 2024

LaurentMazare commented May 3, 2024

EricLBuehler commented May 3, 2024

LaurentMazare commented May 3, 2024

EricLBuehler commented May 7, 2024

`broadcast_as` error when processing multiple tokens at once in quantized example #2153

`broadcast_as` error when processing multiple tokens at once in quantized example #2153

EricLBuehler commented May 2, 2024 •

edited

Loading