Fix Qwen tensor parallelism #120

tgaddair · 2023-12-10T21:14:17Z

Unlike Llama, Qwen stores qkv as a single tensor on disk. However, logically it is still handled as three distinct tensors in the attention block, which means that splitting the tensor as though it is one big tensor will cause things to break. Instead, when loading and splitting the tensor, we need to take into account each individual chunk (q, k, v) and split each chunk, rather than the entire tensor.

Related: #115.

tgaddair added 3 commits December 10, 2023 10:31

Only raise prefill error if CUDA error

4a72154

Fixed Qwen tensor parallelism

f910b3d

Docstring

16790cd

tgaddair mentioned this pull request Dec 10, 2023

Some error records and questions #115

Open

4 tasks

tgaddair requested review from geoffreyangus and magdyksaleh December 10, 2023 21:24

Merge branch 'main' into fix-qwen-tp

c7d4ac1

magdyksaleh approved these changes Dec 11, 2023

View reviewed changes

tgaddair added 3 commits December 11, 2023 09:21

Cleanup signature

1f064ce

Fix splitting

38a84e6

Don't stack

17183e8

tgaddair merged commit b98884f into main Dec 11, 2023
1 check failed

tgaddair deleted the fix-qwen-tp branch December 11, 2023 19:09

KrisWongz mentioned this pull request Mar 22, 2024

Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens") #354

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Qwen tensor parallelism #120

Fix Qwen tensor parallelism #120

tgaddair commented Dec 10, 2023 •

edited

Fix Qwen tensor parallelism #120

Fix Qwen tensor parallelism #120

Conversation

tgaddair commented Dec 10, 2023 • edited

tgaddair commented Dec 10, 2023 •

edited