Implement tensor parallel SGMV #79

tgaddair · 2023-11-28T21:00:43Z

This PR combines tensor parallelism with the SGMV vectorized CUDA kernel code path. In effect, this also addresses most of the pain points identified #6, as we only require one collective op per rank instead of one per adapter. The one caveat here is that the SGMV kernel doesn't work with ranks < 8, and since tensor parallelism reduces the effective rank of the tensor, this means we need to fallback on the loop implementation any time rank / world_size < 8. We'll try to address this in future iterations by extending the SGMV kernel to support ranks < 8 (#78).

tgaddair added 6 commits November 28, 2023 09:44

WIP: tensor parallel sgmv

a993785

Added test sgmv

0c8a6b0

Merge

2b62fa8

Fixed lora_a

fcb9caf

Refactor

ff7007a

Don't vectorize if rank < 8

bf45447

tgaddair requested review from geoffreyangus and magdyksaleh November 28, 2023 21:00

geoffreyangus approved these changes Nov 28, 2023

View reviewed changes

tgaddair merged commit cb96f12 into main Nov 28, 2023
1 check passed

tgaddair deleted the tp-sgmv branch November 28, 2023 21:44

tgaddair mentioned this pull request Nov 29, 2023

Fuse allgather requests across adapters and q, k, v to reduce small network requests #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement tensor parallel SGMV #79

Implement tensor parallel SGMV #79

tgaddair commented Nov 28, 2023

Implement tensor parallel SGMV #79

Implement tensor parallel SGMV #79

Conversation

tgaddair commented Nov 28, 2023