Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement tensor parallel SGMV #79

Merged
merged 6 commits into from
Nov 28, 2023
Merged

Implement tensor parallel SGMV #79

merged 6 commits into from
Nov 28, 2023

Conversation

tgaddair
Copy link
Contributor

This PR combines tensor parallelism with the SGMV vectorized CUDA kernel code path. In effect, this also addresses most of the pain points identified #6, as we only require one collective op per rank instead of one per adapter. The one caveat here is that the SGMV kernel doesn't work with ranks < 8, and since tensor parallelism reduces the effective rank of the tensor, this means we need to fallback on the loop implementation any time rank / world_size < 8. We'll try to address this in future iterations by extending the SGMV kernel to support ranks < 8 (#78).

@tgaddair tgaddair merged commit cb96f12 into main Nov 28, 2023
1 check passed
@tgaddair tgaddair deleted the tp-sgmv branch November 28, 2023 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants