Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
One the major limitations of the original SGMV kernel is that it can only be applied to a batch consisting entirely of adapters of the same rank. This meant that in cases where ranks differed within a batch, we needed to fallback to the loop and mask approach. This was particularly problematic in cases where batches mix between the base model and adapters, as in a production setting it will frequently be the case that some significant portion of the requests will be to the base model.
One workaround would be to apply a zero weight matrix as a stand-in for the LoRA weights for rows corresponding to the base model or adapters of different ranks, but this means we need to allocate additional matrices for every rank, which can add up.
To workaround this, we extend the existing SGMV kernels to support a sparse list of segments (meaning that not every segment in the batch should have the SGMV operation applied to it). This allows us to avoid applying anything to the base model rows, and handle batches that contain mixed rank adapters.
In cases where adapters in a batch have mixed rank, we process each rank in turn, such that the number of SGMV operations becomes O(R) where R is the number of distinct ranks in the batch. This is a significant improvement of the O(A) cost of the loop implementation, where A is the number of adapters.
With this change, the only case where SGMV is not being used is with tensor parallelism, which we'll address in a follow-up PR shortly.