New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FP [6,8,12] quantizer op #5336
FP [6,8,12] quantizer op #5336
Conversation
Regarding the @mrwyattii @loadams any ideas? |
@microsoft-github-policy-service agree company="Snowflake" |
@jeffra - No immediate guesses, but I'll take a look |
Thanks @loadams! I believe this is resolved now, I figured out the issue with ninja enabled. I have to disable the pre-compile for this op for the test since it includes V100 compute target which is not compatible with this op (since it only works with bf16). I was able to get all the CI tests to pass (before updating with latest master right now). I think now it just needs a review from folks? :) |
That makes sense, I hadn't reviewed which is probably why I missed that you already identified the issue :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arashashari and/or @JamesTheZ please add your review as well, thank you
@mrwyattii, probably you meant @arashb :) btw, @arashashari is also welcome to review this ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Optimized version of `nn.Linear` that adds features such as: * LoRA w. base weight sharding * FP [6,8,12] quantization Depends on #5336 being merged first Co-authored-by: @rajhans Co-authored-by: @aurickq --------- Co-authored-by: Rajhans Samdani <rajhans.samdani@snowflake.com> Co-authored-by: Jeff Rasley <jeff.rasley@snowflake.com>
Flexible-bit quantizer-dequantizer library with fp6/fp12/fp8 support Requires Ampere+ architecture, this is due to the initial focus of this op only on `bfloat16` input types. Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Optimized version of `nn.Linear` that adds features such as: * LoRA w. base weight sharding * FP [6,8,12] quantization Depends on microsoft#5336 being merged first Co-authored-by: @rajhans Co-authored-by: @aurickq --------- Co-authored-by: Rajhans Samdani <rajhans.samdani@snowflake.com> Co-authored-by: Jeff Rasley <jeff.rasley@snowflake.com>
Flexible-bit quantizer-dequantizer library with fp6/fp12/fp8 support
Requires Ampere+ architecture, this is due to the initial focus of this op only on
bfloat16
input types.Co-authored-by: Reza Yazdani reza.yazdani@snowflake.com