feat: Add Bitsandbytes quantization for transformer backend #1775

fakezeta · 2024-02-29T08:57:08Z

Is your feature request related to a problem? Please describe.

Quantization is not available for transformer backend
Describe the solution you'd like

Add bitsandbytes 4bit quantization to be triggered by the user with the low_vram flag in the model definition
Additionally I propose to use the f16 flag to change the compute_dtype to bfloat16 for better optimization on Nvidia cards
Describe alternatives you've considered

Additional context

I've implemented this while fixing #1774
Issue opened for tracking

The text was updated successfully, but these errors were encountered:

mudler · 2024-02-29T10:30:37Z

Is your feature request related to a problem? Please describe.

Quantization is not available for transformer backend Describe the solution you'd like

Add bitsandbytes 4bit quantization to be triggered by the user with the low_vram flag in the model definition Additionally I propose to use the f16 flag to change the compute_dtype to bfloat16 for better optimization on Nvidia cards Describe alternatives you've considered

Additional context

I've implemented this while fixing #1774 Issue opened for tracking

good point @fakezeta! do you have already have the changes and come up with a PR? maybe we can take it from there

fakezeta · 2024-02-29T10:32:20Z

You know I'm old, give me some time 😄
I think I'll do it tonight (CET).

mudler · 2024-02-29T10:33:58Z

You know I'm old, give me some time 😄 I think I'll do it tonight (CET).

That's fine, actually, thanks for the efforts!

#1775 and fix: Transformer backend error on CUDA #1774 (#1823) * fixes #1775 and #1774 Add BitsAndBytes Quantization and fixes embedding on CUDA devices * Manage 4bit and 8 bit quantization Manage different BitsAndBytes options with the quantization: parameter in yaml * fix compilation errors on non CUDA environment

…for Openvino and CUDA (#1892) * fixes #1775 and #1774 Add BitsAndBytes Quantization and fixes embedding on CUDA devices * Manage 4bit and 8 bit quantization Manage different BitsAndBytes options with the quantization: parameter in yaml * fix compilation errors on non CUDA environment * OpenVINO draft First draft of OpenVINO integration in transformer backend * first working implementation * Streaming working * Small fix for regression on CUDA and XPU * use pip version of optimum[openvino] * Update backend/python/transformers/transformers_server.py Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

fakezeta added the enhancement New feature or request label Feb 29, 2024

fakezeta mentioned this issue Mar 12, 2024

feat: Add Bitsandbytes quantization for transformer backend enhancement #1775 and fix: Transformer backend error on CUDA #1774 #1823

Merged

1 task

mudler closed this as completed in #1823 Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Bitsandbytes quantization for transformer backend #1775

feat: Add Bitsandbytes quantization for transformer backend #1775

fakezeta commented Feb 29, 2024

mudler commented Feb 29, 2024 •

edited

Loading

fakezeta commented Feb 29, 2024

mudler commented Feb 29, 2024

feat: Add Bitsandbytes quantization for transformer backend #1775

feat: Add Bitsandbytes quantization for transformer backend #1775

Comments

fakezeta commented Feb 29, 2024

mudler commented Feb 29, 2024 • edited Loading

fakezeta commented Feb 29, 2024

mudler commented Feb 29, 2024

mudler commented Feb 29, 2024 •

edited

Loading