Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Bitsandbytes quantization for transformer backend #1775

Closed
fakezeta opened this issue Feb 29, 2024 · 3 comments · Fixed by #1823
Closed

feat: Add Bitsandbytes quantization for transformer backend #1775

fakezeta opened this issue Feb 29, 2024 · 3 comments · Fixed by #1823
Labels
enhancement New feature or request

Comments

@fakezeta
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

Quantization is not available for transformer backend
Describe the solution you'd like

Add bitsandbytes 4bit quantization to be triggered by the user with the low_vram flag in the model definition
Additionally I propose to use the f16 flag to change the compute_dtype to bfloat16 for better optimization on Nvidia cards
Describe alternatives you've considered

Additional context

I've implemented this while fixing #1774
Issue opened for tracking

@fakezeta fakezeta added the enhancement New feature or request label Feb 29, 2024
@mudler
Copy link
Owner

mudler commented Feb 29, 2024

Is your feature request related to a problem? Please describe.

Quantization is not available for transformer backend Describe the solution you'd like

Add bitsandbytes 4bit quantization to be triggered by the user with the low_vram flag in the model definition Additionally I propose to use the f16 flag to change the compute_dtype to bfloat16 for better optimization on Nvidia cards Describe alternatives you've considered

Additional context

I've implemented this while fixing #1774 Issue opened for tracking

good point @fakezeta! do you have already have the changes and come up with a PR? maybe we can take it from there

@fakezeta
Copy link
Collaborator Author

You know I'm old, give me some time 😄
I think I'll do it tonight (CET).

@mudler
Copy link
Owner

mudler commented Feb 29, 2024

You know I'm old, give me some time 😄 I think I'll do it tonight (CET).

That's fine, actually, thanks for the efforts!

mudler pushed a commit that referenced this issue Mar 14, 2024
#1775 and fix: Transformer backend error on CUDA #1774 (#1823)

* fixes #1775 and #1774

Add BitsAndBytes Quantization and fixes embedding on CUDA devices

* Manage 4bit and 8 bit quantization

Manage different BitsAndBytes options with the quantization: parameter in yaml

* fix compilation errors on non CUDA environment
mudler added a commit that referenced this issue Mar 26, 2024
…for Openvino and CUDA (#1892)

* fixes #1775 and #1774

Add BitsAndBytes Quantization and fixes embedding on CUDA devices

* Manage 4bit and 8 bit quantization

Manage different BitsAndBytes options with the quantization: parameter in yaml

* fix compilation errors on non CUDA environment

* OpenVINO draft

First draft of OpenVINO integration in transformer backend

* first working implementation

* Streaming working

* Small fix for regression on CUDA and XPU

* use pip version of optimum[openvino]

* Update backend/python/transformers/transformers_server.py

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants