Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformer backend error on CUDA #1774

Closed
fakezeta opened this issue Feb 29, 2024 · 0 comments
Closed

Transformer backend error on CUDA #1774

fakezeta opened this issue Feb 29, 2024 · 0 comments
Labels
bug Something isn't working unconfirmed

Comments

@fakezeta
Copy link
Collaborator

LocalAI version:

quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg

Environment, CPU architecture, OS, and Version:

Windows 11 Docker 25.03 with wsl2 backend
Kernel Version: 5.15.133.1-microsoft-standard-WSL2
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 15.62GiB
GPU NVidia 3060Ti 8GB VRAM

Describe the bug

Running intfloat/multilingual-e5-base with transformer backend with cuda: true fail with RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select) in logs
To Reproduce

Request embedding from AnythingLLM with the following embedding configuration

name: text-embedding-ada-002
backend: transformers
cuda: true
embeddings: true
low_vram: true
f16: true
device: cuda:0
parameters:
  model: intfloat/multilingual-e5-base

Expected behavior

Generate Embedding
Logs

8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr Server started. Listening on: 127.0.0.1:46411
8:26AM DBG GRPC Service Ready
8:26AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:intfloat/multilingual-e5-base ContextSize:0 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:true Embeddings:true NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:12 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/intfloat/multilingual-e5-base Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:true CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr Loading model intfloat/multilingual-e5-base to CUDA.
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr Traceback (most recent call last):
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/grpc/_server.py", line 552, in _call_behavior
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     response_or_iterator = behavior(argument, context)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/build/backend/python/transformers/transformers_server.py", line 112, in Embedding
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     model_output = self.model(**encoded_input)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return forward_call(*args, **kwargs)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 830, in forward
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     embedding_output = self.embeddings(
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr                        ^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return self._call_impl(*args, **kwargs)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return forward_call(*args, **kwargs)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 126, in forward
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     inputs_embeds = self.word_embeddings(input_ids)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return self._call_impl(*args, **kwargs)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return forward_call(*args, **kwargs)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 162, in forward
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return F.embedding(
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/functional.py", line 2233, in embedding
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

Additional context

I've implemented a fix locally and opened this Issue to track it.

@fakezeta fakezeta added bug Something isn't working unconfirmed labels Feb 29, 2024
mudler pushed a commit that referenced this issue Mar 14, 2024
#1775 and fix: Transformer backend error on CUDA #1774 (#1823)

* fixes #1775 and #1774

Add BitsAndBytes Quantization and fixes embedding on CUDA devices

* Manage 4bit and 8 bit quantization

Manage different BitsAndBytes options with the quantization: parameter in yaml

* fix compilation errors on non CUDA environment
mudler added a commit that referenced this issue Mar 26, 2024
…for Openvino and CUDA (#1892)

* fixes #1775 and #1774

Add BitsAndBytes Quantization and fixes embedding on CUDA devices

* Manage 4bit and 8 bit quantization

Manage different BitsAndBytes options with the quantization: parameter in yaml

* fix compilation errors on non CUDA environment

* OpenVINO draft

First draft of OpenVINO integration in transformer backend

* first working implementation

* Streaming working

* Small fix for regression on CUDA and XPU

* use pip version of optimum[openvino]

* Update backend/python/transformers/transformers_server.py

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
Projects
None yet
Development

No branches or pull requests

1 participant