Skip to content

after installing backend cuda12-bark, granite-embedding-107m-multilingual does not work anymore. #7222

@greygoo

Description

@greygoo

LocalAI version:

LocalAI used within LocalAGI

  • localai image:
    localai/localai:master-gpu-nvidia-cuda-12
  • localagi commit:
    commit c7d1b5834072208d6e6c72660d36d0fb50c7ee92

Environment, CPU architecture, OS, and Version:

Linux ai 6.15.5-1-default #1 SMP PREEMPT_DYNAMIC Sun Jul 6 18:09:53 UTC 2025 (478c062) x86_64 x86_64 x86_64 GNU/Linux
Docker execution
1x NVIDIA 4060RTX 16G

Describe the bug

This problem occurred within a LocalAGI setup, but since it seems to be purely localai related, bug is filed here.
After installing the backend cuda12-bark, it seems, that granite-embedding-107m-multilingual gets run with it, and this for some reason does not fail, but gets stuck at Preparing models, please wait, see logs for details.

To Reproduce

  • Install cuda12-bark
  • Restart LocalAI
  • use granite-embedding-107m-multilingual

Expected behavior

granite-embedding-107m-multilingual should not use the bark backends, or at least fail when attempting to do so, as it blocks LocalAGI from starting.

Logs

localai-1                  | 3:40PM INF Trying to load the model 'granite-embedding-107m-multilingual' with the backend '[kitten-tts cuda12-bark chatterbox cuda12-llama-cpp cuda12-chatterbox cuda12-stablediffusion-ggml llama-cpp piper bark transformers cud
a12-whisper whisper stablediffusion-ggml cuda12-transformers bark-cpp]'                                                                                                                                                                                         
localai-1                  | 3:40PM INF [kitten-tts] Attempting to load                                                                                                                                                                                         
localai-1                  | 3:40PM INF BackendLoader starting backend=kitten-tts modelID=granite-embedding-107m-multilingual o.model=granite-embedding-107m-multilingual-f16.gguf

...
<loading with kitten-tts fails>
...

localai-1                  | 3:40PM ERR Failed to load model granite-embedding-107m-multilingual with backend kitten-tts error="failed to load model with internal loader: could not load model (no success): Unexpected err=RepositoryNotFoundError('401 Client
 Error. (Request ID: Root=1-69120765-0feeb37c7743c58b37723adb;71825d50-f775-42b6-8c28-d9b1e469a3f2)\\n\\nRepository Not Found for url: https://huggingface.co/KittenML/granite-embedding-107m-multilingual-f16.gguf/resolve/main/config.json.\\nPlease make sure
 you specified the correct `repo_id` and `repo_type`.\\nIf you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication\\nInvalid username or password.')
, type(err)=<class 'huggingface_hub.errors.RepositoryNotFoundError'>" modelID=granite-embedding-107m-multilingual
localai-1                  | 3:40PM INF [kitten-tts] Fails: failed to load model with internal loader: could not load model (no success): Unexpected err=RepositoryNotFoundError('401 Client Error. (Request ID: Root=1-69120765-0feeb37c7743c58b37723adb;71825d
50-f775-42b6-8c28-d9b1e469a3f2)\n\nRepository Not Found for url: https://huggingface.co/KittenML/granite-embedding-107m-multilingual-f16.gguf/resolve/main/config.json.\nPlease make sure you specified the correct `repo_id` and `repo_type`.\nIf you are tryin
g to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication\nInvalid username or password.'), type(err)=<class 'huggingface_hub.errors.RepositoryNotFoundError'>
localai-1                  | 3:40PM INF [cuda12-bark] Attempting to load                                                                                                                                                                                        
localai-1                  | 3:40PM INF BackendLoader starting backend=cuda12-bark modelID=granite-embedding-107m-multilingual o.model=granite-embedding-107m-multilingual-f16.gguf
localai-1                  | 3:40PM DBG Loading model in memory from file: /models/granite-embedding-107m-multilingual-f16.gguf                                                                                                                                 
localai-1                  | 3:40PM DBG Loading Model granite-embedding-107m-multilingual with gRPC (file: /models/granite-embedding-107m-multilingual-f16.gguf) (backend: cuda12-bark): {backendString:cuda12-bark model:granite-embedding-107m-multilingual-f1
6.gguf modelID:granite-embedding-107m-multilingual context:{emptyCtx:{}} gRPCOptions:0xc0005c3508 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 parallelRequests:false}
localai-1                  | 3:40PM DBG Loading external backend: /backends/cuda12-bark/run.sh                                                                                                                                                                  
localai-1                  | 3:40PM DBG external backend is file: &{name:run.sh size:191 mode:493 modTime:{wall:0 ext:63897547571 loc:0x4b5a2c0} sys:{Dev:66309 Ino:244072010 Nlink:1 Mode:33261 Uid:0 Gid:0 X__pad0:0 Rdev:0 Size:191 Blksize:4096 Blocks:8 Ati
m:{Sec:1762167001 Nsec:791695205} Mtim:{Sec:1761950771 Nsec:0} Ctim:{Sec:1762141620 Nsec:653929949} X__unused:[0 0 0]}}
localai-1                  | 3:40PM DBG Loading GRPC Process: /backends/cuda12-bark/run.sh                                                                                                                                                                      
localai-1                  | 3:40PM DBG GRPC Service for granite-embedding-107m-multilingual will be running at: '127.0.0.1:46417'                                                                                                                              
localai-1                  | 3:40PM DBG GRPC Service state dir: /tmp/go-processmanager336956458                                                                                                                                                                 
localai-1                  | 3:40PM DBG GRPC Service Started                                                                    
localai-1                  | 3:40PM DBG Wait for the service to start up                                                                                                                                                                                        
localai-1                  | 3:40PM DBG Options: ContextSize:512  Seed:3483905  NBatch:512  MMap:true  Embeddings:true  NGPULayers:99999999  Threads:8  FlashAttention:"auto"  Options:"gpu"  Options:"use_jinja:true"
localai-1                  | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stdout Initializing libbackend for cuda12-bark
localai-1                  | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stdout Using portable Python                                                                                                                                 
localai-1                  | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stderr /backends/cuda12-bark/venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and wil
l be removed in v5 of Transformers. Use `HF_HOME` instead.
localai-1                  | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stderr   warnings.warn(                                                                                                                                      
localai-1                  | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stderr Server started. Listening on: 127.0.0.1:46417
localai-1                  | 3:40PM DBG GRPC Service Ready                                                                      
localai-1                  | 3:40PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:0xc000839958} sizeCache:0 unknownFields:[] Model:granite-embedding-107m-multilingual-f16.gguf ContextSize:
512 Seed:3483905 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:true NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:8 RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/granite-embe
dding-107m-multilingual-f16.gguf PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryU
tilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: DisableLogStatus:false DType: LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFa
ctor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:auto NoKVOffload:false ModelPath://models LoraAdapters:[] LoraScales:[] Options:[gpu use_jinja:true] CacheTypeKey: CacheTypeValue: GrammarTriggers:[] Reranking:false Overrides:[]}
localai-1                  | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stderr Preparing models, please wait

5 minutes afterwards, nothing else but healthcheck logs could be seen.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions