-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
LocalAI version:
LocalAI used within LocalAGI
- localai image:
localai/localai:master-gpu-nvidia-cuda-12 - localagi commit:
commit c7d1b5834072208d6e6c72660d36d0fb50c7ee92
Environment, CPU architecture, OS, and Version:
Linux ai 6.15.5-1-default #1 SMP PREEMPT_DYNAMIC Sun Jul 6 18:09:53 UTC 2025 (478c062) x86_64 x86_64 x86_64 GNU/Linux
Docker execution
1x NVIDIA 4060RTX 16G
Describe the bug
This problem occurred within a LocalAGI setup, but since it seems to be purely localai related, bug is filed here.
After installing the backend cuda12-bark, it seems, that granite-embedding-107m-multilingual gets run with it, and this for some reason does not fail, but gets stuck at Preparing models, please wait, see logs for details.
To Reproduce
- Install cuda12-bark
- Restart LocalAI
- use granite-embedding-107m-multilingual
Expected behavior
granite-embedding-107m-multilingual should not use the bark backends, or at least fail when attempting to do so, as it blocks LocalAGI from starting.
Logs
localai-1 | 3:40PM INF Trying to load the model 'granite-embedding-107m-multilingual' with the backend '[kitten-tts cuda12-bark chatterbox cuda12-llama-cpp cuda12-chatterbox cuda12-stablediffusion-ggml llama-cpp piper bark transformers cud
a12-whisper whisper stablediffusion-ggml cuda12-transformers bark-cpp]'
localai-1 | 3:40PM INF [kitten-tts] Attempting to load
localai-1 | 3:40PM INF BackendLoader starting backend=kitten-tts modelID=granite-embedding-107m-multilingual o.model=granite-embedding-107m-multilingual-f16.gguf
...
<loading with kitten-tts fails>
...
localai-1 | 3:40PM ERR Failed to load model granite-embedding-107m-multilingual with backend kitten-tts error="failed to load model with internal loader: could not load model (no success): Unexpected err=RepositoryNotFoundError('401 Client
Error. (Request ID: Root=1-69120765-0feeb37c7743c58b37723adb;71825d50-f775-42b6-8c28-d9b1e469a3f2)\\n\\nRepository Not Found for url: https://huggingface.co/KittenML/granite-embedding-107m-multilingual-f16.gguf/resolve/main/config.json.\\nPlease make sure
you specified the correct `repo_id` and `repo_type`.\\nIf you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication\\nInvalid username or password.')
, type(err)=<class 'huggingface_hub.errors.RepositoryNotFoundError'>" modelID=granite-embedding-107m-multilingual
localai-1 | 3:40PM INF [kitten-tts] Fails: failed to load model with internal loader: could not load model (no success): Unexpected err=RepositoryNotFoundError('401 Client Error. (Request ID: Root=1-69120765-0feeb37c7743c58b37723adb;71825d
50-f775-42b6-8c28-d9b1e469a3f2)\n\nRepository Not Found for url: https://huggingface.co/KittenML/granite-embedding-107m-multilingual-f16.gguf/resolve/main/config.json.\nPlease make sure you specified the correct `repo_id` and `repo_type`.\nIf you are tryin
g to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication\nInvalid username or password.'), type(err)=<class 'huggingface_hub.errors.RepositoryNotFoundError'>
localai-1 | 3:40PM INF [cuda12-bark] Attempting to load
localai-1 | 3:40PM INF BackendLoader starting backend=cuda12-bark modelID=granite-embedding-107m-multilingual o.model=granite-embedding-107m-multilingual-f16.gguf
localai-1 | 3:40PM DBG Loading model in memory from file: /models/granite-embedding-107m-multilingual-f16.gguf
localai-1 | 3:40PM DBG Loading Model granite-embedding-107m-multilingual with gRPC (file: /models/granite-embedding-107m-multilingual-f16.gguf) (backend: cuda12-bark): {backendString:cuda12-bark model:granite-embedding-107m-multilingual-f1
6.gguf modelID:granite-embedding-107m-multilingual context:{emptyCtx:{}} gRPCOptions:0xc0005c3508 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 parallelRequests:false}
localai-1 | 3:40PM DBG Loading external backend: /backends/cuda12-bark/run.sh
localai-1 | 3:40PM DBG external backend is file: &{name:run.sh size:191 mode:493 modTime:{wall:0 ext:63897547571 loc:0x4b5a2c0} sys:{Dev:66309 Ino:244072010 Nlink:1 Mode:33261 Uid:0 Gid:0 X__pad0:0 Rdev:0 Size:191 Blksize:4096 Blocks:8 Ati
m:{Sec:1762167001 Nsec:791695205} Mtim:{Sec:1761950771 Nsec:0} Ctim:{Sec:1762141620 Nsec:653929949} X__unused:[0 0 0]}}
localai-1 | 3:40PM DBG Loading GRPC Process: /backends/cuda12-bark/run.sh
localai-1 | 3:40PM DBG GRPC Service for granite-embedding-107m-multilingual will be running at: '127.0.0.1:46417'
localai-1 | 3:40PM DBG GRPC Service state dir: /tmp/go-processmanager336956458
localai-1 | 3:40PM DBG GRPC Service Started
localai-1 | 3:40PM DBG Wait for the service to start up
localai-1 | 3:40PM DBG Options: ContextSize:512 Seed:3483905 NBatch:512 MMap:true Embeddings:true NGPULayers:99999999 Threads:8 FlashAttention:"auto" Options:"gpu" Options:"use_jinja:true"
localai-1 | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stdout Initializing libbackend for cuda12-bark
localai-1 | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stdout Using portable Python
localai-1 | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stderr /backends/cuda12-bark/venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and wil
l be removed in v5 of Transformers. Use `HF_HOME` instead.
localai-1 | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stderr warnings.warn(
localai-1 | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stderr Server started. Listening on: 127.0.0.1:46417
localai-1 | 3:40PM DBG GRPC Service Ready
localai-1 | 3:40PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:0xc000839958} sizeCache:0 unknownFields:[] Model:granite-embedding-107m-multilingual-f16.gguf ContextSize:
512 Seed:3483905 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:true NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:8 RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/granite-embe
dding-107m-multilingual-f16.gguf PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryU
tilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: DisableLogStatus:false DType: LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFa
ctor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:auto NoKVOffload:false ModelPath://models LoraAdapters:[] LoraScales:[] Options:[gpu use_jinja:true] CacheTypeKey: CacheTypeValue: GrammarTriggers:[] Reranking:false Overrides:[]}
localai-1 | 3:40PM DBG GRPC(granite-embedding-107m-multilingual-127.0.0.1:46417): stderr Preparing models, please wait
5 minutes afterwards, nothing else but healthcheck logs could be seen.
Additional context