Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could not load model - all backends returned error #1037

Open
aaron13100 opened this issue Sep 11, 2023 · 3 comments
Open

could not load model - all backends returned error #1037

aaron13100 opened this issue Sep 11, 2023 · 3 comments
Assignees
Labels
bug Something isn't working need-more-information

Comments

@aaron13100
Copy link

LocalAI version:

According to git the last commit is from Sun Sep 3 02:38:52 2023 -0700 and says "added Linux Mint"

Environment, CPU architecture, OS, and Version:

Linux instance-7 6.2.0-1013-gcp #13-Ubuntu SMP Tue Aug 29 23:07:20 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
I gave the VM 8 cores and 64gigs of ram. Ubuntu 23.04.

Describe the bug

To Reproduce

I tried to specify the model at https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/tree/main. The model does appear using the curl http://localhost:8080/models/available function and does start downloading that way. The download didn't complete so I downloaded the file separately and placed it in the /models directory.

I then used

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llama-2-70b-chat.ggmlv3.q5_K_M.bin",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9 
   }'

but get an error instead of a response. I also tried

  curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q5_K_M.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.1
   }'

and

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "huggingface@TheBloke/Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q5_K_M.bin"
   }'  

Expected behavior

Some kind of answer from the model and a non-error message.

Logs

Client side:

{"error":{"code":500,"message":"could not load model - all backends returned error: 24 errors occurred:
	* could not load model: rpc error: code = Unknown desc = failed loading model
	* could not load model: rpc error: code = Unknown desc = failed loading model
	(...repeats 14 times...)
	* could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
	* could not load model: rpc error: code = Unknown desc = stat /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin: no such file or directory
	* could not load model: rpc error: code = Unknown desc = stat /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin: no such file or directory
	* could not load model: rpc error: code = Unknown desc = unsupported model type /build/models/llama-2-70b-chat

The file does exist. I added some symbolic links at build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin and /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin and the errors at the end changed a bit.

	* could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
	* could not load model: rpc error: code = Unknown desc = stat /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin: no such file or directory
	* could not load model: rpc error: code = Unknown desc = stat /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin: no such file or directory
	* could not load model: rpc error: code = Unknown desc = unsupported model type /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin (should end with .onnx)
	* backend unsupported: /build/extra/grpc/huggingface/huggingface.py
	* backend unsupported: /build/extra/grpc/autogptq/autogptq.py
	* backend unsupported: /build/extra/grpc/bark/ttsbark.py
	* backend unsupported: /build/extra/grpc/diffusers/backend_diffusers.py
	* backend unsupported: /build/extra/grpc/exllama/exllama.py
	* backend unsupported: /build/extra/grpc/vall-e-x/ttsvalle.py

Server side:
The log is quite long and I'm not sure what to include, but it looks like it's going through various ways to try to load the model and they all fail.

Etc
Maybe there's a different file/format I'm supposed to use?

It does load and run the example from the docs, wizardlm-13b-v1.0-superhot-8k.ggmlv3.q4_K_M.bin.
thanks

@aaron13100 aaron13100 added the bug Something isn't working label Sep 11, 2023
@Mafyuh
Copy link

Mafyuh commented Sep 13, 2023

I'm having same problem on ubuntu 23.04. Same exact issue where gallery downloading didn't download model, getting all the same rpc errors as you. I disabled ufw and reloaded the container, the model loaded, and is receiving requests, but not responding to anything. This is even using the ggml-gpt4all-j model from the getting started docs. I have tried multiple llama-2-7b-chat.ggmlv3 as well all same result.

Here are the logs when i managed to get gpt4all-j loaded, but didnt respond to any requests, with some of the rpc errors

[gptneox] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
1:26AM DBG [bert-embeddings] Attempting to load
1:26AM DBG Loading model bert-embeddings from ggml-gpt4all-j
1:26AM DBG Loading model in memory from file: /models/ggml-gpt4all-j
1:26AM DBG Loading GRPC Model bert-embeddings: {backendString:bert-embeddings model:ggml-gpt4all-j threads:2 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000020180 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
1:26AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/bert-embeddings
1:26AM DBG GRPC Service for ggml-gpt4all-j will be running at: '127.0.0.1:40785'
1:26AM DBG GRPC Service state dir: /tmp/go-processmanager2855979786
1:26AM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:40785: connect: connection refused"
1:26AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:40785): stderr 2023/09/13 01:26:42 gRPC Server listening at 127.0.0.1:40785
1:26AM DBG GRPC Service Ready
1:26AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:ggml-gpt4all-j ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:2 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/ggml-gpt4all-j Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false AudioPath:}
[127.0.0.1]:33128  200  -  GET      /readyz
[127.0.0.1]:55076  200  -  GET      /readyz
[127.0.0.1]:39038  200  -  GET      /readyz
[127.0.0.1]:41112  200  -  GET      /readyz
1:30AM DBG Request received: 
1:30AM DBG Configuration read: &{PredictionOptions:{Model:ggml-gpt4all-j Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name: F16:false Threads:2 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
1:30AM DBG Parameters: &{PredictionOptions:{Model:ggml-gpt4all-j Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name: F16:false Threads:2 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
1:30AM DBG Prompt (before templating): How are you?
[127.0.0.1]:57584  200  -  GET      /readyz

EDIT : It is something with Ubuntu or linux, same exact setup was followed but on windows 11 and it runs fine, same model (llama-2-7b-chat.ggmlv3.q4_K_M.bin), gpu and steps followed in install.

@Aisuko
Copy link
Collaborator

Aisuko commented Oct 2, 2023

Hi, guys. Thanks for your feedback. For @aaron13100, the issue maybe the model is not complete. I saw the service cannot load the model llama-2-70b-chat.ggmlv3.q5_K_M.bin: no such file or directory (maybe you have download it to the correct path, but it may not loaded to memory correct.). And the other logs mentioned the model format should be xxxx.

I suggest you to use some easy example from gallery to do a test first. Make sure everything is fine and then you can try some customise models.

For @Mafyuh, I saw the log that everything goes well. Do you use GPU on Ubuntu? If only CPU, how long time you wait for the request?

@Aisuko Aisuko assigned Aisuko and unassigned mudler Oct 2, 2023
@Aisuko
Copy link
Collaborator

Aisuko commented Oct 2, 2023

And for the content of the log.I know the log maybe let you are confused little bit. Here is an issue related to it, #1076.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need-more-information
Projects
None yet
Development

No branches or pull requests

4 participants