only 4 threads are used #498

badsmoke · 2023-06-04T15:34:04Z

LocalAI version:
latest docker image

Environment, CPU architecture, OS, and Version:
Ryzen 9 3900X -> 12 Cores 24 Threads
windows 10 -> wsl (5.15.90.1-microsoft-standard-WSL2 ) docker

Describe the bug
i have the model ggml-gpt4all-l13b-snoozy.bin
but only a maximum of 4 threads are used.

docker-compose.yml

version: "3.9"
services:
  localai:
    image: quay.io/go-skynet/local-ai:latest
    volumes:
      - ./models:/models
      - /etc/localtime:/etc/localtime:ro
    ports:
      - "8080:8080"
    command:
       --models-path /models
       --context-size 1024
       --threads 23
       --debug

model file
models/gpt-3.5-turbo.yaml

name: gpt-3.5-turbo
# Default model parameters
parameters:
  # Relative to the models path
  model: ggml-gpt4all-l13b-snoozy.bin
  # temperature
  temperature: 0.3
  # all the OpenAI request options here..

# Default context size
context_size: 512
threads: 23
# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.
#backend: gptj # available: llama, stablelm, gpt2, gptj rwkv
# stopwords (if supported by the backend)
stopwords:
- "HUMAN:"
- "### Response:"
# define chat roles
roles:
  user: "HUMAN:"
  system: "GPT:"
template:
  # template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the files
  completion: completion
  chat: ggml-gpt4all-l13b-snoozy

template

The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
### Prompt:
{{.Input}}
### Response:

e.g. command

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "ggml-gpt4all-l13b-snoozy.bin",
     "messages": [{"role": "user", "content": "What is Kubernetes?"}],
     "temperature": 0.7
   }'

thanks

The text was updated successfully, but these errors were encountered:

mudler · 2023-06-04T22:09:14Z

good catch, thanks for filing up an issue!

that looks a regression, somehow the patch got lost when upstreaming the binding. I've opened up a PR upstream too: nomic-ai/gpt4all#836

badsmoke · 2023-06-05T07:01:47Z

should your branch itself be executable?

I have built a new version with it, but unfortunately comes an error that he can not load the model

localai_1  | 9:00AM DBG Request received: {"model":"gpt-3.5-turbo","file":"","language":"","response_format":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"system","content":"You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown."},{"role":"user","content":"hey"}],"stream":true,"echo":false,"top_p":0,"top_k":0,"temperature":0.5,"max_tokens":1000,"n":0,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"seed":0,"mode":0,"step":0}
localai_1  | 9:00AM DBG Parameter Config: &{OpenAIRequest:{Model:ggml-gpt4all-l13b-snoozy.bin File: Language: ResponseFormat: Size: Prompt:<nil> Instruction: Input:<nil> Stop:<nil> Messages:[] Stream:false Echo:false TopP:0 TopK:0 Temperature:0.5 Maxtokens:1000 N:0 Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 Seed:0 Mode:0 Step:0} Name:gpt-3.5-turbo StopWords:[HUMAN: ### Response:] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:false Threads:20 Debug:true Roles:map[system:GPT: user:HUMAN:] Embeddings:false Backend: TemplateConfig:{Completion:completion Chat:ggml-gpt4all-l13b-snoozy Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptStrings:[] InputStrings:[] InputToken:[]}
localai_1  | 9:00AM DBG Stream request received
localai_1  | 9:00AM DBG Template found, input modified to: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
localai_1  | ### Prompt:
localai_1  | GPT: You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown.
localai_1  | HUMAN: hey
localai_1  | ### Response:
localai_1  | 
localai_1  | [172.31.48.1]:62236  200  -  POST     /v1/chat/completions
localai_1  | 9:00AM DBG Loading model 'ggml-gpt4all-l13b-snoozy.bin' greedly
localai_1  | 9:00AM DBG [llama] Attempting to load
localai_1  | 9:00AM DBG Loading model llama from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Sending chunk: {"object":"chat.completion.chunk","model":"gpt-3.5-turbo","choices":[{"delta":{"role":"assistant"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
localai_1  | 
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | 9:00AM DBG [llama] Fails: failed loading model
localai_1  | 9:00AM DBG [gpt4all] Attempting to load
localai_1  | 9:00AM DBG Loading model gpt4all from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | LLAMA ERROR: failed to load model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG [gpt4all] Fails: failed loading model
localai_1  | 9:00AM DBG [gptneox] Attempting to load
localai_1  | 9:00AM DBG Loading model gptneox from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | gpt_neox_model_load: invalid model file '/models/ggml-gpt4all-l13b-snoozy.bin' (bad magic)
localai_1  | gpt_neox_bootstrap: failed to load model from '/models/ggml-gpt4all-l13b-snoozy.bin

mudler · 2023-06-05T10:10:11Z

should your branch itself be executable?

I have built a new version with it, but unfortunately comes an error that he can not load the model

localai_1  | 9:00AM DBG Request received: {"model":"gpt-3.5-turbo","file":"","language":"","response_format":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"system","content":"You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown."},{"role":"user","content":"hey"}],"stream":true,"echo":false,"top_p":0,"top_k":0,"temperature":0.5,"max_tokens":1000,"n":0,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"seed":0,"mode":0,"step":0}
localai_1  | 9:00AM DBG Parameter Config: &{OpenAIRequest:{Model:ggml-gpt4all-l13b-snoozy.bin File: Language: ResponseFormat: Size: Prompt:<nil> Instruction: Input:<nil> Stop:<nil> Messages:[] Stream:false Echo:false TopP:0 TopK:0 Temperature:0.5 Maxtokens:1000 N:0 Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 Seed:0 Mode:0 Step:0} Name:gpt-3.5-turbo StopWords:[HUMAN: ### Response:] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:false Threads:20 Debug:true Roles:map[system:GPT: user:HUMAN:] Embeddings:false Backend: TemplateConfig:{Completion:completion Chat:ggml-gpt4all-l13b-snoozy Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptStrings:[] InputStrings:[] InputToken:[]}
localai_1  | 9:00AM DBG Stream request received
localai_1  | 9:00AM DBG Template found, input modified to: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
localai_1  | ### Prompt:
localai_1  | GPT: You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown.
localai_1  | HUMAN: hey
localai_1  | ### Response:
localai_1  | 
localai_1  | [172.31.48.1]:62236  200  -  POST     /v1/chat/completions
localai_1  | 9:00AM DBG Loading model 'ggml-gpt4all-l13b-snoozy.bin' greedly
localai_1  | 9:00AM DBG [llama] Attempting to load
localai_1  | 9:00AM DBG Loading model llama from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Sending chunk: {"object":"chat.completion.chunk","model":"gpt-3.5-turbo","choices":[{"delta":{"role":"assistant"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
localai_1  | 
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | 9:00AM DBG [llama] Fails: failed loading model
localai_1  | 9:00AM DBG [gpt4all] Attempting to load
localai_1  | 9:00AM DBG Loading model gpt4all from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | LLAMA ERROR: failed to load model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG [gpt4all] Fails: failed loading model
localai_1  | 9:00AM DBG [gptneox] Attempting to load
localai_1  | 9:00AM DBG Loading model gptneox from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | gpt_neox_model_load: invalid model file '/models/ggml-gpt4all-l13b-snoozy.bin' (bad magic)
localai_1  | gpt_neox_bootstrap: failed to load model from '/models/ggml-gpt4all-l13b-snoozy.bin

thanks for the heads up, I've included a fix for this issue in #507

badsmoke · 2023-06-05T10:49:52Z

thank you, the changes work for me.

but unfortunately the output is still extremely slow.

if i run this on the same device with the same model and the software from gpt4all, there is an instant response (3seconds), with localai 1minutes, this is a huge difference

mudler · 2023-06-05T11:20:47Z

You shouldn't overbook threads, but rather match the number of physical cores. Here I get 120ms per token, but my hardware ain't much capable either

badsmoke · 2023-06-05T11:37:15Z

I have also tried it with 12 cores (actual number of pyhsic cores), but it is the same resultis about 1 minute, until even the first word arrives

mudler · 2023-06-05T11:50:11Z

here with 8threads ggml-gpt4all-l13b-snoozy.bin takes about 15s to reply to "How are you?" (note meanwhile my CPU was busy doing other things):

also note that the model will be loaded into memory at the first usage, so it will be slightly slower on the first call. Where are you running LocalAI?

badsmoke · 2023-06-05T12:10:09Z

i run it via docker, something seems to go wrong it doesn't look like anything is loaded into memory.

when i make a request, nothing really happens for 1 minute, and then the first tokens arrive

localai_1    |  ┌───────────────────────────────────────────────────┐ 
localai_1    |  │                   Fiber v2.46.0                   │ 
localai_1    |  │               http://127.0.0.1:8080               │ 
localai_1    |  │       (bound on host 0.0.0.0 and port 8080)       │ 
localai_1    |  │                                                   │ 
localai_1    |  │ Handlers ............ 23  Processes ........... 1 │ 
localai_1    |  │ Prefork ....... Disabled  PID .............. 3890 │ 
localai_1    |  └───────────────────────────────────────────────────┘ 
localai_1    | 
localai_1    | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1    | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                             <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                               <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1    | llama_init_from_file: failed to load model
localai_1    | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1    | gptjllama_model_load_internal: format     = ggjt v1 (latest)
localai_1    | gptjllama_model_load_internal: n_vocab    = 32000
localai_1    | gptjllama_model_load_internal: n_ctx      = 2048
localai_1    | gptjllama_model_load_internal: n_embd     = 5120
localai_1    | gptjllama_model_load_internal: n_mult     = 256
localai_1    | gptjllama_model_load_internal: n_head     = 40
localai_1    | gptjllama_model_load_internal: n_layer    = 40
localai_1    | gptjllama_model_load_internal: n_rot      = 128
localai_1    | gptjllama_model_load_internal: ftype      = 2 (mostly Q4_0)
localai_1    | gptjllama_model_load_internal: n_ff       = 13824
localai_1    | gptjllama_model_load_internal: n_parts    = 1
localai_1    | gptjllama_model_load_internal: model size = 13B
localai_1    | gptjllama_model_load_internal: ggml ctx size =  73.73 KB
localai_1    | gptjllama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
localai_1    | gptjllama_init_from_file: kv self size  = 1600.00 MB

boixu · 2023-06-05T15:18:55Z

I get the same tensor error

badsmoke added the bug Something isn't working label Jun 4, 2023

badsmoke assigned mudler Jun 4, 2023

mudler linked a pull request Jun 4, 2023 that will close this issue

fix: downgrade gpt4all #503

Merged

1 task

mudler mentioned this issue Jun 4, 2023

fix: downgrade gpt4all #503

Merged

1 task

mudler closed this as completed in #503 Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only 4 threads are used #498

only 4 threads are used #498

badsmoke commented Jun 4, 2023

mudler commented Jun 4, 2023

badsmoke commented Jun 5, 2023

mudler commented Jun 5, 2023 •

edited

Loading

badsmoke commented Jun 5, 2023

mudler commented Jun 5, 2023

badsmoke commented Jun 5, 2023

mudler commented Jun 5, 2023 •

edited

Loading

badsmoke commented Jun 5, 2023 •

edited

Loading

boixu commented Jun 5, 2023

only 4 threads are used #498

only 4 threads are used #498

Comments

badsmoke commented Jun 4, 2023

mudler commented Jun 4, 2023

badsmoke commented Jun 5, 2023

mudler commented Jun 5, 2023 • edited Loading

badsmoke commented Jun 5, 2023

mudler commented Jun 5, 2023

badsmoke commented Jun 5, 2023

mudler commented Jun 5, 2023 • edited Loading

badsmoke commented Jun 5, 2023 • edited Loading

boixu commented Jun 5, 2023

mudler commented Jun 5, 2023 •

edited

Loading

mudler commented Jun 5, 2023 •

edited

Loading

badsmoke commented Jun 5, 2023 •

edited

Loading