Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only 4 threads are used #498

Closed
badsmoke opened this issue Jun 4, 2023 · 9 comments · Fixed by #503
Closed

only 4 threads are used #498

badsmoke opened this issue Jun 4, 2023 · 9 comments · Fixed by #503
Assignees
Labels
bug Something isn't working

Comments

@badsmoke
Copy link

badsmoke commented Jun 4, 2023

LocalAI version:
latest docker image

Environment, CPU architecture, OS, and Version:
Ryzen 9 3900X -> 12 Cores 24 Threads
windows 10 -> wsl (5.15.90.1-microsoft-standard-WSL2 ) docker

Describe the bug
i have the model ggml-gpt4all-l13b-snoozy.bin
but only a maximum of 4 threads are used.

docker-compose.yml

version: "3.9"
services:
  localai:
    image: quay.io/go-skynet/local-ai:latest
    volumes:
      - ./models:/models
      - /etc/localtime:/etc/localtime:ro
    ports:
      - "8080:8080"
    command:
       --models-path /models
       --context-size 1024
       --threads 23
       --debug

model file
models/gpt-3.5-turbo.yaml

name: gpt-3.5-turbo
# Default model parameters
parameters:
  # Relative to the models path
  model: ggml-gpt4all-l13b-snoozy.bin
  # temperature
  temperature: 0.3
  # all the OpenAI request options here..

# Default context size
context_size: 512
threads: 23
# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.
#backend: gptj # available: llama, stablelm, gpt2, gptj rwkv
# stopwords (if supported by the backend)
stopwords:
- "HUMAN:"
- "### Response:"
# define chat roles
roles:
  user: "HUMAN:"
  system: "GPT:"
template:
  # template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the files
  completion: completion
  chat: ggml-gpt4all-l13b-snoozy

template

The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
### Prompt:
{{.Input}}
### Response:

e.g. command

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "ggml-gpt4all-l13b-snoozy.bin",
     "messages": [{"role": "user", "content": "What is Kubernetes?"}],
     "temperature": 0.7
   }'

thanks

@badsmoke badsmoke added the bug Something isn't working label Jun 4, 2023
@mudler
Copy link
Owner

mudler commented Jun 4, 2023

good catch, thanks for filing up an issue!

that looks a regression, somehow the patch got lost when upstreaming the binding. I've opened up a PR upstream too: nomic-ai/gpt4all#836

@mudler mudler linked a pull request Jun 4, 2023 that will close this issue
1 task
@mudler mudler mentioned this issue Jun 4, 2023
1 task
@badsmoke
Copy link
Author

badsmoke commented Jun 5, 2023

should your branch itself be executable?

I have built a new version with it, but unfortunately comes an error that he can not load the model

localai_1  | 9:00AM DBG Request received: {"model":"gpt-3.5-turbo","file":"","language":"","response_format":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"system","content":"You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown."},{"role":"user","content":"hey"}],"stream":true,"echo":false,"top_p":0,"top_k":0,"temperature":0.5,"max_tokens":1000,"n":0,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"seed":0,"mode":0,"step":0}
localai_1  | 9:00AM DBG Parameter Config: &{OpenAIRequest:{Model:ggml-gpt4all-l13b-snoozy.bin File: Language: ResponseFormat: Size: Prompt:<nil> Instruction: Input:<nil> Stop:<nil> Messages:[] Stream:false Echo:false TopP:0 TopK:0 Temperature:0.5 Maxtokens:1000 N:0 Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 Seed:0 Mode:0 Step:0} Name:gpt-3.5-turbo StopWords:[HUMAN: ### Response:] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:false Threads:20 Debug:true Roles:map[system:GPT: user:HUMAN:] Embeddings:false Backend: TemplateConfig:{Completion:completion Chat:ggml-gpt4all-l13b-snoozy Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptStrings:[] InputStrings:[] InputToken:[]}
localai_1  | 9:00AM DBG Stream request received
localai_1  | 9:00AM DBG Template found, input modified to: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
localai_1  | ### Prompt:
localai_1  | GPT: You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown.
localai_1  | HUMAN: hey
localai_1  | ### Response:
localai_1  | 
localai_1  | [172.31.48.1]:62236  200  -  POST     /v1/chat/completions
localai_1  | 9:00AM DBG Loading model 'ggml-gpt4all-l13b-snoozy.bin' greedly
localai_1  | 9:00AM DBG [llama] Attempting to load
localai_1  | 9:00AM DBG Loading model llama from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Sending chunk: {"object":"chat.completion.chunk","model":"gpt-3.5-turbo","choices":[{"delta":{"role":"assistant"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
localai_1  | 
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | 9:00AM DBG [llama] Fails: failed loading model
localai_1  | 9:00AM DBG [gpt4all] Attempting to load
localai_1  | 9:00AM DBG Loading model gpt4all from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | LLAMA ERROR: failed to load model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG [gpt4all] Fails: failed loading model
localai_1  | 9:00AM DBG [gptneox] Attempting to load
localai_1  | 9:00AM DBG Loading model gptneox from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | gpt_neox_model_load: invalid model file '/models/ggml-gpt4all-l13b-snoozy.bin' (bad magic)
localai_1  | gpt_neox_bootstrap: failed to load model from '/models/ggml-gpt4all-l13b-snoozy.bin

@mudler
Copy link
Owner

mudler commented Jun 5, 2023

should your branch itself be executable?

I have built a new version with it, but unfortunately comes an error that he can not load the model

localai_1  | 9:00AM DBG Request received: {"model":"gpt-3.5-turbo","file":"","language":"","response_format":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"system","content":"You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown."},{"role":"user","content":"hey"}],"stream":true,"echo":false,"top_p":0,"top_k":0,"temperature":0.5,"max_tokens":1000,"n":0,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"seed":0,"mode":0,"step":0}
localai_1  | 9:00AM DBG Parameter Config: &{OpenAIRequest:{Model:ggml-gpt4all-l13b-snoozy.bin File: Language: ResponseFormat: Size: Prompt:<nil> Instruction: Input:<nil> Stop:<nil> Messages:[] Stream:false Echo:false TopP:0 TopK:0 Temperature:0.5 Maxtokens:1000 N:0 Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 Seed:0 Mode:0 Step:0} Name:gpt-3.5-turbo StopWords:[HUMAN: ### Response:] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:false Threads:20 Debug:true Roles:map[system:GPT: user:HUMAN:] Embeddings:false Backend: TemplateConfig:{Completion:completion Chat:ggml-gpt4all-l13b-snoozy Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptStrings:[] InputStrings:[] InputToken:[]}
localai_1  | 9:00AM DBG Stream request received
localai_1  | 9:00AM DBG Template found, input modified to: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
localai_1  | ### Prompt:
localai_1  | GPT: You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown.
localai_1  | HUMAN: hey
localai_1  | ### Response:
localai_1  | 
localai_1  | [172.31.48.1]:62236  200  -  POST     /v1/chat/completions
localai_1  | 9:00AM DBG Loading model 'ggml-gpt4all-l13b-snoozy.bin' greedly
localai_1  | 9:00AM DBG [llama] Attempting to load
localai_1  | 9:00AM DBG Loading model llama from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Sending chunk: {"object":"chat.completion.chunk","model":"gpt-3.5-turbo","choices":[{"delta":{"role":"assistant"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
localai_1  | 
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | 9:00AM DBG [llama] Fails: failed loading model
localai_1  | 9:00AM DBG [gpt4all] Attempting to load
localai_1  | 9:00AM DBG Loading model gpt4all from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | LLAMA ERROR: failed to load model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG [gpt4all] Fails: failed loading model
localai_1  | 9:00AM DBG [gptneox] Attempting to load
localai_1  | 9:00AM DBG Loading model gptneox from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | gpt_neox_model_load: invalid model file '/models/ggml-gpt4all-l13b-snoozy.bin' (bad magic)
localai_1  | gpt_neox_bootstrap: failed to load model from '/models/ggml-gpt4all-l13b-snoozy.bin

thanks for the heads up, I've included a fix for this issue in #507

@badsmoke
Copy link
Author

badsmoke commented Jun 5, 2023

thank you, the changes work for me.

but unfortunately the output is still extremely slow.

if i run this on the same device with the same model and the software from gpt4all, there is an instant response (3seconds), with localai 1minutes, this is a huge difference

@mudler
Copy link
Owner

mudler commented Jun 5, 2023

You shouldn't overbook threads, but rather match the number of physical cores. Here I get 120ms per token, but my hardware ain't much capable either

@badsmoke
Copy link
Author

badsmoke commented Jun 5, 2023

I have also tried it with 12 cores (actual number of pyhsic cores), but it is the same resultis about 1 minute, until even the first word arrives

@mudler
Copy link
Owner

mudler commented Jun 5, 2023

here with 8threads ggml-gpt4all-l13b-snoozy.bin takes about 15s to reply to "How are you?" (note meanwhile my CPU was busy doing other things):

Peek 2023-06-05 13-45

also note that the model will be loaded into memory at the first usage, so it will be slightly slower on the first call. Where are you running LocalAI?

@badsmoke
Copy link
Author

badsmoke commented Jun 5, 2023

i run it via docker, something seems to go wrong it doesn't look like anything is loaded into memory.

when i make a request, nothing really happens for 1 minute, and then the first tokens arrive

localai_1    |  ┌───────────────────────────────────────────────────┐ 
localai_1    |  │                   Fiber v2.46.0                   │ 
localai_1    |  │               http://127.0.0.1:8080               │ 
localai_1    |  │       (bound on host 0.0.0.0 and port 8080)       │ 
localai_1    |  │                                                   │ 
localai_1    |  │ Handlers ............ 23  Processes ........... 1 │ 
localai_1    |  │ Prefork ....... Disabled  PID .............. 3890 │ 
localai_1    |  └───────────────────────────────────────────────────┘ 
localai_1    | 
localai_1    | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1    | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                             <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                               <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1    | llama_init_from_file: failed to load model
localai_1    | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1    | gptjllama_model_load_internal: format     = ggjt v1 (latest)
localai_1    | gptjllama_model_load_internal: n_vocab    = 32000
localai_1    | gptjllama_model_load_internal: n_ctx      = 2048
localai_1    | gptjllama_model_load_internal: n_embd     = 5120
localai_1    | gptjllama_model_load_internal: n_mult     = 256
localai_1    | gptjllama_model_load_internal: n_head     = 40
localai_1    | gptjllama_model_load_internal: n_layer    = 40
localai_1    | gptjllama_model_load_internal: n_rot      = 128
localai_1    | gptjllama_model_load_internal: ftype      = 2 (mostly Q4_0)
localai_1    | gptjllama_model_load_internal: n_ff       = 13824
localai_1    | gptjllama_model_load_internal: n_parts    = 1
localai_1    | gptjllama_model_load_internal: model size = 13B
localai_1    | gptjllama_model_load_internal: ggml ctx size =  73.73 KB
localai_1    | gptjllama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
localai_1    | gptjllama_init_from_file: kv self size  = 1600.00 MB

@boixu
Copy link

boixu commented Jun 5, 2023

I get the same tensor error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants