Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for max_loaded_maps and num_parallel variables/parameter #11225

Open
jars101 opened this issue Jun 5, 2024 · 4 comments
Open

Support for max_loaded_maps and num_parallel variables/parameter #11225

jars101 opened this issue Jun 5, 2024 · 4 comments
Assignees

Comments

@jars101
Copy link

jars101 commented Jun 5, 2024

It does not seem that ollama running on ipex-llm supports the most recent max_loaded_maps and num_parallel variables/parameters. Is it supported in current ollama version under llama-cpp? how does one enables it? thanks

@sgwhat
Copy link
Contributor

sgwhat commented Jun 6, 2024

Hi @jars101 ,

  1. Ollama does not support max_loaded_maps.
  2. You may run the command below to enable num_parallel setting:
    export OLLAMA_NUM_PARALLEL=2
    ./ollama serve

@sgwhat sgwhat self-assigned this Jun 6, 2024
@jars101
Copy link
Author

jars101 commented Jun 6, 2024

Thank you @sgwhat , most recent versions of ollama do support both OLLAMA_MAX_LOADED_MAPS and OLLAMA_NUM_PARALELL for linux and windows. Running ollama through cuda(ipex-llm) does not seem to keep the settings since for every request on same model it reloads the model into memory. This behaviour on ollama for windows (standalone) does not occur.

llm-cpp snippet log:

base) C:\Users\Admin>conda activate llm-cpp

(llm-cpp) C:\Users\Admin>ollama serve
2024/06/05 04:57:07 routes.go:1008: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR:C:\Users\Admin\AppData\Local\Programs\Ollama\ollama_runners OLLAMA_TMPDIR:]"
time=2024-06-05T04:57:07.321-07:00 level=INFO source=images.go:704 msg="total blobs: 78"
time=2024-06-05T04:57:07.351-07:00 level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-06-05T04:57:07.361-07:00 level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)"

@jars101
Copy link
Author

jars101 commented Jun 6, 2024

My bad, OLLAMA_NUM_PARALELL does work but OLLAMA_MAX_LOADED_MAPS does not. I went ahead and deployed a new installation of llm-cpp+ollama and I see that now i can make use of both variables. Also, ollama ps is available as well. The only problem I see when setting OLLAMA_KEEP_ALIVE to 600 seconds for instance is the following error:

INFO [print_timings] total time = 38167.02 ms | slot_id=0 t_prompt_processing=2029.651 t_token_generation=36137.372 t_total=38167.023 task_id=3 tid="3404" timestamp=1717656974 [GIN] 2024/06/05 - 23:56:14 | 200 | 47.9829961s | 10.240.0.1 | POST "/api/chat" Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) Exception caught at file:C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:16685, func:operator() SYCL error: CHECK_TRY_ERROR((*stream) .memcpy((char *)tensor->data + offset, host_buf, size) .wait()): Meet error in this line code! in function ggml_backend_sycl_buffer_set_tensor at C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:16685 GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error" [GIN] 2024/06/05 - 23:58:18 | 200 | 4.1697536s | 10.240.0.1 | POST "/api/chat" Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) Exception caught at file:C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:17384, func:operator() SYCL error: CHECK_TRY_ERROR(g_syclStreams[sycl_ctx->device][0]->memcpy( (char *)tensor->data + offset, data, size).wait()): Meet error in this line code! in function ggml_backend_sycl_set_tensor_async at C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:17384 GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"

Further removing OLLAMA_KEEP_ALIVE and letting it be default of 5miunutes, im observiing the same issue.

@sgwhat
Copy link
Contributor

sgwhat commented Jun 6, 2024

Could you please share the output of pip list from your environment and also your GPU model? Additionally, it would be helpful for us to resolve the issue if you could provide more information from the Ollama Server side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants