Support for max_loaded_maps and num_parallel variables/parameter #11225

jars101 · 2024-06-05T10:27:27Z

It does not seem that ollama running on ipex-llm supports the most recent max_loaded_maps and num_parallel variables/parameters. Is it supported in current ollama version under llama-cpp? how does one enables it? thanks

sgwhat · 2024-06-06T02:06:20Z

Hi @jars101 ,

Ollama does not support max_loaded_maps.
You may run the command below to enable num_parallel setting:
```
export OLLAMA_NUM_PARALLEL=2
./ollama serve
```

jars101 · 2024-06-06T02:19:22Z

Thank you @sgwhat , most recent versions of ollama do support both OLLAMA_MAX_LOADED_MAPS and OLLAMA_NUM_PARALELL for linux and windows. Running ollama through cuda(ipex-llm) does not seem to keep the settings since for every request on same model it reloads the model into memory. This behaviour on ollama for windows (standalone) does not occur.

llm-cpp snippet log:

base) C:\Users\Admin>conda activate llm-cpp

(llm-cpp) C:\Users\Admin>ollama serve
2024/06/05 04:57:07 routes.go:1008: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR:C:\Users\Admin\AppData\Local\Programs\Ollama\ollama_runners OLLAMA_TMPDIR:]"
time=2024-06-05T04:57:07.321-07:00 level=INFO source=images.go:704 msg="total blobs: 78"
time=2024-06-05T04:57:07.351-07:00 level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-06-05T04:57:07.361-07:00 level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)"

jars101 · 2024-06-06T02:24:18Z

My bad, OLLAMA_NUM_PARALELL does work but OLLAMA_MAX_LOADED_MAPS does not. I went ahead and deployed a new installation of llm-cpp+ollama and I see that now i can make use of both variables. Also, ollama ps is available as well. The only problem I see when setting OLLAMA_KEEP_ALIVE to 600 seconds for instance is the following error:

INFO [print_timings] total time = 38167.02 ms | slot_id=0 t_prompt_processing=2029.651 t_token_generation=36137.372 t_total=38167.023 task_id=3 tid="3404" timestamp=1717656974 [GIN] 2024/06/05 - 23:56:14 | 200 | 47.9829961s | 10.240.0.1 | POST "/api/chat" Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) Exception caught at file:C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:16685, func:operator() SYCL error: CHECK_TRY_ERROR((*stream) .memcpy((char *)tensor->data + offset, host_buf, size) .wait()): Meet error in this line code! in function ggml_backend_sycl_buffer_set_tensor at C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:16685 GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error" [GIN] 2024/06/05 - 23:58:18 | 200 | 4.1697536s | 10.240.0.1 | POST "/api/chat" Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) Exception caught at file:C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp, line:17384, func:operator() SYCL error: CHECK_TRY_ERROR(g_syclStreams[sycl_ctx->device][0]->memcpy( (char *)tensor->data + offset, data, size).wait()): Meet error in this line code! in function ggml_backend_sycl_set_tensor_async at C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:17384 GGML_ASSERT: C:/Users/Administrator/actions-runner/cpp-release/_work/llm.cpp/llm.cpp/ollama-internal/llm/llama.cpp/ggml-sycl.cpp:3021: !"SYCL error"

Further removing OLLAMA_KEEP_ALIVE and letting it be default of 5miunutes, im observiing the same issue.

sgwhat · 2024-06-06T08:17:26Z

Could you please share the output of pip list from your environment and also your GPU model? Additionally, it would be helpful for us to resolve the issue if you could provide more information from the Ollama Server side.

jason-dai added the user issue label Jun 5, 2024

sgwhat self-assigned this Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for max_loaded_maps and num_parallel variables/parameter #11225

Support for max_loaded_maps and num_parallel variables/parameter #11225

jars101 commented Jun 5, 2024

sgwhat commented Jun 6, 2024

jars101 commented Jun 6, 2024

jars101 commented Jun 6, 2024 •

edited

Loading

sgwhat commented Jun 6, 2024

Support for max_loaded_maps and num_parallel variables/parameter #11225

Support for max_loaded_maps and num_parallel variables/parameter #11225

Comments

jars101 commented Jun 5, 2024

sgwhat commented Jun 6, 2024

jars101 commented Jun 6, 2024

jars101 commented Jun 6, 2024 • edited Loading

sgwhat commented Jun 6, 2024

jars101 commented Jun 6, 2024 •

edited

Loading