Tighten up memory prediction logging #5106

dhiltgen · 2024-06-18T02:11:21Z

Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.

A few examples (non-debug regular logging level):

time=2024-06-17T19:07:15.507-07:00 level=INFO source=types.go:98 msg="inference compute" id=0 library=metal compute="" driver=0.0 name="" total="96.0 GiB" available="96.0 GiB"
[GIN] 2024/06/17 - 19:07:33 | 200 |     313.875µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/17 - 19:07:33 | 200 |    2.473333ms |       127.0.0.1 | POST     "/api/show"
time=2024-06-17T19:07:33.438-07:00 level=INFO source=memory.go:303 msg="offload to metal" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[96.0 GiB]" memory.required.full="3.2 GiB" memory.required.partial="3.2 GiB" memory.required.kv="650.0 MiB" memory.required.allocations="[3.2 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="2.2 GiB" memory.weights.nonrepeating="103.8 MiB" memory.graph.full="157.0 MiB" memory.graph.partial="157.0 MiB"
time=2024-06-17T19:07:33.439-07:00 level=INFO source=server.go:359 msg="starting llama server" cmd="/var/folders/hs/0tcx8spd1vv390h0j6jq5vq80000gn/T/ollama3083603568/runners/metal/ollama_llama_server --model /Users/daniel/.ollama/models/blobs/sha256-66002b78c70a22ab25e16cc9a1736c6cc6335398c7312e3eb33db202350afe66 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --parallel 1 --port 63538"

time=2024-06-18T02:09:26.404Z level=WARN source=gpu.go:225 msg="CPU does not have minimum vector extensions, GPU inference disabled" required=avx detected="no vector extensions"
time=2024-06-18T02:09:26.405Z level=INFO source=types.go:98 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="31.3 GiB" available="30.4 GiB"
[GIN] 2024/06/18 - 02:09:36 | 200 |     662.875µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/18 - 02:09:36 | 200 |    3.819958ms |       127.0.0.1 | POST     "/api/show"
time=2024-06-18T02:09:36.898Z level=INFO source=memory.go:303 msg="offload to cpu" layers.requested=-1 layers.model=27 layers.offload=0 layers.split="" memory.available="[30.4 GiB]" memory.required.full="2.7 GiB" memory.required.partial="0 B" memory.required.kv="650.0 MiB" memory.required.allocations="[2.7 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="2.2 GiB" memory.weights.nonrepeating="103.8 MiB" memory.graph.full="157.0 MiB" memory.graph.partial="177.2 MiB"
time=2024-06-18T02:09:36.905Z level=INFO source=server.go:359 msg="starting llama server" cmd="/tmp/ollama284292296/runners/cpu/ollama_llama_server --model /root/.ollama/models/blobs/sha256-66002b78c70a22ab25e16cc9a1736c6cc6335398c7312e3eb33db202350afe66 --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 43251"

time=2024-06-17T19:02:51.398-07:00 level=INFO source=types.go:98 msg="inference compute" id=GPU-1c750365-54dc-7082-7c6b-9dd953a68ab6 library=cuda compute=6.1 driver=12.3 name="NVIDIA GeForce GTX 1060 6GB" total="5.9 GiB" available="5.7 GiB"
[GIN] 2024/06/17 - 19:02:57 | 200 |      28.835µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/17 - 19:02:57 | 200 |     453.399µs |       127.0.0.1 | POST     "/api/show"
time=2024-06-17T19:02:58.233-07:00 level=INFO source=memory.go:303 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[5.7 GiB]" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="650.0 MiB" memory.required.allocations="[3.1 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="2.2 GiB" memory.weights.nonrepeating="103.8 MiB" memory.graph.full="157.0 MiB" memory.graph.partial="177.2 MiB"
time=2024-06-17T19:02:58.233-07:00 level=INFO source=server.go:359 msg="starting llama server" cmd="/tmp/ollama3201791839/runners/cuda_v11/ollama_llama_server --model /home/daniel/.ollama/models/blobs/sha256-66002b78c70a22ab25e16cc9a1736c6cc6335398c7312e3eb33db202350afe66 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --parallel 1 --port 43155"

llm/memory.go

Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.

mxyng approved these changes Jun 18, 2024

View reviewed changes

jmorganca reviewed Jun 18, 2024

View reviewed changes

llm/memory.go Outdated Show resolved Hide resolved

dhiltgen force-pushed the clean_logs branch from 0ee8eeb to 7784ca3 Compare June 18, 2024 16:15

dhiltgen merged commit b55958a into ollama:main Jun 18, 2024
12 checks passed

dhiltgen deleted the clean_logs branch June 18, 2024 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten up memory prediction logging #5106

Tighten up memory prediction logging #5106

dhiltgen commented Jun 18, 2024

Tighten up memory prediction logging #5106

Tighten up memory prediction logging #5106

Conversation

dhiltgen commented Jun 18, 2024