Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tighten up memory prediction logging #5106

Merged
merged 1 commit into from
Jun 18, 2024
Merged

Conversation

dhiltgen
Copy link
Collaborator

Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.

A few examples (non-debug regular logging level):

time=2024-06-17T19:07:15.507-07:00 level=INFO source=types.go:98 msg="inference compute" id=0 library=metal compute="" driver=0.0 name="" total="96.0 GiB" available="96.0 GiB"
[GIN] 2024/06/17 - 19:07:33 | 200 |     313.875µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/17 - 19:07:33 | 200 |    2.473333ms |       127.0.0.1 | POST     "/api/show"
time=2024-06-17T19:07:33.438-07:00 level=INFO source=memory.go:303 msg="offload to metal" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[96.0 GiB]" memory.required.full="3.2 GiB" memory.required.partial="3.2 GiB" memory.required.kv="650.0 MiB" memory.required.allocations="[3.2 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="2.2 GiB" memory.weights.nonrepeating="103.8 MiB" memory.graph.full="157.0 MiB" memory.graph.partial="157.0 MiB"
time=2024-06-17T19:07:33.439-07:00 level=INFO source=server.go:359 msg="starting llama server" cmd="/var/folders/hs/0tcx8spd1vv390h0j6jq5vq80000gn/T/ollama3083603568/runners/metal/ollama_llama_server --model /Users/daniel/.ollama/models/blobs/sha256-66002b78c70a22ab25e16cc9a1736c6cc6335398c7312e3eb33db202350afe66 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --parallel 1 --port 63538"
time=2024-06-18T02:09:26.404Z level=WARN source=gpu.go:225 msg="CPU does not have minimum vector extensions, GPU inference disabled" required=avx detected="no vector extensions"
time=2024-06-18T02:09:26.405Z level=INFO source=types.go:98 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="31.3 GiB" available="30.4 GiB"
[GIN] 2024/06/18 - 02:09:36 | 200 |     662.875µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/18 - 02:09:36 | 200 |    3.819958ms |       127.0.0.1 | POST     "/api/show"
time=2024-06-18T02:09:36.898Z level=INFO source=memory.go:303 msg="offload to cpu" layers.requested=-1 layers.model=27 layers.offload=0 layers.split="" memory.available="[30.4 GiB]" memory.required.full="2.7 GiB" memory.required.partial="0 B" memory.required.kv="650.0 MiB" memory.required.allocations="[2.7 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="2.2 GiB" memory.weights.nonrepeating="103.8 MiB" memory.graph.full="157.0 MiB" memory.graph.partial="177.2 MiB"
time=2024-06-18T02:09:36.905Z level=INFO source=server.go:359 msg="starting llama server" cmd="/tmp/ollama284292296/runners/cpu/ollama_llama_server --model /root/.ollama/models/blobs/sha256-66002b78c70a22ab25e16cc9a1736c6cc6335398c7312e3eb33db202350afe66 --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 43251"
time=2024-06-17T19:02:51.398-07:00 level=INFO source=types.go:98 msg="inference compute" id=GPU-1c750365-54dc-7082-7c6b-9dd953a68ab6 library=cuda compute=6.1 driver=12.3 name="NVIDIA GeForce GTX 1060 6GB" total="5.9 GiB" available="5.7 GiB"
[GIN] 2024/06/17 - 19:02:57 | 200 |      28.835µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/17 - 19:02:57 | 200 |     453.399µs |       127.0.0.1 | POST     "/api/show"
time=2024-06-17T19:02:58.233-07:00 level=INFO source=memory.go:303 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[5.7 GiB]" memory.required.full="3.1 GiB" memory.required.partial="3.1 GiB" memory.required.kv="650.0 MiB" memory.required.allocations="[3.1 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="2.2 GiB" memory.weights.nonrepeating="103.8 MiB" memory.graph.full="157.0 MiB" memory.graph.partial="177.2 MiB"
time=2024-06-17T19:02:58.233-07:00 level=INFO source=server.go:359 msg="starting llama server" cmd="/tmp/ollama3201791839/runners/cuda_v11/ollama_llama_server --model /home/daniel/.ollama/models/blobs/sha256-66002b78c70a22ab25e16cc9a1736c6cc6335398c7312e3eb33db202350afe66 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --parallel 1 --port 43155"

llm/memory.go Outdated Show resolved Hide resolved
Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.
@dhiltgen dhiltgen merged commit b55958a into ollama:main Jun 18, 2024
12 checks passed
@dhiltgen dhiltgen deleted the clean_logs branch June 18, 2024 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants