feat: support setting the KV cache quant type #5098

sammcj · 2024-06-17T12:24:22Z

WIP

Testing adding configuration to allow setting the KV cache type re: #5091

Allow setting the KV cache type in the env and params.
Allow setting flashattention in params (as well as the existing env).

sammcj · 2024-06-17T13:17:13Z

Hmm, seems to crash out when doing this through Ollama, perhaps there's something I've missed?

>>> /set parameter cache_type_k q4_0
Set parameter 'cache_type_k' to 'q4_0'
>>> /set parameter cache_type_v q4_0
Set parameter 'cache_type_v' to 'q4_0'
>>> tell me a joke
Error: llama runner process has terminated: signal: aborted (core dumped)
...

llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   182.57 MiB
llm_load_tensors:      CUDA0 buffer size =   549.39 MiB
llm_load_tensors:      CUDA1 buffer size =   658.72 MiB
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 4096
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   135.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   117.00 MiB
llama_new_context_with_model: KV self size  =  252.00 MiB, K (q4_0):  126.00 MiB, V (q4_0):  126.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.59 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   344.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   439.77 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   259.02 MiB
llama_new_context_with_model: graph nodes  = 875
llama_new_context_with_model: graph splits = 3
GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/fattn-common.cuh:97: K->type == GGML_TYPE_F16
time=2024-06-17T13:15:35.247Z level=INFO source=server.go:590 msg="waiting for server to become available" status="llm server not responding"
time=2024-06-17T13:15:36.063Z level=INFO source=server.go:590 msg="waiting for server to become available" status="llm server error"
time=2024-06-17T13:15:36.313Z level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) "
[GIN] 2024/06/17 - 13:15:36 | 500 |  9.220745669s |       127.0.0.1 | POST     "/api/chat"
time=2024-06-17T13:15:41.443Z level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.12908724 model=/home/llm/.ollama/models/blobs/sha256-58148c0e3025b575e546f9b58d1bd0e451c5ae9533c1ff15a94145061cd02538

jmorganca · 2024-06-29T01:04:25Z

@sammcj thanks for the PR. It did indeed crash for me as well, I'm not sure if all runtimes support the quantized kv cache (cuda, metal, etc)

sammcj · 2024-06-29T01:17:06Z

Yeah I gave it a red hot go but didn't get anywhere :( oh well, maybe in the future.

feat: support setting the KV cache quant type

154d64b

sammcj force-pushed the cacheconfig branch from 15bc67a to 154d64b Compare June 17, 2024 13:26

sammcj added 2 commits June 17, 2024 23:29

feat: support setting the KV cache quant type

221f442

Merge branch 'main' into cacheconfig

97e29ac

sammcj closed this by deleting the head repository Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support setting the KV cache quant type #5098

feat: support setting the KV cache quant type #5098

sammcj commented Jun 17, 2024 •

edited

Loading

sammcj commented Jun 17, 2024

jmorganca commented Jun 29, 2024

sammcj commented Jun 29, 2024

feat: support setting the KV cache quant type #5098

feat: support setting the KV cache quant type #5098

Conversation

sammcj commented Jun 17, 2024 • edited Loading

sammcj commented Jun 17, 2024

jmorganca commented Jun 29, 2024

sammcj commented Jun 29, 2024

sammcj commented Jun 17, 2024 •

edited

Loading