Add support for full CUDA GPU offloading #105

mudler · 2023-06-15T21:16:42Z

Special thanks to @chnyda that gave me access to a CUDA GPU to test this out. And of course, to llama.cpp to provide CUDA support!

mudler · 2023-06-15T21:19:01Z

@deadprogram tested locally here, now seems to work!

Signed-off-by: mudler <mudler@mocaccino.org>

Bumps [llama.cpp](https://github.com/ggerganov/llama.cpp) from `2347e45` to `bed9275`. - [Release notes](https://github.com/ggerganov/llama.cpp/releases) - [Commits](ggerganov/llama.cpp@2347e45...bed9275) --- updated-dependencies: - dependency-name: llama.cpp dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

mudler · 2023-06-15T21:32:38Z

ok, bringing in #104 in place of #103 makes it fail with:

GGML_ASSERT: /home/ubuntu/go-llama.cpp/llama.cpp/ggml-cuda.cu:2079: src0->type == GGML_TYPE_F16

mudler · 2023-06-15T22:43:33Z

looks like f16 is not respected and still loads with f32. Needs a closer look

mudler · 2023-06-16T12:45:30Z

something is off.

load_model: f16 true                                                                                                                                                                                                 
load_model: lparams.f16 true                                                                              
load_model: lparams.low_vram false                                                                        
load_model: lparams.mlock false                                                                           
load_model: lparams.use_mmap true
ggml_init_cublas: found 1 CUDA devices:                                                                                                                                                                              
  Device 0: Tesla T4                                                                                      
llama_init_from_file: f16 false                                                                           
llama_init_from_file: mmap false                                                                          
llama_init_from_file: mlock true                                                                          
llama_init_from_file: low_vram true

m4xw · 2023-06-16T15:50:38Z

ok, bringing in #104 in place of #103 makes it fail with:

GGML_ASSERT: /home/ubuntu/go-llama.cpp/llama.cpp/ggml-cuda.cu:2079: src0->type == GGML_TYPE_F16

Interestingly enough this only happens for me when lets say i spin up a clean localai instance and then run autogpt first thing. I dont get that when I initialize the model by starting a chat via the chatbot-ui first and then using it with autogpt.
Almost like some race around model load completion.
Been getting some sporadic crashes on init (but always around model load finish), seems to work somewhat once its running tho, thats on Windows with GPU through WSL/Docker.

GGML_ASSERT: /build/go-llama/llama.cpp/ggml-cuda.cu:2079: src0->type == GGML_TYPE_F16
SIGABRT: abort
PC=0x7f639e734ccc m=4 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 51 [syscall]:
runtime.cgocall(0x9b7820, 0xc00052c898)
        /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc00052c870 sp=0xc00052c838 pc=0x47ad7c
github.com/go-skynet/go-llama%2ecpp._Cfunc_llama_predict(0x7f62a13896d0, 0x7f62a0ed1a30, 0xc0000e8a80, 0x0)
        _cgo_gotypes.go:217 +0x4c fp=0xc00052c898 sp=0xc00052c870 pc=0x9014ac     
github.com/go-skynet/go-llama%2ecpp.(*LLama).Predict.func2(0xc000300000?, 0xc00052ca88?, {0xc0000e8a80, 0x0?, 0x130ade0?}, 0xc0000b2b01?)
        /build/go-llama/llama.go:211 +0x94 fp=0xc00052c8e8 sp=0xc00052c898 pc=0x904074
github.com/go-skynet/go-llama%2ecpp.(*LLama).Predict(0xc000012ae0, {0xc000300000, 
0x1109}, {0xc0000b2be0, 0xd, 0x0?})
        /build/go-llama/llama.go:211 +0x2c8 fp=0xc00052cba0 sp=0xc00052c8e8 pc=0x903d08
github.com/go-skynet/LocalAI/api.ModelInference.func12()
        /build/api/prediction.go:535 +0xde fp=0xc00052cec8 sp=0xc00052cba0 pc=0x98529e
github.com/go-skynet/LocalAI/api.ModelInference.func14()
        /build/api/prediction.go:577 +0x1aa fp=0xc00052cf80 sp=0xc00052cec8 pc=0x984dea
github.com/go-skynet/LocalAI/api.ComputeChoices({0xc000300000, 0x1109}, 0xc0000263c0, 0xc0000c2b00, 0xc000025800?, 0x0?, 0x1913e10, 0x800?)
        /build/api/prediction.go:601 +0x246 fp=0xc00052d840 sp=0xc00052cf80 pc=0x988146
github.com/go-skynet/LocalAI/api.chatEndpoint.func2(0xc0000c2580)
        /build/api/openai.go:444 +0x8c5 fp=0xc00052d9f0 sp=0xc00052d840 pc=0x97e105
github.com/gofiber/fiber/v2.(*App).next(0xc0000c9200, 0xc0000c2580)
        /go/pkg/mod/github.com/gofiber/fiber/v2@v2.46.0/router.go:144 +0x1bf fp=0xc00052da98 sp=0xc00052d9f0 pc=0x8c7a9f
github.com/gofiber/fiber/v2.(*Ctx).Next(0x14?)
        /go/pkg/mod/github.com/gofiber/fiber/v2@v2.46.0/ctx.go:913 +0x53 fp=0xc00052dab8 sp=0xc00052da98 pc=0x8b30b3

Here one of the random crashes right after init:

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0xf0f54cba pc=0x483f75]     

runtime stack:
runtime.throw({0x1394951?, 0x7f1fad04929a?})
        /usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0x7f1f1204c8b0 sp=0x7f1f1204c880 pc=0x4abedd
runtime.sigpanic()
        /usr/local/go/src/runtime/signal_unix.go:825 +0x3e9 fp=0x7f1f1204c910 sp=0x7f1f1204c8b0 pc=0x4c2389
runtime.acquirem(...)
        /usr/local/go/src/runtime/runtime1.go:482
runtime.mallocgc(0x2c0, 0x1367b00, 0x1)
        /usr/local/go/src/runtime/malloc.go:935 +0xd5 fp=0x7f1f1204c978 sp=0x7f1f1204c910 pc=0x483f75
runtime.newobject(0x7f1f00000090?)
        /usr/local/go/src/runtime/malloc.go:1254 +0x27 fp=0x7f1f1204c9a0 sp=0x7f1f1204c978 pc=0x484887
runtime: g 0: unexpected return pc for github.com/go-skynet/LocalAI/api.ModelInference called from 0xa75846
stack: frame={sp:0x7f1f1204c9a0, fp:0x7f1f1204cca0} stack=[0x7f1f1184ec20,0x7f1f1204e820)
0x00007f1f1204c8a0:  0x00007f1f1204c900  0x00000000004c2389 <runtime.sigpanic+0x00000000000003e9>
0x00007f1f1204c8b0:  0x0000000001394951  0x00007f1fad04929a
0x00007f1f1204c8c0:  0x0000000000000002  0x0000000000000000
0x00007f1f1204c8d0:  0x0000000000000002  0x0002ffff00001fbb
0x00007f1f1204c8e0:  0x000000c0001021a0  0x0000000000000007
0x00007f1f1204c8f0:  0x00007f1eb155d5a0  0x00007f1fad04929a
0x00007f1f1204c900:  0x00007f1f1204c968  0x0000000000483f75 <runtime.mallocgc+0x00000000000000d5>
0x00007f1f1204c910:  0x0000000000000016  0x0002ffff00001fbb
0x00007f1f1204c920:  0x00007f1f1204cda0  0x0000000000000007
0x00007f1f1204c930:  0x0000000000000001  0x4720726f66204144
0x00007f1f1204c940:  0xffffffffffffffff  0xffff0000ffff0000
0x00007f1f1204c950:  0x0000000000000000  0x0002ffff00001fbb
0x00007f1f1204c960:  0x00007f1e00000005  0x00007f1f1204c990
0x00007f1f1204c970:  0x0000000000484887 <runtime.newobject+0x0000000000000027>  0x00000000000002c0
0x00007f1f1204c980:  0x0000000001367b00  0x665f646565662e01
0x00007f1f1204c990:  0x00007f1f1204cc90  0x0000000000983f5d <github.com/go-skynet/LocalAI/api.ModelInference+0x000000000000005d>
0x00007f1f1204c9a0: <0x00007f1f00000090  0x0000000000000000
0x00007f1f1204c9b0:  0x0000100000001000  0x0000000000000000
0x00007f1f1204c9c0:  0x0000010000001000  0x0000000000000000
0x00007f1f1204c9d0:  0x00002b0000001000  0x0000000000000000
0x00007f1f1204c9e0:  0x001f8000ffe007ff  0xe30000000007000f
0x00007f1f1204c9f0:  0x4008000000000000  0x0000000000000000
0x00007f1f1204ca00:  0x3ff3ae147ae147ae  0x0000000000000000
0x00007f1f1204ca10:  0x0000000000000000  0x0000000000000000
0x00007f1f1204ca20:  0x0000000000000000  0x0000000000000000
0x00007f1f1204ca30:  0x656b203a6e6f6974  0x6576696c612d7065
0x00007f1f1204ca40:  0x66736e6172540a0d  0x646f636e452d7265
0x00007f1f1204ca50:  0x0000000000000000  0x0000000000000000
0x00007f1f1204ca60:  0x0000000000000000  0x0000000000000000
0x00007f1f1204ca70:  0x656b203a6e6f6974  0x6576696c612d7065
0x00007f1f1204ca80:  0x66736e6172540a0d  0x646f636e452d7265
0x00007f1f1204ca90:  0x0000000000000000  0x0000000000000000
0x00007f1f1204caa0:  0x0000000000000000  0x0000000000000000
0x00007f1f1204cab0:  0x656b203a6e6f6974  0x6576696c612d7065
0x00007f1f1204cac0:  0x0000000000000002  0x8000000000000006
0x00007f1f1204cad0:  0x0000000000000000  0x0000000000000000
0x00007f1f1204cae0:  0x0000000000000000  0x0000000000000000
0x00007f1f1204caf0:  0x0000000000000000  0x0000000000000000
0x00007f1f1204cb00:  0x0000000000000002  0x00007f1fa5b40cca
0x00007f1f1204cb10:  0x0000000000000000  0x00007f1eb0000000
0x00007f1f1204cb20:  0x000000000157c000  0x0000000001577000
0x00007f1f1204cb30:  0x0000000000000000  0x507732262c33d400
0x00007f1f1204cb40:  0x0000000000000002  0x0000000000005010
0x00007f1f1204cb50:  0x00007f1f00000030  0x0000000000005000
0x00007f1f1204cb60:  0x0000000000000073  0xfffffffffffffec8
0x00007f1f1204cb70:  0x00007f1f00000090  0x00007f1fa5b4010f
0x00007f1f1204cb80:  0x000000012d497b00  0x00007f1eb157b3d0
0x00007f1f1204cb90:  0x0000000000000001  0x0000000000000000
0x00007f1f1204cba0:  0x0000000002040000  0x507732262c33d400
0x00007f1f1204cbb0:  0x00007f1f00000030  0xfffffffffffffe90
0x00007f1f1204cbc0:  0x0000000000000000  0x00007f1fa5b41e7d
0x00007f1f1204cbd0:  0x00007f1eb1573bb0  0x00007f1eb15741c0
0x00007f1f1204cbe0:  0x0000000000000001  0x00007f1f00000003
0x00007f1f1204cbf0:  0x000000000000001e  0x000000000000001a
0x00007f1f1204cc00:  0x00007f1f00000030  0xfffffffffffffec8
0x00007f1f1204cc10:  0x00007f1eb0efb8a8  0x00007f1eb0efa3f0
0x00007f1f1204cc20:  0x00007f1eb11b0d90  0x00007f1fa5b42799
0x00007f1f1204cc30:  0x0000000000005000  0x00007f1eb1578be8
0x00007f1f1204cc40:  0x000000000000001a  0x00007f1eb1579110
0x00007f1f1204cc50:  0x00007f1eb155af40  0x00007f1faceaf58c
0x00007f1f1204cc60:  0x0000000000000019  0x0000000000a69c37
0x00007f1f1204cc70:  0x00007f1f1204cdf0  0x0000000000000019
0x00007f1f1204cc80:  0x00007f1eb156bc78  0x00007f1f1204cdf0
0x00007f1f1204cc90:  0x00007f1f1204cdf0 !0x0000000000a75846
0x00007f1f1204cca0: >0x000100044e564441  0x0000000000000200
0x00007f1f1204ccb0:  0x00007f1f1204e270  0x0000100000002b00
0x00007f1f1204ccc0:  0x0000000000000000  0x00000001276c0c00
0x00007f1f1204ccd0:  0x0000000000000001  0x00007f1f1204cd90
0x00007f1f1204cce0:  0x00007f1f1204cd90  0x00000020fffffff8
0x00007f1f1204ccf0:  0x0000010000001000  0x0000100000002b00
0x00007f1f1204cd00:  0x000000000014ca4b  0x0000000a00000028
0x00007f1f1204cd10:  0x000000012d497b00  0x00007f1f1204cd90
0x00007f1f1204cd20:  0x0000000000000000  0x0000000000984c10 <github.com/go-skynet/LocalAI/api.ModelInference+0x0000000000000d10>
0x00007f1f1204cd30:  0x0000000000000000  0x0000000000000000
0x00007f1f1204cd40:  0x00007f1f1204cda0  0x0000000000000000
0x00007f1f1204cd50:  0x00007f1eb11b0d90  0x0000000000000009
0x00007f1f1204cd60:  0x00007f1eb155d5a0  0x00007f1eb155d5a8
0x00007f1f1204cd70:  0x00007f1eb155d5a8  0x0000000000000180
0x00007f1f1204cd80:  0x00007f1f1204cd90  0x0000000000000009
0x00007f1f1204cd90:  0x332e73726579616c  0x00007f1fa5b40031
github.com/go-skynet/LocalAI/api.ModelInference({_, _}, _, {{{0x100044e564441, 0x200}, {0x7f1f1204e270, 0x100000002b00}, {0x0, 0x1276c0c00}, {0x1, ...}, ...}, ...}, ...)
        /build/api/prediction.go:246 +0x5d fp=0x7f1f1204cca0 sp=0x7f1f1204c9a0 pc=0x983f5d

goroutine 42 [syscall]:
runtime.cgocall(0x9b7860, 0xc0002eafd8)
        /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc0002eafb0 sp=0xc0002eaf78 pc=0x47ad7c
github.com/go-skynet/go-llama%2ecpp._Cfunc_load_model(0x7f1f00000dc0, 0x400, 0x0, 
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x28, ...)

GOLLAMA_VERSION?=a52ae7a66ae7fa42fd29f0bca9480c5c198feff9
Using WizardLM-7B-uncensored.ggmlv3.q5_1.bin

m4xw · 2023-06-16T15:52:12Z

Some other models that i tried behaved simliar weird, always around init, but i never got the FP16 issue there

mudler · 2023-06-16T21:37:10Z

I've been debugging this with @lu-zero (thanks!) and seems it is around the pass-by-value in llama_init_from_file. Somehow, when compiled with the binding, the copy gets mangled, and booleans are shuffled. That yields to f16, memlock, or mmap not respected. Tested with GCC11 and GCC12

Especially with golang bindings, calling by value has the side-effect of values not being copied correctly. This has been observed with the bindings in go-skynet/go-llama.cpp#105.

This is needed until ggerganov/llama.cpp#1902 is addressed/merged. Signed-off-by: mudler <mudler@mocaccino.org>

Signed-off-by: mudler <mudler@mocaccino.org>

mudler · 2023-06-16T22:25:40Z

With the patch and this PR:

root@gpu-friend:/home/ubuntu/go-llama.cpp# CGO_LDFLAGS="-lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "/home/ubuntu/WizardLM-7B-uncensored.ggmlv3.q4_0.bin" -t 1 -ngl 40
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4
llama.cpp: loading model from /home/ubuntu/WizardLM-7B-uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 128
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1862.39 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 5084 MB
...................................................................................................
llama_init_from_file: kv self size  =   64.00 MB
Model loaded successfully.
>>> What's up?

Sending What's up?

I hope this message finds you well. It's been a while since we last talked, and I just wanted to check in on you. How have you been? What have you been up to?
llama_print_timings:        load time =  6033.40 ms
llama_print_timings:      sample time =    28.27 ms /    42 runs   (    0.67 ms per token)
llama_print_timings: prompt eval time =   410.65 ms /     7 tokens (   58.66 ms per token)
llama_print_timings:        eval time =  1892.24 ms /    41 runs   (   46.15 ms per token)
llama_print_timings:       total time =  2347.59 ms
>>>

mudler · 2023-06-16T22:28:15Z

Patch has been submitted to llama.cpp. meanwhile applying the patch manually here to unblock updates

mudler · 2023-06-20T21:36:32Z

Fix upstreamed in: ggerganov/llama.cpp#1936

deadprogram · 2023-06-20T21:40:08Z

Great work @mudler thanks for all the effort!

mudler and others added 4 commits June 15, 2023 23:20

Init the backend when loading the model

3a62054

Signed-off-by: mudler <mudler@mocaccino.org>

Add LowVRAM option parameter

1a18c6d

Signed-off-by: mudler <mudler@mocaccino.org>

Add VocabOnly

207f689

Signed-off-by: mudler <mudler@mocaccino.org>

mudler force-pushed the cuda branch from 115931b to a52ae7a Compare June 15, 2023 21:20

No need to call make again

75e99c8

mudler mentioned this pull request Jun 16, 2023

Pass pointer to params in llama_init_from_file ggerganov/llama.cpp#1902

Closed

Workaround llama_init_from_file parameter copy

c55308a

This is needed until ggerganov/llama.cpp#1902 is addressed/merged. Signed-off-by: mudler <mudler@mocaccino.org>

mudler force-pushed the cuda branch from 76e4416 to c55308a Compare June 16, 2023 22:18

ci: fix macOS builds

5aa2a48

Signed-off-by: mudler <mudler@mocaccino.org>

mudler merged commit 35a3c99 into master Jun 16, 2023
3 checks passed

mudler deleted the cuda branch June 16, 2023 22:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for full CUDA GPU offloading #105

Add support for full CUDA GPU offloading #105

mudler commented Jun 15, 2023 •

edited

mudler commented Jun 15, 2023

mudler commented Jun 15, 2023

mudler commented Jun 15, 2023

mudler commented Jun 16, 2023

m4xw commented Jun 16, 2023 •

edited

m4xw commented Jun 16, 2023

mudler commented Jun 16, 2023

mudler commented Jun 16, 2023

mudler commented Jun 16, 2023

mudler commented Jun 20, 2023

deadprogram commented Jun 20, 2023

Add support for full CUDA GPU offloading #105

Add support for full CUDA GPU offloading #105

Conversation

mudler commented Jun 15, 2023 • edited

mudler commented Jun 15, 2023

mudler commented Jun 15, 2023

mudler commented Jun 15, 2023

mudler commented Jun 16, 2023

m4xw commented Jun 16, 2023 • edited

m4xw commented Jun 16, 2023

mudler commented Jun 16, 2023

mudler commented Jun 16, 2023

mudler commented Jun 16, 2023

mudler commented Jun 20, 2023

deadprogram commented Jun 20, 2023

mudler commented Jun 15, 2023 •

edited

m4xw commented Jun 16, 2023 •

edited