Cuda inference doesn't work anymore! #812

emakkus · 2023-07-26T19:59:45Z

LocalAI version:
quay.io/go-skynet/local-ai:sha-72e3e23-cublas-cuda12-ffmpeg@sha256:f868a3348ca3747843542eeb1391003def43c92e3fafa8d073af9098a41a7edd

I also tried to build the Image myself, exact same behaviour

Environment, CPU architecture, OS, and Version:
Linux lxdocker 6.2.16-4-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-5 (2023-07-14T17:53Z) x86_64 GNU/Linux

Its a proxmoc lxc with docker running inside it. The CUDA inference did work in an earlier version of this project, and llama.cpp still does work.

Describe the bug
No matter how I configure the model, I can't get the inference to run on the GPU. The GPU is being recognized, however it's vram usage stays at 274MiB / 24576MiB

nvidia-smi does work inside the container.

When starting the container, the following message appears:

model name	: 12th Gen Intel(R) Core(TM) i7-1260P
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
CPU:    AVX    found OK
CPU:    AVX2   found OK
CPU: no AVX512 found
@@@@@
7:23PM DBG no galleries to load
7:23PM INF Starting LocalAI using 12 threads, with models path: /models
7:23PM INF LocalAI version: 3f829f1 (3f829f11f57e33e44849282b3f0d123a7bf7ea87)
 ┌───────────────────────────────────────────────────┐ 
 │                   Fiber v2.48.0                   │ 
 │               http://127.0.0.1:8080               │ 
 │       (bound on host 0.0.0.0 and port 8080)       │ 
 │                                                   │ 
 │ Handlers ............ 31  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ................ 14 │ 
 └───────────────────────────────────────────────────┘

and when I make the completion call, only the CPU seems to take the load and slowly it responds. (Instead of using the GPU and being fast).

Also I somehow ALWAYS get the following message in the logs when I make the API call:

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:41227: connect: connection refused"

However the api call still works. I just can't see what the backend is doing.

If I attach to the container and go into the go-llama directory and make the test call from there:

CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda-12.2/lib64" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "/models/guanaco-33B.ggmlv3.q4_0.bin" -t 10

I get the following output:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama.cpp: loading model from /models/guanaco-33B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 128
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.14 MB
llama_model_load_internal: mem required  = 19512.67 MB (+ 3124.00 MB per state)
llama_new_context_with_model: kv self size  =  195.00 MB
Model loaded successfully.

As you can see, it is able to find the GPU but it wont use it. When I write anything to it, only the CPU is used.

In the ./examples/main.go I could find the ngl parameter for gpu layers, I used it with 60 and 70 and it didn't help. Same behaviour!

Finally I ran this:

go run ./examples -m "/models/guanaco-33B.ggmlv3.q4_0.bin" -t 10 -ngl 70 -n 1024

I removed all the prefix stuff, and I get the exact same behaviour with the exact same output as above. It is as if go-llama somehow doesn't make use of the GPU anymore.

However the interesting part is this:

If I make this call:

root@lxdocker:/build/go-llama/build/bin# ./main -m /models/guanaco-33B.ggmlv3.q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 1024 -ngl 70

(I copied the whole bash line, so the path can be seen too) It works! The GPU is being used and its super fast as expected!

Here the model output from llama.cpp:

main: build = 852 (294f424)
main: seed  = 1690401176
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama.cpp: loading model from /models/guanaco-33B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.14 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2215.41 MB (+ 3124.00 MB per state)
llama_model_load_internal: allocating batch_size x (768 kB + n_ctx x 208 B) = 436 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 60 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 63/63 layers to GPU
llama_model_load_internal: total VRAM used: 20899 MB
llama_new_context_with_model: kv self size  =  780.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 1024, n_keep = 0


 Building a website can be done in 10 simple steps:... (Model answers here to me)

As you can see llama.cpp is able to use the GPU... but localAI somehow isn't. I'm trying to figure the problem out for several days now, but I just can't... sadly I can't code in go, so I don't really understand what's going on either... And the grpc stuff also seems to throw errors but somehow still work...

I hope someone with more knowledge of how the whole backend is set up can maybe help out... I tried to gather as much information as I can.

I would really love to be able to use this project!

To Reproduce
Simply try to make an inference via CUDA...

Expected behavior
The GPU should be used. Just as llama.cpp in its naked form in the Image itself is able to...

The text was updated successfully, but these errors were encountered:

mudler · 2023-07-26T20:45:25Z

here it works - can you post your model config file, and logs with debug enabled?

emakkus · 2023-07-26T21:22:19Z

So this is currently my docker-compose file:

version: '3.6'

services:
  api:
    #image: gl-registry.um-a.one/hyp3rson1x/localai/localai-cuda-sd-tts-debian12:v1.0
    image: quay.io/go-skynet/local-ai:sha-72e3e23-cublas-cuda12-ffmpeg@sha256:f868a3348ca3747843542eeb1391003def43c92e3fafa8d073af9098a41a7edd
    #pull_policy: always
    network_mode: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
#    ports:
#      - 8080:8080
    environment:
      - MODELS_PATH=/models
      - BUILD_TYPE=cublas
      - CGO_LDFLAGS=-lcublas -lcudart -L/usr/local/cuda-12.2/lib64
      - GO_TAGS=stablediffusion,tts
      - REBUILD=false
      - CONTEXT_SIZE=4096
      - THREADS=12
      - ADDRESS=0.0.0.0:8080
      - IMAGE_PATH=/tmp
      - UPLOAD_LIMIT=100
      - PRELOAD_MODELS=[{"url":"https://raw.githubusercontent.com/go-skynet/model-gallery/main/guanaco.yaml","name":"guanaco-online","overrides":{"f16":true, "low_vram":false, "mmap":true,"gpu_layers":70, "main_gpu":"0","mmap":true,"context_size":4096,"parameters":{"model":"guanaco-33B.ggmlv3.q4_0.bin"}}}]
    volumes:
      - /AI/model-configs/:/models/
    command: ["/usr/bin/local-ai --debug"]

How can I enable debug? I added the --debug thing at the end but that doesn't seem to do anything...

I'm essentially using the guanaco.yaml from the model gallery and overriding it so it has additional options set.
This is the guanaco-online.yaml that's being created:

backend: llama
context_size: 4096
f16: true
gpu_layers: 70
low_vram: false
main_gpu: "0"
mmap: true
name: guanaco-online
parameters:
  model: guanaco-33B.ggmlv3.q4_0.bin
  temperature: 0.2
  top_k: 80
  top_p: 0.7
roles:
  assistant: 'Assistant:'
  system: 'System:'
  user: 'User:'
template:
  chat: guanaco-chat
  completion: guanaco-completion

low_vram: false
main_gpu: "0"

I added recently thinking they might somehow enable the usage of the GPU. I also had ´batch´ set before, but in the config.go file I couldn't find anything for batch so I changed it to context_size thinking maybe that could fix my problem...

This is the config.go section I'm talking about:

type Config struct {
	PredictionOptions `yaml:"parameters"`
	Name              string            `yaml:"name"`
	StopWords         []string          `yaml:"stopwords"`
	Cutstrings        []string          `yaml:"cutstrings"`
	TrimSpace         []string          `yaml:"trimspace"`
	ContextSize       int               `yaml:"context_size"`
	F16               bool              `yaml:"f16"`
	NUMA              bool              `yaml:"numa"`
	Threads           int               `yaml:"threads"`
	Debug             bool              `yaml:"debug"`
	Roles             map[string]string `yaml:"roles"`
	Embeddings        bool              `yaml:"embeddings"`
	Backend           string            `yaml:"backend"`
	TemplateConfig    TemplateConfig    `yaml:"template"`
	MirostatETA       float64           `yaml:"mirostat_eta"`
	MirostatTAU       float64           `yaml:"mirostat_tau"`
	Mirostat          int               `yaml:"mirostat"`
	NGPULayers        int               `yaml:"gpu_layers"`
	MMap              bool              `yaml:"mmap"`
	MMlock            bool              `yaml:"mmlock"`
	LowVRAM           bool              `yaml:"low_vram"`

	TensorSplit           string `yaml:"tensor_split"`
	MainGPU               string `yaml:"main_gpu"`
	ImageGenerationAssets string `yaml:"asset_dir"`

	PromptCachePath string `yaml:"prompt_cache_path"`
	PromptCacheAll  bool   `yaml:"prompt_cache_all"`
	PromptCacheRO   bool   `yaml:"prompt_cache_ro"`

	Grammar string `yaml:"grammar"`

	PromptStrings, InputStrings                []string
	InputToken                                 [][]int
	functionCallString, functionCallNameString string

	FunctionsConfig Functions `yaml:"function"`

	SystemPrompt string `yaml:"system_prompt"`
}

emakkus · 2023-07-26T21:34:32Z

Weird, in the Image currently active in my docker-compose file when I try to run this command:

go run ./examples -m "/models/guanaco-33B.ggmlv3.q4_0.bin" -t 10 -ngl 70 -n 1024

I now get another error:

# github.com/go-skynet/go-llama.cpp
binding.cpp: In function 'void* load_model(const char*, int, int, bool, bool, bool, bool, bool, bool, int, int, const char*, const char*, bool)':
binding.cpp:634:72: error: could not convert '& lparams' from 'llama_context_params*' to 'llama_context_params'
  634 |         struct llama_model * model = llama_load_model_from_file(fname, &lparams);
      |                                                                        ^~~~~~~~
      |                                                                        |
      |                                                                        llama_context_params*
binding.cpp:638:74: error: could not convert '& lparams' from 'llama_context_params*' to 'llama_context_params'
  638 |         struct llama_context * ctx = llama_new_context_with_model(model, &lparams);
      |                                                                          ^~~~~~~~
      |                                                                          |
      |                                                                          llama_context_params*

However the Api call via postman still gives me an answer. But it's slow, since it's running on the CPU only...

Also when I make the API call I still get the rpc error and nothing else in the logs for localai:

9:10PM DBG no galleries to load
9:10PM INF Starting LocalAI using 12 threads, with models path: /models
9:10PM INF LocalAI version: 72e3e23 (72e3e236de2962861558919041b45ece48cbfb34)
 ┌───────────────────────────────────────────────────┐ 
 │                   Fiber v2.48.0                   │ 
 │               http://127.0.0.1:8080               │ 
 │       (bound on host 0.0.0.0 and port 8080)       │ 
 │                                                   │ 
 │ Handlers ............ 31  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ................ 13 │ 
 └───────────────────────────────────────────────────┘ 
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:33207: connect: connection refused"

mudler · 2023-07-26T21:46:28Z

To enable debug you need to set DEBUG=true in the environment variables

emakkus · 2023-07-26T22:24:54Z

Okay I did that, this comes when I make the API call:

 ┌───────────────────────────────────────────────────┐ 
 │                   Fiber v2.48.0                   │ 
 │               http://127.0.0.1:8080               │ 
 │       (bound on host 0.0.0.0 and port 8080)       │ 
 │                                                   │ 
 │ Handlers ............ 32  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ................ 13 │ 
 └───────────────────────────────────────────────────┘ 
[192.168.20.90]:50744  200  -  POST     /v1/chat/completions
10:20PM DBG Request received: {"model":"guanaco-online","language":"","n":0,"top_p":0,"top_k":0,"temperature":0.7,"max_tokens":0,"echo":false,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"frequency_penalty":0,"tfz":0,"typical_p":0,"seed":0,"file":"","response_format":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","content":"What's your name?"}],"functions":null,"function_call":null,"stream":true,"mode":0,"step":0,"grammar":"","grammar_json_functions":null}
10:20PM DBG Configuration read: &{PredictionOptions:{Model:guanaco-33B.ggmlv3.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:guanaco-online StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 F16:true NUMA:false Threads:12 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:guanaco-chat ChatMessage: Completion:guanaco-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:70 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU:0 ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
10:20PM DBG Parameters: &{PredictionOptions:{Model:guanaco-33B.ggmlv3.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:guanaco-online StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 F16:true NUMA:false Threads:12 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:guanaco-chat ChatMessage: Completion:guanaco-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:70 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU:0 ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
10:20PM DBG Prompt (before templating): User: What's your name?
10:20PM DBG Stream request received
10:20PM DBG Template found, input modified to: ### Instruction:
User: What's your name?
### Response:
10:20PM DBG Prompt (after templating): ### Instruction:
User: What's your name?
### Response:
10:20PM DBG Sending chunk: {"object":"chat.completion.chunk","model":"guanaco-online","choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
10:20PM DBG Loading model llama from guanaco-33B.ggmlv3.q4_0.bin
10:20PM DBG Loading model in memory from file: /models/guanaco-33B.ggmlv3.q4_0.bin
10:20PM DBG Loading GRPC Model llama: {backendString:llama modelFile:guanaco-33B.ggmlv3.q4_0.bin threads:12 assetDir:/tmp/localai/backend_data context:0xc0000420d8 gRPCOptions:0xc0007d4090 externalBackends:map[huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py]}
10:20PM DBG Loading GRPC Process%!(EXTRA string=/tmp/localai/backend_data/backend-assets/grpc/llama)
10:20PM DBG GRPC Service for guanaco-33B.ggmlv3.q4_0.bin will be running at: '127.0.0.1:44997'
10:20PM DBG GRPC Service state dir: /tmp/go-processmanager287012424
10:20PM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:44997: connect: connection refused"
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr 2023/07/26 22:20:28 gRPC Server listening at 127.0.0.1:44997
10:20PM DBG GRPC Service Ready
10:20PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:/models/guanaco-33B.ggmlv3.q4_0.bin ContextSize:4096 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:70 MainGPU:0 TensorSplit: Threads:12 LibrarySearchPath:}
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr ggml_init_cublas: found 1 CUDA devices:
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama.cpp: loading model from /models/guanaco-33B.ggmlv3.q4_0.bin
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: format     = ggjt v3 (latest)
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: n_vocab    = 32000
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: n_ctx      = 4096
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: n_embd     = 6656
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: n_mult     = 256
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: n_head     = 52
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: n_layer    = 60
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: n_rot      = 128
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: freq_base  = 10000.0
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: freq_scale = 1
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: ftype      = 2 (mostly Q4_0)
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: n_ff       = 17920
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: model size = 30B
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: ggml ctx size =    0.14 MB
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_model_load_internal: mem required  = 19925.67 MB (+ 3124.00 MB per state)
10:20PM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44997): stderr llama_new_context_with_model: kv self size  = 6240.00 MB
10:20PM DBG Sending chunk: {"object":"chat.completion.chunk","model":"guanaco-online","choices":[{"index":0,"delta":{"content":" I"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
10:20PM DBG Sending chunk: {"object":"chat.completion.chunk","model":"guanaco-online","choices":[{"index":0,"delta":{"content":" am"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
10:20PM DBG Sending chunk: {"object":"chat.completion.chunk","model":"guanaco-online","choices":[{"index":0,"delta":{"content":" Ope ... <Model answer stream chunks here>

And this is my cURL:

curl --location 'http://lxdocker:8080/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
     "model": "guanaco-online",
     "messages": [
         {
             "role": "user",
             "content": "What'\''s your name?"
         }
     ],
     "temperature": 0.7,
     "stream": true
}'

mudler · 2023-07-26T22:34:19Z

@emakkus can you try with images from master and see if you can reproduce there? e.g. quay.io/go-skynet/local-ai:master-cublas-cuda11

emakkus · 2023-07-27T00:33:58Z

@mudler I tried the image you recommended. That one actually seems to be able to utilize the GPU, however it then fails because of some segmentation fault right when it starts the inference.

Here are the logs:

12:29AM DBG no galleries to load
12:29AM INF Starting LocalAI using 12 threads, with models path: /models
12:29AM INF LocalAI version: 0883d32 (0883d324d9b29b12e8417aa20d6458a77f62aab1)
12:29AM DBG Model: guanaco (config: {PredictionOptions:{Model: Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:guanaco StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 F16:true NUMA:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:70 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:})
12:29AM DBG Model: wizardcoder (config: {PredictionOptions:{Model:WizardCoder-15B-1.0.ggmlv3.q5_1.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:wizardcoder StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2048 F16:true NUMA:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend:starcoder TemplateConfig:{Chat:wizard-instr ChatMessage: Completion:wizard-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:})
12:29AM DBG Model: guanaco-config (config: {PredictionOptions:{Model:guanaco-33B.ggmlv3.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:guanaco-config StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 F16:false NUMA:false Threads:0 Debug:false Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend: TemplateConfig:{Chat:guanaco-chat ChatMessage: Completion:guanaco-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:70 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:})
12:29AM DBG Model: guanaco-online (config: {PredictionOptions:{Model:guanaco-33B.ggmlv3.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:guanaco-online StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 F16:true NUMA:false Threads:0 Debug:false Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:guanaco-chat ChatMessage: Completion:guanaco-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:50 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU:0 ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:})
12:29AM DBG Extracting backend assets files to /tmp/localai/backend_data
12:29AM DBG Config overrides map[context_size:4096 f16:true gpu_layers:50 low_vram:false main_gpu:0 mmap:true parameters:map[model:guanaco-33B.ggmlv3.q4_0.bin]]
12:29AM DBG Prompt template "guanaco-completion" written
12:29AM DBG Prompt template "guanaco-chat" written
12:29AM DBG Written config file /models/guanaco-online.yaml
 ┌───────────────────────────────────────────────────┐ 
 │                   Fiber v2.48.0                   │ 
 │               http://127.0.0.1:8080               │ 
 │       (bound on host 0.0.0.0 and port 8080)       │ 
 │                                                   │ 
 │ Handlers ............ 32  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ................ 14 │ 
 └───────────────────────────────────────────────────┘ 
12:29AM DBG Request received: 
12:29AM DBG Configuration read: &{PredictionOptions:{Model:guanaco-33B.ggmlv3.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:guanaco-online StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 F16:true NUMA:false Threads:12 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:guanaco-chat ChatMessage: Completion:guanaco-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:50 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU:0 ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
12:29AM DBG Parameters: &{PredictionOptions:{Model:guanaco-33B.ggmlv3.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:guanaco-online StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 F16:true NUMA:false Threads:12 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:guanaco-chat ChatMessage: Completion:guanaco-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:50 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU:0 ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
12:29AM DBG Prompt (before templating): User: What's your name?
12:29AM DBG Stream request received
12:29AM DBG Template found, input modified to: ### Instruction:
User: What's your name?
### Response:
12:29AM DBG Prompt (after templating): ### Instruction:
User: What's your name?
### Response:
[192.168.20.90]:53042  200  -  POST     /v1/chat/completions
12:29AM DBG Loading model llama from guanaco-33B.ggmlv3.q4_0.bin
12:29AM DBG Loading model in memory from file: /models/guanaco-33B.ggmlv3.q4_0.bin
12:29AM DBG Loading GRPC Model llama: {backendString:llama modelFile:guanaco-33B.ggmlv3.q4_0.bin threads:12 assetDir:/tmp/localai/backend_data context:0xc00003a0b8 gRPCOptions:0xc000156360 externalBackends:map[huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py]}
12:29AM DBG Sending chunk: {"object":"chat.completion.chunk","model":"guanaco-online","choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
12:29AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
12:29AM DBG GRPC Service for guanaco-33B.ggmlv3.q4_0.bin will be running at: '127.0.0.1:37767'
12:29AM DBG GRPC Service state dir: /tmp/go-processmanager3461421060
12:29AM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37767: connect: connection refused"
12:29AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 2023/07/27 00:29:59 gRPC Server listening at 127.0.0.1:37767
12:30AM DBG GRPC Service Ready
12:30AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:/models/guanaco-33B.ggmlv3.q4_0.bin ContextSize:4096 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:50 MainGPU:0 TensorSplit: Threads:12 LibrarySearchPath:}
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr ggml_init_cublas: found 1 CUDA devices:
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama.cpp: loading model from /models/guanaco-33B.ggmlv3.q4_0.bin
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: format     = ggjt v3 (latest)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: n_vocab    = 32000
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: n_ctx      = 4096
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: n_embd     = 6656
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: n_mult     = 256
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: n_head     = 52
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: n_head_kv  = 52
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: n_layer    = 60
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: n_rot      = 128
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: n_gqa      = 1
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: n_ff       = 17920
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: freq_base  = 0.0
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: freq_scale = 5.60519e-44
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: ftype      = 2 (mostly Q4_0)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: model size = 30B
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: ggml ctx size =    0.16 MB
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: using CUDA for GPU acceleration
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: mem required  = 3986.37 MB (+ 6240.00 MB per state)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: allocating batch_size x (768 kB + n_ctx x 208 B) = 800 MB VRAM for the scratch buffer
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: offloading 50 repeating layers to GPU
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: offloaded 50/63 layers to GPU
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_model_load_internal: total VRAM used: 15154 MB
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr llama_new_context_with_model: kv self size  = 6240.00 MB
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr fatal error: unexpected signal during runtime execution
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr [signal SIGSEGV: segmentation violation code=0x1 addr=0x100 pc=0x7fb248789319]
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime stack:
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.throw({0x9aa8d8?, 0x0?})
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0x7ffca7421f08 sp=0x7ffca7421ed8 pc=0x45587d
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.sigpanic()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/signal_unix.go:825 +0x3e9 fp=0x7ffca7421f68 sp=0x7ffca7421f08 pc=0x46bd29
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr goroutine 34 [syscall]:
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.cgocall(0x8182d0, 0xc0002397d8)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc0002397b0 sp=0xc000239778 pc=0x42499c
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr github.com/go-skynet/go-llama%2ecpp._Cfunc_load_model(0x20ea0e0, 0x1000, 0x0, 0x1, 0x0, 0x0, 0x1, 0x0, 0x32, 0x200, ...)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	_cgo_gotypes.go:238 +0x4d fp=0xc0002397d8 sp=0xc0002397b0 pc=0x81110d
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr github.com/go-skynet/go-llama%2ecpp.New({0xc000250000, 0x23}, {0xc00020a140, 0x7, 0x9011e0?})
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/build/go-llama/llama.go:26 +0x257 fp=0xc0002398e0 sp=0xc0002397d8 pc=0x8117f7
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr github.com/go-skynet/LocalAI/pkg/grpc/llm/llama.(*LLM).Load(0xc0000142a0, 0xc000232000)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/build/pkg/grpc/llm/llama/llama.go:52 +0x66d fp=0xc0002399a8 sp=0xc0002398e0 pc=0x814fed
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr github.com/go-skynet/LocalAI/pkg/grpc.(*server).LoadModel(0x97d740?, {0xc000232000?, 0x5dbdc6?}, 0x0?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/build/pkg/grpc/server.go:42 +0x28 fp=0xc000239a10 sp=0xc0002399a8 pc=0x817248
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr github.com/go-skynet/LocalAI/pkg/grpc/proto._Backend_LoadModel_Handler({0x95ed40?, 0xc00007bd20}, {0xa428f0, 0xc000204210}, 0xc000220070, 0x0)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/build/pkg/grpc/proto/backend_grpc.pb.go:236 +0x170 fp=0xc000239a68 sp=0xc000239a10 pc=0x80e4f0
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc.(*Server).processUnaryRPC(0xc0001a21e0, {0xa45578, 0xc0000ef040}, 0xc00021c000, 0xc0001aaa20, 0xd83a10, 0x0)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:1337 +0xdf3 fp=0xc000239e48 sp=0xc000239a68 pc=0x7f7393
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc.(*Server).handleStream(0xc0001a21e0, {0xa45578, 0xc0000ef040}, 0xc00021c000, 0x0)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:1714 +0xa36 fp=0xc000239f68 sp=0xc000239e48 pc=0x7fc4b6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc.(*Server).serveStreams.func1.1()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:959 +0x98 fp=0xc000239fe0 sp=0xc000239f68 pc=0x7f4d98
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goexit()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000239fe8 sp=0xc000239fe0 pc=0x487461
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr created by google.golang.org/grpc.(*Server).serveStreams.func1
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:957 +0x18c
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr goroutine 1 [IO wait]:
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc00019fb68 sp=0xc00019fb48 pc=0x4585d6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.netpollblock(0x7fb248bca6e8?, 0x42402f?, 0x0?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/netpoll.go:527 +0xf7 fp=0xc00019fba0 sp=0xc00019fb68 pc=0x450f17
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr internal/poll.runtime_pollWait(0x7fb248b04ef8, 0x72)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/netpoll.go:306 +0x89 fp=0xc00019fbc0 sp=0xc00019fba0 pc=0x482009
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr internal/poll.(*pollDesc).wait(0xc0000f6280?, 0x4?, 0x0)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32 fp=0xc00019fbe8 sp=0xc00019fbc0 pc=0x4f0312
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr internal/poll.(*pollDesc).waitRead(...)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr internal/poll.(*FD).Accept(0xc0000f6280)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/internal/poll/fd_unix.go:614 +0x2bd fp=0xc00019fc90 sp=0xc00019fbe8 pc=0x4f5c1d
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr net.(*netFD).accept(0xc0000f6280)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/net/fd_unix.go:172 +0x35 fp=0xc00019fd48 sp=0xc00019fc90 pc=0x607115
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr net.(*TCPListener).accept(0xc000012618)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/net/tcpsock_posix.go:148 +0x25 fp=0xc00019fd70 sp=0xc00019fd48 pc=0x61f985
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr net.(*TCPListener).Accept(0xc000012618)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/net/tcpsock.go:297 +0x3d fp=0xc00019fda0 sp=0xc00019fd70 pc=0x61ea7d
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc.(*Server).Serve(0xc0001a21e0, {0xa42180?, 0xc000012618})
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:821 +0x475 fp=0xc00019fee8 sp=0xc00019fda0 pc=0x7f39b5
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr github.com/go-skynet/LocalAI/pkg/grpc.StartServer({0x7ffca7442c5b?, 0xc000024190?}, {0xa44af0?, 0xc0000142a0})
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/build/pkg/grpc/server.go:121 +0x125 fp=0xc00019ff50 sp=0xc00019fee8 pc=0x817de5
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr main.main()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/build/cmd/grpc/llama/main.go:22 +0x85 fp=0xc00019ff80 sp=0xc00019ff50 pc=0x817f45
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.main()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:250 +0x207 fp=0xc00019ffe0 sp=0xc00019ff80 pc=0x4581a7
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goexit()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00019ffe8 sp=0xc00019ffe0 pc=0x487461
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr goroutine 2 [force gc (idle)]:
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000064fb0 sp=0xc000064f90 pc=0x4585d6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goparkunlock(...)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:387
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.forcegchelper()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:305 +0xb0 fp=0xc000064fe0 sp=0xc000064fb0 pc=0x458410
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goexit()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000064fe8 sp=0xc000064fe0 pc=0x487461
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr created by runtime.init.6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:293 +0x25
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr goroutine 3 [GC sweep wait]:
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000065780 sp=0xc000065760 pc=0x4585d6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goparkunlock(...)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:387
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.bgsweep(0x0?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/mgcsweep.go:278 +0x8e fp=0xc0000657c8 sp=0xc000065780 pc=0x4447ce
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.gcenable.func1()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/mgc.go:178 +0x26 fp=0xc0000657e0 sp=0xc0000657c8 pc=0x439a86
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goexit()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000657e8 sp=0xc0000657e0 pc=0x487461
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr created by runtime.gcenable
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/mgc.go:178 +0x6b
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr goroutine 4 [GC scavenge wait]:
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.gopark(0xc000032070?, 0xa3b380?, 0x1?, 0x0?, 0x0?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000065f70 sp=0xc000065f50 pc=0x4585d6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goparkunlock(...)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:387
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.(*scavengerState).park(0xdcfb20)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/mgcscavenge.go:400 +0x53 fp=0xc000065fa0 sp=0xc000065f70 pc=0x4426f3
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.bgscavenge(0x0?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/mgcscavenge.go:628 +0x45 fp=0xc000065fc8 sp=0xc000065fa0 pc=0x442cc5
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.gcenable.func2()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/mgc.go:179 +0x26 fp=0xc000065fe0 sp=0xc000065fc8 pc=0x439a26
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goexit()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000065fe8 sp=0xc000065fe0 pc=0x487461
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr created by runtime.gcenable
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/mgc.go:179 +0xaa
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr goroutine 5 [finalizer wait]:
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.gopark(0x1a0?, 0xdd0040?, 0x60?, 0x78?, 0xc000064770?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000064628 sp=0xc000064608 pc=0x4585d6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.runfinq()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/mfinal.go:193 +0x107 fp=0xc0000647e0 sp=0xc000064628 pc=0x438ac7
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goexit()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000647e8 sp=0xc0000647e0 pc=0x487461
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr created by runtime.createfing
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/mfinal.go:163 +0x45
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr goroutine 11 [select]:
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.gopark(0xc00017ff00?, 0x2?, 0xc3?, 0x3a?, 0xc00017fed4?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc00017fd60 sp=0xc00017fd40 pc=0x4585d6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.selectgo(0xc00017ff00, 0xc00017fed0, 0x629ea9?, 0x0, 0xc0001f0000?, 0x1)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/select.go:327 +0x7be fp=0xc00017fea0 sp=0xc00017fd60 pc=0x4681be
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc00009a690, 0x1)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/controlbuf.go:418 +0x115 fp=0xc00017ff30 sp=0xc00017fea0 pc=0x768e95
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc00013e2a0)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/controlbuf.go:552 +0x91 fp=0xc00017ff90 sp=0xc00017ff30 pc=0x769611
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc/internal/transport.NewServerTransport.func2()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:341 +0xda fp=0xc00017ffe0 sp=0xc00017ff90 pc=0x780ffa
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goexit()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00017ffe8 sp=0xc00017ffe0 pc=0x487461
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr created by google.golang.org/grpc/internal/transport.NewServerTransport
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:338 +0x1bb3
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr goroutine 12 [select]:
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.gopark(0xc000067770?, 0x4?, 0x9?, 0x0?, 0xc0000676c0?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000067508 sp=0xc0000674e8 pc=0x4585d6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.selectgo(0xc000067770, 0xc0000676b8, 0x0?, 0x0, 0xc0001aad20?, 0x1)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/select.go:327 +0x7be fp=0xc000067648 sp=0xc000067508 pc=0x4681be
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc/internal/transport.(*http2Server).keepalive(0xc0000ef040)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:1155 +0x233 fp=0xc0000677c8 sp=0xc000067648 pc=0x7886d3
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc/internal/transport.NewServerTransport.func4()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:344 +0x26 fp=0xc0000677e0 sp=0xc0000677c8 pc=0x780ee6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goexit()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000677e8 sp=0xc0000677e0 pc=0x487461
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr created by google.golang.org/grpc/internal/transport.NewServerTransport
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:344 +0x1bf8
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr goroutine 13 [IO wait]:
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.gopark(0x447960?, 0xb?, 0x0?, 0x0?, 0x6?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000078aa0 sp=0xc000078a80 pc=0x4585d6
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.netpollblock(0x4d5745?, 0x42402f?, 0x0?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/netpoll.go:527 +0xf7 fp=0xc000078ad8 sp=0xc000078aa0 pc=0x450f17
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr internal/poll.runtime_pollWait(0x7fb248b04e08, 0x72)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/netpoll.go:306 +0x89 fp=0xc000078af8 sp=0xc000078ad8 pc=0x482009
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr internal/poll.(*pollDesc).wait(0xc0000f6480?, 0xc0001e8000?, 0x0)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32 fp=0xc000078b20 sp=0xc000078af8 pc=0x4f0312
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr internal/poll.(*pollDesc).waitRead(...)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr internal/poll.(*FD).Read(0xc0000f6480, {0xc0001e8000, 0x8000, 0x8000})
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/internal/poll/fd_unix.go:167 +0x299 fp=0xc000078bb8 sp=0xc000078b20 pc=0x4f16f9
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr net.(*netFD).Read(0xc0000f6480, {0xc0001e8000?, 0x1060100000000?, 0x8?})
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/net/fd_posix.go:55 +0x29 fp=0xc000078c00 sp=0xc000078bb8 pc=0x604f89
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr net.(*conn).Read(0xc0000142d0, {0xc0001e8000?, 0x18?, 0xdd0040?})
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/net/net.go:183 +0x45 fp=0xc000078c48 sp=0xc000078c00 pc=0x616ac5
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr net.(*TCPConn).Read(0x800010601?, {0xc0001e8000?, 0x0?, 0xc000078ca8?})
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	<autogenerated>:1 +0x29 fp=0xc000078c78 sp=0xc000078c48 pc=0x629ba9
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr bufio.(*Reader).Read(0xc000090a20, {0xc0001de120, 0x9, 0x0?})
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/bufio/bufio.go:237 +0x1bb fp=0xc000078cb0 sp=0xc000078c78 pc=0x57b97b
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr io.ReadAtLeast({0xa3ec00, 0xc000090a20}, {0xc0001de120, 0x9, 0x9}, 0x9)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/io/io.go:332 +0x9a fp=0xc000078cf8 sp=0xc000078cb0 pc=0x4cf6ba
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr io.ReadFull(...)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/io/io.go:351
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr golang.org/x/net/http2.readFrameHeader({0xc0001de120?, 0x9?, 0xc00012c048?}, {0xa3ec00?, 0xc000090a20?})
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/golang.org/x/net@v0.12.0/http2/frame.go:237 +0x6e fp=0xc000078d48 sp=0xc000078cf8 pc=0x7290ce
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr golang.org/x/net/http2.(*Framer).ReadFrame(0xc0001de0e0)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/golang.org/x/net@v0.12.0/http2/frame.go:498 +0x95 fp=0xc000078df8 sp=0xc000078d48 pc=0x729915
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc0000ef040, 0x0?, 0x0?)
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:642 +0x167 fp=0xc000078f10 sp=0xc000078df8 pc=0x784327
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc.(*Server).serveStreams(0xc0001a21e0, {0xa45578?, 0xc0000ef040})
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:946 +0x162 fp=0xc000078f80 sp=0xc000078f10 pc=0x7f4ae2
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr google.golang.org/grpc.(*Server).handleRawConn.func1()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:889 +0x46 fp=0xc000078fe0 sp=0xc000078f80 pc=0x7f4386
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr runtime.goexit()
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000078fe8 sp=0xc000078fe0 pc=0x487461
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr created by google.golang.org/grpc.(*Server).handleRawConn
12:30AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:37767): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:888 +0x185
[127.0.0.1]:33386  200  -  GET      /readyz

Are there other images I could try? Or do you see the problem with the logs I provided?

larkinwc · 2023-07-27T01:51:26Z

I think am able to replicate the issues with a fresh vm in GCP, G2-standard-4 instance with 1x Nvidia L4. OS is common-gpu-debian-11-py310

       _,met$$$$$gg.          l_williams_capone@gpu-test-2 
    ,g$$$$$$$$$$$$$$$P.       ---------------------------- 
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 11 (bullseye) x86_64 
 ,$$P'              `$$$.     Host: Google Compute Engine 
',$$P       ,ggs.     `$$b:   Kernel: 5.10.0-23-cloud-amd64 
`d$$'     ,$P"'   .    $$$    Uptime: 8 hours, 11 mins 
 $$P      d$'     ,    $$P    Packages: 704 (dpkg) 
 $$:      $$.   -    ,d$$'    Shell: bash 5.1.4 
 $$;      Y$b._   _,d$P'      Terminal: /dev/pts/0 
 Y$$.    `.`"Y$$$$P"'         CPU: Intel Xeon (4) @ 2.200GHz 
 `$$b      "-.__              GPU: NVIDIA 00:03.0 NVIDIA Corporation Device 27b8 
  `Y$$                        Memory: 448MiB / 16008MiB 
   `Y$$.
     `$$b.                                            
       `Y$$b.                                         
          `"Y$b._
              `"""

Output of trying execute a model using GPU acceleration:

l_williams_capone@gpu-test-2:~/LocalAI$ sudo ./local-ai --address :8081 --debug 
1:49AM DBG no galleries to load
1:49AM INF Starting LocalAI using 4 threads, with models path: /home/l_williams_capone/LocalAI/models
1:49AM INF LocalAI version: v1.22.0-6-gc79ddd6 (c79ddd6fc4cbd6eb64ed2a8220176ce7cbf40b6e)
1:49AM DBG Model: baichuan-7b (config: {PredictionOptions:{Model: Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:baichuan-7b StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 F16:false NUMA:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:})
1:49AM DBG Model: openllama-7b (config: {PredictionOptions:{Model:open-llama-7b-q4_0 Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:true IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:openllama-7b StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 F16:false NUMA:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:35 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:})
1:49AM DBG Extracting backend assets files to /tmp/localai/backend_data

 ┌───────────────────────────────────────────────────┐ 
 │                   Fiber v2.48.0                   │ 
 │               http://127.0.0.1:8081               │ 
 │       (bound on host 0.0.0.0 and port 8081)       │ 
 │                                                   │ 
 │ Handlers ............ 32  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ............. 85347 │ 
 └───────────────────────────────────────────────────┘ 

1:49AM DBG Request received: 
1:49AM DBG `input`: &{PredictionOptions:{Model:openllama-7b Language: N:0 TopP:0 TopK:0 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Context:context.Background.WithCancel Cancel:0x4b9060 File: ResponseFormat: Size: Prompt:1 Instruction: Input:<nil> Stop:<nil> Messages:[] Functions:[] FunctionCall:<nil> Stream:false Mode:0 Step:0 Grammar: JSONFunctionGrammarObject:<nil>}
1:49AM DBG Parameter Config: &{PredictionOptions:{Model:open-llama-7b-q4_0 Language: N:0 TopP:0.7 TopK:80 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:true IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:openllama-7b StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 F16:false NUMA:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:35 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[1] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
1:49AM DBG Template found, input modified to: Q: Complete the following text: 1\nA: 

1:49AM DBG Loading model llama from open-llama-7b-q4_0
1:49AM DBG Loading model in memory from file: /home/l_williams_capone/LocalAI/models/open-llama-7b-q4_0
1:49AM DBG Loading GRPC Model llama: {backendString:llama modelFile:open-llama-7b-q4_0 threads:4 assetDir:/tmp/localai/backend_data context:0xc0000ae010 gRPCOptions:0xc000220ab0 externalBackends:map[]}
1:49AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
1:49AM DBG GRPC Service for open-llama-7b-q4_0 will be running at: '127.0.0.1:40753'
1:49AM DBG GRPC Service state dir: /tmp/go-processmanager973188895
1:49AM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:40753: connect: connection refused"
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 2023/07/27 01:49:23 gRPC Server listening at 127.0.0.1:40753
1:49AM DBG GRPC Service Ready
1:49AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:/home/l_williams_capone/LocalAI/models/open-llama-7b-q4_0 ContextSize:0 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:35 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:}
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr ggml_init_cublas: found 1 CUDA devices:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr   Device 0: NVIDIA L4, compute capability 8.9
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama.cpp: loading model from /home/l_williams_capone/LocalAI/models/open-llama-7b-q4_0
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: format     = ggjt v3 (latest)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_vocab    = 32000
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_ctx      = 512
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_embd     = 4096
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_mult     = 256
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_head     = 32
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_head_kv  = 32
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_layer    = 32
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_rot      = 128
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_gqa      = 1
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_ff       = 11008
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: freq_base  = 0.0
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: freq_scale = 5.60519e-44
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: ftype      = 2 (mostly Q4_0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: model size = 7B
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: ggml ctx size = 3615.73 MB
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: using CUDA for GPU acceleration
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: mem required  =  372.40 MB (+  512.00 MB per state)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: offloading 32 repeating layers to GPU
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: offloading non-repeating layers to GPU
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: offloading v cache to GPU
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: offloading k cache to GPU
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: offloaded 35/35 layers to GPU
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: total VRAM used: 4090 MB
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_new_context_with_model: kv self size  =  512.00 MB
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr fatal error: unexpected signal during runtime execution
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr [signal SIGSEGV: segmentation violation code=0x1 addr=0x100 pc=0x7fc7e53c3ab8]
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime stack:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.throw({0x9aa8d8?, 0x0?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0x7ffcbf547748 sp=0x7ffcbf547718 pc=0x45587d
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.sigpanic()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/signal_unix.go:825 +0x3e9 fp=0x7ffcbf5477a8 sp=0x7ffcbf547748 pc=0x46bd29
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 38 [syscall]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.cgocall(0x8182d0, 0xc0001157d8)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc0001157b0 sp=0xc000115778 pc=0x42499c
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/go-llama%2ecpp._Cfunc_load_model(0x2971f10, 0x200, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x23, 0x200, ...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     _cgo_gotypes.go:238 +0x4d fp=0xc0001157d8 sp=0xc0001157b0 pc=0x81110d
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/go-llama%2ecpp.New({0xc00011a680, 0x39}, {0xc000072b00, 0x5, 0x9011e0?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/go-llama/llama.go:26 +0x257 fp=0xc0001158e0 sp=0xc0001157d8 pc=0x8117f7
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/LocalAI/pkg/grpc/llm/llama.(*LLM).Load(0xc0000142a0, 0xc000196750)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/pkg/grpc/llm/llama/llama.go:52 +0x66d fp=0xc0001159a8 sp=0xc0001158e0 pc=0x814fed
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/LocalAI/pkg/grpc.(*server).LoadModel(0x97d740?, {0xc000196750?, 0x5dbdc6?}, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/pkg/grpc/server.go:42 +0x28 fp=0xc000115a10 sp=0xc0001159a8 pc=0x817248
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/LocalAI/pkg/grpc/proto._Backend_LoadModel_Handler({0x95ed40?, 0xc00005fd20}, {0xa428f0, 0xc00017ed80}, 0xc000161d50, 0x0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/pkg/grpc/proto/backend_grpc.pb.go:236 +0x170 fp=0xc000115a68 sp=0xc000115a10 pc=0x80e4f0
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).processUnaryRPC(0xc00017c1e0, {0xa45578, 0xc000082340}, 0xc0000ea000, 0xc00017ea20, 0xd85a10, 0x0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:1337 +0xdf3 fp=0xc000115e48 sp=0xc000115a68 pc=0x7f7393
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).handleStream(0xc00017c1e0, {0xa45578, 0xc000082340}, 0xc0000ea000, 0x0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:1714 +0xa36 fp=0xc000115f68 sp=0xc000115e48 pc=0x7fc4b6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).serveStreams.func1.1()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:959 +0x98 fp=0xc000115fe0 sp=0xc000115f68 pc=0x7f4d98
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000115fe8 sp=0xc000115fe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by google.golang.org/grpc.(*Server).serveStreams.func1
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:957 +0x18c
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 1 [IO wait]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc00018fb68 sp=0xc00018fb48 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.netpollblock(0xc00018fbf8?, 0x42402f?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/netpoll.go:527 +0xf7 fp=0xc00018fba0 sp=0xc00018fb68 pc=0x450f17
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.runtime_pollWait(0x7fc791cd3ef8, 0x72)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/netpoll.go:306 +0x89 fp=0xc00018fbc0 sp=0xc00018fba0 pc=0x482009
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*pollDesc).wait(0xc00015e280?, 0x0?, 0x0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32 fp=0xc00018fbe8 sp=0xc00018fbc0 pc=0x4f0312
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*pollDesc).waitRead(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*FD).Accept(0xc00015e280)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_unix.go:614 +0x2bd fp=0xc00018fc90 sp=0xc00018fbe8 pc=0x4f5c1d
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*netFD).accept(0xc00015e280)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/net/fd_unix.go:172 +0x35 fp=0xc00018fd48 sp=0xc00018fc90 pc=0x607115
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*TCPListener).accept(0xc000012618)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/net/tcpsock_posix.go:148 +0x25 fp=0xc00018fd70 sp=0xc00018fd48 pc=0x61f985
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*TCPListener).Accept(0xc000012618)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/net/tcpsock.go:297 +0x3d fp=0xc00018fda0 sp=0xc00018fd70 pc=0x61ea7d
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).Serve(0xc00017c1e0, {0xa42180?, 0xc000012618})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:821 +0x475 fp=0xc00018fee8 sp=0xc00018fda0 pc=0x7f39b5
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/LocalAI/pkg/grpc.StartServer({0x7ffcbf5678b4?, 0xc000024190?}, {0xa44af0?, 0xc0000142a0})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/pkg/grpc/server.go:121 +0x125 fp=0xc00018ff50 sp=0xc00018fee8 pc=0x817de5
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr main.main()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/cmd/grpc/llama/main.go:22 +0x85 fp=0xc00018ff80 sp=0xc00018ff50 pc=0x817f45
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.main()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:250 +0x207 fp=0xc00018ffe0 sp=0xc00018ff80 pc=0x4581a7
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00018ffe8 sp=0xc00018ffe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 2 [force gc (idle)]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000048fb0 sp=0xc000048f90 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goparkunlock(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:387
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.forcegchelper()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:305 +0xb0 fp=0xc000048fe0 sp=0xc000048fb0 pc=0x458410
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000048fe8 sp=0xc000048fe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by runtime.init.6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:293 +0x25
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 3 [GC sweep wait]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000049780 sp=0xc000049760 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goparkunlock(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:387
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.bgsweep(0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgcsweep.go:278 +0x8e fp=0xc0000497c8 sp=0xc000049780 pc=0x4447ce
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gcenable.func1()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgc.go:178 +0x26 fp=0xc0000497e0 sp=0xc0000497c8 pc=0x439a86
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000497e8 sp=0xc0000497e0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by runtime.gcenable
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgc.go:178 +0x6b
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 4 [GC scavenge wait]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0xc000070000?, 0xa3b380?, 0x1?, 0x0?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000049f70 sp=0xc000049f50 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goparkunlock(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:387
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.(*scavengerState).park(0xdd1b20)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgcscavenge.go:400 +0x53 fp=0xc000049fa0 sp=0xc000049f70 pc=0x4426f3
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.bgscavenge(0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgcscavenge.go:628 +0x45 fp=0xc000049fc8 sp=0xc000049fa0 pc=0x442cc5
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gcenable.func2()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgc.go:179 +0x26 fp=0xc000049fe0 sp=0xc000049fc8 pc=0x439a26
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by runtime.gcenable
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgc.go:179 +0xaa
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 5 [finalizer wait]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0x1a0?, 0xdd2040?, 0x60?, 0x78?, 0xc000048770?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000048628 sp=0xc000048608 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.runfinq()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mfinal.go:193 +0x107 fp=0xc0000487e0 sp=0xc000048628 pc=0x438ac7
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000487e8 sp=0xc0000487e0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by runtime.createfing
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mfinal.go:163 +0x45
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 35 [select]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0xc000271f00?, 0x2?, 0xc3?, 0x3a?, 0xc000271ed4?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000271d60 sp=0xc000271d40 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.selectgo(0xc000271f00, 0xc000271ed0, 0x629ea9?, 0x0, 0xc0000b2000?, 0x1)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/select.go:327 +0x7be fp=0xc000271ea0 sp=0xc000271d60 pc=0x4681be
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc0000c2050, 0x1)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/controlbuf.go:418 +0x115 fp=0xc000271f30 sp=0xc000271ea0 pc=0x768e95
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc00022e2a0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/controlbuf.go:552 +0x91 fp=0xc000271f90 sp=0xc000271f30 pc=0x769611
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.NewServerTransport.func2()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:341 +0xda fp=0xc000271fe0 sp=0xc000271f90 pc=0x780ffa
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000271fe8 sp=0xc000271fe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by google.golang.org/grpc/internal/transport.NewServerTransport
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:338 +0x1bb3
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 36 [select]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0xc0000e0770?, 0x4?, 0x10?, 0x0?, 0xc0000e06c0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0000e0508 sp=0xc0000e04e8 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.selectgo(0xc0000e0770, 0xc0000e06b8, 0x0?, 0x0, 0x0?, 0x1)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/select.go:327 +0x7be fp=0xc0000e0648 sp=0xc0000e0508 pc=0x4681be
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.(*http2Server).keepalive(0xc000082340)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:1155 +0x233 fp=0xc0000e07c8 sp=0xc0000e0648 pc=0x7886d3
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.NewServerTransport.func4()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:344 +0x26 fp=0xc0000e07e0 sp=0xc0000e07c8 pc=0x780ee6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000e07e8 sp=0xc0000e07e0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by google.golang.org/grpc/internal/transport.NewServerTransport
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:344 +0x1bf8
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 37 [IO wait]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0x100000008?, 0xb?, 0x0?, 0x0?, 0x6?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000056aa0 sp=0xc000056a80 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.netpollblock(0x4d5745?, 0x42402f?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/netpoll.go:527 +0xf7 fp=0xc000056ad8 sp=0xc000056aa0 pc=0x450f17
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.runtime_pollWait(0x7fc791cd3e08, 0x72)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/netpoll.go:306 +0x89 fp=0xc000056af8 sp=0xc000056ad8 pc=0x482009
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*pollDesc).wait(0xc000094080?, 0xc0000aa000?, 0x0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32 fp=0xc000056b20 sp=0xc000056af8 pc=0x4f0312
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*pollDesc).waitRead(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*FD).Read(0xc000094080, {0xc0000aa000, 0x8000, 0x8000})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_unix.go:167 +0x299 fp=0xc000056bb8 sp=0xc000056b20 pc=0x4f16f9
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*netFD).Read(0xc000094080, {0xc0000aa000?, 0x1060100000000?, 0x8?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/net/fd_posix.go:55 +0x29 fp=0xc000056c00 sp=0xc000056bb8 pc=0x604f89
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*conn).Read(0xc0000a6000, {0xc0000aa000?, 0x50?, 0x0?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/net/net.go:183 +0x45 fp=0xc000056c48 sp=0xc000056c00 pc=0x616ac5
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*TCPConn).Read(0x800010601?, {0xc0000aa000?, 0x0?, 0xc000056ca8?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     <autogenerated>:1 +0x29 fp=0xc000056c78 sp=0xc000056c48 pc=0x629ba9
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr bufio.(*Reader).Read(0xc0000a00c0, {0xc0000c4040, 0x9, 0x0?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/bufio/bufio.go:237 +0x1bb fp=0xc000056cb0 sp=0xc000056c78 pc=0x57b97b
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr io.ReadAtLeast({0xa3ec00, 0xc0000a00c0}, {0xc0000c4040, 0x9, 0x9}, 0x9)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/io/io.go:332 +0x9a fp=0xc000056cf8 sp=0xc000056cb0 pc=0x4cf6ba
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr io.ReadFull(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/io/io.go:351
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr golang.org/x/net/http2.readFrameHeader({0xc0000c4040?, 0x9?, 0xc0000a4060?}, {0xa3ec00?, 0xc0000a00c0?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/golang.org/x/net@v0.12.0/http2/frame.go:237 +0x6e fp=0xc000056d48 sp=0xc000056cf8 pc=0x7290ce
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr golang.org/x/net/http2.(*Framer).ReadFrame(0xc0000c4000)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/golang.org/x/net@v0.12.0/http2/frame.go:498 +0x95 fp=0xc000056df8 sp=0xc000056d48 pc=0x729915
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc000082340, 0x0?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:642 +0x167 fp=0xc000056f10 sp=0xc000056df8 pc=0x784327
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).serveStreams(0xc00017c1e0, {0xa45578?, 0xc000082340})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:946 +0x162 fp=0xc000056f80 sp=0xc000056f10 pc=0x7f4ae2
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).handleRawConn.func1()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:889 +0x46 fp=0xc000056fe0 sp=0xc000056f80 pc=0x7f4386
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000056fe8 sp=0xc000056fe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by google.golang.org/grpc.(*Server).handleRawConn
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:888 +0x185
[66.68.171.135]:50325  500  -  POST     /v1/completions

mudler · 2023-07-27T07:39:37Z

@emakkus can you try to add REBUILD=true to the env vars and see if persist?

@larkinwc is that a pre compiled binary or did you compiled it locally?

emakkus · 2023-07-27T08:05:00Z

@mudler

I did try it with Rebuild true now, but the results are exactly the same as before:

[127.0.0.1]:37282  200  -  GET      /readyz
7:50AM DBG Request received: 
7:50AM DBG Configuration read: &{PredictionOptions:{Model:guanaco-33B.ggmlv3.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:guanaco-online StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 F16:false NUMA:false Threads:12 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:guanaco-chat ChatMessage: Completion:guanaco-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:50 MMap:true MMlock:false LowVRAM:true TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
7:50AM DBG Parameters: &{PredictionOptions:{Model:guanaco-33B.ggmlv3.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:guanaco-online StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 F16:false NUMA:false Threads:12 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:guanaco-chat ChatMessage: Completion:guanaco-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:50 MMap:true MMlock:false LowVRAM:true TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
7:50AM DBG Prompt (before templating): User: What's your name?
7:50AM DBG Template found, input modified to: ### Instruction:
User: What's your name?
### Response:
7:50AM DBG Prompt (after templating): ### Instruction:
User: What's your name?
### Response:
7:50AM DBG Loading model llama from guanaco-33B.ggmlv3.q4_0.bin
7:50AM DBG Loading model in memory from file: /models/guanaco-33B.ggmlv3.q4_0.bin
7:50AM DBG Loading GRPC Model llama: {backendString:llama modelFile:guanaco-33B.ggmlv3.q4_0.bin threads:12 assetDir:/tmp/localai/backend_data context:0xc00003a0b8 gRPCOptions:0xc0003eea20 externalBackends:map[huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py]}
7:50AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
7:50AM DBG GRPC Service for guanaco-33B.ggmlv3.q4_0.bin will be running at: '127.0.0.1:44105'
7:50AM DBG GRPC Service state dir: /tmp/go-processmanager1630167704
7:50AM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:44105: connect: connection refused"
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr 2023/07/27 07:50:51 gRPC Server listening at 127.0.0.1:44105
7:50AM DBG GRPC Service Ready
7:50AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:/models/guanaco-33B.ggmlv3.q4_0.bin ContextSize:4096 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:true Embeddings:false NUMA:false NGPULayers:50 MainGPU: TensorSplit: Threads:12 LibrarySearchPath:}
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr ggml_init_cublas: found 1 CUDA devices:
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama.cpp: loading model from /models/guanaco-33B.ggmlv3.q4_0.bin
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: format     = ggjt v3 (latest)
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: n_vocab    = 32000
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: n_ctx      = 4096
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: n_embd     = 6656
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: n_mult     = 256
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: n_head     = 52
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: n_head_kv  = 52
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: n_layer    = 60
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: n_rot      = 128
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: n_gqa      = 1
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: n_ff       = 17920
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: freq_base  = 0.0
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: freq_scale = 5.60519e-44
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: ftype      = 2 (mostly Q4_0)
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: model size = 30B
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: ggml ctx size =    0.16 MB
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: using CUDA for GPU acceleration
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: mem required  = 3986.37 MB (+ 12480.00 MB per state)
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: not allocating a VRAM scratch buffer due to low VRAM option
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: offloading 50 repeating layers to GPU
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: offloaded 50/63 layers to GPU
7:50AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_model_load_internal: total VRAM used: 14354 MB
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr llama_new_context_with_model: kv self size  = 12480.00 MB
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr fatal error: unexpected signal during runtime execution
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr [signal SIGSEGV: segmentation violation code=0x1 addr=0x100 pc=0x7ffa84389319]
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr 
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr runtime stack:
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr runtime.throw({0x9aa8d8?, 0x0?})
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr 	/usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0x7ffa459c0348 sp=0x7ffa459c0318 pc=0x45587d
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr runtime.sigpanic()
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr 	/usr/local/go/src/runtime/signal_unix.go:825 +0x3e9 fp=0x7ffa459c03a8 sp=0x7ffa459c0348 pc=0x46bd29
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr 
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr goroutine 19 [syscall]:
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr runtime.cgocall(0x8182d0, 0xc0000ad7d8)
7:51AM DBG GRPC(guanaco-33B.ggmlv3.q4_0.bin-127.0.0.1:44105): stderr 	/usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc0000ad7b0 sp=0xc0000ad778 pc=0x42499c...

I also tried to run the go-llama example, and there I also get the Segfault:

root@lxdocker:/build/go-llama# CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "/models/guanaco-33B.ggmlv3.q4_0.bin" -t 10 -ngl 50 -n 512
# github.com/go-skynet/go-llama.cpp
binding.cpp: In function 'void llama_binding_free_model(void*)':
binding.cpp:613:5: warning: possible problem detected in invocation of 'operator delete' [-Wdelete-incomplete]
  613 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
binding.cpp:613:17: warning: invalid use of incomplete type 'struct llama_model'
  613 |     delete ctx->model;
      |            ~~~~~^~~~~
In file included from go-llama/llama.cpp/examples/common.h:5,
                 from binding.cpp:1:
go-llama/llama.cpp/llama.h:66:12: note: forward declaration of 'struct llama_model'
   66 |     struct llama_model;
      |            ^~~~~~~~~~~
binding.cpp:613:5: note: neither the destructor nor the class-specific 'operator delete' will be called, even if they are declared when the class is defined
  613 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama.cpp: loading model from /models/guanaco-33B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 128
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_head_kv  = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: freq_base  = 0.0
llama_model_load_internal: freq_scale = 5.60519e-44
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.16 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 3545.37 MB (+  195.00 MB per state)
llama_model_load_internal: allocating batch_size x (768 kB + n_ctx x 208 B) = 0 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 50 repeating layers to GPU
llama_model_load_internal: offloaded 50/63 layers to GPU
llama_model_load_internal: total VRAM used: 14354 MB
llama_new_context_with_model: kv self size  =  195.00 MB
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x100 pc=0x7f0251589319]

runtime stack:
runtime.throw({0x56de49?, 0x0?})
        /usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0x7ffcfec788c8 sp=0x7ffcfec78898 pc=0x45283d
runtime.sigpanic()
        /usr/local/go/src/runtime/signal_unix.go:825 +0x3e9 fp=0x7ffcfec78928 sp=0x7ffcfec788c8 pc=0x467389

goroutine 1 [syscall]:
runtime.cgocall(0x4b5ea0, 0xc000069bf0)
        /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc000069bc8 sp=0xc000069b90 pc=0x423cfc
github.com/go-skynet/go-llama%2ecpp._Cfunc_load_model(0x19520e0, 0x80, 0x0, 0x1, 0x0, 0x1, 0x1, 0x0, 0x32, 0x0, ...)
        _cgo_gotypes.go:238 +0x4d fp=0xc000069bf0 sp=0xc000069bc8 pc=0x4b2c8d
github.com/go-skynet/go-llama%2ecpp.New({0x7ffcfec98be0, 0x23}, {0xc000069e68, 0x4, 0x1?})
        /build/go-llama/llama.go:26 +0x257 fp=0xc000069cf8 sp=0xc000069bf0 pc=0x4b31d7
main.main()
        /build/go-llama/examples/main.go:35 +0x38f fp=0xc000069f80 sp=0xc000069cf8 pc=0x4b51af
runtime.main()
        /usr/local/go/src/runtime/proc.go:250 +0x207 fp=0xc000069fe0 sp=0xc000069f80 pc=0x455167
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000069fe8 sp=0xc000069fe0 pc=0x4803e1

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000056fb0 sp=0xc000056f90 pc=0x455596
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:387
runtime.forcegchelper()
        /usr/local/go/src/runtime/proc.go:305 +0xb0 fp=0xc000056fe0 sp=0xc000056fb0 pc=0x4553d0
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000056fe8 sp=0xc000056fe0 pc=0x4803e1
created by runtime.init.6
        /usr/local/go/src/runtime/proc.go:293 +0x25

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000057780 sp=0xc000057760 pc=0x455596
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:387
runtime.bgsweep(0x0?)
        /usr/local/go/src/runtime/mgcsweep.go:278 +0x8e fp=0xc0000577c8 sp=0xc000057780 pc=0x44222e
runtime.gcenable.func1()
        /usr/local/go/src/runtime/mgc.go:178 +0x26 fp=0xc0000577e0 sp=0xc0000577c8 pc=0x4376e6
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000577e8 sp=0xc0000577e0 pc=0x4803e1
created by runtime.gcenable
        /usr/local/go/src/runtime/mgc.go:178 +0x6b

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc000024070?, 0x5867b8?, 0x1?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000057f70 sp=0xc000057f50 pc=0x455596
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:387
runtime.(*scavengerState).park(0x6a1920)
        /usr/local/go/src/runtime/mgcscavenge.go:400 +0x53 fp=0xc000057fa0 sp=0xc000057f70 pc=0x440153
runtime.bgscavenge(0x0?)
        /usr/local/go/src/runtime/mgcscavenge.go:628 +0x45 fp=0xc000057fc8 sp=0xc000057fa0 pc=0x440725
runtime.gcenable.func2()
        /usr/local/go/src/runtime/mgc.go:179 +0x26 fp=0xc000057fe0 sp=0xc000057fc8 pc=0x437686
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000057fe8 sp=0xc000057fe0 pc=0x4803e1
created by runtime.gcenable
        /usr/local/go/src/runtime/mgc.go:179 +0xaa

goroutine 5 [finalizer wait]:
runtime.gopark(0x1a0?, 0x6a1d60?, 0x0?, 0x7a?, 0xc000056770?)
        /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000056628 sp=0xc000056608 pc=0x455596
runtime.runfinq()
        /usr/local/go/src/runtime/mfinal.go:193 +0x107 fp=0xc0000567e0 sp=0xc000056628 pc=0x436727
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000567e8 sp=0xc0000567e0 pc=0x4803e1
created by runtime.createfing
        /usr/local/go/src/runtime/mfinal.go:163 +0x45
exit status 2

So now I had suspicions if maybe llama.cpp somehow was broken, but it seems to be fine:

root@lxdocker:/build/go-llama/build/bin# ./main -m /models/guanaco-33B.ggmlv3.q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 1024 -ngl 70
main: build = 895 (84e09a7)
main: seed  = 1690444651
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama.cpp: loading model from /models/guanaco-33B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_head_kv  = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.16 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  602.42 MB (+  780.00 MB per state)
llama_model_load_internal: allocating batch_size x (768 kB + n_ctx x 208 B) = 436 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 60 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 63/63 layers to GPU
llama_model_load_internal: total VRAM used: 18555 MB
llama_new_context_with_model: kv self size  =  780.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 1024, n_keep = 0


 Building a website can be done in 10 simple steps:
1. Identify your goals...

All these commands are from within the same container with Rebuild=true.

I also tried to run the Dockerfile build on the masterbranch also, but I get the exact same results. At least it's trying to use the GPU now... but I don't get why the Segfault would happen. Am I missing a param that it expects or something?

This is my PELOAD_MODELS value:

PRELOAD_MODELS=[{"url":"https://raw.githubusercontent.com/go-skynet/model-gallery/main/guanaco.yaml","name":"guanaco-online","overrides":{"f16":false,"low_vram":true,"mmap":true,"gpu_layers":50,"context_size":4096,"parameters":{"model":"guanaco-33B.ggmlv3.q4_0.bin"}}}]

And this my current full docker-compose.yaml:

version: '3.6'

services:
  api:
    #image: gl-registry.um-a.one/hyp3rson1x/localai/localai-cuda-sd-tts-debian12:v2.0
    #image: quay.io/go-skynet/local-ai:sha-72e3e23-cublas-cuda12-ffmpeg@sha256:f868a3348ca3747843542eeb1391003def43c92e3fafa8d073af9098a41a7edd
    image: quay.io/go-skynet/local-ai:master-cublas-cuda11
    #pull_policy: always
    network_mode: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
#    ports:
#      - 8080:8080
    environment:
      - MODELS_PATH=/models
      - BUILD_TYPE=cublas
      #- CGO_LDFLAGS=-lcublas -lcudart -L/usr/local/cuda-12.2/lib64
      - GO_TAGS=stablediffusion,tts
      - REBUILD=true
      - CONTEXT_SIZE=4096
      - THREADS=12
      - ADDRESS=0.0.0.0:8080
      - IMAGE_PATH=/tmp
      - DEBUG=true
      - UPLOAD_LIMIT=100
      - PRELOAD_MODELS=[{"url":"https://raw.githubusercontent.com/go-skynet/model-gallery/main/guanaco.yaml","name":"guanaco-online","overrides":{"f16":false,"low_vram":true,"mmap":true,"gpu_layers":50,"context_size":4096,"parameters":{"model":"guanaco-33B.ggmlv3.q4_0.bin"}}}]
    volumes:
      - /AI/model-configs/:/models/
    command: ["/usr/bin/local-ai"]

I changed the gpu_layers back to 50, thinking the VRAM might have been getting full or something along those lines, but now that definetly shouldn't be the case, it's only taking about 15GB from 24GB.

emakkus · 2023-07-27T12:25:01Z

I now tried to use an older release Image, and there everything works:

quay.io/go-skynet/local-ai:v1.21.0-cublas-cuda12-ffmpeg

So something must have happened afterwards that somehow leads to the segfaults... my model configuration is the same as before.

I will try out 1.22.0 but my hopes are kinda low on that one...

mudler · 2023-07-27T13:27:25Z

weird. I could finally reproduce in another box - I'll try to have a look at it later today

larkinwc · 2023-07-27T14:19:05Z

@emakkus can you try to add REBUILD=true to the env vars and see if persist?

@larkinwc is that a pre compiled binary or did you compiled it locally?

That was a local build with just cublas enabled. I had pulled the latest from git. I will try later with a different version tag/label to see if I can get it working there.

emakkus · 2023-07-27T16:03:48Z

I tried out quay.io/go-skynet/local-ai:v1.22.0-cublas-cuda12-ffmpeg and that actually has the exact same problems as I had explained in my very first post here.

It sees the GPU, but doesn't use it. Even if I run go-llama directly inside the container, the exact same thing happens. Even setting -ngl doesn't change the behaviour.

CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "/models/guanaco-33B.ggmlv3.q4_0.bin" -t 10 -ngl 60

However what works (as it does in every version I tried so far) is to directly run the naked llama.cpp:

./main -m /models/guanaco-33B.ggmlv3.q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 1024 -ngl 70

So version 1.21.0 worked flawlessly with Cuda, version 1.22.0 sees the GPU but refuses to use it. It doesn't segfault or error out, but it only uses the CPU.

And the master branch version tries to use the GPU, but segfaults...

Normally I wouldn't mind using 1.21.0... but because of Llama2 and so on... I kinda want to stay up-to-date.

Fixes: #812 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler · 2023-07-27T16:55:38Z

@emakkus @larkinwc #821 fixes the issues locally here - can you try as well? When CI passes I'll merge it for the Rope changes in any case - you can also wait for a new image from master after the merge.

emakkus · 2023-07-27T18:46:14Z

@mudler I love you man, now it works! I have built your update_rope branch and it works! <3

Polkadoty · 2023-08-05T07:43:41Z

I actually am having the same issue that emakkus had at the beginning. I've tried v1.23.1, v1.23.0, v.1.22, v.1.21, all of them haven't worked.

My nvidia-smi works within the repo, I just don't see any processes running or found. Specifically I've been using this with the Obsidian plugin for LocalAI, but even when I run prompts directly through the terminal it doesn't work.

Here is my nvidia-smi:

Sat Aug  5 07:39:55 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.04              Driver Version: 536.23       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
|  0%   42C    P0              57W / 450W |   6211MiB / 24564MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Here is that same weird error that emakkus got last week in the logs:

2023-08-05 02:35:52 @@@@@
2023-08-05 02:35:52 Skipping rebuild
2023-08-05 02:35:52 @@@@@
2023-08-05 02:35:52 If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
2023-08-05 02:35:52 If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
2023-08-05 02:35:52 CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF"
2023-08-05 02:35:52 see the documentation at: https://localai.io/basics/build/index.html
2023-08-05 02:35:52 Note: See also https://github.com/go-skynet/LocalAI/issues/288
2023-08-05 02:35:52 @@@@@
2023-08-05 02:35:52 CPU info:
2023-08-05 02:35:52 model name  : AMD Ryzen 9 7900X 12-Core Processor
2023-08-05 02:35:52 flags               : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
2023-08-05 02:35:52 CPU:    AVX    found OK
2023-08-05 02:35:52 CPU:    AVX2   found OK
2023-08-05 02:35:52 CPU:    AVX512 found OK
2023-08-05 02:35:52 @@@@@
2023-08-05 02:35:52 
2023-08-05 02:35:52  ┌───────────────────────────────────────────────────┐ 
2023-08-05 02:35:52  │                   Fiber v2.48.0                   │ 
2023-08-05 02:35:52  │               http://127.0.0.1:8080               │ 
2023-08-05 02:35:52  │       (bound on host 0.0.0.0 and port 8080)       │ 
2023-08-05 02:35:52  │                                                   │ 
2023-08-05 02:35:52  │ Handlers ............ 31  Processes ........... 1 │ 
2023-08-05 02:35:52  │ Prefork ....... Disabled  PID ................ 14 │ 
2023-08-05 02:35:52  └───────────────────────────────────────────────────┘ 
2023-08-05 02:35:52 
2023-08-05 02:36:06 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35967: connect: connection refused"
2023-08-05 02:35:52 7:35AM INF Starting LocalAI using 4 threads, with models path: /models
2023-08-05 02:35:52 7:35AM INF LocalAI version: v1.23.1-8-gacd829a (acd829a7a0e1623c0871c8b34c36c76afd4feac8)

The API call works, but for some reason it's just never using the GPU. I thought I was going crazy until I found this thread.

Here's the docker-compose.yaml file:

version: '3.6'

services:
  api:
    image: quay.io/go-skynet/local-ai:v1.23.1-cublas-cuda12-ffmpeg
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - 8080:8080
    environment:
      - MODELS_PATH=/models
      - BUILD_TYPE=cublas
      - DEBUG=true
    env_file:
      - .env
    volumes:
      - ./models:/models:cached
    command: ["/usr/bin/local-ai" ]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

emakkus added the bug Something isn't working label Jul 26, 2023

emakkus assigned mudler Jul 26, 2023

mudler added the needs more info label Jul 26, 2023

mudler removed their assignment Jul 26, 2023

mudler mentioned this issue Jul 27, 2023

Add rope setting, use float32, workaround CUDA compat issues go-skynet/go-llama.cpp#153

Merged

mudler added a commit that referenced this issue Jul 27, 2023

fix: add rope settings during model load, fix CUDA

a265874

Fixes: #812 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler mentioned this issue Jul 27, 2023

fix: add rope settings during model load, fix CUDA #821

Merged

mudler removed the needs more info label Jul 27, 2023

emakkus closed this as completed Jul 27, 2023

rioncarter mentioned this issue Jul 28, 2023

GPU offloading does not appear to work for falcon-40b #829

Open

Polkadoty mentioned this issue Aug 6, 2023

CUDA does not work anymore with llama backend #840

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda inference doesn't work anymore! #812

Cuda inference doesn't work anymore! #812

emakkus commented Jul 26, 2023 •

edited

mudler commented Jul 26, 2023

emakkus commented Jul 26, 2023 •

edited

emakkus commented Jul 26, 2023 •

edited

mudler commented Jul 26, 2023

emakkus commented Jul 26, 2023 •

edited

mudler commented Jul 26, 2023

emakkus commented Jul 27, 2023 •

edited

larkinwc commented Jul 27, 2023

mudler commented Jul 27, 2023 •

edited

emakkus commented Jul 27, 2023 •

edited

emakkus commented Jul 27, 2023

mudler commented Jul 27, 2023

larkinwc commented Jul 27, 2023

emakkus commented Jul 27, 2023

mudler commented Jul 27, 2023

emakkus commented Jul 27, 2023

Polkadoty commented Aug 5, 2023 •

edited

Cuda inference doesn't work anymore! #812

Cuda inference doesn't work anymore! #812

Comments

emakkus commented Jul 26, 2023 • edited

mudler commented Jul 26, 2023

emakkus commented Jul 26, 2023 • edited

emakkus commented Jul 26, 2023 • edited

mudler commented Jul 26, 2023

emakkus commented Jul 26, 2023 • edited

mudler commented Jul 26, 2023

emakkus commented Jul 27, 2023 • edited

larkinwc commented Jul 27, 2023

mudler commented Jul 27, 2023 • edited

emakkus commented Jul 27, 2023 • edited

emakkus commented Jul 27, 2023

mudler commented Jul 27, 2023

larkinwc commented Jul 27, 2023

emakkus commented Jul 27, 2023

mudler commented Jul 27, 2023

emakkus commented Jul 27, 2023

Polkadoty commented Aug 5, 2023 • edited

emakkus commented Jul 26, 2023 •

edited

emakkus commented Jul 26, 2023 •

edited

emakkus commented Jul 26, 2023 •

edited

emakkus commented Jul 26, 2023 •

edited

emakkus commented Jul 27, 2023 •

edited

mudler commented Jul 27, 2023 •

edited

emakkus commented Jul 27, 2023 •

edited

Polkadoty commented Aug 5, 2023 •

edited