Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLama2 failing to load in Docker with cuBLAS #1109

Closed
RussellPacheco opened this issue Sep 26, 2023 · 17 comments
Closed

LLama2 failing to load in Docker with cuBLAS #1109

RussellPacheco opened this issue Sep 26, 2023 · 17 comments
Assignees
Labels
bug Something isn't working

Comments

@RussellPacheco
Copy link

LocalAI version:

local-ai:master-cublas-cuda12

Environment, CPU architecture, OS, and Version:

Docker Container Info:
Linux 60bfc24c5413 4.18.0-477.21.1.el8_8.x86_64 #1 SMP Thu Aug 10 13:51:50 EDT 2023 x86_64 GNU/Linux

Host Device Info:
Linux rig 4.18.0-477.21.1.el8_8.x86_64 #1 SMP Thu Aug 10 13:51:50 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
NAME="AlmaLinux"
VERSION="8.8 (Sapphire Caracal)"
GPU: NVIDIA GeForce RTX 2080
CUDA Version: 12.2
Architecture: x86_64
Model name: AMD Ryzen 5 5600X 6-Core Processor

Describe the bug

Trying to run llama2-13b-chat.gguf in a docker container built with cuBLAS support ends with errors and the model not even being loaded into VRAM. See below for related files and debug log output.

To Reproduce

docker-compose.yaml

version: '3.6'

services:
  api:
    image: quay.io/go-skynet/local-ai:master-cublas-cuda12
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - 8080:8080
    env_file:
      - .env
    volumes:
      - ./models:/models:cached
    command: ["/usr/bin/local-ai"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

.env file contents

## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
THREADS=20

## Specify a different bind address (defaults to ":8080")
# ADDRESS=127.0.0.1:8080

## Default models context size
# CONTEXT_SIZE=512
#
## Define galleries.
## models will to install will be visible in `/models/available`
# GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}]

## CORS settings
# CORS=true
# CORS_ALLOW_ORIGINS=*

## Default path for models
#
MODELS_PATH=/models

## Enable debug mode
DEBUG=true

## Disables COMPEL (Diffusers)
# COMPEL=0

## Enable/Disable single backend (useful if only one GPU is available)
SINGLE_ACTIVE_BACKEND=true

## Specify a build type. Available: cublas, openblas, clblas.
## cuBLAS: This is a GPU-accelerated version of the complete standard BLAS (Basic Linear Algebra Subprograms) library. It's provided by Nvidia and is part of their CUDA toolkit.
## OpenBLAS: This is an open-source implementation of the BLAS library that aims to provide highly optimized code for various platforms. It includes support for multi-threading and can be compiled to use hardware-specific features for additional performance. OpenBLAS can run on many kinds of hardware, including CPUs from Intel, AMD, and ARM.
## clBLAS:   This is an open-source implementation of the BLAS library that uses OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. clBLAS is designed to take advantage of the parallel computing power of GPUs but can also run on any hardware that supports OpenCL. This includes hardware from different vendors like Nvidia, AMD, and Intel.
BUILD_TYPE=cublas

## Uncomment and set to true to enable rebuilding from source
REBUILD=false

## Enable go tags, available: stablediffusion, tts
## stablediffusion: image generation with stablediffusion
## tts: enables text-to-speech with go-piper 
## (requires REBUILD=true)
#
# GO_TAGS=stablediffusion

## Path where to store generated images
# IMAGE_PATH=/tmp

## Specify a default upload limit in MB (whisper)
# UPLOAD_LIMIT

## List of external GRPC backends (note on the container image this variable is already set to use extra backends available in extra/)
# EXTERNAL_GRPC_BACKENDS=my-backend:127.0.0.1:9000,my-backend2:/usr/bin/backend.py

### Advanced settings ###
### Those are not really used by LocalAI, but from components in the stack ###
##
### Preload libraries
# LD_PRELOAD=

### Huggingface cache for models
# HUGGINGFACE_HUB_CACHE=/usr/local/huggingface

### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.
# PYTHON_GRPC_MAX_WORKERS=1

llama2-13b-chat.yaml file contents:

name: llama2-13b-chat
backend: llama
gpu_layers: 1
low_vram: true
parameters:
  model: llama2-13b-chat.gguf
  top_k: 80
  temperature: 0.2
  top_p: 0.7
context_size: 4096
template:
  chat: llama2-13b-chat
system_prompt: |
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
    If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
roles: 
  assistant: 'Assistant:'
  system: 'System:'
  user: 'User:'

llama2-13b-chat.tmpl file contents:

[INST]
{{if .SystemPrompt}}<<SYS>>{{.SystemPrompt}}<</SYS>>{{end}}
{{if .Input}}{{.Input}}{{end}}
[/INST] 

Assistant: 

Curl Command:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "llama2-13b-chat", "messages": [{"role": "user", "content": "How old is the earth?"}]}'

Expected behavior

Llama2.gguf model should successfully be loaded into VRAM and provide a response to the prompt.

Logs

Debug Log Output:

localai-api-1  | 7:15AM DBG Extracting backend assets files to /tmp/localai/backend_data
localai-api-1  | 
localai-api-1  |  ┌───────────────────────────────────────────────────┐ 
localai-api-1  |  │                   Fiber v2.49.0                   │ 
localai-api-1  |  │               http://127.0.0.1:8080               │ 
localai-api-1  |  │       (bound on host 0.0.0.0 and port 8080)       │ 
localai-api-1  |  │                                                   │ 
localai-api-1  |  │ Handlers ............ 63  Processes ........... 1 │ 
localai-api-1  |  │ Prefork ....... Disabled  PID ................ 14 │ 
localai-api-1  |  └───────────────────────────────────────────────────┘ 
localai-api-1  | 
localai-api-1  | 
localai-api-1  | 7:16AM DBG Request received: 
localai-api-1  | 7:16AM DBG Configuration read: &{PredictionOptions:{Model:llama2-13b-chat.gguf Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama2-13b-chat F16:false Threads:20 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:llama2-13b-chat ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  |  TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:1 MMap:false MMlock:false LowVRAM:true Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0}}
localai-api-1  | 7:16AM DBG Parameters: &{PredictionOptions:{Model:llama2-13b-chat.gguf Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama2-13b-chat F16:false Threads:20 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:llama2-13b-chat ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  |  TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:1 MMap:false MMlock:false LowVRAM:true Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0}}
localai-api-1  | 7:16AM DBG Prompt (before templating): User: How old is the earth?
localai-api-1  | 7:16AM DBG Template found, input modified to: [INST]
localai-api-1  | <<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  | <</SYS>>
localai-api-1  | User: How old is the earth?
localai-api-1  | [/INST] 
localai-api-1  | 
localai-api-1  | Assistant: 
localai-api-1  | 
localai-api-1  | 7:16AM DBG Prompt (after templating): [INST]
localai-api-1  | <<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  | <</SYS>>
localai-api-1  | User: How old is the earth?
localai-api-1  | [/INST] 
localai-api-1  | 
localai-api-1  | Assistant: 
localai-api-1  | 
localai-api-1  | 
localai-api-1  | 7:16AM DBG Loading model llama from llama2-13b-chat.gguf
localai-api-1  | 7:16AM DBG Stopping all backends except 'llama2-13b-chat.gguf'
localai-api-1  | 7:16AM DBG Loading model in memory from file: /models/llama2-13b-chat.gguf
localai-api-1  | 7:16AM DBG Loading GRPC Model llama: {backendString:llama model:llama2-13b-chat.gguf threads:20 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000350000 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true}
localai-api-1  | 7:16AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
localai-api-1  | 7:16AM DBG GRPC Service for llama2-13b-chat.gguf will be running at: '127.0.0.1:46809'
localai-api-1  | 7:16AM DBG GRPC Service state dir: /tmp/go-processmanager3095530620
localai-api-1  | 7:16AM DBG GRPC Service Started
localai-api-1  | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:46809: connect: connection refused"
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 2023/09/26 07:16:16 gRPC Server listening at 127.0.0.1:46809
localai-api-1  | 7:16AM DBG GRPC Service Ready
localai-api-1  | 7:16AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:llama2-13b-chat.gguf ContextSize:4096 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:true Embeddings:false NUMA:false NGPULayers:1 MainGPU: TensorSplit: Threads:20 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/llama2-13b-chat.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false}
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr create_gpt_params: loading model /models/llama2-13b-chat.gguf
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr SIGSEGV: segmentation violation
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr PC=0x7f9de1797789 m=3 sigcode=1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr signal arrived during cgo execution
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr goroutine 3 [syscall]:
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.cgocall(0x81e290, 0xc0000f9590)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc0000f9568 sp=0xc0000f9530 pc=0x4176cb
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr github.com/go-skynet/go-llama%2ecpp._Cfunc_load_model(0x7f9d60000e60, 0x1000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x200, ...)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	_cgo_gotypes.go:266 +0x4c fp=0xc0000f9590 sp=0xc0000f9568 pc=0x81474c
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr github.com/go-skynet/go-llama%2ecpp.New({0xc000338000, 0x1c}, {0xc000319380, 0x9, 0x92aa60?})
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/build/go-llama/llama.go:39 +0x3aa fp=0xc0000f9790 sp=0xc0000f9590 pc=0x81500a
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr github.com/go-skynet/LocalAI/pkg/backend/llm/llama.(*LLM).Load(0xc0000a0c80, 0xc000308000)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/build/pkg/backend/llm/llama/llama.go:81 +0xbcb fp=0xc0000f9958 sp=0xc0000f9790 pc=0x8198eb
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr github.com/go-skynet/LocalAI/pkg/grpc.(*server).LoadModel(0xc0000a0d50, {0xc000308000?, 0x50d446?}, 0x0?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/build/pkg/grpc/server.go:50 +0xe6 fp=0xc0000f9a08 sp=0xc0000f9958 pc=0x81c326
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr github.com/go-skynet/LocalAI/pkg/grpc/proto._Backend_LoadModel_Handler({0x99bec0?, 0xc0000a0d50}, {0xa82310, 0xc000304030}, 0xc000204070, 0x0)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/build/pkg/grpc/proto/backend_grpc.pb.go:264 +0x169 fp=0xc0000f9a60 sp=0xc0000f9a08 pc=0x809f89
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc.(*Server).processUnaryRPC(0xc0001661e0, {0xa85458, 0xc00017e340}, 0xc00021a000, 0xc000174c00, 0x1168650, 0x0)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1360 +0xe15 fp=0xc0000f9e40 sp=0xc0000f9a60 pc=0x7f34b5
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc.(*Server).handleStream(0xc0001661e0, {0xa85458, 0xc00017e340}, 0xc00021a000, 0x0)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1737 +0x9e7 fp=0xc0000f9f68 sp=0xc0000f9e40 pc=0x7f81e7
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc.(*Server).serveStreams.func1.1()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:982 +0x8d fp=0xc0000f9fe0 sp=0xc0000f9f68 pc=0x7f122d
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goexit()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000f9fe8 sp=0xc0000f9fe0 pc=0x47a9e1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 29
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:980 +0x165
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr goroutine 1 [IO wait]:
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.gopark(0x4c6b10?, 0xc00015db28?, 0xa0?, 0xdb?, 0xc00015db78?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00015db08 sp=0xc00015dae8 pc=0x44be6e
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.netpollblock(0x478a52?, 0x416e66?, 0x0?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/netpoll.go:564 +0xf7 fp=0xc00015db40 sp=0xc00015db08 pc=0x4448f7
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr internal/poll.runtime_pollWait(0x7f9d6efb0e58, 0x72)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/netpoll.go:343 +0x85 fp=0xc00015db60 sp=0xc00015db40 pc=0x475905
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr internal/poll.(*pollDesc).wait(0xc00012c600?, 0x4?, 0x0)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00015db88 sp=0xc00015db60 pc=0x4dfb67
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr internal/poll.(*pollDesc).waitRead(...)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr internal/poll.(*FD).Accept(0xc00012c600)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/internal/poll/fd_unix.go:611 +0x2ac fp=0xc00015dc30 sp=0xc00015db88 pc=0x4e504c
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr net.(*netFD).accept(0xc00012c600)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/net/fd_unix.go:172 +0x29 fp=0xc00015dce8 sp=0xc00015dc30 pc=0x6434a9
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr net.(*TCPListener).accept(0xc0000f0440)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/net/tcpsock_posix.go:152 +0x1e fp=0xc00015dd10 sp=0xc00015dce8 pc=0x65a43e
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr net.(*TCPListener).Accept(0xc0000f0440)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/net/tcpsock.go:315 +0x30 fp=0xc00015dd40 sp=0xc00015dd10 pc=0x6595f0
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc.(*Server).Serve(0xc0001661e0, {0xa818c8?, 0xc0000f0440})
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:844 +0x462 fp=0xc00015de80 sp=0xc00015dd40 pc=0x7efee2
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr github.com/go-skynet/LocalAI/pkg/grpc.StartServer({0x7ffd12945c50?, 0xc0000b4130?}, {0xa85f40?, 0xc0000a0c80})
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/build/pkg/grpc/server.go:178 +0x17d fp=0xc00015df10 sp=0xc00015de80 pc=0x81dd1d
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr main.main()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/build/cmd/grpc/llama/main.go:22 +0x85 fp=0xc00015df40 sp=0xc00015df10 pc=0x81dec5
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.main()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:267 +0x2bb fp=0xc00015dfe0 sp=0xc00015df40 pc=0x44ba1b
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goexit()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00015dfe8 sp=0xc00015dfe0 pc=0x47a9e1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr goroutine 2 [force gc (idle)]:
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000058fa8 sp=0xc000058f88 pc=0x44be6e
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goparkunlock(...)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:404
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.forcegchelper()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:322 +0xb3 fp=0xc000058fe0 sp=0xc000058fa8 pc=0x44bcf3
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goexit()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000058fe8 sp=0xc000058fe0 pc=0x47a9e1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr created by runtime.init.6 in goroutine 1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:310 +0x1a
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr goroutine 18 [GC sweep wait]:
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000054778 sp=0xc000054758 pc=0x44be6e
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goparkunlock(...)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:404
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.bgsweep(0x0?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/mgcsweep.go:280 +0x94 fp=0xc0000547c8 sp=0xc000054778 pc=0x437d54
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.gcenable.func1()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/mgc.go:200 +0x25 fp=0xc0000547e0 sp=0xc0000547c8 pc=0x42ce85
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goexit()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000547e8 sp=0xc0000547e0 pc=0x47a9e1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr created by runtime.gcenable in goroutine 1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/mgc.go:200 +0x66
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr goroutine 19 [GC scavenge wait]:
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.gopark(0xc00009e000?, 0xa7ab20?, 0x1?, 0x0?, 0xc0000924e0?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000054f70 sp=0xc000054f50 pc=0x44be6e
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goparkunlock(...)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:404
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.(*scavengerState).park(0x11b1800)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000054fa0 sp=0xc000054f70 pc=0x435589
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.bgscavenge(0x0?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc000054fc8 sp=0xc000054fa0 pc=0x435b1c
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.gcenable.func2()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/mgc.go:201 +0x25 fp=0xc000054fe0 sp=0xc000054fc8 pc=0x42ce25
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goexit()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000054fe8 sp=0xc000054fe0 pc=0x47a9e1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr created by runtime.gcenable in goroutine 1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/mgc.go:201 +0xa5
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr goroutine 20 [finalizer wait]:
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.gopark(0x9c6220?, 0x10044d001?, 0x0?, 0x0?, 0x454025?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000058628 sp=0xc000058608 pc=0x44be6e
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.runfinq()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/mfinal.go:193 +0x107 fp=0xc0000587e0 sp=0xc000058628 pc=0x42bf07
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goexit()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000587e8 sp=0xc0000587e0 pc=0x47a9e1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr created by runtime.createfing in goroutine 1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/mfinal.go:163 +0x3d
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr goroutine 27 [select]:
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.gopark(0xc00020ff00?, 0x2?, 0x0?, 0xe0?, 0xc00020fecc?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00020fd78 sp=0xc00020fd58 pc=0x44be6e
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.selectgo(0xc00020ff00, 0xc00020fec8, 0x7070e0910100402?, 0x0, 0x664605?, 0x1)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/select.go:327 +0x725 fp=0xc00020fe98 sp=0xc00020fd78 pc=0x45b8c5
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc0000a6aa0, 0x1)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/internal/transport/controlbuf.go:418 +0x113 fp=0xc00020ff30 sp=0xc00020fe98 pc=0x769f53
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc00007e070)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/internal/transport/controlbuf.go:552 +0x86 fp=0xc00020ff90 sp=0xc00020ff30 pc=0x76a686
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc/internal/transport.NewServerTransport.func2()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/internal/transport/http2_server.go:341 +0xd5 fp=0xc00020ffe0 sp=0xc00020ff90 pc=0x781695
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goexit()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00020ffe8 sp=0xc00020ffe0 pc=0x47a9e1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr created by google.golang.org/grpc/internal/transport.NewServerTransport in goroutine 26
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/internal/transport/http2_server.go:338 +0x1b0c
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr goroutine 28 [select]:
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.gopark(0xc000056f70?, 0x4?, 0xa0?, 0x6?, 0xc000056ec0?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000056d28 sp=0xc000056d08 pc=0x44be6e
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.selectgo(0xc000056f70, 0xc000056eb8, 0x0?, 0x0, 0x0?, 0x1)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/select.go:327 +0x725 fp=0xc000056e48 sp=0xc000056d28 pc=0x45b8c5
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc/internal/transport.(*http2Server).keepalive(0xc00017e340)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/internal/transport/http2_server.go:1155 +0x225 fp=0xc000056fc8 sp=0xc000056e48 pc=0x788ae5
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc/internal/transport.NewServerTransport.func4()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/internal/transport/http2_server.go:344 +0x25 fp=0xc000056fe0 sp=0xc000056fc8 pc=0x781585
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goexit()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000056fe8 sp=0xc000056fe0 pc=0x47a9e1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr created by google.golang.org/grpc/internal/transport.NewServerTransport in goroutine 26
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/internal/transport/http2_server.go:344 +0x1b4e
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr goroutine 29 [IO wait]:
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.gopark(0x7f9d6d726b98?, 0xb?, 0x0?, 0x0?, 0x6?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000068aa8 sp=0xc000068a88 pc=0x44be6e
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.netpollblock(0x4c4d98?, 0x416e66?, 0x0?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/netpoll.go:564 +0xf7 fp=0xc000068ae0 sp=0xc000068aa8 pc=0x4448f7
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr internal/poll.runtime_pollWait(0x7f9d6efb0d60, 0x72)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/netpoll.go:343 +0x85 fp=0xc000068b00 sp=0xc000068ae0 pc=0x475905
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr internal/poll.(*pollDesc).wait(0xc00012ca00?, 0xc0002a6000?, 0x0)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000068b28 sp=0xc000068b00 pc=0x4dfb67
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr internal/poll.(*pollDesc).waitRead(...)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr internal/poll.(*FD).Read(0xc00012ca00, {0xc0002a6000, 0x8000, 0x8000})
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc000068bc0 sp=0xc000068b28 pc=0x4e0e5a
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr net.(*netFD).Read(0xc00012ca00, {0xc0002a6000?, 0x1060100000000?, 0x8?})
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/net/fd_posix.go:55 +0x25 fp=0xc000068c08 sp=0xc000068bc0 pc=0x641485
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr net.(*conn).Read(0xc0000b2328, {0xc0002a6000?, 0xc000068c98?, 0x3?})
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/net/net.go:179 +0x45 fp=0xc000068c50 sp=0xc000068c08 pc=0x651b65
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr net.(*TCPConn).Read(0x0?, {0xc0002a6000?, 0xc000068ca8?, 0x469d4d?})
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	<autogenerated>:1 +0x25 fp=0xc000068c80 sp=0xc000068c50 pc=0x664305
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr bufio.(*Reader).Read(0xc0000a50e0, {0xc00017c120, 0x9, 0xc13cbf308f52b15f?})
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/bufio/bufio.go:244 +0x197 fp=0xc000068cb8 sp=0xc000068c80 pc=0x5bca37
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr io.ReadAtLeast({0xa7f3a0, 0xc0000a50e0}, {0xc00017c120, 0x9, 0x9}, 0x9)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/io/io.go:335 +0x90 fp=0xc000068d00 sp=0xc000068cb8 pc=0x4bef90
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr io.ReadFull(...)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/io/io.go:354
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr golang.org/x/net/http2.readFrameHeader({0xc00017c120, 0x9, 0xc000028090?}, {0xa7f3a0?, 0xc0000a50e0?})
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/golang.org/x/net@v0.14.0/http2/frame.go:237 +0x65 fp=0xc000068d50 sp=0xc000068d00 pc=0x756ce5
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr golang.org/x/net/http2.(*Framer).ReadFrame(0xc00017c0e0)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/golang.org/x/net@v0.14.0/http2/frame.go:498 +0x85 fp=0xc000068df8 sp=0xc000068d50 pc=0x757425
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc00017e340, 0x0?, 0x0?)
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/internal/transport/http2_server.go:642 +0x165 fp=0xc000068f10 sp=0xc000068df8 pc=0x784905
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc.(*Server).serveStreams(0xc0001661e0, {0xa85458?, 0xc00017e340})
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:969 +0x149 fp=0xc000068f80 sp=0xc000068f10 pc=0x7f0fa9
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr google.golang.org/grpc.(*Server).handleRawConn.func1()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:912 +0x45 fp=0xc000068fe0 sp=0xc000068f80 pc=0x7f0885
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr runtime.goexit()
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000068fe8 sp=0xc000068fe0 pc=0x47a9e1
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr created by google.golang.org/grpc.(*Server).handleRawConn in goroutine 26
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 	/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:911 +0x185
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr 
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr rax    0x7f9d60001440
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr rbx    0x7f9d60001180
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr rcx    0x7f9d60001460
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr rdx    0x0
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr rdi    0x7f9d60001180
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr rsi    0x7f9d67ff5520
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr rbp    0x7f9d67ff5520
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr rsp    0x7f9d67ff52c0
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr r8     0x7f9d60001440
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr r9     0x7f9d60000080
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr r10    0xfffffffffffffcfd
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr r11    0x30
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr r12    0x0
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr r13    0x0
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr r14    0x7f9d60001190
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr r15    0x7f9d67ff5480
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr rip    0x7f9de1797789
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr rflags 0x10246
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr cs     0x33
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr fs     0x0
localai-api-1  | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr gs     0x0
localai-api-1  | [172.18.0.1]:45508  500  -  POST     /v1/chat/completions

Additional context

@RussellPacheco RussellPacheco added the bug Something isn't working label Sep 26, 2023
@yhyu13
Copy link

yhyu13 commented Sep 26, 2023

localai-api-1 | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr create_gpt_params: loading model /models/llama2-13b-chat.gguf
localai-api-1 | 7:16AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:46809): stderr SIGSEGV: segmentation violation

Segfault in go-llamacpp code I believe

Correct me if I am wrong, the go-llamacpp this project use at moment is quite old, serveral months outdated from latest llamacpp. GGUF is not supported I think, only ggml (deprecated format) is supported

@RussellPacheco
Copy link
Author

@yhyu13

According to #943, GGUF support was just added a couple of days ago.

@RussellPacheco
Copy link
Author

RussellPacheco commented Sep 26, 2023

After deleting the existing container and image, redownloading and rebuilding, the model is properly being loaded into VRAM. Although, the inference speed is extremely slow. I killed the process after several minutes of waiting for a response. I will do more testing later to see exactly how long does it take and report back here.

@lunamidori5 lunamidori5 self-assigned this Sep 26, 2023
@lunamidori5
Copy link
Collaborator

you have the GPU layers set to 1 if you set those higher it will use more VRAM and Use more GPU @RussellPacheco
Llama Supports GGUF, Llama-stable does not @yhyu13

@RussellPacheco
Copy link
Author

@lunamidori5 Is there any documentation you can recommend to better understand the relationship between the possible number of GPU layers and a GPU model?

@RussellPacheco
Copy link
Author

RussellPacheco commented Sep 27, 2023

After some testing, I was able to conclude what is the highest possible value I could set for gpu_layers with my GPU and this is 9. Please note that when set to 10, there is no inference and LocalAI returns the response down below (an error). This is due to running out of memory.

However, even when safely maximizing the amount of VRAM loaded, inference takes over 1 hour and 30 minutes. Running the same prompt directly with llama.cpp, inference only takes 3 minutes.

Please see below for more information:

curated logs with gpu_layers:10 :

localai-api-1  | 4:27AM DBG Request received: 
localai-api-1  | 4:27AM DBG Configuration read: &{PredictionOptions:{Model:llama2-13b-chat.gguf Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama2-13b-chat F16:false Threads:20 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:llama2-13b-chat ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  |  TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:10 MMap:false MMlock:false LowVRAM:true Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 4:27AM DBG Parameters: &{PredictionOptions:{Model:llama2-13b-chat.gguf Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama2-13b-chat F16:false Threads:20 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:llama2-13b-chat ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  |  TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:10 MMap:false MMlock:false LowVRAM:true Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 4:27AM DBG Prompt (before templating): User: How old is the earth?
localai-api-1  | 4:27AM DBG Template found, input modified to: [INST]
localai-api-1  | <<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  | <</SYS>>
localai-api-1  | User: How old is the earth?
localai-api-1  | [/INST] 
localai-api-1  | 
localai-api-1  | Assistant: 
localai-api-1  | 
localai-api-1  | 4:27AM DBG Prompt (after templating): [INST]
localai-api-1  | <<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  | <</SYS>>
localai-api-1  | User: How old is the earth?
localai-api-1  | [/INST] 
localai-api-1  | 
localai-api-1  | Assistant: 
localai-api-1  | 
localai-api-1  | 
localai-api-1  | 4:27AM DBG Loading model llama from llama2-13b-chat.gguf
localai-api-1  | 4:27AM DBG Stopping all backends except 'llama2-13b-chat.gguf'
localai-api-1  | 4:27AM DBG Loading model in memory from file: /models/llama2-13b-chat.gguf
localai-api-1  | 4:27AM DBG Loading GRPC Model llama: {backendString:llama model:llama2-13b-chat.gguf threads:20 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0003831e0 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true}
localai-api-1  | 4:27AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
localai-api-1  | 4:27AM DBG GRPC Service for llama2-13b-chat.gguf will be running at: '127.0.0.1:37993'
localai-api-1  | 4:27AM DBG GRPC Service state dir: /tmp/go-processmanager601508576
localai-api-1  | 4:27AM DBG GRPC Service Started
localai-api-1  | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37993: connect: connection refused"
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr 2023/09/27 04:27:22 gRPC Server listening at 127.0.0.1:37993
localai-api-1  | 4:27AM DBG GRPC Service Ready
localai-api-1  | 4:27AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:llama2-13b-chat.gguf ContextSize:4096 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:true Embeddings:false NUMA:false NGPULayers:10 MainGPU: TensorSplit: Threads:20 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/llama2-13b-chat.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr create_gpt_params_cuda: loading model /models/llama2-13b-chat.gguf
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr ggml_init_cublas: found 1 CUDA devices:
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr   Device 0: NVIDIA GeForce RTX 2080 SUPER, compute capability 7.5
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llama_model_loader: loaded meta data with 15 key-value pairs and 363 tensors from /models/llama2-13b-chat.gguf (version GGUF V2 (latest))

~portion of logs removed for brevity~

localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llama_model_loader: - type  f32:   81 tensors
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llama_model_loader: - type  f16:  282 tensors
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: format         = GGUF V2 (latest)
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: arch           = llama
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: vocab type     = SPM
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: n_vocab        = 32000
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: n_merges       = 0
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: n_ctx_train    = 4096
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: n_ctx          = 4096
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: n_embd         = 5120
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: n_head         = 40
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: n_head_kv      = 40
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: n_layer        = 40
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: n_rot          = 128
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: n_gqa          = 1
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: f_norm_eps     = 1.0e-05
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: n_ff           = 13824
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: freq_base      = 10000.0
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: freq_scale     = 1
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: model type     = 13B
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: model ftype    = mostly F16
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: model size     = 13.02 B
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: general.name   = LLaMA v2
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: BOS token = 1 '<s>'
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: EOS token = 2 '</s>'
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: UNK token = 0 '<unk>'
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_print_meta: LF token  = 13 '<0x0A>'
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_tensors: ggml ctx size = 24826.70 MB
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_tensors: using CUDA for GPU acceleration
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_tensors: mem required  = 18776.31 MB (+ 6400.00 MB per state)
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_tensors: offloading 10 repeating layers to GPU
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_tensors: offloaded 10/43 layers to GPU
localai-api-1  | 4:27AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llm_load_tensors: VRAM used: 6051 MB
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr ....................................................................................................
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llama_new_context_with_model: kv self size  = 6400.00 MB
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llama_new_context_with_model: compute buffer total size =  351.47 MB
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr llama_new_context_with_model: not allocating a VRAM scratch buffer due to low VRAM option
localai-api-1  | [127.0.0.1]:53740 200 - GET /readyz
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr 
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr CUDA error 2 at /build/go-llama/llama.cpp/ggml-cuda.cu:5474: out of memory
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stderr current device: 0
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stdout  [INST]
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stdout <<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stdout If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stdout <</SYS>>
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stdout User: How old is the earth?
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stdout [/INST] 
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stdout 
localai-api-1  | 4:28AM DBG GRPC(llama2-13b-chat.gguf-127.0.0.1:37993): stdout Assistant: 
localai-api-1  | [172.18.0.1]:34686 500 - POST /v1/chat/completions
localai-api-1  | [127.0.0.1]:39304 200 - GET /readyz

error html response with gpu_layers: 10 :

[2023-09-27 12:59:07][root@rig ~]# curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "llama2-13b-chat", "messages": [{"role": "user", "content": "How old is the earth?"}]}'
{"error":{"code":500,"message":"rpc error: code = Unavailable desc = error reading from server: EOF","type":""}}

With gpu_layers: 9 :

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 ...    Off | 00000000:06:00.0 Off |                  N/A |
| 46%   43C    P8              10W / 250W |   7437MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2771      C   ...kend_data/backend-assets/grpc/llama     7372MiB |
+---------------------------------------------------------------------------------------+

However, as seen below, it took over an hour and a half for the inference for a simple prompt:

[2023-09-27 11:22:57][root@rig ~]# curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "llama2-13b-chat", "messages": [{"role": "user", "content": "How old is the earth?"}]}'

{"object":"chat.completion","model":"llama2-13b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Thank you for your question! The age of the Earth is a topic that has been studied and debated by scientists for centuries. According to the most widely accepted scientific theory, the Earth is estimated to be around 4.54 billion years old. This estimate is based on a variety of evidence, including the study of the Earth's rocks, the Moon's orbit, and the radioactive decay of elements in the Earth's crust. However, it's important to note that there is some variation in the estimated age of the Earth among different scientific sources, and some alternative theories and interpretations of the evidence have been proposed. Ultimately, the age of the Earth is a complex and multifaceted question that continues to be the subject of ongoing scientific research and debate. I hope this information is helpful! Is there anything else you would like to know?"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[2023-09-27 12:58:12][root@rig ~]# 

For comparison, running the same prompt directly with llama.cpp, the inference speed takes about 3 minutes:

[2023-09-27 13:05:35][root@rig /opt/llama.cpp]# ./main -m ./models/llama-2-13b-chat/llama2-13b-chat.gguf -n 128 -p "How old is the earth?"
Log start
main: build = 1135 (bce1fef)
main: seed  = 1695787573
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 SUPER, compute capability 7.5
llama_model_loader: loaded meta data with 15 key-value pairs and 363 tensors from ./models/llama-2-13b-chat/llama2-13b-chat.gguf (version GGUF V2 (latest))

~portion of the logs were removed for brevity~

llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly F16
llm_load_print_meta: model size     = 13.02 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 24826.70 MB (+  400.00 MB per state)
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/43 layers to GPU
llm_load_tensors: VRAM used: 0 MB
....................................................................................................
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.47 MB
llama_new_context_with_model: VRAM scratch buffer: 74.00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 How old is the earth?

The age of the Earth is approximately 4.54 billion years. This estimate is based on a variety of scientific methods, including radiometric dating of rocks and meteorites, and the study of the Earth's magnetic field and the rate of cooling of the Earth's core. The most widely accepted age for the Earth is based on the radioactive decay of uranium-238 to lead-206, which has a half-life of about 4.5 billion years. This method has been used to date rocks and meteorites that contain uranium, and the ages obtained
llama_print_timings:        load time = 46431.11 ms
llama_print_timings:      sample time =    47.95 ms /   128 runs   (    0.37 ms per token,  2669.67 tokens per second)
llama_print_timings: prompt eval time =   825.12 ms /     7 tokens (  117.87 ms per token,     8.48 tokens per second)
llama_print_timings:        eval time = 90594.38 ms /   127 runs   (  713.34 ms per token,     1.40 tokens per second)
llama_print_timings:       total time = 91553.25 ms
Log end
[2023-09-27 13:08:33][root@rig /opt/llama.cpp]# 

@lunamidori5
Copy link
Collaborator

@lunamidori5 Is there any documentation you can recommend to better understand the relationship between the possible number of GPU layers and a GPU model?

Sadly no, I do not as every model is its own, but I am working on updating the how tos to make that even more clear! @RussellPacheco

@lunamidori5
Copy link
Collaborator

@RussellPacheco in your yaml file add

low_vram: true
mmap: false
mmlock: false

@RussellPacheco
Copy link
Author

RussellPacheco commented Sep 28, 2023

@lunamidori5

I made the above changes and the inference speed did speed up slightly and now takes 50 minutes instead of the 1 hour and a half it was taking before. However this is still significantly slower than when running llama.cpp directly with it's inference speed of 3 minutes.

[2023-09-28 08:11:23][root@rig ~]# curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "llama2-13b-chat", "messages": [{"role": "user", "content": "How old is the earth?"}]}'

{"object":"chat.completion","model":"llama2-13b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Thank you for your question! The age of the Earth is a topic that has been studied and debated by scientists for centuries. According to the most widely accepted scientific theory, the Earth is estimated to be around 4.54 billion years old. This estimate is based on a variety of evidence, including the study of the Earth's rocks, the Moon's orbit, and the radioactive decay of elements in the Earth's crust. However, it's important to note that there is some variation in the estimated age of the Earth among different scientific sources, and some alternative theories and interpretations of the evidence have been proposed. Ultimately, the age of the Earth is a complex and multifaceted question that continues to be the subject of ongoing scientific research and debate. I hope this information is helpful! Is there anything else you would like to know?"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[2023-09-28 09:01:10][root@rig ~]# 

@lunamidori5
Copy link
Collaborator

@RussellPacheco what is your gpu layers set to again? (Ill keep looking into what may be making this happen)

@RussellPacheco
Copy link
Author

@lunamidori5 They are set to 9. Thank you for looking into this for me.

@lunamidori5
Copy link
Collaborator

@lunamidori5 They are set to 9. Thank you for looking into this for me.

@RussellPacheco do you have a smaller model you are okay trying? like a 7b or 3b?

@RussellPacheco
Copy link
Author

@lunamidori5 I am currently preparing this now and will let you know the outcome of this.

@RussellPacheco
Copy link
Author

RussellPacheco commented Sep 28, 2023

@lunamidori5

Here are my results for attempting to run a 7B model. The conclusion was that even though the model fully loads into VRAM, there was no inference even after waiting more than 2 hours. The same model can do inference directly using llama.cpp and takes a minute.

llama2-7b.yaml file contents:

name: llama2-7b
backend: llama
gpu_layers: 9
low_vram: true
mmap: false
mmlock: false
parameters:
  model: llama2-7b.gguf
  top_k: 80
  temperature: 0.2
  top_p: 0.7
context_size: 4096
template:
  chat: llama2-7b
system_prompt: |
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
    If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
roles: 
  assistant: 'Assistant:'
  system: 'System:'
  user: 'User:'

LocalAI curated logs:

localai-api-1  | 1:05AM DBG Request received: 
localai-api-1  | 1:05AM DBG Configuration read: &{PredictionOptions:{Model:llama2-7b.gguf Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama2-7b F16:false Threads:20 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:llama2-7b ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  |  TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:9 MMap:false MMlock:false LowVRAM:true Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 1:05AM DBG Parameters: &{PredictionOptions:{Model:llama2-7b.gguf Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama2-7b F16:false Threads:20 Debug:true Roles:map[assistant:Assistant: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat:llama2-7b ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  |  TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:9 MMap:false MMlock:false LowVRAM:true Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 1:05AM DBG Prompt (before templating): User: How old is the earth?
localai-api-1  | 1:05AM DBG Template found, input modified to: [INST]
localai-api-1  | <<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  | <</SYS>>
localai-api-1  | User: How old is the earth?
localai-api-1  | [/INST] 
localai-api-1  | 
localai-api-1  | Assistant: 
localai-api-1  | 
localai-api-1  | 
localai-api-1  | 1:05AM DBG Prompt (after templating): [INST]
localai-api-1  | <<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
localai-api-1  | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
localai-api-1  | <</SYS>>
localai-api-1  | User: How old is the earth?
localai-api-1  | [/INST] 
localai-api-1  | 
localai-api-1  | Assistant: 
localai-api-1  | 
localai-api-1  | 1:05AM DBG Loading model llama from llama2-7b.gguf
localai-api-1  | 1:05AM DBG Stopping all backends except 'llama2-7b.gguf'
localai-api-1  | 1:05AM DBG Loading model in memory from file: /models/llama2-7b.gguf
localai-api-1  | 
localai-api-1  | 1:05AM DBG Loading GRPC Model llama: {backendString:llama model:llama2-7b.gguf threads:20 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0000a2ea0 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true}
localai-api-1  | 1:05AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
localai-api-1  | 1:05AM DBG GRPC Service for llama2-7b.gguf will be running at: '127.0.0.1:43567'
localai-api-1  | 1:05AM DBG GRPC Service state dir: /tmp/go-processmanager3128491290
localai-api-1  | 1:05AM DBG GRPC Service Started
localai-api-1  | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:43567: connect: connection refused"
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr 2023/09/28 01:05:21 gRPC Server listening at 127.0.0.1:43567
localai-api-1  | 1:05AM DBG GRPC Service Ready
localai-api-1  | 1:05AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:llama2-7b.gguf ContextSize:4096 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:true Embeddings:false NUMA:false NGPULayers:9 MainGPU: TensorSplit: Threads:20 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/llama2-7b.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr create_gpt_params_cuda: loading model /models/llama2-7b.gguf
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr ggml_init_cublas: found 1 CUDA devices:
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr   Device 0: NVIDIA GeForce RTX 2080 SUPER, compute capability 7.5
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: loaded meta data with 15 key-value pairs and 291 tensors from /models/llama2-7b.gguf (version GGUF V2 (latest))

~ portion of the logs removed for brevity ~

localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv   0:                       general.architecture str     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv   1:                               general.name str     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv   2:                       llama.context_length u32     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv   3:                     llama.embedding_length u32     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv   4:                          llama.block_count u32     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv   7:                 llama.attention.head_count u32     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv  10:                          general.file_type u32     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - type  f32:   65 tensors
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_model_loader: - type  f16:  226 tensors
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: format         = GGUF V2 (latest)
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: arch           = llama
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: vocab type     = SPM
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: n_vocab        = 32000
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: n_merges       = 0
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: n_ctx_train    = 4096
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: n_ctx          = 4096
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: n_embd         = 4096
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: n_head         = 32
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: n_head_kv      = 32
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: n_layer        = 32
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: n_rot          = 128
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: n_gqa          = 1
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: f_norm_eps     = 1.0e-05
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: n_ff           = 11008
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: freq_base      = 10000.0
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: freq_scale     = 1
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: model type     = 7B
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: model ftype    = mostly F16
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: model size     = 6.74 B
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: general.name   = LLaMA v2
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: BOS token = 1 '<s>'
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: EOS token = 2 '</s>'
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: UNK token = 0 '<unk>'
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_print_meta: LF token  = 13 '<0x0A>'
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_tensors: ggml ctx size = 12853.11 MB
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_tensors: using CUDA for GPU acceleration
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_tensors: mem required  = 9378.83 MB (+ 4096.00 MB per state)
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_tensors: offloading 9 repeating layers to GPU
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_tensors: offloaded 9/35 layers to GPU
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llm_load_tensors: VRAM used: 3475 MB
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr ...................................................................................................
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_new_context_with_model: kv self size  = 4096.00 MB
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_new_context_with_model: compute buffer total size =  281.47 MB
localai-api-1  | 1:05AM DBG GRPC(llama2-7b.gguf-127.0.0.1:43567): stderr llama_new_context_with_model: not allocating a VRAM scratch buffer due to low VRAM option

nvidia-smi output (shows the model getting fully loaded into VRAM) :

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 ...    Off | 00000000:06:00.0 Off |                  N/A |
| 46%   46C    P2              42W / 250W |   4951MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     31314      C   ...kend_data/backend-assets/grpc/llama     4914MiB |
+---------------------------------------------------------------------------------------+

timestamp:

[2023-09-28 09:35:11][root@rig /opt/backup/models/thebloke_llama2-13b]# curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "llama2-7b", "messages": [{"role": "user", "content": "How old is the earth?"}]}'
curl: (52) Empty reply from server
[2023-09-28 11:55:06][root@rig /opt/backup/models/thebloke_llama2-13b]# 

llama.cpp output and timestamp:

[2023-09-28 11:58:45][root@rig /opt/llama.cpp]# ./main -m /opt/backup/models/llama2-7b/llama2-7b.gguf -n 128 -p "How old is the earth?"
Log start
main: build = 1135 (bce1fef)
main: seed  = 1695869932
ggml_init_cublas: found 1 CUDA devices:
 Device 0: NVIDIA GeForce RTX 2080 SUPER, compute capability 7.5
llama_model_loader: loaded meta data with 15 key-value pairs and 291 tensors from /opt/backup/models/llama2-7b/llama2-7b.gguf (version GGUF V2 (latest))

~ portion of the log was removed for brevity ~ 

llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 4096
llm_load_print_meta: n_head         = 32
llm_load_print_meta: n_head_kv      = 32
llm_load_print_meta: n_layer        = 32
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 11008
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 7B
llm_load_print_meta: model ftype    = mostly F16
llm_load_print_meta: model size     = 6.74 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 12853.11 MB (+  256.00 MB per state)
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0 MB
...................................................................................................
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size =   71.97 MB
llama_new_context_with_model: VRAM scratch buffer: 70.50 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


How old is the earth? It depends on who you ask.
The age of the earth and its materials has been a topic for debate since long before Charles Lyell published his first geological textbook, Principles of Geology (1830), which challenged ideas about the formation of sedimentary rocks such as coal in terms of catastrophic events rather than by processes over time. A central theme running through the book is that rocks are not created from a single moment, but formed through “uniformitarian” processes—such as erosion and deposition—over geological timescales.
This idea has been debated
llama_print_timings:        load time = 13456.41 ms
llama_print_timings:      sample time =    55.51 ms /   128 runs   (    0.43 ms per token,  2305.97 tokens per second)
llama_print_timings: prompt eval time =   496.99 ms /     7 tokens (   71.00 ms per token,    14.08 tokens per second)
llama_print_timings:        eval time = 48221.20 ms /   127 runs   (  379.69 ms per token,     2.63 tokens per second)
llama_print_timings:       total time = 48819.18 ms
Log end
[2023-09-28 11:59:55][root@rig /opt/llama.cpp]# 

@lunamidori5
Copy link
Collaborator

@RussellPacheco Alright, I am deeply sorry about that. Let me do some more looking the only thing I can think of is something to do with docker or the image

@Dbone29
Copy link

Dbone29 commented Oct 9, 2023

RussellPacheco you are using 20 Threads but your cpu doe only hav 6 cores. This can slow localai down. you should set threads to 5 or 6 maybe max 7.

@RussellPacheco
Copy link
Author

@Dbone29 Yup, that was the problem. An oversight on my part. I set it to an appropriate number based on the amount of cores and threads per core my CPU has, and everything is working as expected. Inference only takes 2 minutes now.

I will close this issue as everything is working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants