You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Environment, CPU architecture, OS, and Version:
Running in TrueNAS Scale Kubernetes (k3s) with a NVidia Tesla P40 in the container.
# uname -a
Linux localai-ix-chart-f8bbbb7c7-x6xx9 6.1.42-production+truenas #2 SMP PREEMPT_DYNAMIC Mon Aug 14 23:21:26 UTC 2023 x86_64 GNU/Linux
# nvidia-smi
Tue Aug 22 16:36:27 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P40 Off | 00000000:23:00.0 Off | 0 |
| N/A 28C P8 10W / 250W | 2MiB / 23040MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
# cat /proc/cpuinfo |grep "model name" | nl
1 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
2 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
3 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
4 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
5 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
6 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
7 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
8 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
9 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
10 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
11 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
12 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
13 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
14 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
15 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
16 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
17 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
18 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
19 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
20 model name : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
# cat /proc/meminfo | grep Mem
MemTotal: 32701568 kB
MemFree: 18305148 kB
MemAvailable: 18767368 kB
Describe the bug
AutoGPTQ added by #871 doesn't work in upstream container. Also tried exllama and gives a linker error for CudaSetDevice.
To Reproduce
curl $LOCALAI/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheBloke/orca_mini_v2_13b-GPTQ",
"messages": [{"role": "user", "content": "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n \n### User: \ntell me about AI \n### Response:"}],
"backend": "autogptq", "model_base_name": "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
}'
{"error":{"code":500,"message":"could not load model (no success): Unexpected err=FileNotFoundError('Could not find model in TheBloke/orca_mini_v2_13b-GPTQ'), type(err)=\u003cclass 'FileNotFoundError'\u003e","type":""}}
curl $LOCALAI/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deadbeef",
"messages": [{"role": "user", "content": "Give me a HTTP REST server made in rust that uses sqlite."}],
"temperature": 0.9
}' | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 655 100 489 100 166 115 39 0:00:04 0:00:04 --:--:-- 154
{
"error": {
"code": 500,
"message": "could not load model (no success): Unexpected err=OSError(\"wizardlm-13b-v1.1-GPTQ-4bit-128g.no-act.order.safetensors is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\\nIf this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\"), type(err)=<class 'OSError'>",
"type": ""
}
}
Expected behavior
Logs
2023-08-22 09:32:17.702437-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:TheBloke/orca_mini_v2_13b-GPTQ ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/TheBloke/orca_mini_v2_13b-GPTQ Device: UseTriton:false ModelBaseName:orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0}
2023-08-22 09:32:18.184608-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr
Downloading (…)okenizer_config.json: 0%| | 0.00/727 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 126kB/s]
2023-08-22 09:32:20.337018-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr
Downloading tokenizer.model: 0%| | 0.00/500k [00:00<?, ?B/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:01<00:00, 400kB/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:01<00:00, 400kB/s]
2023-08-22 09:32:20.789331-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr
Downloading (…)cial_tokens_map.json: 0%| | 0.00/435 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 435/435 [00:00<00:00, 202kB/s]
2023-08-22 09:32:20.792331-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
2023-08-22 09:32:21.126540-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr
Downloading (…)lve/main/config.json: 0%| | 0.00/812 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|██████████| 812/812 [00:00<00:00, 151kB/s]
2023-08-22 09:32:21.646001-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr
Downloading (…)quantize_config.json: 0%| | 0.00/158 [00:00<?, ?B/s]
Downloading (…)quantize_config.json: 100%|██████████| 158/158 [00:00<00:00, 74.0kB/s]
2023-08-22 09:32:21.818950-07:00[10.10.5.174]:36522 �[91m 500 �[0m - �[92m POST �[0m /v1/chat/completions
The text was updated successfully, but these errors were encountered:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheBloke/orca_mini_v2_13b-GPTQ",
"messages": [{"role": "user", "content": "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n \n### User: \ntell me about AI \n### Response:"}],
"backend": "autogptq", "model_base_name": "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
}'
ah just saw you tried, my bad - it looks like a downloading issue to me. For local files, only exllama now works with local folders - what error you get there? also, did you tried with another model?
ah just saw you tried, my bad - it looks like a downloading issue to me. For local files, only exllama now works with local folders - what error you get there? also, did you tried with another model?
For exllama the error seens like a incompatible cuda version on the container:
I also tried Vicuna and the TheBloke directly to download, they give the same not found errors. But for standard llama-cpp it downloads just fine (I tested the same models in GGML versions over llama-cpp and they work fine).
LocalAI version:
Docker Image:
quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg
Environment, CPU architecture, OS, and Version:
Running in TrueNAS Scale Kubernetes (k3s) with a NVidia Tesla P40 in the container.
Describe the bug
AutoGPTQ added by #871 doesn't work in upstream container. Also tried exllama and gives a linker error for CudaSetDevice.
To Reproduce
Also tried with a local model:
Expected behavior
Logs
The text was updated successfully, but these errors were encountered: