bug: autogptq doesnt work (cant download model) #941

racerxdl · 2023-08-22T16:40:42Z

LocalAI version:

Docker Image: quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg

Environment, CPU architecture, OS, and Version:
Running in TrueNAS Scale Kubernetes (k3s) with a NVidia Tesla P40 in the container.

# uname -a
Linux localai-ix-chart-f8bbbb7c7-x6xx9 6.1.42-production+truenas #2 SMP PREEMPT_DYNAMIC Mon Aug 14 23:21:26 UTC 2023 x86_64 GNU/Linux
# nvidia-smi
Tue Aug 22 16:36:27 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      Off | 00000000:23:00.0 Off |                    0 |
| N/A   28C    P8              10W / 250W |      2MiB / 23040MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
# cat /proc/cpuinfo |grep "model name" | nl
     1  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     2  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     3  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     4  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     5  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     6  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     7  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     8  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     9  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    10  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    11  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    12  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    13  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    14  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    15  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    16  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    17  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    18  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    19  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    20  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
# cat /proc/meminfo  | grep Mem
MemTotal:       32701568 kB
MemFree:        18305148 kB
MemAvailable:   18767368 kB

Describe the bug
AutoGPTQ added by #871 doesn't work in upstream container. Also tried exllama and gives a linker error for CudaSetDevice.

To Reproduce

curl $LOCALAI/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/orca_mini_v2_13b-GPTQ",
     "messages": [{"role": "user", "content": "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n \n### User: \ntell me about AI \n### Response:"}],
     "backend": "autogptq", "model_base_name": "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
}'
{"error":{"code":500,"message":"could not load model (no success): Unexpected err=FileNotFoundError('Could not find model in TheBloke/orca_mini_v2_13b-GPTQ'), type(err)=\u003cclass 'FileNotFoundError'\u003e","type":""}}

Also tried with a local model:

name: deadbeef
backend: autogptq
parameters:
  model: wizardlm-13b-v1.1-GPTQ-4bit-128g.no-act.order.safetensors

curl $LOCALAI/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "deadbeef",
     "messages": [{"role": "user", "content": "Give me a HTTP REST server made in rust that uses sqlite."}],
     "temperature": 0.9
   }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   655  100   489  100   166    115     39  0:00:04  0:00:04 --:--:--   154
{
  "error": {
    "code": 500,
    "message": "could not load model (no success): Unexpected err=OSError(\"wizardlm-13b-v1.1-GPTQ-4bit-128g.no-act.order.safetensors is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\\nIf this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\"), type(err)=<class 'OSError'>",
    "type": ""
  }
}

Expected behavior

Logs

2023-08-22 09:32:17.702437-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:TheBloke/orca_mini_v2_13b-GPTQ ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/TheBloke/orca_mini_v2_13b-GPTQ Device: UseTriton:false ModelBaseName:orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0}
2023-08-22 09:32:18.184608-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 126kB/s]
2023-08-22 09:32:20.337018-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:01<00:00, 400kB/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:01<00:00, 400kB/s]
2023-08-22 09:32:20.789331-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 435/435 [00:00<00:00, 202kB/s]
2023-08-22 09:32:20.792331-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
2023-08-22 09:32:21.126540-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)lve/main/config.json:   0%|          | 0.00/812 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|██████████| 812/812 [00:00<00:00, 151kB/s]
2023-08-22 09:32:21.646001-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)quantize_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]
Downloading (…)quantize_config.json: 100%|██████████| 158/158 [00:00<00:00, 74.0kB/s]
2023-08-22 09:32:21.818950-07:00[10.10.5.174]:36522 �[91m 500 �[0m - �[92m POST    �[0m /v1/chat/completions

The text was updated successfully, but these errors were encountered:

mudler · 2023-08-22T16:43:24Z

did you tried with:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/orca_mini_v2_13b-GPTQ",
     "messages": [{"role": "user", "content": "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n \n### User: \ntell me about AI \n### Response:"}],
     "backend": "autogptq", "model_base_name": "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
}'

?

mudler · 2023-08-22T16:44:54Z

ah just saw you tried, my bad - it looks like a downloading issue to me. For local files, only exllama now works with local folders - what error you get there? also, did you tried with another model?

racerxdl · 2023-08-22T16:47:06Z

ah just saw you tried, my bad - it looks like a downloading issue to me. For local files, only exllama now works with local folders - what error you get there? also, did you tried with another model?

For exllama the error seens like a incompatible cuda version on the container:

ImportError: /usr/local/lib/python3.9/dist-packages/exllama_ext.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

I also tried Vicuna and the TheBloke directly to download, they give the same not found errors. But for standard llama-cpp it downloads just fine (I tested the same models in GGML versions over llama-cpp and they work fine).

racerxdl added the bug Something isn't working label Aug 22, 2023

racerxdl assigned mudler Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: autogptq doesnt work (cant download model) #941

bug: autogptq doesnt work (cant download model) #941

racerxdl commented Aug 22, 2023

mudler commented Aug 22, 2023

mudler commented Aug 22, 2023 •

edited

racerxdl commented Aug 22, 2023

bug: autogptq doesnt work (cant download model) #941

bug: autogptq doesnt work (cant download model) #941

Comments

racerxdl commented Aug 22, 2023

mudler commented Aug 22, 2023

mudler commented Aug 22, 2023 • edited

racerxdl commented Aug 22, 2023

mudler commented Aug 22, 2023 •

edited