Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: autogptq doesnt work (cant download model) #941

Open
racerxdl opened this issue Aug 22, 2023 · 3 comments
Open

bug: autogptq doesnt work (cant download model) #941

racerxdl opened this issue Aug 22, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@racerxdl
Copy link

LocalAI version:

Docker Image: quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg

Environment, CPU architecture, OS, and Version:
Running in TrueNAS Scale Kubernetes (k3s) with a NVidia Tesla P40 in the container.

# uname -a
Linux localai-ix-chart-f8bbbb7c7-x6xx9 6.1.42-production+truenas #2 SMP PREEMPT_DYNAMIC Mon Aug 14 23:21:26 UTC 2023 x86_64 GNU/Linux
# nvidia-smi
Tue Aug 22 16:36:27 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      Off | 00000000:23:00.0 Off |                    0 |
| N/A   28C    P8              10W / 250W |      2MiB / 23040MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
# cat /proc/cpuinfo |grep "model name" | nl
     1  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     2  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     3  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     4  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     5  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     6  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     7  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     8  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     9  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    10  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    11  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    12  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    13  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    14  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    15  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    16  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    17  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    18  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    19  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    20  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
# cat /proc/meminfo  | grep Mem
MemTotal:       32701568 kB
MemFree:        18305148 kB
MemAvailable:   18767368 kB

Describe the bug
AutoGPTQ added by #871 doesn't work in upstream container. Also tried exllama and gives a linker error for CudaSetDevice.

To Reproduce

curl $LOCALAI/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/orca_mini_v2_13b-GPTQ",
     "messages": [{"role": "user", "content": "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n \n### User: \ntell me about AI \n### Response:"}],
     "backend": "autogptq", "model_base_name": "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
}'
{"error":{"code":500,"message":"could not load model (no success): Unexpected err=FileNotFoundError('Could not find model in TheBloke/orca_mini_v2_13b-GPTQ'), type(err)=\u003cclass 'FileNotFoundError'\u003e","type":""}}

Also tried with a local model:

name: deadbeef
backend: autogptq
parameters:
  model: wizardlm-13b-v1.1-GPTQ-4bit-128g.no-act.order.safetensors
curl $LOCALAI/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "deadbeef",
     "messages": [{"role": "user", "content": "Give me a HTTP REST server made in rust that uses sqlite."}],
     "temperature": 0.9
   }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   655  100   489  100   166    115     39  0:00:04  0:00:04 --:--:--   154
{
  "error": {
    "code": 500,
    "message": "could not load model (no success): Unexpected err=OSError(\"wizardlm-13b-v1.1-GPTQ-4bit-128g.no-act.order.safetensors is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\\nIf this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\"), type(err)=<class 'OSError'>",
    "type": ""
  }
}

Expected behavior

Logs

2023-08-22 09:32:17.702437-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:TheBloke/orca_mini_v2_13b-GPTQ ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/TheBloke/orca_mini_v2_13b-GPTQ Device: UseTriton:false ModelBaseName:orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0}
2023-08-22 09:32:18.184608-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 126kB/s]
2023-08-22 09:32:20.337018-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:01<00:00, 400kB/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:01<00:00, 400kB/s]
2023-08-22 09:32:20.789331-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 435/435 [00:00<00:00, 202kB/s]
2023-08-22 09:32:20.792331-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
2023-08-22 09:32:21.126540-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)lve/main/config.json:   0%|          | 0.00/812 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|██████████| 812/812 [00:00<00:00, 151kB/s]
2023-08-22 09:32:21.646001-07:00�[90m4:32PM�[0m �[33mDBG�[0m GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)quantize_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]
Downloading (…)quantize_config.json: 100%|██████████| 158/158 [00:00<00:00, 74.0kB/s]
2023-08-22 09:32:21.818950-07:00[10.10.5.174]:36522 �[91m 500 �[0m - �[92m POST    �[0m /v1/chat/completions
@racerxdl racerxdl added the bug Something isn't working label Aug 22, 2023
@mudler
Copy link
Owner

mudler commented Aug 22, 2023

did you tried with:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/orca_mini_v2_13b-GPTQ",
     "messages": [{"role": "user", "content": "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n \n### User: \ntell me about AI \n### Response:"}],
     "backend": "autogptq", "model_base_name": "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
}'

?

@mudler
Copy link
Owner

mudler commented Aug 22, 2023

ah just saw you tried, my bad - it looks like a downloading issue to me. For local files, only exllama now works with local folders - what error you get there? also, did you tried with another model?

@racerxdl
Copy link
Author

ah just saw you tried, my bad - it looks like a downloading issue to me. For local files, only exllama now works with local folders - what error you get there? also, did you tried with another model?

For exllama the error seens like a incompatible cuda version on the container:

ImportError: /usr/local/lib/python3.9/dist-packages/exllama_ext.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

I also tried Vicuna and the TheBloke directly to download, they give the same not found errors. But for standard llama-cpp it downloads just fine (I tested the same models in GGML versions over llama-cpp and they work fine).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants