What's the proper way to run a downloaded model with GPU? #990

dany-nonstop · 2023-09-01T03:27:50Z

dany-nonstop
Sep 1, 2023

I use GALLERIES=[{"name":"model-gallery","url":"github:go-skynet/model-gallery/index.yaml"},{"url":"github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}] environment parameter to enable model galleries,
and the use a request with body {"id": "thebloke__wizard-vicuna-13b-ggml__wizard-vicuna-13b.ggmlv3.q4_0.bin"} at api's /models/apply to download a model locally, the download was successful
the request downloaded 4 files thebloke__wizard-vicuna-13b-ggml__wizard-vicuna-13b.ggmlv3.q4_0.bin.yaml vicuna-chat.tmpl vicuna-completion.tmpl wizard-vicuna-13B.ggmlv3.q4_0.bin.
for completion at /v1/completion, I can only call the model by the wizard-vicuna-13B.ggmlv3.q4_0.bin file name, not the model name in the gallery thebloke__wizard-vicuna-13b-ggml__wizard-vicuna-13b.ggmlv3.q4_0.bin
the model only runs on the CPU but not the GPU, it has been confirmed by watching CPU load with top and and GPU load with nvidia-smi inside the docker
I looked at the downloaded model configuration file thebloke__wizard-vicuna-13b-ggml__wizard-vicuna-13b.ggmlv3.q4_0.bin.yaml, and added a line gpu_layers: 1000, but it seems the file is ignored if I'm using the actual model name wizard-vicuna-13B.ggmlv3.q4_0.bin.

question: what is the proper way to use a model's definition yaml file? how can I enable GPU for inference?

many thanks in advance!