Replies: 2 comments
-
Support for ngqa is being added, See: #860 |
Beta Was this translation helpful? Give feedback.
0 replies
-
Waiting for docs |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
when I run every Llama2 70B model,i will get this error:
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr ggml_init_cublas: found 1 CUDA devices:
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama.cpp: loading model from /root/Local/cublas/models/llama-2-70b.ggmlv3.q6_K.bin
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: format = ggjt v3 (latest)
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_vocab = 32000
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_ctx = 4096
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_embd = 8192
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_mult = 4096
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_head = 64
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_head_kv = 64
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_layer = 80
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_rot = 128
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_gqa = 1
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_ff = 24576
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: freq_base = 10000.0
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: freq_scale = 1
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: ftype = 18 (mostly Q6_K)
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: model size = 65B
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: ggml ctx size = 53965.41 MB
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: using CUDA for GPU acceleration
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_load_model_from_file: failed to load model
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_init_from_gpt_params: error: failed to load model '/root/Local/cublas/models/llama-2-70b.ggmlv3.q6_K.bin'
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr load_binding_model: error: unable to load model
I see "n_gqa" parameter in the llama-cpp-python 0.1.77.
I wanna know how can I select this parameter in the Local_AI ?
Thanks!!
Beta Was this translation helpful? Give feedback.
All reactions