GPU推理速度相比于llama.cpp慢了一倍

我的机器配置是AMD Ryzen 5950x, NVidia RTX A6000, CUDA 11.7
我目前有2套测试配置，都是截止到7月6日的最新代码，都使用同样的参数 t=6, l/n=128, prompt="how to build a house in 10 steps“
C1:  chatglm2-6B,  使用 chatglm.cpp
C2:  vicuna_7b_v1.3, 使用 llama.cpp
**在CPU下：**
 **FP16** 

- C1: 2.5 t/s
- C2: 2.3 t/s

**Q4_0** 

- C1: 8.5 t/s
- C2: 7.5 t/s
所以在CPU下C1是稳定比C2快10%的，这应该是符合6BVS7B的

**在GPU下:**
**FP16**
- C1: 34.6 t/s
- C2: 44.3 t/s

**Q4_0** 

- C1: 49.8 t/s
- C2: 108.1 t/s

    所以在GPU下反而C1比C2慢很多，low bit差别更加明显。用GPU的时候我是设置--threads 1，跟llama.cpp一致。
    

chatglm.cpp里没有提供-ngl所以我不确定是不是一些层没有完全offload到GPU导致的，比较难DEBUG

UPDATE:
附带一张我最近和fastllm的性能测试， 可以看到对于fastllm, chatglm2 6B在GPU上是比vicua 7B快40%的，证明这个架构没什么问题。但在chatglm.cpp当前反而是比vicuna 7B慢了50%
![image](https://github.com/li-plus/chatglm.cpp/assets/15835199/322c98de-1e47-404a-a867-80ee83373c68)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU推理速度相比于llama.cpp慢了一倍 #36

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GPU推理速度相比于llama.cpp慢了一倍 #36

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions