Skip to content
This repository was archived by the owner on Aug 30, 2024. It is now read-only.

Commit a129213

Browse files
authored
[GPTQ Enhence] Support GPTQ & AWQ inference for QWENv1, v1.5 and Mixtral. (#134)
1 parent aa4a8ab commit a129213

File tree

12 files changed

+778
-108
lines changed

12 files changed

+778
-108
lines changed

docs/gptq_and_awq.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,12 @@ Neural Speed supports multiple weight-only quantization algorithms, such as GPTQ
66
More algorithm details please check [GPTQ](https://arxiv.org/abs/2210.17323) and [AWQ](https://arxiv.org/abs/2306.00978).
77

88
Validated GPTQ & AWQ models directly from the HuggingFace:
9-
* [Llama-2-7B-Chat-GPT](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ) & [Llama-2-13B-Chat-GPT](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ)
10-
* [CodeLlama-7B-Instruct-GPTQ](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GPTQ) & [CodeLlama-13B-Instruct-GPTQ](https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GPTQ)
9+
* [Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ) & [Llama-2-13B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-Chat-GPTQ) & [Llama-2-7B-AWQ](https://huggingface.co/TheBloke/Llama-2-7B-AWQ) & [Llama-2-13B-chat-AWQ](https://huggingface.co/TheBloke/Llama-2-13B-chat-AWQ)
10+
* [CodeLlama-7B-Instruct-GPTQ](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GPTQ) & [CodeLlama-13B-Instruct-GPTQ](https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GPTQ) & [CodeLlama-7B-AWQ](https://huggingface.co/TheBloke/CodeLlama-7B-AWQ) & [CodeLlama-13B-AWQ](https://huggingface.co/TheBloke/CodeLlama-13B-AWQ)
11+
* [Mistral-7B-Instruct-v0.1-GPTQ](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GPTQ) & [Mistral-7B-Instruct-v0.1-AWQ](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GPTQ)
12+
* [Mixtral-8x7B-Instruct-v0.1-GPTQ](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ) & [Mixtral-8x7B-Instruct-v0.1-AWQ](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ)
13+
* [Qwen-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Qwen-7B-Chat-GPTQ) & [Qwen-7B-Chat-AWQ](https://huggingface.co/TheBloke/Qwen-7B-Chat-AWQ) & * [Qwen1.5-7B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-7B-Chat-GPTQ-Int4)
1114
* [SOLAR-10.7B-v1.0-GPTQ](https://huggingface.co/TheBloke/SOLAR-10.7B-v1.0-GPTQ)
12-
* [Llama-2-7B-AWQ](https://huggingface.co/TheBloke/Llama-2-7B-AWQ) & [Llama-2-13B-chat-AWQ](https://huggingface.co/TheBloke/Llama-2-13B-chat-AWQ)
13-
* [CodeLlama-7B-AWQ](https://huggingface.co/TheBloke/CodeLlama-7B-AWQ) & [CodeLlama-13B-AWQ](https://huggingface.co/TheBloke/CodeLlama-13B-AWQ)
1415

1516
Please check more validated GPTQ & AWQ models in the list of [supported_models](./docs/supported_models.md).
1617

docs/supported_models.md

Lines changed: 47 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -72,17 +72,58 @@ Neural Speed supports the following models:
7272
<td>✅</td>
7373
<td>✅</td>
7474
<td>Latest</td>
75+
</tr>
76+
<tr>
77+
<td><a href="https://huggingface.co/Intel/neural-chat-7b-v3-1" target="_blank" rel="noopener noreferrer">Neural-Chat-7B-v3-1</a>,
78+
<a href="https://huggingface.co/Intel/neural-chat-7b-v3-2" target="_blank" rel="noopener noreferrer">Neural-Chat-7B-v3-2</a></td>
79+
<td>✅</td>
80+
<td>✅</td>
81+
<td>✅</td>
82+
<td>✅</td>
83+
<td>✅</td>
84+
<td>✅</td>
85+
<td>✅</td>
86+
<td>✅</td>
87+
<td>Latest</td>
88+
</tr>
89+
<tr>
90+
<td><a href="https://huggingface.co/mistralai/Mistral-7B-v0.1" target="_blank" rel="noopener noreferrer">Mistral-7B</a>,
91+
<a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1" target="_blank" rel="noopener noreferrer">Mixtral-8x7B</a></td>
92+
<td>✅</td>
93+
<td>✅</td>
94+
<td>✅</td>
95+
<td>✅</td>
96+
<td>✅</td>
97+
<td>✅</td>
98+
<td>✅</td>
99+
<td>✅</td>
100+
<td>4.36.0 or newer</td>
101+
</tr>
102+
<tr>
103+
<td><a href="https://huggingface.co/Qwen/Qwen-7B-Chat" target="_blank" rel="noopener noreferrer">Qwen-7B</a>,
104+
<a href="https://huggingface.co/Qwen/Qwen-14B-Chat" target="_blank" rel="noopener noreferrer">Qwen-14B</a>,
105+
<a href="https://huggingface.co/Qwen/Qwen1.5-7B-Chat" target="_blank" rel="noopener noreferrer">Qwen1.5-7B</a>,
106+
<a href="https://huggingface.co/Qwen/Qwen1.5-0.5B" target="_blank" rel="noopener noreferrer">Qwen1.5-0.5B</a></td>
107+
<td>✅</td>
108+
<td>✅</td>
109+
<td>✅</td>
110+
<td>✅</td>
111+
<td>✅</td>
112+
<td>✅</td>
113+
<td>✅</td>
114+
<td>✅</td>
115+
<td>Latest</td>
75116
</tr>
76117
<tr>
77118
<td><a href="https://huggingface.co/EleutherAI/gpt-j-6b" target="_blank" rel="noopener noreferrer">GPT-J-6B</a></td>
78119
<td>✅</td>
79-
<td> </td>
80-
<td> </td>
81-
<td> </td>
82120
<td>✅</td>
83-
<td> </td>
84-
<td> </td>
85-
<td> </td>
121+
<td>✅</td>
122+
<td>✅</td>
123+
<td>✅</td>
124+
<td>✅</td>
125+
<td>✅</td>
126+
<td>✅</td>
86127
<td>Latest</td>
87128
</tr>
88129
<tr>
@@ -160,19 +201,6 @@ Neural Speed supports the following models:
160201
<td> </td>
161202
<td> </td>
162203
<td>Latest</td>
163-
</tr>
164-
<tr>
165-
<td><a href="https://huggingface.co/Intel/neural-chat-7b-v3-1" target="_blank" rel="noopener noreferrer">Neural-Chat-7B-v3-1</a>,
166-
<a href="https://huggingface.co/Intel/neural-chat-7b-v3-2" target="_blank" rel="noopener noreferrer">Neural-Chat-7B-v3-2</a></td>
167-
<td>✅</td>
168-
<td>✅</td>
169-
<td>✅</td>
170-
<td>✅</td>
171-
<td>✅</td>
172-
<td>✅</td>
173-
<td>✅</td>
174-
<td>✅</td>
175-
<td>Latest</td>
176204
</tr>
177205
<tr>
178206
<td><a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank" rel="noopener noreferrer">ChatGLM-6B</a>,
@@ -200,34 +228,6 @@ Neural Speed supports the following models:
200228
<td> </td>
201229
<td>4.33.1</td>
202230
</tr>
203-
<tr>
204-
<td><a href="https://huggingface.co/mistralai/Mistral-7B-v0.1" target="_blank" rel="noopener noreferrer">Mistral-7B</a>,
205-
<a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1" target="_blank" rel="noopener noreferrer">Mixtral-8x7B</a></td>
206-
<td>✅</td>
207-
<td> </td>
208-
<td> </td>
209-
<td> </td>
210-
<td>✅</td>
211-
<td> </td>
212-
<td> </td>
213-
<td> </td>
214-
<td>4.36.0 or newer</td>
215-
</tr>
216-
<tr>
217-
<td><a href="https://huggingface.co/Qwen/Qwen-7B-Chat" target="_blank" rel="noopener noreferrer">Qwen-7B</a>,
218-
<a href="https://huggingface.co/Qwen/Qwen-14B-Chat" target="_blank" rel="noopener noreferrer">Qwen-14B</a>,
219-
<a href="https://huggingface.co/Qwen/Qwen1.5-7B-Chat" target="_blank" rel="noopener noreferrer">Qwen1.5-7B</a>,
220-
<a href="https://huggingface.co/Qwen/Qwen1.5-0.5B" target="_blank" rel="noopener noreferrer">Qwen1.5-0.5B</a></td>
221-
<td>✅</td>
222-
<td> </td>
223-
<td> </td>
224-
<td> </td>
225-
<td>✅</td>
226-
<td> </td>
227-
<td> </td>
228-
<td> </td>
229-
<td>Latest</td>
230-
</tr>
231231
<tr>
232232
<td><a href="https://huggingface.co/microsoft/phi-2" target="_blank" rel="noopener noreferrer">phi-2</a>,
233233
<a href="https://huggingface.co/microsoft/phi-1_5" target="_blank" rel="noopener noreferrer">phi-1_5</a>

neural_speed/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ def __import_package(self, model_type):
6666
import neural_speed.qwen_cpp as cpp_model
6767
elif model_type == "mistral":
6868
import neural_speed.mistral_cpp as cpp_model
69-
elif model_type == "qwen":
69+
elif model_type == "qwen2":
7070
import neural_speed.qwen_cpp as cpp_model
7171
elif model_type == "phi":
7272
import neural_speed.phi_cpp as cpp_model

neural_speed/convert/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
from transformers import AutoConfig
2020
import subprocess
2121

22-
model_maps = {"gpt_neox": "gptneox", "gpt_bigcode": "starcoder", "whisper": "whisper"}
22+
model_maps = {"gpt_neox": "gptneox", "gpt_bigcode": "starcoder", "whisper": "whisper", "qwen2": "qwen"}
2323

2424

2525
def convert_model(model, outfile, outtype="f32", whisper_repo_path=None, use_quantized_model=False):

0 commit comments

Comments
 (0)