Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen-7B-Chat fail with larger 6.7k for second or 3rd time #11106

Closed
juan-OY opened this issue May 23, 2024 · 2 comments
Closed

Qwen-7B-Chat fail with larger 6.7k for second or 3rd time #11106

juan-OY opened this issue May 23, 2024 · 2 comments
Assignees

Comments

@juan-OY
Copy link

juan-OY commented May 23, 2024

MTL running One task with -i 6707 -o 160
it shows OOM on MTL, while the similar command can pass in the previous testing.

Traceback (most recent call last):
File "C:\multi-modality\cvte_qwen\ultra_test_code_and_data\benchmark_test2intel\speed_test_ultra.py", line 241, in
infer_test(model, tokenizer, input_token_num, output_token_num, total_speed_file)
File "C:\multi-modality\cvte_qwen\ultra_test_code_and_data\benchmark_test2intel\speed_test_ultra.py", line 108, in infer_test
prefill_output = model(**model_inputs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Intel/.cache\huggingface\modules\transformers_modules\Qwen-7B-Chat-sym_int4\modeling_qwen.py", line 1060, in forward
lm_logits = self.lm_head(hidden_states)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\ipex_llm\transformers\low_bit_linear.py", line 703, in forward
result = linear_q4_0.forward_new(x_2d, self.weight.data, self.weight.qtype,
RuntimeError: XPU out of memory. Tried to allocate 2.37 GiB (GPU 0; 14.48 GiB total capacity; 6.94 GiB already allocated; 8.04 GiB reserved in total by PyTorch)

@qiuxin2012 qiuxin2012 self-assigned this May 23, 2024
@qiuxin2012
Copy link
Contributor

To minimize MTL's memory usage, you can put embedding on cpu memory by setting cpu_embedding=True when calling from_pretrained or load_low_bit. Qwen's embedding is about 1GB.

@juan-OY
Copy link
Author

juan-OY commented May 27, 2024

we can close it, issue can not be reproduced again

@juan-OY juan-OY closed this as completed May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants