-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigDL-A750-Qwen7b-Allocation is out of device memory on current platform. #10575
Comments
Hi, Thanks for raising this issue, want to confirm with you:
We are converting the loaded model into fp16 for less memory usage as Arc750 only has 8g memory. (follow-up in this issue: intel-analytics/text-generation-webui#25) |
Regarding your question, the initial GPU memory occupied by the system is approximately 1.6GB. |
Some suggestions from our side for you to possibly run QWen-7B on Arc750:
|
s
Ok, I see the latest link, I'll give it a try. Thanks a lot. |
I'd like to ask if you have run Qwen with a 750 before, and how much GPU memory will it take? Thanks. |
I haven't tried text generation webui, but for simple generate, qwen-7b can run on Arc750, for 256 input the memory I observe is 5290.11M. This memory value is from xpu-smi, and may not be the actual peak memory. I suppose the peak memory would be somewhat close or larger than 6g. Some suggestions you may try to run on your side:
|
Thank you very much, I'll give it a try. |
I'm using https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one with api |
can you clarify where export IPEX_LLM_LOW_MEM=1 needs to be put? When I type it in conda before starting the server.py I get:
My arc A750 outputs the following error after a few interactions with the chatbot, which I assume it is memory related.
|
Running on windows please change it to |
thanks, what doesn't seem to be have been then issue. As soon as about 2.000+ context is reached I get
should I open a new ticket? |
Sure, you can open a new ticket and give more details about your settings (system, version, how you run, etc.). We will try to reproduce this. Thanks! |
When I use A750 to run BigDL to load the Qwen-7b int4 model, it will show that the memory is exceeded, I don't know what's going on, is there a problem with my operation?
The following is the error message:
Traceback (most recent call last):
File "D:\workspace\text-generation-webui-bigdl-llm\modules\text_generation.py", line 408, in generate_reply_HF
shared.model.generate(**generate_params)
File "C:\Users\ZhangChen.cache\huggingface\modules\transformers_modules\Qwen-7B\modeling_qwen.py", line 1259, in generate
return super().generate(
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\transformers\generation\utils.py", line 1525, in generate
return self.sample(
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\transformers\generation\utils.py", line 2622, in sample
outputs = self(
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZhangChen.cache\huggingface\modules\transformers_modules\Qwen-7B\modeling_qwen.py", line 1060, in forward
lm_logits = self.lm_head(hidden_states)
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\bigdl\llm\transformers\low_bit_linear.py", line 622, in forward
result = linear_q4_0.forward_new(x_2d, self.weight.data, self.weight.qtype,
RuntimeError: Allocation is out of device memory on current platform.
Output generated in 12.06 seconds (0.00 tokens/s, 0 tokens, context 730, seed 290229866)
The text was updated successfully, but these errors were encountered: