Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigDL-A750-Qwen7b-Allocation is out of device memory on current platform. #10575

Open
ChenVkl opened this issue Mar 28, 2024 · 12 comments
Open

Comments

@ChenVkl
Copy link

ChenVkl commented Mar 28, 2024

When I use A750 to run BigDL to load the Qwen-7b int4 model, it will show that the memory is exceeded, I don't know what's going on, is there a problem with my operation?
The following is the error message:
Traceback (most recent call last):
File "D:\workspace\text-generation-webui-bigdl-llm\modules\text_generation.py", line 408, in generate_reply_HF
shared.model.generate(**generate_params)
File "C:\Users\ZhangChen.cache\huggingface\modules\transformers_modules\Qwen-7B\modeling_qwen.py", line 1259, in generate
return super().generate(
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\transformers\generation\utils.py", line 1525, in generate
return self.sample(
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\transformers\generation\utils.py", line 2622, in sample
outputs = self(
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZhangChen.cache\huggingface\modules\transformers_modules\Qwen-7B\modeling_qwen.py", line 1060, in forward
lm_logits = self.lm_head(hidden_states)
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\bigdl\llm\transformers\low_bit_linear.py", line 622, in forward
result = linear_q4_0.forward_new(x_2d, self.weight.data, self.weight.qtype,
RuntimeError: Allocation is out of device memory on current platform.
Output generated in 12.06 seconds (0.00 tokens/s, 0 tokens, context 730, seed 290229866)

image 选定的照片
@hkvision
Copy link
Contributor

Hi,

Thanks for raising this issue, want to confirm with you:

  • What's the initial GPU memory occupied by the system before you run the model inference?
  • What input length do you chat with the model? context 730 is it this?

We are converting the loaded model into fp16 for less memory usage as Arc750 only has 8g memory. (follow-up in this issue: intel-analytics/text-generation-webui#25)

@ChenVkl ChenVkl closed this as completed Apr 2, 2024
@ChenVkl
Copy link
Author

ChenVkl commented Apr 2, 2024

Hi,

Thanks for raising this issue, want to confirm with you:

  • What's the initial GPU memory occupied by the system before you run the model inference?
  • What input length do you chat with the model? context 730 is it this?

We are converting the loaded model into fp16 for less memory usage as Arc750 only has 8g memory. (follow-up in this issue: intel-analytics/text-generation-webui#25)

Regarding your question, the initial GPU memory occupied by the system is approximately 1.6GB.
When I chat with a large model, any simple question will result in an error message indicating that the memory is exceeded.
Currently, I have switched to the Qwen-7b-int4 model to try whether it can run on BigDL. For specific issue, please refer to this link. #10616

@hkvision
Copy link
Contributor

hkvision commented Apr 3, 2024

Some suggestions from our side for you to possibly run QWen-7B on Arc750:

  • Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and export IPEX_LLM_LOW_MEM=1 before you launch the WebUI
  • Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.

@ChenVkl
Copy link
Author

ChenVkl commented Apr 7, 2024

s

Some suggestions from our side for you to possibly run QWen-7B on Arc750:

  • Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and export IPEX_LLM_LOW_MEM=1 before you launch the WebUI
  • Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.

Ok, I see the latest link, I'll give it a try. Thanks a lot.

@ChenVkl
Copy link
Author

ChenVkl commented Apr 7, 2024

Some suggestions from our side for you to possibly run QWen-7B on Arc750:

  • Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and export IPEX_LLM_LOW_MEM=1 before you launch the WebUI
  • Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.

I'd like to ask if you have run Qwen with a 750 before, and how much GPU memory will it take? Thanks.

@hkvision
Copy link
Contributor

hkvision commented Apr 7, 2024

Some suggestions from our side for you to possibly run QWen-7B on Arc750:

  • Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and export IPEX_LLM_LOW_MEM=1 before you launch the WebUI
  • Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.

I'd like to ask if you have run Qwen with a 750 before, and how much GPU memory will it take? Thanks.

I haven't tried text generation webui, but for simple generate, qwen-7b can run on Arc750, for 256 input the memory I observe is 5290.11M. This memory value is from xpu-smi, and may not be the actual peak memory. I suppose the peak memory would be somewhat close or larger than 6g.

Some suggestions you may try to run on your side:

@ChenVkl
Copy link
Author

ChenVkl commented Apr 9, 2024

qwen-7b can run on Arc750

Some suggestions from our side for you to possibly run QWen-7B on Arc750:

  • Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and export IPEX_LLM_LOW_MEM=1 before you launch the WebUI
  • Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.

I'd like to ask if you have run Qwen with a 750 before, and how much GPU memory will it take? Thanks.

I haven't tried text generation webui, but for simple generate, qwen-7b can run on Arc750, for 256 input the memory I observe is 5290.11M. This memory value is from xpu-smi, and may not be the actual peak memory. I suppose the peak memory would be somewhat close or larger than 6g.

Some suggestions you may try to run on your side:

Thank you very much, I'll give it a try.
In addition, I want to ask by the way, you said that you can run Qwen-7b with A750, which link do you use, could you please send it to me if it's convenient for you?

@hkvision
Copy link
Contributor

hkvision commented Apr 9, 2024

I'm using https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one with api transformer_int4_fp16_gpu in config.yaml with export IPEX_LLM_LOW_MEM=1 and bash run-arc.sh.
Is it the link you want?

@Daroude
Copy link

Daroude commented Apr 10, 2024

can you clarify where export IPEX_LLM_LOW_MEM=1 needs to be put? When I type it in conda before starting the server.py I get:

'export' is not recognized as an internal or external command, operable program or batch file.

My arc A750 outputs the following error after a few interactions with the chatbot, which I assume it is memory related.

RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)

@hkvision
Copy link
Contributor

Running on windows please change it to set IPEX_LLM_LOW_MEM=1

@Daroude
Copy link

Daroude commented Apr 12, 2024

thanks, what doesn't seem to be have been then issue. As soon as about 2.000+ context is reached I get

RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) Output generated in 1.29 seconds (0.00 tokens/s, 0 tokens, context 2308, seed 1309198421)

should I open a new ticket?

@hkvision
Copy link
Contributor

thanks, what doesn't seem to be have been then issue. As soon as about 2.000+ context is reached I get

RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) Output generated in 1.29 seconds (0.00 tokens/s, 0 tokens, context 2308, seed 1309198421)

should I open a new ticket?

Sure, you can open a new ticket and give more details about your settings (system, version, how you run, etc.). We will try to reproduce this. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants