Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory access fault #5890

Closed
1 task done
Orion-zhen opened this issue Apr 20, 2024 · 2 comments
Closed
1 task done

Memory access fault #5890

Orion-zhen opened this issue Apr 20, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Orion-zhen
Copy link

Orion-zhen commented Apr 20, 2024

Describe the bug

As I tried to chat with my llm through openai api, it just ran into a core dump: Memory access fault by GPU node-1 (Agent handle: 0x64e3ca9e2010) on address 0x75e681099000. Reason: Page not present or supervisor privilege.

Update: even in tg-webui itself the chat would crash

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

  1. start the webui at a customed api port: python server.py --api --api-port 11451, and load a model
  2. go to other GPT-webui like Sider or NextChat, set the api server and chat
  3. error occurs!

Screenshot

the api server crashed with terminal message below:

image

the webui crashed:

image

with similar terminal output:

image

Logs

the first crash using api server:

❯ python server.py --api --api-port 11451
12:55:07-606488 INFO     Starting Text generation web UI                                                                                                                                                                           
12:55:07-608075 INFO     Loading the extension "openai"                                                                                                                                                                            
12:55:07-649839 INFO     OpenAI-compatible API URL:                                                                                                                                                                                
                                                                                                                                                                                                                                   
                         http://127.0.0.1:11451                                                                                                                                                                                    
                                                                                                                                                                                                                                   

Running on local URL:  http://127.0.0.1:7860

12:55:16-379990 INFO     Loading "Qwen-32B"                                                                                                                                                                                        
12:55:16-707760 WARNING  You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage to be a lot higher than it could be.                                                                                    
                         Try installing flash-attention following the instructions here: https://github.com/Dao-AILab/flash-attention#installation-and-features                                                                    
12:55:49-579044 INFO     LOADER: "ExLlamav2"                                                                                                                                                                                       
12:55:49-580220 INFO     TRUNCATION LENGTH: 4096                                                                                                                                                                                   
12:55:49-580571 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                                                                                                             
12:55:49-580940 INFO     Loaded the model in 33.20 seconds.                                                                                                                                                                        
Memory access fault by GPU node-1 (Agent handle: 0x64e3ca9e2010) on address 0x75e681099000. Reason: Page not present or supervisor privilege.
[1]    616392 IOT instruction (core dumped)  python server.py --api --api-port 11451

the second crash using tg-webui:

❯ python server.py --api --listen --model Qwen-32B
13:14:08-576771 INFO     Starting Text generation web UI                                                                                                                                                                           
13:14:08-578185 WARNING                                                                                                                                                                                                            
                         You are potentially exposing the web UI to the entire internet without any access password.                                                                                                               
                         You can create one with the "--gradio-auth" flag like this:                                                                                                                                               
                                                                                                                                                                                                                                   
                         --gradio-auth username:password                                                                                                                                                                           
                                                                                                                                                                                                                                   
                         Make sure to replace username:password with your own.                                                                                                                                                     
13:14:08-581204 INFO     Loading "Qwen-32B"                                                                                                                                                                                        
13:14:08-932613 WARNING  You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage to be a lot higher than it could be.                                                                                    
                         Try installing flash-attention following the instructions here: https://github.com/Dao-AILab/flash-attention#installation-and-features                                                                    
13:14:33-197401 INFO     LOADER: "ExLlamav2"                                                                                                                                                                                       
13:14:33-198636 INFO     TRUNCATION LENGTH: 4096                                                                                                                                                                                   
13:14:33-198966 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                                                                                                             
13:14:33-199306 INFO     Loaded the model in 24.62 seconds.                                                                                                                                                                        
13:14:33-199649 INFO     Loading the extension "openai"                                                                                                                                                                            
13:14:33-265414 INFO     OpenAI-compatible API URL:                                                                                                                                                                                
                                                                                                                                                                                                                                   
                         http://0.0.0.0:5000                                                                                                                                                                                       
                                                                                                                                                                                                                                   

Running on local URL:  http://0.0.0.0:7860

Memory access fault by GPU node-1 (Agent handle: 0x6452407b78b0) on address 0x75980082f000. Reason: Page not present or supervisor privilege.
[1]    630061 IOT instruction (core dumped)  python server.py --api --listen --model Qwen-32B

System Info

@Orion-zhen Orion-zhen added the bug Something isn't working label Apr 20, 2024
@Beinsezii
Copy link

This is likely an old ROCm issue. Even if you system rocm is 6.0.3, pytorch bundles its own compiler which gets used for exllama's c extensions, and the current webui specifies rocm 5.6 prebuilt torch wheels still.

I found the XTX is way more stable if you use the Pytorch 2.3.0 release candidate w/ ROCm 6.0 then recompile exllama2 to match.

# Currently 2.3.0 RC-final
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/test/rocm6.0
# v0.0.19 commit hash
pip install -U git+https://github.com/turboderp/exllamav2.git@ad8691c6d1aab2d1ddbdcbe9341c7c7a96e59f2f

You can also save the compiled exllama to a wheel with pip wheel instead of pip install so you can re-use it.

If you're using the native arch linux python/rocm, my own wheel might work for you.
exllamav2-0.0.19-cp311-cp311-linux_x86_64.zip

ROCm 5.6 and especially 5.7 are extremely unstable on my XTX. Frequent unrecoverable page faults including that exact error you see. If the roc6 exl2 method works I'd also recompile llama-cpp and other extensions to match for a rock solid experience.

@Orion-zhen
Copy link
Author

Orion-zhen commented Apr 20, 2024

Many thanks!

By upgrading pytorch and recompiling the exllamav2, the problem is solved.

(However, it will take time to see whether it's stable or not.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants