Memory access fault #5890

Orion-zhen · 2024-04-20T05:10:19Z

Describe the bug

As I tried to chat with my llm through openai api, it just ran into a core dump: Memory access fault by GPU node-1 (Agent handle: 0x64e3ca9e2010) on address 0x75e681099000. Reason: Page not present or supervisor privilege.

Update: even in tg-webui itself the chat would crash

Is there an existing issue for this?

I have searched the existing issues

Reproduction

start the webui at a customed api port: python server.py --api --api-port 11451, and load a model
go to other GPT-webui like Sider or NextChat, set the api server and chat
error occurs!

Screenshot

the api server crashed with terminal message below:

the webui crashed:

with similar terminal output:

Logs

the first crash using api server:

❯ python server.py --api --api-port 11451
12:55:07-606488 INFO     Starting Text generation web UI                                                                                                                                                                           
12:55:07-608075 INFO     Loading the extension "openai"                                                                                                                                                                            
12:55:07-649839 INFO     OpenAI-compatible API URL:                                                                                                                                                                                
                                                                                                                                                                                                                                   
                         http://127.0.0.1:11451                                                                                                                                                                                    
                                                                                                                                                                                                                                   

Running on local URL:  http://127.0.0.1:7860

12:55:16-379990 INFO     Loading "Qwen-32B"                                                                                                                                                                                        
12:55:16-707760 WARNING  You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage to be a lot higher than it could be.                                                                                    
                         Try installing flash-attention following the instructions here: https://github.com/Dao-AILab/flash-attention#installation-and-features                                                                    
12:55:49-579044 INFO     LOADER: "ExLlamav2"                                                                                                                                                                                       
12:55:49-580220 INFO     TRUNCATION LENGTH: 4096                                                                                                                                                                                   
12:55:49-580571 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                                                                                                             
12:55:49-580940 INFO     Loaded the model in 33.20 seconds.                                                                                                                                                                        
Memory access fault by GPU node-1 (Agent handle: 0x64e3ca9e2010) on address 0x75e681099000. Reason: Page not present or supervisor privilege.
[1]    616392 IOT instruction (core dumped)  python server.py --api --api-port 11451

the second crash using tg-webui:

❯ python server.py --api --listen --model Qwen-32B
13:14:08-576771 INFO     Starting Text generation web UI                                                                                                                                                                           
13:14:08-578185 WARNING                                                                                                                                                                                                            
                         You are potentially exposing the web UI to the entire internet without any access password.                                                                                                               
                         You can create one with the "--gradio-auth" flag like this:                                                                                                                                               
                                                                                                                                                                                                                                   
                         --gradio-auth username:password                                                                                                                                                                           
                                                                                                                                                                                                                                   
                         Make sure to replace username:password with your own.                                                                                                                                                     
13:14:08-581204 INFO     Loading "Qwen-32B"                                                                                                                                                                                        
13:14:08-932613 WARNING  You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage to be a lot higher than it could be.                                                                                    
                         Try installing flash-attention following the instructions here: https://github.com/Dao-AILab/flash-attention#installation-and-features                                                                    
13:14:33-197401 INFO     LOADER: "ExLlamav2"                                                                                                                                                                                       
13:14:33-198636 INFO     TRUNCATION LENGTH: 4096                                                                                                                                                                                   
13:14:33-198966 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                                                                                                             
13:14:33-199306 INFO     Loaded the model in 24.62 seconds.                                                                                                                                                                        
13:14:33-199649 INFO     Loading the extension "openai"                                                                                                                                                                            
13:14:33-265414 INFO     OpenAI-compatible API URL:                                                                                                                                                                                
                                                                                                                                                                                                                                   
                         http://0.0.0.0:5000                                                                                                                                                                                       
                                                                                                                                                                                                                                   

Running on local URL:  http://0.0.0.0:7860

Memory access fault by GPU node-1 (Agent handle: 0x6452407b78b0) on address 0x75980082f000. Reason: Page not present or supervisor privilege.
[1]    630061 IOT instruction (core dumped)  python server.py --api --listen --model Qwen-32B

System Info

OS: Arch Linux
GPU: AMD RX 7900XTX
ROCm: 6.0.3
tg-webui: main branch
model: Qwen1.5-32B-Chat-GPTQ-Int4

The text was updated successfully, but these errors were encountered:

Beinsezii · 2024-04-20T10:34:52Z

This is likely an old ROCm issue. Even if you system rocm is 6.0.3, pytorch bundles its own compiler which gets used for exllama's c extensions, and the current webui specifies rocm 5.6 prebuilt torch wheels still.

I found the XTX is way more stable if you use the Pytorch 2.3.0 release candidate w/ ROCm 6.0 then recompile exllama2 to match.

# Currently 2.3.0 RC-final
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/test/rocm6.0
# v0.0.19 commit hash
pip install -U git+https://github.com/turboderp/exllamav2.git@ad8691c6d1aab2d1ddbdcbe9341c7c7a96e59f2f

You can also save the compiled exllama to a wheel with pip wheel instead of pip install so you can re-use it.

If you're using the native arch linux python/rocm, my own wheel might work for you.
exllamav2-0.0.19-cp311-cp311-linux_x86_64.zip

ROCm 5.6 and especially 5.7 are extremely unstable on my XTX. Frequent unrecoverable page faults including that exact error you see. If the roc6 exl2 method works I'd also recompile llama-cpp and other extensions to match for a rock solid experience.

Orion-zhen · 2024-04-20T16:45:57Z

Many thanks!

By upgrading pytorch and recompiling the exllamav2, the problem is solved.

(However, it will take time to see whether it's stable or not.)

Orion-zhen added the bug Something isn't working label Apr 20, 2024

Orion-zhen mentioned this issue Apr 20, 2024

Action to build wheels on ROCm 6.0 turboderp/exllamav2#421

Merged

Orion-zhen closed this as completed Apr 20, 2024

Beinsezii mentioned this issue Apr 24, 2024

AMD thread #3759

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory access fault #5890

Memory access fault #5890

Orion-zhen commented Apr 20, 2024 •

edited

Loading

Beinsezii commented Apr 20, 2024

Orion-zhen commented Apr 20, 2024 •

edited

Loading

Memory access fault #5890

Memory access fault #5890

Comments

Orion-zhen commented Apr 20, 2024 • edited Loading

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

Beinsezii commented Apr 20, 2024

Orion-zhen commented Apr 20, 2024 • edited Loading

Orion-zhen commented Apr 20, 2024 •

edited

Loading

Orion-zhen commented Apr 20, 2024 •

edited

Loading