Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Server hangs when inference on Intel GPU #2336

Open
geekboood opened this issue Feb 23, 2024 · 3 comments
Open

Model Server hangs when inference on Intel GPU #2336

geekboood opened this issue Feb 23, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@geekboood
Copy link

Describe the bug
Inference hangs when using A770

Logs
server logs

[2024-02-23 15:54:44.147][2184239][serving][error][modelinstance.cpp:1193] Async caught an exception Internal inference error: Exception from src/inference/src/infer_request.cpp:256:
clFlush

[2024-02-23 15:55:08.725][2184301][serving][error][modelinstance.cpp:1193] Async caught an exception Internal inference error: Exception from src/inference/src/infer_request.cpp:256:
clFlush

[2024-02-23 15:55:33.291][2184634][serving][error][modelinstance.cpp:1193] Async caught an exception Internal inference error: Exception from src/inference/src/infer_request.cpp:256:
clFlush

kernel logs

[1331510.701350] i915 0000:03:00.0: [drm] GPU HANG: ecode 12:10:85def5fa, in ovms [809080]
[1331510.701372] i915 0000:03:00.0: [drm] ovms[809080] context reset due to GPU hang
[1331516.943270] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f0e!
[1331517.368428] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f16!
[1331517.543874] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f18!
[1331531.202434] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f1a!
[1331531.204035] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f1c!
[1331531.204263] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f1e!
[1331531.204844] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f20!
[1331531.210043] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f22!
[1331531.210182] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f24!
[1331531.212604] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f26!
[1331531.212840] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f28!
[1331531.214194] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f2a!
[1331531.214293] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f2e!
[1331531.214379] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f2c!
[1331531.218911] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f30!
[1331531.224320] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f32!
[1331531.224845] Fence expiration time out i915-0000:03:00.0:ovms[809080]:24f36!

Configuration
OpenVINO Model Server 2023.3.4e91aac76
OpenVINO backend 2023.3.0.13775.ceeafaf64f3
Bazel build flags: --strip=always --define MEDIAPIPE_DISABLE=0 --cxxopt=-DMEDIAPIPE_DISABLE=0 --define PYTHON_DISABLE=1 --cxxopt=-DPYTHON_DISABLE=1

@geekboood geekboood added the bug Something isn't working label Feb 23, 2024
@mzegla
Copy link
Collaborator

mzegla commented Feb 26, 2024

Could you check your model with OpenVINO benchmark app: https://docs.openvino.ai/2023.3/openvino_sample_benchmark_tool.html ?
Run with -d GPU option to run on GPU.
Please also share the command you use to start OVMS.

@p-durandin
Copy link

@geekboood please provide information about Linux kernel and GPU driver versions

@geekboood
Copy link
Author

geekboood commented Mar 22, 2024

My environment is pretty complicated...
My Host server uses Debian, and i915 kernel driver. I passthrough the GPU to LXC container that installed ubuntu 22.04 Intel GPU dependencies.
And I run multiple models on a single GPU (I tweaked the compute runtime parameter to use Multi-CCS Modes which should be helpful), each model is part of a inference pipeline. When the pipeline goes through high loads, sometimes model server hangs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants