onnxruntime.InferenceSession.run sometimes get stuck, sometimes not #21418

quarrying · 2024-07-19T15:26:57Z

Describe the issue

I have built onnxruntime-gpu 1.4.0 following https://github.com/microsoft/onnxruntime/blob/v1.4.0/dockerfiles/Dockerfile.cuda . The output of import onnxruntime and onnxruntime.get_device() are both normal, and onnxruntime.InferenceSession() seems ok too. However, sess.run() sometimes runs smoothly but gets stuck at other times (GPU memory not full, only ~2G of 11G). I have tried various SessionOptions but the issue persists. PS: The code is running within a Docker container.

To reproduce

import os
import time
from datetime import datetime

import numpy as np
import onnxruntime

if __name__ == '__main__':

    sess_options = onnxruntime.SessionOptions()
    sess_options.log_severity_level = 1
    # sess_options.intra_op_num_threads = 1
    # sess_options.inter_op_num_threads = 1
    # sess_options.enable_profiling = True
    # sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
    # sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
    # sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_BASIC
    # sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
    providers = ['CUDAExecutionProvider', 'CPUExecutionProvider'] 
    model_path = 'model.onnx'
    sess = onnxruntime.InferenceSession(model_path, sess_options, providers)

    input_names = [item.name for item in sess.get_inputs()]
    output_names = [item.name for item in sess.get_outputs()]
    
    while True:
        image = np.random.uniform(-1, 1, size=(1, 3, 1280, 1280)).astype(np.float32)
        start_time = time.time()
        print(f'{datetime.now()} starts')
        sess.run(output_names, {input_names[0]: image})
        print(f'{datetime.now()} elapsed {time.time() - start_time}')

Urgency

No response

Platform

Linux

OS Version

Ubuntu 18.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

v1.4.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 10.1, CUDNN 7.6.5, Driver 430.50, NVIDIA 2080 Ti

The text was updated successfully, but these errors were encountered:

tianleiwu · 2024-07-19T18:37:48Z

what do you mean gets stuck (or could you share the outputs of the above script that is not normal)?
It is likely the first inference run will take longer due to cuDNN convolution algo tuning and resource allocation, the remaining runs shall be faster.

quarrying · 2024-07-20T04:25:36Z

what do you mean gets stuck (or could you share the outputs of the above script that is not normal)? It is likely the first inference run will take longer due to cuDNN convolution algo tuning and resource allocation, the remaining runs shall be faster.

The output is as follows:

2024-07-19 17:58:51.796326681 [I:onnxruntime:, inference_session.cc:174 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
2024-07-19 17:58:53.146773527 [I:onnxruntime:, inference_session.cc:840 Initialize] Initializing session.
2024-07-19 17:58:53.151257663 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2024-07-19 17:58:53.154013361 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2024-07-19 17:58:53.162242333 [V:onnxruntime:, inference_session.cc:679 TransformGraph] Node placements
2024-07-19 17:58:53.162261331 [V:onnxruntime:, inference_session.cc:681 TransformGraph] All nodes have been placed on [CUDAExecutionProvider].
2024-07-19 17:58:53.166021272 [V:onnxruntime:, session_state.cc:71 CreateGraphInfo] SaveMLValueNameIndexMapping
2024-07-19 17:58:53.166334752 [V:onnxruntime:, session_state.cc:116 CreateGraphInfo] Done saving OrtValue mappings.
2024-07-19 17:58:55.055747308 [I:onnxruntime:, finalize_session_state.cc:173 SaveInitializedTensors] Saving initialized tensors.
2024-07-19 17:58:55.269780199 [I:onnxruntime:, finalize_session_state.cc:225 SaveInitializedTensors] Done saving initialized tensors
2024-07-19 17:58:55.289089454 [I:onnxruntime:, inference_session.cc:954 Initialize] Session successfully initialized.
2024-07-19 17:58:55.344849 starts
2024-07-19 17:58:55.350650700 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution
2024-07-19 17:58:55.783378 elapsed 0.43862199783325195
2024-07-19 17:58:55.860368 starts
2024-07-19 17:58:55.869268259 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution
2024-07-19 17:58:55.922098 elapsed 0.06176280975341797
2024-07-19 17:58:55.984943 starts
2024-07-19 17:58:55.989988341 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution
2024-07-19 17:58:56.042832 elapsed 0.05792117118835449
2024-07-19 17:58:56.097794 starts
2024-07-19 17:58:56.102932070 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution
2024-07-19 17:58:56.154957 elapsed 0.057192087173461914
2024-07-19 17:58:56.209661 starts
2024-07-19 17:58:56.214824273 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution
2024-07-19 17:58:56.267699 elapsed 0.058066606521606445
2024-07-19 17:58:56.322444 starts
2024-07-19 17:58:56.327602975 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution
2024-07-19 17:58:56.379816 elapsed 0.05740189552307129
2024-07-19 17:58:56.434758 starts
2024-07-19 17:58:56.439920970 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution
2024-07-19 17:58:56.492616 elapsed 0.05788826942443848
2024-07-19 17:58:56.548453 starts
2024-07-19 17:58:56.553534271 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution
2024-07-19 17:58:56.605954 elapsed 0.05753040313720703
2024-07-19 17:58:56.662649 starts
2024-07-19 17:58:56.667794682 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution
2024-07-19 17:58:56.719645 elapsed 0.05702567100524902
2024-07-19 17:58:56.775130 starts
2024-07-19 17:58:56.780257027 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution
2024-07-19 17:58:56.833278 elapsed 0.05822920799255371
2024-07-19 17:58:56.891268 starts
2024-07-19 17:58:56.896471060 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution
2024-07-19 17:58:56.948649 elapsed 0.057410240173339844
2024-07-19 17:58:57.003492 starts
2024-07-19 17:58:57.008576873 [I:onnxruntime:, sequential_executor.cc:150 Execute] Begin execution

The program may stop at any inference time, which could be the first time, the second time, or any other time.

tianleiwu · 2024-07-20T21:38:26Z

1.4 is too old.
Could you upgrade to onnxruntime-gpu 1.18.1 and cuda 11.8, cudnn 8.9?

github-actions · 2024-08-20T15:00:52Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

andreaslenz3 · 2024-10-01T16:42:23Z

I have a similar issue for CPU Execution. The execution times rise 10x after approx 1h.

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Jul 19, 2024

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onnxruntime.InferenceSession.run sometimes get stuck, sometimes not #21418

onnxruntime.InferenceSession.run sometimes get stuck, sometimes not #21418

quarrying commented Jul 19, 2024

tianleiwu commented Jul 19, 2024 •

edited

Loading

quarrying commented Jul 20, 2024

tianleiwu commented Jul 20, 2024

github-actions bot commented Aug 20, 2024

andreaslenz3 commented Oct 1, 2024

onnxruntime.InferenceSession.run sometimes get stuck, sometimes not #21418

onnxruntime.InferenceSession.run sometimes get stuck, sometimes not #21418

Comments

quarrying commented Jul 19, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

tianleiwu commented Jul 19, 2024 • edited Loading

quarrying commented Jul 20, 2024

tianleiwu commented Jul 20, 2024

github-actions bot commented Aug 20, 2024

andreaslenz3 commented Oct 1, 2024

tianleiwu commented Jul 19, 2024 •

edited

Loading