Skip to content

TRT EP race condition fix during ep compile time#13356

Merged
chilo-ms merged 8 commits intomainfrom
chi/trt_ep_concurrent_fix
Oct 19, 2022
Merged

TRT EP race condition fix during ep compile time#13356
chilo-ms merged 8 commits intomainfrom
chi/trt_ep_concurrent_fix

Conversation

@chilo-ms
Copy link
Copy Markdown
Contributor

@chilo-ms chilo-ms commented Oct 18, 2022

Description

TRT EP has the chance to encounter race condition when multiple threads are doing engine serialization/deserialization during EP compile time.
Let's say one thread is serializing the engine and has not yet completely written all the data to file, and at this moment, another thread finds the engine file is existed and begins to deserialize the engine, it will end up deserialize the corrupt file.
The fix is to put a lock around engine deserialization/serialization, engine build and context build.

Motivation and Context

The TensorRT EP Windows CI sometimes fails because of TensorrtExecutionProviderTest.MultiThreadsTestWithOneSessionSingleThreadInference unit test fails (This PR changes the name to SessionCreationWithMultiThreadsAndInferenceWithMultiThreads). It's highly possible due to race condition.
The TensorRT CI failure also been reported here

@chilo-ms chilo-ms changed the title Chi/trt ep concurrent fix TRT EP race condition fix during ep compile time Oct 18, 2022
}
trt_context = tensorrt_ptr::unique_pointer<nvinfer1::IExecutionContext>(trt_engine->createExecutionContextWithoutDeviceMemory());
} else {
size_t mem_size = trt_engine->getDeviceMemorySize();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the indentation seems off here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@chilo-ms chilo-ms merged commit 86c5c07 into main Oct 19, 2022
@chilo-ms chilo-ms deleted the chi/trt_ep_concurrent_fix branch October 19, 2022 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants