New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[runtime][hip] Add tracing in graph execution mode #16894
Conversation
Here is the change that collects events only on device destruction https://github.com/sogartar/SHARK-Runtime/tree/hip-graph-tracing-collect-only-on-device-destruction |
I saw internally that you're testing the latest branch to make sure the concurrency problem isn't with the HIP API. Please update your findings and we can go forward with this patch. |
@nithinsubbiah here is an issue #16903. |
Thanks Boian, I have seen some failures on the cts earlier. I'm looking into them. |
@@ -54,6 +54,9 @@ IREE_HAL_HIP_REQUIRED_PFN_DECL(hipGraphExecDestroy, hipGraphExec_t) | |||
IREE_HAL_HIP_REQUIRED_PFN_DECL(hipGraphInstantiate, hipGraphExec_t *, | |||
hipGraph_t, hipGraphNode_t *, char *, size_t) | |||
IREE_HAL_HIP_REQUIRED_PFN_DECL(hipGraphLaunch, hipGraphExec_t, hipStream_t) | |||
IREE_HAL_HIP_REQUIRED_PFN_DECL(hipGraphAddEventRecordNode, hipGraphNode_t *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort these APIs alphabetically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -77,12 +76,22 @@ void iree_hal_hip_tracing_zone_begin_external_impl( | |||
const char* file_name, size_t file_name_length, uint32_t line, | |||
const char* function_name, size_t function_name_length, const char* name, | |||
size_t name_length); | |||
void iree_hal_hip_graph_tracing_zone_begin_external_impl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can up update the existing tracing APIs to include "stream" inside to better differentiate?
@@ -249,7 +275,10 @@ static uint16_t iree_hal_hip_tracing_context_insert_query( | |||
void iree_hal_hip_tracing_zone_begin_impl( | |||
iree_hal_hip_tracing_context_t* context, hipStream_t stream, | |||
const iree_tracing_location_t* src_loc) { | |||
if (!context) return; | |||
if (!context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use IREE_ASSERT_ARGUMENT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -260,14 +289,31 @@ void iree_hal_hip_tracing_zone_begin_external_impl( | |||
const char* file_name, size_t file_name_length, uint32_t line, | |||
const char* function_name, size_t function_name_length, const char* name, | |||
size_t name_length) { | |||
if (!context) return; | |||
if (!context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
void iree_hal_hip_tracing_zone_end_impl(iree_hal_hip_tracing_context_t* context, | ||
hipStream_t stream); | ||
void iree_hal_hip_graph_tracing_zone_end_impl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -11,11 +11,13 @@ | |||
#include <stdint.h> | |||
|
|||
#include "iree/base/api.h" | |||
#include "iree/base/attributes.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is included as part of base/api.h
so no need to pull in specifically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -593,6 +593,10 @@ static iree_status_t iree_hal_hip_pending_queue_actions_issue_execution( | |||
} | |||
|
|||
// Then launch all command buffers to the dispatch stream. | |||
IREE_TRACE_ZONE_BEGIN(zone_launch_command_buffers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to prefix the zone name with zone_
. Maybe just call it "dispatch_command_buffers".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
iree_hal_hip_graph_command_buffer_cast(base_command_buffer); | ||
IREE_TRACE_ZONE_BEGIN(z0); | ||
|
||
IREE_RETURN_AND_END_ZONE_IF_ERROR( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do a normal iree_status_t status =
and return status
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
&command_buffer->hip_graph_nodes[command_buffer->graph_node_count++]; | ||
size_t dependency_count = command_buffer->hip_barrier_node ? 1 : 0; | ||
IREE_ASSERT(dependency_count > 0 && | ||
"Ending a zone should at least depend on the beginning."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use IREE_ASSERT_GT
and for the error message avoid captialize first char and avoid the end .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one final nit. Feel free to land after fixing it.
IREE_ASSERT(dependency_count > 0 && | ||
"Ending a zone should at least depend on the beginning."); | ||
IREE_ASSERT_GT(dependency_count, 0, | ||
"Ending a zone should at least depend on the beginning."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: no captialization and no ending period for assert messages: "end .. beginning".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@antiagainst would you like to also mirror this in the CUDA driver in this PR? |
There are some outstanding concurrency issues where some events don't get recorded during collection. If we collect only when the device gets destroyed with a 1s delay, the concurrency issues seem to not manifest.
8beb8c7
to
e99cbd2
Compare
e99cbd2
to
c419323
Compare
I rebased and mirrored the change into the CUDA driver. |
This is great work, Boian. Thank you. |
Nice, thanks! LGTM. Please feel free to land. Make sure you update the commit message title and body accordingly given now this also covers cuda. |
There are some outstanding concurrency issues where some events don't get recorded during collection.
If we collect only when the device gets destroyed with a 1s delay, the concurrency issues seem to not manifest.