Skip to content

[Performance Analysis] Adding intra-kernel timing runs #829

Draft
SergioMartin86 wants to merge 12 commits into
hw-native-sys:mainfrom
huawei-csl:intraKernelTiming
Draft

[Performance Analysis] Adding intra-kernel timing runs #829
SergioMartin86 wants to merge 12 commits into
hw-native-sys:mainfrom
huawei-csl:intraKernelTiming

Conversation

@SergioMartin86
Copy link
Copy Markdown

@SergioMartin86 SergioMartin86 commented May 20, 2026

We want to add the ability to run a task multiple times inside the same kernel launch. This is essential for precise timing and performance evaluation of both orchestration and scheduling.

We add:

  • Warmup runs: used to disregard cache intialization/dlopen/kernel launch noise.
  • Timed runs: these are actually timed, and an average + stddev is reported.

By running multiple timed runs, we dissipate OS/device noise that cause random variations in running time. This noise is significant when running these extremely low-latency kernels, so, if we want to precisely measure scheduling/orchestration performance, we need to use a statistical analysis with many samples inside the same kernel launch.

Current blocker:

We are trying (and failing) to reset the SchedulerContext back to its initial state, to be able to be re-run in the same kernel. We try:

deinit(runtime); init(runtime);

But this results in the test failing and a 10x increase in running time.

Relevant Change:

See https://github.com/hw-native-sys/simpler/pull/829/changes#diff-f1bd1d412c7f0c6e99f4f11c3830d67582037fbbd6ef3a981c34edb244f9a849R761 for main timing function we added.

We appreciate help figuring out how to reset the scheduler context to cleanly re-run the pypto task.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance timing framework for AICPU kernels, enabling warmup and timed execution iterations configurable via environment variables. The changes include a new two-phase barrier for thread synchronization, the use of thread-local storage for thread indexing, and enhanced logging. Feedback highlights several critical issues: an operator precedence bug in the thread completion logic that prevents proper cleanup, thread-safety violations when calling initialization routines concurrently, and a break in binary compatibility due to field insertion in the Runtime class. Additionally, improvements are suggested for memory ordering in the barrier, robustness in environment variable parsing, and correcting a log message typo.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment on lines +232 to +241
std::string env_timing_iterations_string = std::string(env_timing_iterations);
bool isValidValue = false;
if (env_timing_iterations_string == "True") { runtime->is_timing_enabled = true; isValidValue = true; }
if (env_timing_iterations_string == "False") { runtime->is_timing_enabled = false; isValidValue = true; }
if (isValidValue == false)
{
LOG_WARN("PTO2_KERNEL_TIMING_ENABLED=%s is invalid, using default: \"False\"", env_timing_iterations);
runtime->is_timing_enabled = false;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The environment variable parsing for PTO2_KERNEL_TIMING_ENABLED is brittle as it only accepts exact case-sensitive matches for 'True' or 'False'. It would be more robust to support a wider range of boolean representations (e.g., '1', '0', 'true', 'false', 'on', 'off') and perform case-insensitive comparisons.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp Outdated
@ChaoWao
Copy link
Copy Markdown
Collaborator

ChaoWao commented May 21, 2026

  • Please give the comparison data of N*kernelLaunch and 1 kernelLaunch*N inner run.

Run 100 times and trim highest 10 and lowest 10

@ChaoWao ChaoWao closed this May 21, 2026
@ChaoWao ChaoWao reopened this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants