New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Graph][Benchmark] Update benchmark function #363
Conversation
Hi Allan, The PR looks good to me. But before merging, it is better to have some demos on the improvement of the accuracy on the selection of parallel k when we clear the L2 cache. |
After some further investigation, it appears that the clearing of the l2 cache is not the greatest contributor, but the usage of |
orig-latency is the original benchmark function |
I just removed the torch dependencies. |
The old benchmarking function did not clear the l2 cache, so repeated runs are biased.
This is especially prevalent in tuning for parallel-k parts, which always selects k_parts=1 due to l2 cache hits, even when it is not the fastest implementation.