Fixing MORE mlas unittest failures in POWER#8673
Conversation
zhanghuanrong
left a comment
There was a problem hiding this comment.
Seems some previous change should change semantic for MlasGemm test to MlasGemmBatch,
not add new test for it. Hold this PR if you want.
We need to fix unittest for above mentioned issue.
|
This PR does not add a new test. It deletes an accidental cut/paste problem where someone converted MlasGemm to MlasGemmBatch, but they left in the call to MlasGemm. All I'm doing in this PR is removing that offending old call, which However, if you are saying "yes we need to fix more things (a superset of this fix)", and you want to just have us delete this and wait until you've made your fixes available to us to test, that will be fine. Let us know what we should do with this. |
|
@zhanghuanrong : you said above "We need to fix unittest for the above mentioned issue." |
|
I will follow up it with chenfucn and zhanghuanrong |
|
Is there more I can do here? If this is going to be open for a while, but there's no more action items for me, that would be good to know... |
zhanghuanrong
left a comment
There was a problem hiding this comment.
Please check in first.
|
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline |
|
/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 9 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
Is there more I can do here? If this is going to be open for a while, but there's no more action items for me, that would be good to know... |
* Fixing MORE mlas unittest failures in POWER (#8673) * Ensure ms-experimental domain Audio Ops build in mac pipeline (#8857) * Globally enable ms-experimental ops * change meaning of ms_experimental to mean *all* ms_experimental ops. Some experimental ops will still be enabled globally without this flag like audio ops. * add cmath * add cmath to signal_defs.cc * move audio back into experimental, verify on mac * remove experimental from mac builds Co-authored-by: Sheil Kumar <sheilk@microsoft.com> * Remove cpuinfo from WCOS builds (#9076) * Fix a bug for Openvino Python binding (#9130) * Fix default initialization value in C API header (#9126) * fix default initialization value in C API header * Fix conflicts * Nits * Do not generate nuget symbol packages on Linux * fix name conflict in 1.9 for Fix default initialization value in C API header * Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101) * make work for both rocm 4.2 and rocm 4.3.1 * fix rocm 4.3.1 docker image reference * fix CUDA_VERSION to ROCM_VERSION * fix ReduceConsts conflict def * add ifdef to miopen_common.h as well * trailing ws * remove OrtCUDAProviderOptions() and simply set value * revert to use custom ctor and fix tests Co-authored-by: austinpagan <fossum@us.ibm.com> Co-authored-by: Sheil Kumar <smk2007@gmail.com> Co-authored-by: Sheil Kumar <sheilk@microsoft.com> Co-authored-by: Tiago Koji Castro Shibata <ticastro@microsoft.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: Suffian Khan <sukha@microsoft.com>
We noted that "onnx_runtime_mlas_test --long" was reporting MILLIONS of errors on our Power servers.
After some analysis it looked like the incorrect answers being received are what would be expected if the
GEMM call were made twice in a row, with the C values from the first call plugged in as input to the second call.
We discovered what appears to be a cut-and-paste error in onnxruntime/test/mlas/unittest/test_fgemm.h,
and when we removed the offending second line, the errors went away!