Skip to content

Fixing MORE mlas unittest failures in POWER#8673

Merged
snnn merged 1 commit intomicrosoft:masterfrom
austinpagan:Fix.MORE.Unittests.Power
Sep 15, 2021
Merged

Fixing MORE mlas unittest failures in POWER#8673
snnn merged 1 commit intomicrosoft:masterfrom
austinpagan:Fix.MORE.Unittests.Power

Conversation

@austinpagan
Copy link
Contributor

We noted that "onnx_runtime_mlas_test --long" was reporting MILLIONS of errors on our Power servers.

After some analysis it looked like the incorrect answers being received are what would be expected if the
GEMM call were made twice in a row, with the C values from the first call plugged in as input to the second call.

We discovered what appears to be a cut-and-paste error in onnxruntime/test/mlas/unittest/test_fgemm.h,
and when we removed the offending second line, the errors went away!

@austinpagan austinpagan requested a review from a team as a code owner August 10, 2021 17:19
@snnn
Copy link
Contributor

snnn commented Aug 10, 2021

#7323

Copy link
Contributor

@zhanghuanrong zhanghuanrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems some previous change should change semantic for MlasGemm test to MlasGemmBatch,
not add new test for it. Hold this PR if you want.
We need to fix unittest for above mentioned issue.

@austinpagan
Copy link
Contributor Author

austinpagan commented Aug 10, 2021

This PR does not add a new test. It deletes an accidental cut/paste problem where someone converted MlasGemm to MlasGemmBatch, but they left in the call to MlasGemm. All I'm doing in this PR is removing that offending old call, which
was accidentally left in place. With this one-line fix we find that onnx_runtime_mlas_test --long runs past the place where the zillions of errors started spewing, and our current run is not producing errors. If it does spit out any errors before it completes, we will update this PR.

However, if you are saying "yes we need to fix more things (a superset of this fix)", and you want to just have us delete this and wait until you've made your fixes available to us to test, that will be fine.

Let us know what we should do with this.

@austinpagan
Copy link
Contributor Author

@zhanghuanrong : you said above "We need to fix unittest for the above mentioned issue."
I do not know whether I am part of the "we" you refer to?
Do I have any further action items here?

@snnn
Copy link
Contributor

snnn commented Aug 23, 2021

I will follow up it with chenfucn and zhanghuanrong

@austinpagan
Copy link
Contributor Author

Is there more I can do here? If this is going to be open for a while, but there's no more action items for me, that would be good to know...

Copy link
Contributor

@zhanghuanrong zhanghuanrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check in first.

@snnn
Copy link
Contributor

snnn commented Sep 1, 2021

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline

@snnn
Copy link
Contributor

snnn commented Sep 1, 2021

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@austinpagan
Copy link
Contributor Author

Is there more I can do here? If this is going to be open for a while, but there's no more action items for me, that would be good to know...

@snnn snnn merged commit a05e328 into microsoft:master Sep 15, 2021
wangyems added a commit that referenced this pull request Sep 21, 2021
* Fixing MORE mlas unittest failures in POWER (#8673)

* Ensure ms-experimental domain Audio Ops build in mac pipeline  (#8857)

* Globally enable ms-experimental ops

* change meaning of ms_experimental to mean *all* ms_experimental ops. Some experimental ops will still be enabled globally without this flag like audio ops.

* add cmath

* add cmath to signal_defs.cc

* move audio back into experimental, verify on mac

* remove experimental from mac builds

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>

* Remove cpuinfo from WCOS builds (#9076)

* Fix a bug for Openvino Python binding (#9130)

* Fix default initialization value in C API header (#9126)

* fix default initialization value in C API header

* Fix conflicts

* Nits

* Do not generate nuget symbol packages on Linux

* fix name conflict in 1.9 for Fix default initialization value in C API header

* Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101)

* make work for both rocm 4.2 and rocm 4.3.1

* fix rocm 4.3.1 docker image reference

* fix CUDA_VERSION to ROCM_VERSION

* fix ReduceConsts conflict def

* add ifdef to miopen_common.h as well

* trailing ws

* remove OrtCUDAProviderOptions() and simply set value

* revert to use custom ctor and fix tests

Co-authored-by: austinpagan <fossum@us.ibm.com>
Co-authored-by: Sheil Kumar <smk2007@gmail.com>
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Co-authored-by: Tiago Koji Castro Shibata <ticastro@microsoft.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants