Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always warn on cpu fallback #318

Merged
merged 3 commits into from
Jun 5, 2024
Merged

Always warn on cpu fallback #318

merged 3 commits into from
Jun 5, 2024

Conversation

dvrogozh
Copy link
Contributor

Fixes: #262, pytorch/pytorch#126488

5bf9e0c ("Register operator's implementation lazily") disabled warnings printout on explicit CPU fallback. I believe that users and customers will benefit from these warnings in all the cases. Note that "explicit fallback" seems to be some internal intel classification term for supported/unsupported operations unlikely known to others. Thus, non-intel users will likely care for cpu fallback in general regardless of its type. This PR adds warning back for all CPU fallback cases. We did discuss in pytorch/pytorch#126488 that maybe printout in Release build might not be needed - I am thinking otherwise and added printout for all the build modes. Let's discuss this in the PR review.

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5

Copy link
Contributor

@EikanWang EikanWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@EikanWang
Copy link
Contributor

@dvrogozh , do you have the privilege to add a reviewer?

@dvrogozh
Copy link
Contributor Author

@dvrogozh , do you have the privilege to add a reviewer?

No, unfortunately I do not.

@dvrogozh
Copy link
Contributor Author

@EikanWang @fengyuan14. Some of pull / preci-ut tests fail. Are these some known failures since I can hardly attribute them to the change in this PR?

@fengyuan14
Copy link
Contributor

@EikanWang @fengyuan14. Some of pull / preci-ut tests fail. Are these some known failures since I can hardly attribute them to the change in this PR?

Retrigger the preci. Skipped cases failed due to PyTorch daily update.

Fixes: intel#262
Fixes: 5bf9e0c ("Register operator's implementation lazily")
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
@dvrogozh
Copy link
Contributor Author

dvrogozh commented Jun 4, 2024

@fengyuan14 : tests are still failing. Should this be retriggered again?

@fengyuan14
Copy link
Contributor

@fengyuan14 : tests are still failing. Should this be retriggered again?
Two new errors never encounter. Let me try retrigger

FAILED test_autograd_xpu.py::TestAutogradDeviceTypeXPU::test_copy_r_to_c_xpu - AssertionError: False is not true : 
FAILED test_autograd_xpu.py::TestAutogradDeviceTypeXPU::test_to_r_to_c_xpu - AssertionError: False is not true : 

@fengyuan14
Copy link
Contributor

I checked the failure again. The failures should be caused by the change in the PR.
image
The case says there should be no warning, but we warn in CPU fallback code path.

@dvrogozh
Copy link
Contributor Author

dvrogozh commented Jun 4, 2024

@fengyuan14 : Ok, so there are some tests in pytorch which don't allow warnings. However, the tests which fail actually run few basic ops - in a way someone might consider that these ops should not fail and should run on appropriate device. I.e. actually someone might consider that these tests are actually broken and incorrectly report status.

There are few ways to proceed with this PR:

  1. Actually fix the tests. This means identify and implement missing xpu ops.
  2. Silence warning conditional to something like release build or environment variable disabled by default.

I need your opinions on this. From my side, I will check tomorrow which ops are not implemented. As of now I checked only 1 test:

  • TestAutogradDeviceTypeXPU.test_copy_r_to_c_xpu - need aten::all.all_out

@fengyuan14
Copy link
Contributor

@fengyuan14 : Ok, so there are some tests in pytorch which don't allow warnings. However, the tests which fail actually run few basic ops - in a way someone might consider that these ops should not fail and should run on appropriate device. I.e. actually someone might consider that these tests are actually broken and incorrectly report status.

There are few ways to proceed with this PR:

  1. Actually fix the tests. This means identify and implement missing xpu ops.
  2. Silence warning conditional to something like release build or environment variable disabled by default.

I need your opinions on this. From my side, I will check tomorrow which ops are not implemented. As of now I checked only 1 test:

  • TestAutogradDeviceTypeXPU.test_copy_r_to_c_xpu - need aten::all.all_out

I prefer the first option. Looks there won't be too many efforts to take. And aten::all.all_out is in our development plan. We won't take extra efforts. We can prioritize the op.

@fengyuan14
Copy link
Contributor

#368

@dvrogozh
Copy link
Contributor Author

dvrogozh commented Jun 4, 2024

From my side, I will check tomorrow which ops are not implemented.

As far as I can tell from the log, we have 2 tests failing, both requiring same op:

  • TestAutogradDeviceTypeXPU.test_copy_r_to_c_xpu - need aten::all.all_out
  • TestAutogradDeviceTypeXPU.test_to_r_to_c_xpu - need aten::all.all_out

@fengyuan14 fengyuan14 added this pull request to the merge queue Jun 5, 2024
Merged via the queue into intel:main with commit 49ab162 Jun 5, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide a way to debug explicit CPU fallback
3 participants