Skip to content

Fix segment fault for alltoall#12701

Merged
pengwa merged 3 commits intomainfrom
pengwa/segfault
Aug 30, 2022
Merged

Fix segment fault for alltoall#12701
pengwa merged 3 commits intomainfrom
pengwa/segfault

Conversation

@pengwa
Copy link
Contributor

@pengwa pengwa commented Aug 24, 2022

Description: fix segment fault for alltoall.

In MOE model tests, there are random issues reporting segment fault.

The problem is AllToAll autograd function is consuming a process group object in its forward function. when the segmentfault happens, the process group param accepted in the forward function is a 'str' like "ONNX:Slice114" or some wired object pointer.

The root cause is: during PythonOp export, we save the process group (as a non tensor pointer type) object's address only, and save it in PythonOp's attribute. But python did not know there is a reference to that object. In the new env MOE test is trying to onboard, it looks there is more possibility process group is destroyed before the tests completed. If it is teared down before PythonOp/PythonOpGrad completed, an unexpected object (e.g. the str or werid object) is picked up and used as process group type, segement fault occurs.

The fix: the process group as inputs of all PythonOp, they should exist along the training lifetime. So we can track those non-tensor object reference once we are exporting PythonOp. Keeping those object in a global store will avoid that object be released. We will clean up the store before python program exits. For multiple UT cases, cleaning up is not triggered, but it should be fine.

Motivation and Context

  • Why is this change required? What problem does it solve?
  • If it fixes an open issue, please link to the issue here.

@pengwa pengwa requested a review from wschin August 24, 2022 13:48
@pengwa pengwa added the training issues related to ONNX Runtime training; typically submitted using template label Aug 24, 2022
@pengwa pengwa requested a review from baijumeswani August 24, 2022 14:05
@pengwa
Copy link
Contributor Author

pengwa commented Aug 30, 2022

Thanks @wschin @baijumeswani.

@pengwa pengwa merged commit a0c25e5 into main Aug 30, 2022
@pengwa pengwa deleted the pengwa/segfault branch August 30, 2022 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants