Skip to content

Release backward inputs per static graph ref count#20804

Merged
pengwa merged 9 commits intomainfrom
pengwa/release_external_outputs
Jun 14, 2024
Merged

Release backward inputs per static graph ref count#20804
pengwa merged 9 commits intomainfrom
pengwa/release_external_outputs

Conversation

@pengwa
Copy link
Contributor

@pengwa pengwa commented May 24, 2024

Release backward inputs per static graph ref count

For the output buffer marked as external output:

  1. Remove the additional ref count we used for avoiding reusing buffer. Instead, when we find reuse input/output buffer, we will make sure the reused buffer not not generated by nodes that has external outputs.
  2. Remove the ref count of pybind feed inputs, which exists all the time until the run_backward completed. Instead, passing a mutuble feeds, and we clean the feeds vector once that is copied into session states and not needed any more before run the graph sequencentially.

Before the change:

One of the backward inputs is 3.9GB, it lives until the backward ends.
image

With the change:

The 3.9GB is released when the last node depending on that tensor completed.

image

Be noted: the peak did not change though, we have more work to do to reduce on the peak.

Others

It is found there are few tests that were updated to use incorrect expected values in previous code refactoring a81faee#diff-9e8fbae7d3dff24106cd17564949f320e943cb3048eae07813c7de144f140419L382.

This PR tries to fix them back, and I think now all test cases are back to normal.

Motivation and Context

@pengwa pengwa added the training issues related to ONNX Runtime training; typically submitted using template label May 24, 2024
@pengwa pengwa requested a review from wschin May 24, 2024 08:25
@pengwa pengwa requested a review from souptc June 11, 2024 13:54
@pengwa
Copy link
Contributor Author

pengwa commented Jun 14, 2024

Thanks @wschin !!

@pengwa pengwa merged commit 87b14ac into main Jun 14, 2024
@pengwa pengwa deleted the pengwa/release_external_outputs branch June 14, 2024 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants