Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tensorflow GPU memory management for pytest runs. #154

Closed
monorimet opened this issue Jun 23, 2022 · 2 comments
Closed

Fix tensorflow GPU memory management for pytest runs. #154

monorimet opened this issue Jun 23, 2022 · 2 comments

Comments

@monorimet
Copy link
Collaborator

monorimet commented Jun 23, 2022

Currently, two cases of GPU memory management issues appear when running pytests for Tensorflow masked_lm models.

When running gpu tests for albert_base_v2, the static_gpu case (currently included in this issue) passes if tolerance values for compare_tensors_tf are increased to rtol=1e-02 and atol=1e-01. All of the tests mentioned in that issue pass with the increased tolerances. This isn't really acceptible accuracy, but we are waiting from the IREE team, so we can work around it for now to get memory management squared away.

TF albert on CPU passes for dynamic and static cases only if the tests are run individually. Tensorflow's allocated memory in CUDA does not free up for the second GPU test whether the first passes or not.

If we try bert_static_gpu, however, cuda runs out of memory even when the test is run by itself -- TF allocates ~39GB of gpu memory for the model at the beginning of the test and we run into cuda OOM when shark_module.compile() is called (hal allocation in IREE).

All of the TF model tests in tank/tf/hf_masked_lm/ share this issue.

@powderluv
Copy link
Contributor

Let's just wait to run the tests separately with SHARK downloader

@monorimet
Copy link
Collaborator Author

#169 solves this issue in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants