Fix tensorflow GPU memory management for pytest runs. #154

monorimet · 2022-06-23T11:08:27Z

Currently, two cases of GPU memory management issues appear when running pytests for Tensorflow masked_lm models.

When running gpu tests for albert_base_v2, the static_gpu case (currently included in this issue) passes if tolerance values for compare_tensors_tf are increased to rtol=1e-02 and atol=1e-01. All of the tests mentioned in that issue pass with the increased tolerances. This isn't really acceptible accuracy, but we are waiting from the IREE team, so we can work around it for now to get memory management squared away.

TF albert on CPU passes for dynamic and static cases only if the tests are run individually. Tensorflow's allocated memory in CUDA does not free up for the second GPU test whether the first passes or not.

If we try bert_static_gpu, however, cuda runs out of memory even when the test is run by itself -- TF allocates ~39GB of gpu memory for the model at the beginning of the test and we run into cuda OOM when shark_module.compile() is called (hal allocation in IREE).

All of the TF model tests in tank/tf/hf_masked_lm/ share this issue.

powderluv · 2022-06-23T13:40:16Z

Let's just wait to run the tests separately with SHARK downloader

monorimet · 2022-07-01T01:01:34Z

#169 solves this issue in the meantime.

monorimet closed this as completed Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tensorflow GPU memory management for pytest runs. #154

Fix tensorflow GPU memory management for pytest runs. #154

monorimet commented Jun 23, 2022 •

edited

powderluv commented Jun 23, 2022

monorimet commented Jul 1, 2022

Fix tensorflow GPU memory management for pytest runs. #154

Fix tensorflow GPU memory management for pytest runs. #154

Comments

monorimet commented Jun 23, 2022 • edited

powderluv commented Jun 23, 2022

monorimet commented Jul 1, 2022

monorimet commented Jun 23, 2022 •

edited