-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: is it possible to control memory usage? #489
Comments
Looks like variable input sizes is key here, e.g. if we bin varying input dimension to multiples of 320 (padding it), memory growth is much less severe. |
@lopuhin, all the application managements for Intel MKL-DNN happens on the application side. The observed behavior is likely the result of Pytorch caching Intel MKL-DNN primitives for performance reasons. I don't know whether Pytorch provides any control over that. |
@Jianhui-Li, can you please help with question from Pytorch side? |
Thank you for a quick answer @vpirogov 👍 |
Thanks. I do not think an example is necessary as the reason for the observed behavior is pretty clear. Pytorch caches Intel MKL-DNN primitives to amortize the cost of primitive creation. When you change the input dimensions the application has to create new Intel MKL-DNN primitives. If this happens a lot you will end up with a lot of primitives in the cache and observe growth in memory consumption. Let's see what @Jianhui-Li has to say about controls over the cache behavior Pytorch provides. |
@lopuhin Thanks for the input. From your description, I agree with Vadim that most likely due to cache the MKL-DNN primitives created for varying input size. Right now we don't have control on the size of the primitive cache since having framework to implicitly exercise the memory size control might not be the best overall solution. I think that binning input size like your did is the solution most of user would like since it give you the control on the memory use and performance gain. Framework may need to capture the varying input size scenario and give warning on the extra memory usage due to the varying input size and guidance to user for binning the input size. Your feedback is welcome. |
Thank you @Jianhui-Li , this makes sense. You are right that binning is a workaround, although it is not ideal because it hurts performance a bit and still results in a higher memory usage than without the cache. Maybe giving user an ability to clear the cache, similar to |
@lopuhin Your inputs is important to us. A few more questions: With the torch.cuda.empty_cache(), I guess you would monitor the memory usage and clean it if it is higher than certain threshold? After binning the input size, what is the memory size you observed? Batch size is a determining factor on how much caching overhead looks like since high batch size can amortize the primitive creation cost. In your usage, what is the batch size? |
Thank you for encouragement @Jianhui-Li
My first thought was to check what would happen if we clear the cache after each request - if performance penalty is small, this looks easiest. But what you propose would be even better.
Please see benchmark results below
I see, thanks for the insight. In my case batch size is 1, and raising it would be challenging as we'd like to keep the latency not too high, and also due to the highly variable input size. Here are benchmark results. Regarding memory growth, note that there are other parts of the system besides the image part, so part of memory growth is due to them. These benchmarks are run over 1000 items, which is close to steady state in terms of memory usage.
Flexible binning is this:
Max image size is 7680, width is always 320. |
@lopuhin You mentioned ResNet. Is it ResNet-50 that your numbers are based on? Thanks. |
@jgong5 numbers are based on ResNet34 without last block, like this:
|
@lopuhin May I know the benchmark url you used? We would like to reproduce the problem and do further analysis. Thanks. |
@mingxiaoh I re-created workload similar to above benchmark in this gist: https://gist.github.com/lopuhin/255992a255810407e2c42a5513e20a13 Here are results I got, including some environment info (OS is Ubuntu 18.04):
|
@lopuhin we can successfully reproduce your problem now and investigate it now, will come back to you when root cause is clear, thanks. |
Great, thank you @mingxiaoh 👍 |
@lopuhin This is due to cache the MKL-DNN primitives created for varying input size. If you would like to reduce the memory usage, you can change the capacity value to a smaller value. (see below) (py3-intel-chainer) [sys_dltest2@mlt-skx139 ideep]$ git diff include/ideep/lru_cache.hpp -template <class value_t, size_t capacity = 1024, class key_t = std::string> // TODO: mutex it Test result: case2: capacity = 128 |
Wow, these are great results @mingxiaoh , thank you! |
@lopuhin We will provide an environment variable to control this threshold so that users don't have to change the source code for it. |
@jgong5 this would be perfect, thank you! |
@jgong5 , I am having a similar issue with tensorflow keras. Will this same solution work for that as well. |
@shahsahilj, the same approach will definitely work for Tensorflow. Note, that the cache is implemented on Tensorflow side, not in Intel MKL-DNN, so the controls for the cache behavior should be implemented there as well. @agramesh1, could you please comment on the cache controls in Tensorflow? |
@shahsahilj In Pytorch, we are using an LRU cache to control the capacity. You may try similar approach in TF as well. |
@shahsahilj The primitives cache in TF is an LRU cache that will limit the memory growth. This should be enabled in TF 1.14 that was just released. |
@jgong5, @agramesh1, does TF provide a way to control the cache size or turn it off completely? |
@agramesh1 I tried using the cache in the TF 1.14. I was able to recompile it with a lower cache and that helped slightly, but I could not get it to work as well as the TF1.13 with mkl that anaconda has in its main channel. Would it be possible to know what config options are used to make the most optimal compilation of the TF |
When using mkldnn from PyTorch for ResNet inference with various input shapes, we observe memory usage growing quickly by around 7 GB, and then growth mostly stops, or slows down significantly, after around 1000 calls to inference, with input shapes ranging from (3, 320, 200) to (3, 320, 7680). MKLDNN provides nice speedup, but this extra memory usage is a little concerning.
The questions are:
Thank you!
The text was updated successfully, but these errors were encountered: