Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: is it possible to control memory usage? #489

Closed
lopuhin opened this issue Jun 7, 2019 · 25 comments
Closed

Question: is it possible to control memory usage? #489

lopuhin opened this issue Jun 7, 2019 · 25 comments
Assignees
Labels
integration Issues with integrating the library into applications

Comments

@lopuhin
Copy link

lopuhin commented Jun 7, 2019

When using mkldnn from PyTorch for ResNet inference with various input shapes, we observe memory usage growing quickly by around 7 GB, and then growth mostly stops, or slows down significantly, after around 1000 calls to inference, with input shapes ranging from (3, 320, 200) to (3, 320, 7680). MKLDNN provides nice speedup, but this extra memory usage is a little concerning.

The questions are:

  • is such memory usage normal in this case?
  • are there any settings or environment variables that can keep memory usage lower, even at the cost of some performance? (I imagine it must be caching some allocations?)

Thank you!

@vpirogov vpirogov self-assigned this Jun 7, 2019
@lopuhin
Copy link
Author

lopuhin commented Jun 7, 2019

Looks like variable input sizes is key here, e.g. if we bin varying input dimension to multiples of 320 (padding it), memory growth is much less severe.

@vpirogov
Copy link
Member

vpirogov commented Jun 7, 2019

@lopuhin, all the application managements for Intel MKL-DNN happens on the application side. The observed behavior is likely the result of Pytorch caching Intel MKL-DNN primitives for performance reasons. I don't know whether Pytorch provides any control over that.

@vpirogov
Copy link
Member

vpirogov commented Jun 7, 2019

@Jianhui-Li, can you please help with question from Pytorch side?

@vpirogov vpirogov added integration Issues with integrating the library into applications and removed question labels Jun 7, 2019
@lopuhin
Copy link
Author

lopuhin commented Jun 7, 2019

Thank you for a quick answer @vpirogov 👍
For reference, my pytorch version is 1.1 (from wheels), and I can create a small self-contained example if needed.

@vpirogov
Copy link
Member

vpirogov commented Jun 7, 2019

Thanks. I do not think an example is necessary as the reason for the observed behavior is pretty clear. Pytorch caches Intel MKL-DNN primitives to amortize the cost of primitive creation. When you change the input dimensions the application has to create new Intel MKL-DNN primitives. If this happens a lot you will end up with a lot of primitives in the cache and observe growth in memory consumption.

Let's see what @Jianhui-Li has to say about controls over the cache behavior Pytorch provides.

@Jianhui-Li
Copy link

@lopuhin Thanks for the input. From your description, I agree with Vadim that most likely due to cache the MKL-DNN primitives created for varying input size. Right now we don't have control on the size of the primitive cache since having framework to implicitly exercise the memory size control might not be the best overall solution. I think that binning input size like your did is the solution most of user would like since it give you the control on the memory use and performance gain. Framework may need to capture the varying input size scenario and give warning on the extra memory usage due to the varying input size and guidance to user for binning the input size. Your feedback is welcome.

@lopuhin
Copy link
Author

lopuhin commented Jun 7, 2019

Thank you @Jianhui-Li , this makes sense. You are right that binning is a workaround, although it is not ideal because it hurts performance a bit and still results in a higher memory usage than without the cache. Maybe giving user an ability to clear the cache, similar to torch.cuda.empty_cache(), would be an even nicer option, for workloads where caching does not provide much benefit, and lower memory usage is more important.

@Jianhui-Li
Copy link

@lopuhin Your inputs is important to us. A few more questions: With the torch.cuda.empty_cache(), I guess you would monitor the memory usage and clean it if it is higher than certain threshold? After binning the input size, what is the memory size you observed? Batch size is a determining factor on how much caching overhead looks like since high batch size can amortize the primitive creation cost. In your usage, what is the batch size?

@lopuhin
Copy link
Author

lopuhin commented Jun 7, 2019

Thank you for encouragement @Jianhui-Li

With the torch.cuda.empty_cache(), I guess you would monitor the memory usage and clean it if it is higher than certain threshold?

My first thought was to check what would happen if we clear the cache after each request - if performance penalty is small, this looks easiest. But what you propose would be even better.

After binning the input size, what is the memory size you observed?

Please see benchmark results below

Batch size is a determining factor on how much caching overhead looks like since high batch size can amortize the primitive creation cost. In your usage, what is the batch size?

I see, thanks for the insight. In my case batch size is 1, and raising it would be challenging as we'd like to keep the latency not too high, and also due to the highly variable input size.

Here are benchmark results. Regarding memory growth, note that there are other parts of the system besides the image part, so part of memory growth is due to them.

These benchmarks are run over 1000 items, which is close to steady state in terms of memory usage.

  mean image time, ms memory growth, kb
flexible binning 349.0 3,919,380
no binning, no mkldnn 395.5 1,225,264
no binning 336.4 6,756,952
binning to multiple of 320 381.1 3,388,052

Flexible binning is this:

if img.size[1] < 640:
    pad_to = 40
elif img.size[1] < 1280:
    pad_to = 80
elif img.size[1] < 2560:
    pad_to = 160
else:
    pad_to = 320

Max image size is 7680, width is always 320.

@jgong5
Copy link

jgong5 commented Jun 10, 2019

@lopuhin You mentioned ResNet. Is it ResNet-50 that your numbers are based on? Thanks.

@lopuhin
Copy link
Author

lopuhin commented Jun 10, 2019

@jgong5 numbers are based on ResNet34 without last block, like this:

    def forward(self, x):
        base = self.base   # this is resnet34
        x = base.conv1(x)
        x = base.bn1(x)
        x = base.relu(x)
        x = base.maxpool(x)

        x = base.layer1(x)
        x = base.layer2(x)
        x = base.layer3(x)
        # x = base.layer4(x)  # this block is ommited
        return x

@mingxiaoh
Copy link

@lopuhin May I know the benchmark url you used? We would like to reproduce the problem and do further analysis. Thanks.

@lopuhin
Copy link
Author

lopuhin commented Jun 12, 2019

@mingxiaoh I re-created workload similar to above benchmark in this gist: https://gist.github.com/lopuhin/255992a255810407e2c42a5513e20a13

Here are results I got, including some environment info (OS is Ubuntu 18.04):

$ python --version
Python 3.6.7
$ pip freeze | grep torch
torch==1.1.0
torchvision==0.3.0
$ cat /proc/cpuinfo | grep 'model name' | uniq
model name	: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
$
$ python mkldnn_489.py --n 1000 --mkldnn
Running with mkldnn=True bin=False
heights: mean=1031, p50=284 p95=5561 max=7680
n=100 memory growth (kb): 2,336,028
n=200 memory growth (kb): 4,057,764
n=300 memory growth (kb): 5,268,144
n=400 memory growth (kb): 5,382,780
n=500 memory growth (kb): 6,110,500
n=600 memory growth (kb): 6,338,308
n=700 memory growth (kb): 6,701,956
n=800 memory growth (kb): 6,924,616
n=900 memory growth (kb): 7,267,328
time: mean=0.456 s, p50=0.132 s, p95=2.411 s
memory (kb): 240,096 initial, 7,411,216 growth
$
$ python mkldnn_489.py --n 1000
Running with mkldnn=False bin=False
heights: mean=1031, p50=284 p95=5561 max=7680
n=100 memory growth (kb): 550,940
n=200 memory growth (kb): 622,592
n=300 memory growth (kb): 669,072
n=400 memory growth (kb): 688,380
n=500 memory growth (kb): 688,380
n=600 memory growth (kb): 688,380
n=700 memory growth (kb): 688,380
n=800 memory growth (kb): 688,380
n=900 memory growth (kb): 703,072
time: mean=0.537 s, p50=0.151 s, p95=2.926 s
memory (kb): 225,784 initial, 703,072 growth
$
$ python mkldnn_489.py --n 1000 --mkldnn --bin
Running with mkldnn=True bin=True
heights: mean=1031, p50=284 p95=5561 max=7680
n=100 memory growth (kb): 1,753,212
n=200 memory growth (kb): 2,347,940
n=300 memory growth (kb): 2,846,872
n=400 memory growth (kb): 3,077,376
n=500 memory growth (kb): 3,467,264
n=600 memory growth (kb): 3,683,448
n=700 memory growth (kb): 3,838,668
n=800 memory growth (kb): 4,089,364
n=900 memory growth (kb): 4,376,568
time: mean=0.482 s, p50=0.138 s, p95=2.420 s
memory (kb): 240,036 initial, 4,388,468 growth
$
$ python mkldnn_489.py --n 1000 --bin
Running with mkldnn=False bin=True
heights: mean=1031, p50=284 p95=5561 max=7680
n=100 memory growth (kb): 549,820
n=200 memory growth (kb): 697,836
n=300 memory growth (kb): 721,840
n=400 memory growth (kb): 721,844
n=500 memory growth (kb): 725,972
n=600 memory growth (kb): 817,852
n=700 memory growth (kb): 817,852
n=800 memory growth (kb): 817,852
n=900 memory growth (kb): 817,852
time: mean=0.569 s, p50=0.161 s, p95=2.925 s
memory (kb): 225,824 initial, 817,852 growth

@mingxiaoh
Copy link

@lopuhin we can successfully reproduce your problem now and investigate it now, will come back to you when root cause is clear, thanks.

@lopuhin
Copy link
Author

lopuhin commented Jun 14, 2019

Great, thank you @mingxiaoh 👍

@mingxiaoh
Copy link

@lopuhin This is due to cache the MKL-DNN primitives created for varying input size. If you would like to reduce the memory usage, you can change the capacity value to a smaller value. (see below)

(py3-intel-chainer) [sys_dltest2@mlt-skx139 ideep]$ git diff include/ideep/lru_cache.hpp
diff --git a/include/ideep/lru_cache.hpp b/include/ideep/lru_cache.hpp
index e900e5b..68f6516 100644
--- a/include/ideep/lru_cache.hpp
+++ b/include/ideep/lru_cache.hpp
@@ -294,7 +294,7 @@ private:
size_type capacity_;
};

-template <class value_t, size_t capacity = 1024, class key_t = std::string>
+template <class value_t, size_t capacity = 128, class key_t = std::string>
class computation_cache {
public:
using iterator = typename lru_cache<key_t, value_t>::iterator;
@@ -346,7 +346,7 @@ public:
};

// TODO: mutex it
-template <class value_t, size_t capacity = 1024, class key_t = std::string>
+template <class value_t, size_t capacity = 128, class key_t = std::string>
class computation_gcache {
public:
using iterator = typename lru_cache<key_t, value_t>::iterator;

Test result:
case1: capacity = 1024
(py3-intel-chainer) [sys_dltest2@mlt-skx054 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn
Running with mkldnn=True bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 2378200
n=200 memory growth (kb): 4153520
n=300 memory growth (kb): 5453900
n=400 memory growth (kb): 5684464
n=500 memory growth (kb): 6531688
n=600 memory growth (kb): 6787628
n=700 memory growth (kb): 6990048
n=800 memory growth (kb): 7342368
n=900 memory growth (kb): 7627728
time: mean=0.3482012832928449 s, p50=0.11159232584759593 s, p95=1.8258836238179357 s
memory (kb): 242532 initial, 7743068 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx054 mingxiao_test]$ python mkldnn_489.py --n 1000
Running with mkldnn=False bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 449452
n=200 memory growth (kb): 490796
n=300 memory growth (kb): 490796
n=400 memory growth (kb): 490796
n=500 memory growth (kb): 490796
n=600 memory growth (kb): 495716
n=700 memory growth (kb): 495716
n=800 memory growth (kb): 495716
n=900 memory growth (kb): 495716
time: mean=0.455472462256439 s, p50=0.1348842727020383 s, p95=2.4116809139493838 s
memory (kb): 202656 initial, 495716 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx054 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn --bin
Running with mkldnn=True bin=True
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 1816980
n=200 memory growth (kb): 2298620
n=300 memory growth (kb): 2823092
n=400 memory growth (kb): 2969092
n=500 memory growth (kb): 3506688
n=600 memory growth (kb): 3674492
n=700 memory growth (kb): 3746492
n=800 memory growth (kb): 3964016
n=900 memory growth (kb): 4184328
time: mean=0.35401891875546426 s, p50=0.11079582339152694 s, p95=1.8085481527261436 s
memory (kb): 244468 initial, 4184328 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx054 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn
Running with mkldnn=True bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 2396348
n=200 memory growth (kb): 4185060
n=300 memory growth (kb): 5479424
n=400 memory growth (kb): 5547464
n=500 memory growth (kb): 6505552
n=600 memory growth (kb): 6875732
n=700 memory growth (kb): 7207476
n=800 memory growth (kb): 7496696
n=900 memory growth (kb): 7774860
time: mean=0.3428835664289072 s, p50=0.10954237962141633 s, p95=1.8142171191982914 s
memory (kb): 245060 initial, 7805132 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx054 mingxiao_test]$

case2: capacity = 128
[sys_dltest2@mlt-skx139 ~]$ source ~/pythonenv/py3-intel-chainer/bin/activate
(py3-intel-chainer) [sys_dltest2@mlt-skx139 ~]$
(py3-intel-chainer) [sys_dltest2@mlt-skx139 ~]$ cd Documents/mingxiao_test/
(py3-intel-chainer) [sys_dltest2@mlt-skx139 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn
Running with mkldnn=True bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 1238836
n=200 memory growth (kb): 1501184
n=300 memory growth (kb): 1721964
n=400 memory growth (kb): 1753388
n=500 memory growth (kb): 2130836
n=600 memory growth (kb): 2139412
n=700 memory growth (kb): 2241712
n=800 memory growth (kb): 2301144
n=900 memory growth (kb): 2519052
time: mean=0.342514000098221 s, p50=0.11045438100700267 s, p95=1.7644889014918574 s
memory (kb): 201740 initial, 2519052 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx139 mingxiao_test]$ python mkldnn_489.py --n 1000
Running with mkldnn=False bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 469264
n=200 memory growth (kb): 602688
n=300 memory growth (kb): 602688
n=400 memory growth (kb): 611324
n=500 memory growth (kb): 611324
n=600 memory growth (kb): 611324
n=700 memory growth (kb): 611324
n=800 memory growth (kb): 611324
n=900 memory growth (kb): 633296
time: mean=0.44288936752011066 s, p50=0.13412621649331413 s, p95=2.358341607873444 s
memory (kb): 158304 initial, 647356 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx139 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn --bin
Running with mkldnn=True bin=True
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 1259300
n=200 memory growth (kb): 1579460
n=300 memory growth (kb): 1693576
n=400 memory growth (kb): 1747780
n=500 memory growth (kb): 2023436
n=600 memory growth (kb): 2087344
n=700 memory growth (kb): 2229264
n=800 memory growth (kb): 2245432
n=900 memory growth (kb): 2491216
time: mean=0.35734817950008435 s, p50=0.11491244149510749 s, p95=1.7963337316439671 s
memory (kb): 201160 initial, 2491216 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx139 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn
Running with mkldnn=True bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 1296192
n=200 memory growth (kb): 1537512
n=300 memory growth (kb): 1897888
n=400 memory growth (kb): 1904016
n=500 memory growth (kb): 2241532
n=600 memory growth (kb): 2252640
n=700 memory growth (kb): 2397764
n=800 memory growth (kb): 2425976
n=900 memory growth (kb): 2555936
time: mean=0.34390410209848776 s, p50=0.11097242051619105 s, p95=1.7738525233697133 s
memory (kb): 200516 initial, 2555936 growth

@lopuhin
Copy link
Author

lopuhin commented Jun 14, 2019

Wow, these are great results @mingxiaoh , thank you!

@jgong5
Copy link

jgong5 commented Jun 15, 2019

@lopuhin We will provide an environment variable to control this threshold so that users don't have to change the source code for it.

@lopuhin
Copy link
Author

lopuhin commented Jun 15, 2019

@jgong5 this would be perfect, thank you!

@shahsahilj
Copy link

shahsahilj commented Jun 18, 2019

@jgong5 , I am having a similar issue with tensorflow keras. Will this same solution work for that as well.
The detailed issue is listed at: https://stackoverflow.com/questions/56639787/how-to-fix-memory-leak-with-variable-tensors-in-tensorflow-mkl-library

@vpirogov
Copy link
Member

@shahsahilj, the same approach will definitely work for Tensorflow. Note, that the cache is implemented on Tensorflow side, not in Intel MKL-DNN, so the controls for the cache behavior should be implemented there as well.

@agramesh1, could you please comment on the cache controls in Tensorflow?

@jgong5
Copy link

jgong5 commented Jun 18, 2019

@shahsahilj In Pytorch, we are using an LRU cache to control the capacity. You may try similar approach in TF as well.

@agramesh1
Copy link

agramesh1 commented Jun 19, 2019

@shahsahilj The primitives cache in TF is an LRU cache that will limit the memory growth. This should be enabled in TF 1.14 that was just released.

@vpirogov
Copy link
Member

@jgong5, @agramesh1, does TF provide a way to control the cache size or turn it off completely?

@shahsahilj
Copy link

@agramesh1 I tried using the cache in the TF 1.14. I was able to recompile it with a lower cache and that helped slightly, but I could not get it to work as well as the TF1.13 with mkl that anaconda has in its main channel. Would it be possible to know what config options are used to make the most optimal compilation of the TF

@vpirogov vpirogov closed this as completed Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration Issues with integrating the library into applications
Projects
None yet
Development

No branches or pull requests

7 participants