Question: is it possible to control memory usage? #489

lopuhin · 2019-06-07T16:28:43Z

When using mkldnn from PyTorch for ResNet inference with various input shapes, we observe memory usage growing quickly by around 7 GB, and then growth mostly stops, or slows down significantly, after around 1000 calls to inference, with input shapes ranging from (3, 320, 200) to (3, 320, 7680). MKLDNN provides nice speedup, but this extra memory usage is a little concerning.

The questions are:

is such memory usage normal in this case?
are there any settings or environment variables that can keep memory usage lower, even at the cost of some performance? (I imagine it must be caching some allocations?)

Thank you!

lopuhin · 2019-06-07T16:41:22Z

Looks like variable input sizes is key here, e.g. if we bin varying input dimension to multiples of 320 (padding it), memory growth is much less severe.

vpirogov · 2019-06-07T16:43:54Z

@lopuhin, all the application managements for Intel MKL-DNN happens on the application side. The observed behavior is likely the result of Pytorch caching Intel MKL-DNN primitives for performance reasons. I don't know whether Pytorch provides any control over that.

vpirogov · 2019-06-07T16:44:50Z

@Jianhui-Li, can you please help with question from Pytorch side?

lopuhin · 2019-06-07T16:46:12Z

Thank you for a quick answer @vpirogov 👍
For reference, my pytorch version is 1.1 (from wheels), and I can create a small self-contained example if needed.

vpirogov · 2019-06-07T16:55:00Z

Thanks. I do not think an example is necessary as the reason for the observed behavior is pretty clear. Pytorch caches Intel MKL-DNN primitives to amortize the cost of primitive creation. When you change the input dimensions the application has to create new Intel MKL-DNN primitives. If this happens a lot you will end up with a lot of primitives in the cache and observe growth in memory consumption.

Let's see what @Jianhui-Li has to say about controls over the cache behavior Pytorch provides.

Jianhui-Li · 2019-06-07T17:05:10Z

@lopuhin Thanks for the input. From your description, I agree with Vadim that most likely due to cache the MKL-DNN primitives created for varying input size. Right now we don't have control on the size of the primitive cache since having framework to implicitly exercise the memory size control might not be the best overall solution. I think that binning input size like your did is the solution most of user would like since it give you the control on the memory use and performance gain. Framework may need to capture the varying input size scenario and give warning on the extra memory usage due to the varying input size and guidance to user for binning the input size. Your feedback is welcome.

lopuhin · 2019-06-07T17:18:35Z

Thank you @Jianhui-Li , this makes sense. You are right that binning is a workaround, although it is not ideal because it hurts performance a bit and still results in a higher memory usage than without the cache. Maybe giving user an ability to clear the cache, similar to torch.cuda.empty_cache(), would be an even nicer option, for workloads where caching does not provide much benefit, and lower memory usage is more important.

Jianhui-Li · 2019-06-07T17:49:14Z

@lopuhin Your inputs is important to us. A few more questions: With the torch.cuda.empty_cache(), I guess you would monitor the memory usage and clean it if it is higher than certain threshold? After binning the input size, what is the memory size you observed? Batch size is a determining factor on how much caching overhead looks like since high batch size can amortize the primitive creation cost. In your usage, what is the batch size?

lopuhin · 2019-06-07T18:59:55Z

Thank you for encouragement @Jianhui-Li

With the torch.cuda.empty_cache(), I guess you would monitor the memory usage and clean it if it is higher than certain threshold?

My first thought was to check what would happen if we clear the cache after each request - if performance penalty is small, this looks easiest. But what you propose would be even better.

After binning the input size, what is the memory size you observed?

Please see benchmark results below

Batch size is a determining factor on how much caching overhead looks like since high batch size can amortize the primitive creation cost. In your usage, what is the batch size?

I see, thanks for the insight. In my case batch size is 1, and raising it would be challenging as we'd like to keep the latency not too high, and also due to the highly variable input size.

Here are benchmark results. Regarding memory growth, note that there are other parts of the system besides the image part, so part of memory growth is due to them.

These benchmarks are run over 1000 items, which is close to steady state in terms of memory usage.

	mean image time, ms	memory growth, kb
flexible binning	349.0	3,919,380
no binning, no mkldnn	395.5	1,225,264
no binning	336.4	6,756,952
binning to multiple of 320	381.1	3,388,052

Flexible binning is this:

if img.size[1] < 640:
    pad_to = 40
elif img.size[1] < 1280:
    pad_to = 80
elif img.size[1] < 2560:
    pad_to = 160
else:
    pad_to = 320

Max image size is 7680, width is always 320.

jgong5 · 2019-06-10T04:44:15Z

@lopuhin You mentioned ResNet. Is it ResNet-50 that your numbers are based on? Thanks.

lopuhin · 2019-06-10T06:12:31Z

@jgong5 numbers are based on ResNet34 without last block, like this:

    def forward(self, x):
        base = self.base   # this is resnet34
        x = base.conv1(x)
        x = base.bn1(x)
        x = base.relu(x)
        x = base.maxpool(x)

        x = base.layer1(x)
        x = base.layer2(x)
        x = base.layer3(x)
        # x = base.layer4(x)  # this block is ommited
        return x

mingxiaoh · 2019-06-12T03:25:27Z

@lopuhin May I know the benchmark url you used? We would like to reproduce the problem and do further analysis. Thanks.

lopuhin · 2019-06-12T18:41:43Z

@mingxiaoh I re-created workload similar to above benchmark in this gist: https://gist.github.com/lopuhin/255992a255810407e2c42a5513e20a13

Here are results I got, including some environment info (OS is Ubuntu 18.04):

$ python --version
Python 3.6.7
$ pip freeze | grep torch
torch==1.1.0
torchvision==0.3.0
$ cat /proc/cpuinfo | grep 'model name' | uniq
model name	: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
$
$ python mkldnn_489.py --n 1000 --mkldnn
Running with mkldnn=True bin=False
heights: mean=1031, p50=284 p95=5561 max=7680
n=100 memory growth (kb): 2,336,028
n=200 memory growth (kb): 4,057,764
n=300 memory growth (kb): 5,268,144
n=400 memory growth (kb): 5,382,780
n=500 memory growth (kb): 6,110,500
n=600 memory growth (kb): 6,338,308
n=700 memory growth (kb): 6,701,956
n=800 memory growth (kb): 6,924,616
n=900 memory growth (kb): 7,267,328
time: mean=0.456 s, p50=0.132 s, p95=2.411 s
memory (kb): 240,096 initial, 7,411,216 growth
$
$ python mkldnn_489.py --n 1000
Running with mkldnn=False bin=False
heights: mean=1031, p50=284 p95=5561 max=7680
n=100 memory growth (kb): 550,940
n=200 memory growth (kb): 622,592
n=300 memory growth (kb): 669,072
n=400 memory growth (kb): 688,380
n=500 memory growth (kb): 688,380
n=600 memory growth (kb): 688,380
n=700 memory growth (kb): 688,380
n=800 memory growth (kb): 688,380
n=900 memory growth (kb): 703,072
time: mean=0.537 s, p50=0.151 s, p95=2.926 s
memory (kb): 225,784 initial, 703,072 growth
$
$ python mkldnn_489.py --n 1000 --mkldnn --bin
Running with mkldnn=True bin=True
heights: mean=1031, p50=284 p95=5561 max=7680
n=100 memory growth (kb): 1,753,212
n=200 memory growth (kb): 2,347,940
n=300 memory growth (kb): 2,846,872
n=400 memory growth (kb): 3,077,376
n=500 memory growth (kb): 3,467,264
n=600 memory growth (kb): 3,683,448
n=700 memory growth (kb): 3,838,668
n=800 memory growth (kb): 4,089,364
n=900 memory growth (kb): 4,376,568
time: mean=0.482 s, p50=0.138 s, p95=2.420 s
memory (kb): 240,036 initial, 4,388,468 growth
$
$ python mkldnn_489.py --n 1000 --bin
Running with mkldnn=False bin=True
heights: mean=1031, p50=284 p95=5561 max=7680
n=100 memory growth (kb): 549,820
n=200 memory growth (kb): 697,836
n=300 memory growth (kb): 721,840
n=400 memory growth (kb): 721,844
n=500 memory growth (kb): 725,972
n=600 memory growth (kb): 817,852
n=700 memory growth (kb): 817,852
n=800 memory growth (kb): 817,852
n=900 memory growth (kb): 817,852
time: mean=0.569 s, p50=0.161 s, p95=2.925 s
memory (kb): 225,824 initial, 817,852 growth

mingxiaoh · 2019-06-14T07:54:40Z

@lopuhin we can successfully reproduce your problem now and investigate it now, will come back to you when root cause is clear, thanks.

lopuhin · 2019-06-14T07:55:58Z

Great, thank you @mingxiaoh 👍

mingxiaoh · 2019-06-14T10:47:54Z

@lopuhin This is due to cache the MKL-DNN primitives created for varying input size. If you would like to reduce the memory usage, you can change the capacity value to a smaller value. (see below)

(py3-intel-chainer) [sys_dltest2@mlt-skx139 ideep]$ git diff include/ideep/lru_cache.hpp
diff --git a/include/ideep/lru_cache.hpp b/include/ideep/lru_cache.hpp
index e900e5b..68f6516 100644
--- a/include/ideep/lru_cache.hpp
+++ b/include/ideep/lru_cache.hpp
@@ -294,7 +294,7 @@ private:
size_type capacity_;
};

-template <class value_t, size_t capacity = 1024, class key_t = std::string>
+template <class value_t, size_t capacity = 128, class key_t = std::string>
class computation_cache {
public:
using iterator = typename lru_cache<key_t, value_t>::iterator;
@@ -346,7 +346,7 @@ public:
};

// TODO: mutex it
-template <class value_t, size_t capacity = 1024, class key_t = std::string>
+template <class value_t, size_t capacity = 128, class key_t = std::string>
class computation_gcache {
public:
using iterator = typename lru_cache<key_t, value_t>::iterator;

Test result:
case1: capacity = 1024
(py3-intel-chainer) [sys_dltest2@mlt-skx054 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn
Running with mkldnn=True bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 2378200
n=200 memory growth (kb): 4153520
n=300 memory growth (kb): 5453900
n=400 memory growth (kb): 5684464
n=500 memory growth (kb): 6531688
n=600 memory growth (kb): 6787628
n=700 memory growth (kb): 6990048
n=800 memory growth (kb): 7342368
n=900 memory growth (kb): 7627728
time: mean=0.3482012832928449 s, p50=0.11159232584759593 s, p95=1.8258836238179357 s
memory (kb): 242532 initial, 7743068 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx054 mingxiao_test]$ python mkldnn_489.py --n 1000
Running with mkldnn=False bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 449452
n=200 memory growth (kb): 490796
n=300 memory growth (kb): 490796
n=400 memory growth (kb): 490796
n=500 memory growth (kb): 490796
n=600 memory growth (kb): 495716
n=700 memory growth (kb): 495716
n=800 memory growth (kb): 495716
n=900 memory growth (kb): 495716
time: mean=0.455472462256439 s, p50=0.1348842727020383 s, p95=2.4116809139493838 s
memory (kb): 202656 initial, 495716 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx054 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn --bin
Running with mkldnn=True bin=True
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 1816980
n=200 memory growth (kb): 2298620
n=300 memory growth (kb): 2823092
n=400 memory growth (kb): 2969092
n=500 memory growth (kb): 3506688
n=600 memory growth (kb): 3674492
n=700 memory growth (kb): 3746492
n=800 memory growth (kb): 3964016
n=900 memory growth (kb): 4184328
time: mean=0.35401891875546426 s, p50=0.11079582339152694 s, p95=1.8085481527261436 s
memory (kb): 244468 initial, 4184328 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx054 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn
Running with mkldnn=True bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 2396348
n=200 memory growth (kb): 4185060
n=300 memory growth (kb): 5479424
n=400 memory growth (kb): 5547464
n=500 memory growth (kb): 6505552
n=600 memory growth (kb): 6875732
n=700 memory growth (kb): 7207476
n=800 memory growth (kb): 7496696
n=900 memory growth (kb): 7774860
time: mean=0.3428835664289072 s, p50=0.10954237962141633 s, p95=1.8142171191982914 s
memory (kb): 245060 initial, 7805132 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx054 mingxiao_test]$

case2: capacity = 128
[sys_dltest2@mlt-skx139 ~]$ source ~/pythonenv/py3-intel-chainer/bin/activate
(py3-intel-chainer) [sys_dltest2@mlt-skx139 ~]$
(py3-intel-chainer) [sys_dltest2@mlt-skx139 ~]$ cd Documents/mingxiao_test/
(py3-intel-chainer) [sys_dltest2@mlt-skx139 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn
Running with mkldnn=True bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 1238836
n=200 memory growth (kb): 1501184
n=300 memory growth (kb): 1721964
n=400 memory growth (kb): 1753388
n=500 memory growth (kb): 2130836
n=600 memory growth (kb): 2139412
n=700 memory growth (kb): 2241712
n=800 memory growth (kb): 2301144
n=900 memory growth (kb): 2519052
time: mean=0.342514000098221 s, p50=0.11045438100700267 s, p95=1.7644889014918574 s
memory (kb): 201740 initial, 2519052 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx139 mingxiao_test]$ python mkldnn_489.py --n 1000
Running with mkldnn=False bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 469264
n=200 memory growth (kb): 602688
n=300 memory growth (kb): 602688
n=400 memory growth (kb): 611324
n=500 memory growth (kb): 611324
n=600 memory growth (kb): 611324
n=700 memory growth (kb): 611324
n=800 memory growth (kb): 611324
n=900 memory growth (kb): 633296
time: mean=0.44288936752011066 s, p50=0.13412621649331413 s, p95=2.358341607873444 s
memory (kb): 158304 initial, 647356 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx139 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn --bin
Running with mkldnn=True bin=True
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 1259300
n=200 memory growth (kb): 1579460
n=300 memory growth (kb): 1693576
n=400 memory growth (kb): 1747780
n=500 memory growth (kb): 2023436
n=600 memory growth (kb): 2087344
n=700 memory growth (kb): 2229264
n=800 memory growth (kb): 2245432
n=900 memory growth (kb): 2491216
time: mean=0.35734817950008435 s, p50=0.11491244149510749 s, p95=1.7963337316439671 s
memory (kb): 201160 initial, 2491216 growth
(py3-intel-chainer) [sys_dltest2@mlt-skx139 mingxiao_test]$ python mkldnn_489.py --n 1000 --mkldnn
Running with mkldnn=True bin=False
heights: mean=1031.131, p50=284.0, p95=5561.399999999995, max=7680
n=100 memory growth (kb): 1296192
n=200 memory growth (kb): 1537512
n=300 memory growth (kb): 1897888
n=400 memory growth (kb): 1904016
n=500 memory growth (kb): 2241532
n=600 memory growth (kb): 2252640
n=700 memory growth (kb): 2397764
n=800 memory growth (kb): 2425976
n=900 memory growth (kb): 2555936
time: mean=0.34390410209848776 s, p50=0.11097242051619105 s, p95=1.7738525233697133 s
memory (kb): 200516 initial, 2555936 growth

lopuhin · 2019-06-14T10:51:54Z

Wow, these are great results @mingxiaoh , thank you!

jgong5 · 2019-06-15T05:09:21Z

@lopuhin We will provide an environment variable to control this threshold so that users don't have to change the source code for it.

lopuhin · 2019-06-15T07:20:41Z

@jgong5 this would be perfect, thank you!

shahsahilj · 2019-06-18T19:53:01Z

@jgong5 , I am having a similar issue with tensorflow keras. Will this same solution work for that as well.
The detailed issue is listed at: https://stackoverflow.com/questions/56639787/how-to-fix-memory-leak-with-variable-tensors-in-tensorflow-mkl-library

vpirogov · 2019-06-18T22:59:18Z

@shahsahilj, the same approach will definitely work for Tensorflow. Note, that the cache is implemented on Tensorflow side, not in Intel MKL-DNN, so the controls for the cache behavior should be implemented there as well.

@agramesh1, could you please comment on the cache controls in Tensorflow?

jgong5 · 2019-06-18T23:31:38Z

@shahsahilj In Pytorch, we are using an LRU cache to control the capacity. You may try similar approach in TF as well.

agramesh1 · 2019-06-19T19:28:15Z

@shahsahilj The primitives cache in TF is an LRU cache that will limit the memory growth. This should be enabled in TF 1.14 that was just released.

vpirogov · 2019-06-20T03:39:18Z

@jgong5, @agramesh1, does TF provide a way to control the cache size or turn it off completely?

shahsahilj · 2019-06-20T18:39:04Z

@agramesh1 I tried using the cache in the TF 1.14. I was able to recompile it with a lower cache and that helped slightly, but I could not get it to work as well as the TF1.13 with mkl that anaconda has in its main channel. Would it be possible to know what config options are used to make the most optimal compilation of the TF

vpirogov added the question label Jun 7, 2019

vpirogov self-assigned this Jun 7, 2019

vpirogov added integration Issues with integrating the library into applications and removed question labels Jun 7, 2019

vpirogov closed this as completed Aug 6, 2019

lopuhin mentioned this issue Oct 15, 2019

High memory usage for CPU inference on variable input shapes (10x compared to pytorch 1.1) pytorch/pytorch#27971

Closed

vpirogov mentioned this issue Nov 13, 2019

JIT leaks memory when I change the max sequence length pytorch/pytorch#25267

Open

xsacha mentioned this issue Dec 9, 2019

TorchScript led to CPU OOM pytorch/pytorch#30949

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: is it possible to control memory usage? #489

Question: is it possible to control memory usage? #489

lopuhin commented Jun 7, 2019

lopuhin commented Jun 7, 2019

vpirogov commented Jun 7, 2019

vpirogov commented Jun 7, 2019

lopuhin commented Jun 7, 2019

vpirogov commented Jun 7, 2019

Jianhui-Li commented Jun 7, 2019

lopuhin commented Jun 7, 2019

Jianhui-Li commented Jun 7, 2019

lopuhin commented Jun 7, 2019

jgong5 commented Jun 10, 2019

lopuhin commented Jun 10, 2019

mingxiaoh commented Jun 12, 2019

lopuhin commented Jun 12, 2019

mingxiaoh commented Jun 14, 2019

lopuhin commented Jun 14, 2019

mingxiaoh commented Jun 14, 2019

lopuhin commented Jun 14, 2019

jgong5 commented Jun 15, 2019

lopuhin commented Jun 15, 2019

shahsahilj commented Jun 18, 2019 •

edited

Loading

vpirogov commented Jun 18, 2019

jgong5 commented Jun 18, 2019

agramesh1 commented Jun 19, 2019 •

edited

Loading

vpirogov commented Jun 20, 2019

shahsahilj commented Jun 20, 2019

Question: is it possible to control memory usage? #489

Question: is it possible to control memory usage? #489

Comments

lopuhin commented Jun 7, 2019

lopuhin commented Jun 7, 2019

vpirogov commented Jun 7, 2019

vpirogov commented Jun 7, 2019

lopuhin commented Jun 7, 2019

vpirogov commented Jun 7, 2019

Jianhui-Li commented Jun 7, 2019

lopuhin commented Jun 7, 2019

Jianhui-Li commented Jun 7, 2019

lopuhin commented Jun 7, 2019

jgong5 commented Jun 10, 2019

lopuhin commented Jun 10, 2019

mingxiaoh commented Jun 12, 2019

lopuhin commented Jun 12, 2019

mingxiaoh commented Jun 14, 2019

lopuhin commented Jun 14, 2019

mingxiaoh commented Jun 14, 2019

lopuhin commented Jun 14, 2019

jgong5 commented Jun 15, 2019

lopuhin commented Jun 15, 2019

shahsahilj commented Jun 18, 2019 • edited Loading

vpirogov commented Jun 18, 2019

jgong5 commented Jun 18, 2019

agramesh1 commented Jun 19, 2019 • edited Loading

vpirogov commented Jun 20, 2019

shahsahilj commented Jun 20, 2019

shahsahilj commented Jun 18, 2019 •

edited

Loading

agramesh1 commented Jun 19, 2019 •

edited

Loading