Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced memory management for DNN blobs #9389

Closed
wants to merge 1 commit into from

Conversation

dkurt
Copy link
Member

@dkurt dkurt commented Aug 17, 2017

This pullrequest changes

There are two methods: optimal for host memory that will be processed on CPU and a less optimal for device memory (OpenCL). The last one follows the same reuse or create strategy as we use now, but it makes more dense packing. Achieved results:

Model with PR, host memory with PR, device memory before PR, host / dev No reusing
AlexNet 2.32MB 2.6MB 3.72MB 6.14MB
Inception-5h 5.2MB 8.65MB 15.62MB 33.13MB
GoogLeNet 5.21MB 8.76MB 15.74MB 33.43MB
ENet, 512x256 23.59MB 26.24MB 72.22MB 137.9MB
SqueezeNet v1.1 4.87MB 5.07MB 11.71MB 20.71MB
ResNet-50 9.63MB 10.43MB 31.51MB 66.64MB
  • Despite at the same time is actual only one memory block (host or device) and we can skip some allocations of host memory, we can't make a decision about it at the allocation stage. So, in case of OpenCL backend, we allocate both CPU and GPU memory (as it was before PR).
  • Outputs with no references (like indices of MaxPooling) are allocated too. Otherwise it will be more complicated for maintaining.
  • Required changes: Torch's Concat and ConcatTable doesn't use Split layer #9384.

@dkurt
Copy link
Member Author

dkurt commented Aug 29, 2017

Maximum resident set size according to /usr/bin/time --verbose.
Test: load network and make single forward pass

Model origin framework DNN
AlexNet 974MB (Caffe) 744MB (x1.3)
Inception-5h 483MB (TensorFlow) 155MB (x3.11)
GoogLeNet 435MB (Caffe) 187MB (x2.32)
ENet, 512x256 456MB (Torch) 62.4MB (x7.3)
SqueezeNet v1.1 233MB (Caffe) 47.4MB (x4.9)
ResNet-50 373MB (Caffe) 238MB (x1.56)

TensorFlow script:

import numpy as np
import tensorflow as tf

with tf.gfile.FastGFile('opencv_extra/testdata/dnn/tensorflow_inception_graph.pb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())

with tf.Session() as sess:
    sess.graph.as_default()
    tf.import_graph_def(graph_def, name='')

    # Generate input
    np.random.seed(2701)
    inp = np.random.standard_normal([1, 224, 224, 3]).astype(np.float32)

    # Receive output
    outTensor = sess.graph.get_tensor_by_name('softmax2:0')
    out = sess.run(outTensor, feed_dict={'input:0': inp})

DNN:

int main(int argc, char** argv) {
  cv::dnn::Net net = cv::dnn::readNetFromTensorflow("opencv_extra/testdata/dnn/tensorflow_inception_graph.pb");
  cv::Mat input({1, 3, 224, 224}, CV_32FC1);
  cv::randu(input, 0.0f, 1.0f);
  net.setInput(input);
  cv::Mat output = net.forward();
  return 0;
}

Caffe:

int main(int argc, char** argv) {
  std::string proto = "opencv_extra/testdata/dnn/ResNet-50-deploy.prototxt";
  std::string weights = "opencv_extra/testdata/dnn/ResNet-50-model.caffemodel";

  caffe::Caffe::set_mode(caffe::Caffe::CPU);
  caffe::Net<float>* net = new caffe::Net<float>(proto, caffe::TEST);
  net->CopyTrainedLayersFrom(weights);
  net->Forward();
  return 0;
}

Torch (model is in CPU mode):

require 'nn'

torch.setdefaulttensortype('torch.FloatTensor')

net = torch.load('ENet-model.t7'):float()

input = torch.FloatTensor(torch.LongStorage({1, 3, 256, 512}))
output = net:forward(input)

@dkurt
Copy link
Member Author

dkurt commented Oct 24, 2018

Some of the ideas from this PR have been merged in different PRs.
Perhaps we can reuse proposed memory scheduling approach later. In example, as a part of some students project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant