Broadcasting is consuming a lot of memory in Gorgonnx/Gorgonia #68

owulveryck · 2019-05-28T09:00:17Z

Bench

I've created this simple benchmark with the MNIST model to analyze the behavior of the code:

package onnx_test

import (
        "testing"

        "github.com/owulveryck/onnx-go"
        "github.com/owulveryck/onnx-go/backend/x/gorgonnx"
        "github.com/owulveryck/onnx-go/internal/examples/mnist"
        "gorgonia.org/tensor"
)

func BenchmarkUnmarshalBinary(b *testing.B) {
        input := tensor.New(tensor.WithShape(1, 1, 28, 28), tensor.Of(tensor.Float32))
        for n := 0; n < b.N; n++ {
                // Create a backend receiver
                backend := gorgonnx.NewGraph()
                // Create a model and set the execution backend
                model := onnx.NewModel(backend)

                // Decode it into the model
                err := model.UnmarshalBinary(mnist.GetMnist())
                if err != nil {
                        b.Fatal(err)
                }
                // Set the first input, the number depends of the model
                model.SetInput(0, input)
                err = backend.Run()
                if err != nil {
                        b.Fatal(err)
                }
        }
}

Running this with go test -bench=. -benchmem -memprofile memprofile.out -cpuprofile profile.out -benchtime=10s generates two files to decode with the go profiler.

CPU

The result for the CPU is displayed herer:

There are possible enhancements, but nothing obvious.

Memory

The result for the Memory usage is more interesting. It shows that the repeatOp of Gorgonia is using a lot of memory. The repeatOp is the foundation of the broadcasting.

This op seems to copy a lot of data:

`gorgonia.Tensor`

The analysis point that this function from the tensor package is involved in extra memory consumption:

https://github.com/gorgonia/tensor/blob/8eeece33868236224d51e7362e36a68642870bd2/array.go#L34-L51

Especially this call to val.Interface()

	return array{
		Header: hdr,
		t:      t,
		v:      val.Interface(),
	}

According to the comment, this field is even not mandatory by the array.

// array is the underlying generic array.
type array struct {
	storage.Header             // the header - the Go representation (a slice)
	t              Dtype       // the element type
	v              interface{} // an additional reference to the underlying slice. This is not strictly necessary, but does improve upon anything that calls .Data()
}

On top of that, the reflect package from stdlib references a TODO with something to enhance in the packEface function (packEface converts v to the empty interface. ):

		if v.flag&flagAddr != 0 {
			// TODO: pass safe boolean from valueInterface so
			// we don't need to copy if safe==true?
			c := unsafe_New(t)
			typedmemmove(t, c, ptr)
			ptr = c
		}

The safe flag is true when calling Interface() function:

// Interface returns v's current value as an interface{}.
// It is equivalent to:
//	var i interface{} = (v's underlying value)
// It panics if the Value was obtained by accessing
// unexported struct fields.
func (v Value) Interface() (i interface{}) {
	return valueInterface(v, true)
}

This suggests that avoiding the copy would significantly improve the performances.

cc @chewxy

The text was updated successfully, but these errors were encountered:

owulveryck · 2019-05-29T15:33:44Z

I just made a quick test by "hacking" the tensor package.
I've made a "lazy initialization" of the array; the value is populated on a call to Data() which is not often.

The results look promising:

Normal bench:

➜  onnx-go git:(benchmarks) ✗  go test -bench=. -benchmem -memprofile memprofile.out -cpuprofile profile.out -benchtime=10s
goos: darwin
goarch: amd64
pkg: github.com/owulveryck/onnx-go
BenchmarkUnmarshalBinary-4          2000          10688594 ns/op         3906741 B/op      67107 allocs/op
PASS
ok      github.com/owulveryck/onnx-go   22.620s

bench with the hack:

➜  onnx-go git:(benchmarks) ✗  go test -bench=. -benchmem -memprofile memprofile.out -cpuprofile profile.out -benchtime=10s
goos: darwin
goarch: amd64
pkg: github.com/owulveryck/onnx-go
BenchmarkUnmarshalBinary-4          3000           6136003 ns/op         2642474 B/op      27664 allocs/op
PASS
ok      github.com/owulveryck/onnx-go   19.169s

owulveryck · 2019-05-30T07:15:48Z

This commit from then tensor.Tensor package drastically enhances performances and memory consumption.

However, I will keep this issue open for now to do a further investigation with the broadcasting mechanism.

owulveryck · 2019-05-30T13:29:26Z

In Gorgonia, the broadcast mechanism is based on the repeatOp, which itself triggers a call to Repeat(...) in the Dense implementation of the Tensor package.
This mechanism is calling copyDenseSliced many times in two embedded for loops. A single call to copyDenseSlice is creating two new objects here

d := dst.arr().slice(dstart, dend)
s := src.arr().slice(sstart, send)

Within this loop, we are creating i x j x 2 strides.
We could reduce the number of creation and garbage collection by extracting the s and d slices from the copyDenseFunction.

owulveryck · 2019-05-30T20:47:44Z

With the PR 43 from the tensor package, the results are now:

➜  onnx-go git:(benchmarks) ✗ go test -bench=. -benchmem -memprofile memprofile.out -cpuprofile profile.out -benchtime=10s
goos: darwin
goarch: amd64
pkg: github.com/owulveryck/onnx-go
BenchmarkUnmarshalBinary-4          3000           4208506 ns/op         2042788 B/op      18273 allocs/op
PASS
ok      github.com/owulveryck/onnx-go   13.320s

Comparing with the initial investigation of the issue, the performance comparison will be:

benchmark                      old ns/op     new ns/op     delta
BenchmarkUnmarshalBinary-4     9457554       4528740       -52.12%

benchmark                      old allocs     new allocs     delta
BenchmarkUnmarshalBinary-4     67120          18272          -72.78%

benchmark                      old bytes     new bytes     delta
BenchmarkUnmarshalBinary-4     3910542       2042637       -47.77%

Once the PR is merged, that will be enough to close this issue

owulveryck · 2019-06-09T14:03:20Z

Closed thanks to PR #43 of the tensor package

owulveryck · 2019-06-11T07:30:19Z

I reopen this issue because on NN involving small tensors, broadcasting is ok, but on bigger tensor it's still too slow.

owulveryck · 2019-06-28T19:35:40Z

PR 299 from Gorgonia should improve things

owulveryck mentioned this issue May 28, 2019

Analyse (and enhance) performances on multiple predictions #67

Closed

owulveryck added the Gorgonia / Gorgonnx This issue is related to the Gorgonia backend label May 28, 2019

owulveryck mentioned this issue May 29, 2019

Benchmarks #69

Merged

owulveryck mentioned this issue May 30, 2019

optimization for the array structure gorgonia/tensor#42

Merged

owulveryck closed this as completed Jun 9, 2019

owulveryck reopened this Jun 11, 2019

owulveryck closed this as completed Jun 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcasting is consuming a lot of memory in Gorgonnx/Gorgonia #68

Broadcasting is consuming a lot of memory in Gorgonnx/Gorgonia #68

owulveryck commented May 28, 2019 •

edited

owulveryck commented May 29, 2019

owulveryck commented May 30, 2019

owulveryck commented May 30, 2019 •

edited

owulveryck commented May 30, 2019

owulveryck commented Jun 9, 2019

owulveryck commented Jun 11, 2019

owulveryck commented Jun 28, 2019

Broadcasting is consuming a lot of memory in Gorgonnx/Gorgonia #68

Broadcasting is consuming a lot of memory in Gorgonnx/Gorgonia #68

Comments

owulveryck commented May 28, 2019 • edited

Bench

CPU

Memory

gorgonia.Tensor

owulveryck commented May 29, 2019

owulveryck commented May 30, 2019

owulveryck commented May 30, 2019 • edited

owulveryck commented May 30, 2019

owulveryck commented Jun 9, 2019

owulveryck commented Jun 11, 2019

owulveryck commented Jun 28, 2019

owulveryck commented May 28, 2019 •

edited

`gorgonia.Tensor`

owulveryck commented May 30, 2019 •

edited