Skip to content
This repository has been archived by the owner on May 31, 2024. It is now read-only.

Broadcasting is consuming a lot of memory in Gorgonnx/Gorgonia #68

Closed
owulveryck opened this issue May 28, 2019 · 7 comments
Closed

Broadcasting is consuming a lot of memory in Gorgonnx/Gorgonia #68

owulveryck opened this issue May 28, 2019 · 7 comments
Labels
Gorgonia / Gorgonnx This issue is related to the Gorgonia backend

Comments

@owulveryck
Copy link
Owner

owulveryck commented May 28, 2019

Bench

I've created this simple benchmark with the MNIST model to analyze the behavior of the code:

package onnx_test

import (
        "testing"

        "github.com/owulveryck/onnx-go"
        "github.com/owulveryck/onnx-go/backend/x/gorgonnx"
        "github.com/owulveryck/onnx-go/internal/examples/mnist"
        "gorgonia.org/tensor"
)

func BenchmarkUnmarshalBinary(b *testing.B) {
        input := tensor.New(tensor.WithShape(1, 1, 28, 28), tensor.Of(tensor.Float32))
        for n := 0; n < b.N; n++ {
                // Create a backend receiver
                backend := gorgonnx.NewGraph()
                // Create a model and set the execution backend
                model := onnx.NewModel(backend)

                // Decode it into the model
                err := model.UnmarshalBinary(mnist.GetMnist())
                if err != nil {
                        b.Fatal(err)
                }
                // Set the first input, the number depends of the model
                model.SetInput(0, input)
                err = backend.Run()
                if err != nil {
                        b.Fatal(err)
                }
        }
}

Running this with go test -bench=. -benchmem -memprofile memprofile.out -cpuprofile profile.out -benchtime=10s generates two files to decode with the go profiler.

CPU

The result for the CPU is displayed herer:
mnist cpu flamegraph

There are possible enhancements, but nothing obvious.

Memory

The result for the Memory usage is more interesting. It shows that the repeatOp of Gorgonia is using a lot of memory. The repeatOp is the foundation of the broadcasting.

Screenshot 2019-05-28 at 11 15 46

This op seems to copy a lot of data:

bla

gorgonia.Tensor

The analysis point that this function from the tensor package is involved in extra memory consumption:

https://github.com/gorgonia/tensor/blob/8eeece33868236224d51e7362e36a68642870bd2/array.go#L34-L51

Especially this call to val.Interface()

	return array{
		Header: hdr,
		t:      t,
		v:      val.Interface(),
	}

According to the comment, this field is even not mandatory by the array.

// array is the underlying generic array.
type array struct {
	storage.Header             // the header - the Go representation (a slice)
	t              Dtype       // the element type
	v              interface{} // an additional reference to the underlying slice. This is not strictly necessary, but does improve upon anything that calls .Data()
}

On top of that, the reflect package from stdlib references a TODO with something to enhance in the packEface function (packEface converts v to the empty interface. ):

		if v.flag&flagAddr != 0 {
			// TODO: pass safe boolean from valueInterface so
			// we don't need to copy if safe==true?
			c := unsafe_New(t)
			typedmemmove(t, c, ptr)
			ptr = c
		}

The safe flag is true when calling Interface() function:

// Interface returns v's current value as an interface{}.
// It is equivalent to:
//	var i interface{} = (v's underlying value)
// It panics if the Value was obtained by accessing
// unexported struct fields.
func (v Value) Interface() (i interface{}) {
	return valueInterface(v, true)
}

This suggests that avoiding the copy would significantly improve the performances.

cc @chewxy

@owulveryck owulveryck added the Gorgonia / Gorgonnx This issue is related to the Gorgonia backend label May 28, 2019
@owulveryck owulveryck mentioned this issue May 29, 2019
@owulveryck
Copy link
Owner Author

I just made a quick test by "hacking" the tensor package.
I've made a "lazy initialization" of the array; the value is populated on a call to Data() which is not often.

The results look promising:

Normal bench:

➜  onnx-go git:(benchmarks) ✗  go test -bench=. -benchmem -memprofile memprofile.out -cpuprofile profile.out -benchtime=10s
goos: darwin
goarch: amd64
pkg: github.com/owulveryck/onnx-go
BenchmarkUnmarshalBinary-4          2000          10688594 ns/op         3906741 B/op      67107 allocs/op
PASS
ok      github.com/owulveryck/onnx-go   22.620s

bench with the hack:

➜  onnx-go git:(benchmarks) ✗  go test -bench=. -benchmem -memprofile memprofile.out -cpuprofile profile.out -benchtime=10s
goos: darwin
goarch: amd64
pkg: github.com/owulveryck/onnx-go
BenchmarkUnmarshalBinary-4          3000           6136003 ns/op         2642474 B/op      27664 allocs/op
PASS
ok      github.com/owulveryck/onnx-go   19.169s

@owulveryck
Copy link
Owner Author

This commit from then tensor.Tensor package drastically enhances performances and memory consumption.

newTensor

However, I will keep this issue open for now to do a further investigation with the broadcasting mechanism.

@owulveryck
Copy link
Owner Author

owulveryck commented May 30, 2019

In Gorgonia, the broadcast mechanism is based on the repeatOp, which itself triggers a call to Repeat(...) in the Dense implementation of the Tensor package.
This mechanism is calling copyDenseSliced many times in two embedded for loops. A single call to copyDenseSlice is creating two new objects here

d := dst.arr().slice(dstart, dend)
s := src.arr().slice(sstart, send)

Within this loop, we are creating i x j x 2 strides.
We could reduce the number of creation and garbage collection by extracting the s and d slices from the copyDenseFunction.

@owulveryck
Copy link
Owner Author

With the PR 43 from the tensor package, the results are now:

➜  onnx-go git:(benchmarks) ✗ go test -bench=. -benchmem -memprofile memprofile.out -cpuprofile profile.out -benchtime=10s
goos: darwin
goarch: amd64
pkg: github.com/owulveryck/onnx-go
BenchmarkUnmarshalBinary-4          3000           4208506 ns/op         2042788 B/op      18273 allocs/op
PASS
ok      github.com/owulveryck/onnx-go   13.320s

Comparing with the initial investigation of the issue, the performance comparison will be:

benchmark                      old ns/op     new ns/op     delta
BenchmarkUnmarshalBinary-4     9457554       4528740       -52.12%

benchmark                      old allocs     new allocs     delta
BenchmarkUnmarshalBinary-4     67120          18272          -72.78%

benchmark                      old bytes     new bytes     delta
BenchmarkUnmarshalBinary-4     3910542       2042637       -47.77%

Once the PR is merged, that will be enough to close this issue

@owulveryck
Copy link
Owner Author

Closed thanks to PR #43 of the tensor package

@owulveryck
Copy link
Owner Author

I reopen this issue because on NN involving small tensors, broadcasting is ok, but on bigger tensor it's still too slow.

@owulveryck owulveryck reopened this Jun 11, 2019
@owulveryck
Copy link
Owner Author

PR 299 from Gorgonia should improve things

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Gorgonia / Gorgonnx This issue is related to the Gorgonia backend
Projects
None yet
Development

No branches or pull requests

1 participant