Optimisation in NewUniqueNode may be buggy #384

owulveryck · 2020-03-21T17:29:30Z

I think this is related to #373 and #375

I made a simple test, as described in #375:

package gorgonia_test

import (
	"testing"

	"gorgonia.org/gorgonia"
)

func TestMean_issue375(t *testing.T) {
	g := gorgonia.NewGraph()

	w0 := gorgonia.NewTensor(g, gorgonia.Float32, 4, gorgonia.WithShape(1, 64, 1, 64), gorgonia.WithName("w0"), gorgonia.WithInit(gorgonia.GlorotN(1.0)))
	result, _ := gorgonia.Mean(w0, 3)
	t.Log(result)
}

After some investigation, the code is panicking because of a call to WithShape here:

gorgonia/node.go

Lines 210 to 212 in 2d4605d

    
           if !acceptVec && !sameDims && !acceptScalar { 
        
           	panic(fmt.Sprintf("Node %v, has %d dimensions(Shape: %v). Input shape is %v, which has %d dimensions", n, n.Dims(), n.shape, s, s.Dims())) 
        
           }

which is called here:

gorgonia/op.go

Line 202 in 2d4605d

    
           retVal = NewUniqueNode(WithType(retType), WithOp(op), WithChildren(children), In(g), WithShape(s...))

The problem is in the function NewUniqueNode which calls newNode... which calls n.borrowNode for optimization:

gorgonia/node.go

Line 259 in 2d4605d

n := borrowNode()

It applies the WithShape function here ...

gorgonia/node.go

Line 264 in 2d4605d

opt(n)

therefore, all the checks are performed on the node that is borrowed and full of rubbish!

In the example, there is only one node to be borrowed; therefore, the panic always returns the same result, but with a more significant code, the problem could appear sporadically.

The text was updated successfully, but these errors were encountered:

chewxy · 2020-03-21T23:08:48Z

Hmm, I thought I added the garbage cleanup on return to pool. Let me look into this further

owulveryck · 2020-03-22T15:50:56Z

You're right, the cleanup happens here:

gorgonia/perf.go

Line 16 in 2d4605d

func returnNode(n *Node) {

This is really weird.

owulveryck · 2020-03-22T16:02:40Z

See on my debugging screenshot:

n's fields have been nilled (n.shape is nil )
nd is 4

How is this possible?

Note: the panic message is coherent with this, it displays 4 dimensions and a shape equal to ():

Node Σ[3](%0) :: Tensor-4 float32, has 4 dimensions(Shape: ()). Input shape is (64), which has 1 dimensions [recovered]

chewxy · 2020-03-22T21:03:38Z

That message means that the Tensor-4 was defined with a type of dim 4, but the given shape is only of dim (). Let me recreate that

chewxy · 2020-03-22T21:04:14Z

the issue is likely in Mean more than anything

chewxy · 2020-03-22T22:41:59Z

The issue appears to be in reductionInferShape which is a bit overeager when it comes to squeezing dimensions.

When (1, 64, 1, 64) gets Sum' d over axis 3, there can be multiple interpretations of what this means. For example, you can get (1, 64, 1, 1) which is what was expected by ApplyOp, you can also get (1, 64, 1) which is what numpy's API generate. reductionInferShape sees (1, 64, 1) and shrinks it to (64), which is a "machine view" of things.

Some opinions.

The "intuitive" notion should be that a Sum returns (1, 64, 1, 1). The "correct" notion should be (1, 64, 1).

It's intuitive that if you reduce along an axis, you get a scalar (i.e. 1 element) back. However, this is solely due to one interpretation of the semantics of shapes. Examples follow.

Consider the 2D case. A "sum" is essentially $\mathbf{a} \times \mathbf{1}'$ , where 1 is a vector of 1s of the same shape as a. Its results are a scalar (i.e. of shape ())

Now consider a 3D case. The 3D case would be a BatchedMatMul in Gorgonia's terms. Now imagine a (2, 3, 4) 3-tensor, and we're summing over the 2nd axis. We're left with two batches of (3-rows of scalars). A row of scalars can be written (3) or (3,1). It depends on your views on linear algebra - is this a subspace we're talking about?

Hence the most generic solution would be to leave it as (2, 3, 1).

A good argument for numpy style shape inference would be "but you said the sum results in a scalar". Ah, but we haven't yet determined the calculus of shapes did we? What happens if we append a () to a (2, 3)? What are the rules around this? Do we say that it's (2,3, ()) and that's equal to (2,3,1) or (2,3) ?

I like to think of () as "unit". And _ as void. The question comes down to how we model "unit". Do we treat it as an identity element (alá 0 for additions and 1 for multiplication)?

More Opinions!

I have analysis paralysis. I do not know what choice to make.

(2,3,1) is more generic - it doesn't lock people down to one definition of a "vector space". This for example, has been quite useful as I have been doing work in gyrovector spaces for a bit.

(2,3) complies with numpy's API. It makes newcomers to Gorgonia feel more at ease.

What We Can Do Now

Fix reductionInferShape, but be prepared to break it for v0.10.0

owulveryck · 2020-03-23T08:28:46Z

What do we do with this issue? I'd like to close it, but I do not want to lose your (very valuable comment) comment.

chewxy · 2020-03-23T21:35:20Z

The comments are mostly opinion. Should not be taken to be gospel. We can write a better version on the docs site once the ideas have stabilized. In the meantime this has gone into my personal org files. So... there.

wzzhu · 2020-06-10T09:28:12Z

The issue appears to be in reductionInferShape which is a bit overeager when it comes to squeezing dimensions.

What We Can Do Now

Fix reductionInferShape, but be prepared to break it for v0.10.0

Besides problem in reductionInferShape, WithShape seems to fail whenever the node is created using NewUniqueNode, since newNode will borrow a node with node.shape initialized to nil. This will always make the nd==0.
Whenever WithShape op is applied, the shape is always mismatched unless the shape passed in from inferred shape is () too.

See the flow

// Max performs a max() on the input and the provided axes.
func Max(a *Node, along ...int) (retVal *Node, err error) {
	if a.IsScalar() {
		// can't max a scalar. Should return error
		return a, nil
	}

	dims := a.Dims()
	if len(along) == 0 {
		along = intRange(0, dims)
	}

	op := newMaxOp(along, dims)

	return ApplyOp(op, a) // <--
}

// ApplyOp is the generic function application - for when no specialization is required
func ApplyOp(op Op, children ...*Node) (retVal *Node, err error) {
	...
	ds := Nodes(children).dimSizers()
	var s tensor.Shape
	if s, err = op.InferShape(ds...); err == nil {
		shapeLogf("inferred shape %v", s)
                // <-- Call NewUniqueNode here
		retVal = NewUniqueNode(WithType(retType), WithOp(op), WithChildren(children), In(g), WithShape(s...))
	} else {
		err = errors.Wrapf(err, "Failed to infer shape. Op: %v", op)
		// retVal = newUniqueNode(withType(retType), withOp(op), withChildren(children), withGraph(g))
	}
	returnDimSizers(ds)
	return
}

// NewUniqueNode creates a new unique node in a graph. If no graph was specified in the construction options then it will just return a graphless node.
func NewUniqueNode(opts ...NodeConsOpt) *Node {
	n := newNode(opts...) // <--
	if n.g == nil {
		return n
	}
	n.fixChildren() // ensure that all the kids are in the graph first

	m := n.g.AddNode(n)
	if n != m {
		returnNode(n)
	}
	m.fixEdges()
	return m
}

func newNode(opts ...NodeConsOpt) *Node {
	n := borrowNode()
	n.dataOn = CPU
	n.id = -1
// <-- n.shape == nil here
	for _, opt := range opts {
		opt(n)
	}
	n.fix()

	incrNN()
	return n
}

// WithShape is a node construction option to initialize a *Node with a particular shape.
// This function panics if the shape's dimensions do not match the specified dimensions of the *Node.
func WithShape(shp ...int) NodeConsOpt {
	s := tensor.Shape(tensor.BorrowInts(len(shp)))
	copy(s, shp)
	f := func(n *Node) {
		nd := n.Dims() // <-- nd == 0 always
		// if nd == 1 && s.IsVector() {
		// 	goto safe
		// }
		isVec := s.IsColVec() || s.IsRowVec()
		acceptVec := (isVec && (nd == 1))
		sameDims := nd == s.Dims()
		acceptScalar := nd == 0 && scalarEquiv(s)

		if !acceptVec && !sameDims && !acceptScalar {
                        // <--panic here!!
			panic(fmt.Sprintf("Node %v, has %d dimensions(Shape: %v). Input shape is %v, which has %d dimensions", n, n.Dims(), n.shape, s, s.Dims()))
		}
		// safe:
		n.shape = s
	}
	return f
}

* Make reductionInferShape conservative to fix #384 The reductionInferShape currently doesn't respect along initially. It aggressively squeezes dimensions. Not only does it affect normal tensor operation, but also it breaks the backprop autoDiff algorithm sometimes when the network containing BroadcastAdd, resulting in crash when calling Grad(). The change tries to strictly respect the parameter along, e.g., (100, 1) along 0, reduction to shape (1) instead () (1, 64, 1, 64) along 3 will reduce to (1, 64, 1) (64, 1, 3, 2) along (2,3) will reduce to (64, 1). Fixed unit tests. * Remove inconsistent dimention for Sum op After changing reductionType to subtract len(along) from reduction op, SumOp's dimension need to be adjusted with it. Otherwise SymDiff will crash for Sum in calcBroadcastShap.

owulveryck added bug opinion wanted labels Mar 21, 2020

owulveryck removed the opinion wanted label Mar 23, 2020

chewxy mentioned this issue Mar 23, 2020

#367 is still alive #383

Open

chewxy closed this as completed Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisation in NewUniqueNode may be buggy #384

Optimisation in NewUniqueNode may be buggy #384

owulveryck commented Mar 21, 2020

chewxy commented Mar 21, 2020

owulveryck commented Mar 22, 2020

owulveryck commented Mar 22, 2020 •

edited

Loading

chewxy commented Mar 22, 2020

chewxy commented Mar 22, 2020

chewxy commented Mar 22, 2020 •

edited

Loading

owulveryck commented Mar 23, 2020

chewxy commented Mar 23, 2020

wzzhu commented Jun 10, 2020 •

edited

Loading

What We Can Do Now

Optimisation in NewUniqueNode may be buggy #384

Optimisation in NewUniqueNode may be buggy #384

Comments

owulveryck commented Mar 21, 2020

chewxy commented Mar 21, 2020

owulveryck commented Mar 22, 2020

owulveryck commented Mar 22, 2020 • edited Loading

chewxy commented Mar 22, 2020

chewxy commented Mar 22, 2020

chewxy commented Mar 22, 2020 • edited Loading

Some opinions.

More Opinions!

What We Can Do Now

owulveryck commented Mar 23, 2020

chewxy commented Mar 23, 2020

wzzhu commented Jun 10, 2020 • edited Loading

What We Can Do Now

owulveryck commented Mar 22, 2020 •

edited

Loading

chewxy commented Mar 22, 2020 •

edited

Loading

wzzhu commented Jun 10, 2020 •

edited

Loading