Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisation in NewUniqueNode may be buggy #384

Closed
owulveryck opened this issue Mar 21, 2020 · 9 comments
Closed

Optimisation in NewUniqueNode may be buggy #384

owulveryck opened this issue Mar 21, 2020 · 9 comments
Labels

Comments

@owulveryck
Copy link
Member

I think this is related to #373 and #375

I made a simple test, as described in #375:

package gorgonia_test

import (
	"testing"

	"gorgonia.org/gorgonia"
)

func TestMean_issue375(t *testing.T) {
	g := gorgonia.NewGraph()

	w0 := gorgonia.NewTensor(g, gorgonia.Float32, 4, gorgonia.WithShape(1, 64, 1, 64), gorgonia.WithName("w0"), gorgonia.WithInit(gorgonia.GlorotN(1.0)))
	result, _ := gorgonia.Mean(w0, 3)
	t.Log(result)
}

After some investigation, the code is panicking because of a call to WithShape here:

gorgonia/node.go

Lines 210 to 212 in 2d4605d

if !acceptVec && !sameDims && !acceptScalar {
panic(fmt.Sprintf("Node %v, has %d dimensions(Shape: %v). Input shape is %v, which has %d dimensions", n, n.Dims(), n.shape, s, s.Dims()))
}

which is called here:

gorgonia/op.go

Line 202 in 2d4605d

retVal = NewUniqueNode(WithType(retType), WithOp(op), WithChildren(children), In(g), WithShape(s...))

The problem is in the function NewUniqueNode which calls newNode... which calls n.borrowNode for optimization:

n := borrowNode()

It applies the WithShape function here ...

opt(n)

therefore, all the checks are performed on the node that is borrowed and full of rubbish!

In the example, there is only one node to be borrowed; therefore, the panic always returns the same result, but with a more significant code, the problem could appear sporadically.

@chewxy
Copy link
Member

chewxy commented Mar 21, 2020

Hmm, I thought I added the garbage cleanup on return to pool. Let me look into this further

@owulveryck
Copy link
Member Author

You're right, the cleanup happens here:

func returnNode(n *Node) {

This is really weird.

@owulveryck
Copy link
Member Author

owulveryck commented Mar 22, 2020

Screenshot 2020-03-22 at 16 58 01

See on my debugging screenshot:

  • n's fields have been nilled (n.shape is nil )
  • nd is 4

How is this possible?

Note: the panic message is coherent with this, it displays 4 dimensions and a shape equal to ():

Node Σ[3](%0) :: Tensor-4 float32, has 4 dimensions(Shape: ()). Input shape is (64), which has 1 dimensions [recovered]

@chewxy
Copy link
Member

chewxy commented Mar 22, 2020

That message means that the Tensor-4 was defined with a type of dim 4, but the given shape is only of dim (). Let me recreate that

@chewxy
Copy link
Member

chewxy commented Mar 22, 2020

the issue is likely in Mean more than anything

@chewxy
Copy link
Member

chewxy commented Mar 22, 2020

The issue appears to be in reductionInferShape which is a bit overeager when it comes to squeezing dimensions.

When (1, 64, 1, 64) gets Sum' d over axis 3, there can be multiple interpretations of what this means. For example, you can get (1, 64, 1, 1) which is what was expected by ApplyOp, you can also get (1, 64, 1) which is what numpy's API generate. reductionInferShape sees (1, 64, 1) and shrinks it to (64), which is a "machine view" of things.

Some opinions.

The "intuitive" notion should be that a Sum returns (1, 64, 1, 1). The "correct" notion should be (1, 64, 1).

It's intuitive that if you reduce along an axis, you get a scalar (i.e. 1 element) back. However, this is solely due to one interpretation of the semantics of shapes. Examples follow.

Consider the 2D case. A "sum" is essentially \mathbf{a} \times \mathbf{1}', where 1 is a vector of 1s of the same shape as a. Its results are a scalar (i.e. of shape ())

Now consider a 3D case. The 3D case would be a BatchedMatMul in Gorgonia's terms. Now imagine a (2, 3, 4) 3-tensor, and we're summing over the 2nd axis. We're left with two batches of (3-rows of scalars). A row of scalars can be written (3) or (3,1). It depends on your views on linear algebra - is this a subspace we're talking about?

Hence the most generic solution would be to leave it as (2, 3, 1).

A good argument for numpy style shape inference would be "but you said the sum results in a scalar". Ah, but we haven't yet determined the calculus of shapes did we? What happens if we append a () to a (2, 3)? What are the rules around this? Do we say that it's (2,3, ()) and that's equal to (2,3,1) or (2,3) ?

I like to think of () as "unit". And _ as void. The question comes down to how we model "unit". Do we treat it as an identity element (alá 0 for additions and 1 for multiplication)?

More Opinions!

I have analysis paralysis. I do not know what choice to make.

(2,3,1) is more generic - it doesn't lock people down to one definition of a "vector space". This for example, has been quite useful as I have been doing work in gyrovector spaces for a bit.

(2,3) complies with numpy's API. It makes newcomers to Gorgonia feel more at ease.

What We Can Do Now

Fix reductionInferShape, but be prepared to break it for v0.10.0

@owulveryck
Copy link
Member Author

What do we do with this issue? I'd like to close it, but I do not want to lose your (very valuable comment) comment.

@chewxy
Copy link
Member

chewxy commented Mar 23, 2020

The comments are mostly opinion. Should not be taken to be gospel. We can write a better version on the docs site once the ideas have stabilized. In the meantime this has gone into my personal org files. So... there.

@wzzhu
Copy link
Contributor

wzzhu commented Jun 10, 2020

The issue appears to be in reductionInferShape which is a bit overeager when it comes to squeezing dimensions.

What We Can Do Now

Fix reductionInferShape, but be prepared to break it for v0.10.0

Besides problem in reductionInferShape, WithShape seems to fail whenever the node is created using NewUniqueNode, since newNode will borrow a node with node.shape initialized to nil. This will always make the nd==0.
Whenever WithShape op is applied, the shape is always mismatched unless the shape passed in from inferred shape is () too.

See the flow

// Max performs a max() on the input and the provided axes.
func Max(a *Node, along ...int) (retVal *Node, err error) {
	if a.IsScalar() {
		// can't max a scalar. Should return error
		return a, nil
	}

	dims := a.Dims()
	if len(along) == 0 {
		along = intRange(0, dims)
	}

	op := newMaxOp(along, dims)

	return ApplyOp(op, a) // <--
}
// ApplyOp is the generic function application - for when no specialization is required
func ApplyOp(op Op, children ...*Node) (retVal *Node, err error) {
	...
	ds := Nodes(children).dimSizers()
	var s tensor.Shape
	if s, err = op.InferShape(ds...); err == nil {
		shapeLogf("inferred shape %v", s)
                // <-- Call NewUniqueNode here
		retVal = NewUniqueNode(WithType(retType), WithOp(op), WithChildren(children), In(g), WithShape(s...))
	} else {
		err = errors.Wrapf(err, "Failed to infer shape. Op: %v", op)
		// retVal = newUniqueNode(withType(retType), withOp(op), withChildren(children), withGraph(g))
	}
	returnDimSizers(ds)
	return
}
// NewUniqueNode creates a new unique node in a graph. If no graph was specified in the construction options then it will just return a graphless node.
func NewUniqueNode(opts ...NodeConsOpt) *Node {
	n := newNode(opts...) // <--
	if n.g == nil {
		return n
	}
	n.fixChildren() // ensure that all the kids are in the graph first

	m := n.g.AddNode(n)
	if n != m {
		returnNode(n)
	}
	m.fixEdges()
	return m
}
func newNode(opts ...NodeConsOpt) *Node {
	n := borrowNode()
	n.dataOn = CPU
	n.id = -1
// <-- n.shape == nil here
	for _, opt := range opts {
		opt(n)
	}
	n.fix()

	incrNN()
	return n
}
// WithShape is a node construction option to initialize a *Node with a particular shape.
// This function panics if the shape's dimensions do not match the specified dimensions of the *Node.
func WithShape(shp ...int) NodeConsOpt {
	s := tensor.Shape(tensor.BorrowInts(len(shp)))
	copy(s, shp)
	f := func(n *Node) {
		nd := n.Dims() // <-- nd == 0 always
		// if nd == 1 && s.IsVector() {
		// 	goto safe
		// }
		isVec := s.IsColVec() || s.IsRowVec()
		acceptVec := (isVec && (nd == 1))
		sameDims := nd == s.Dims()
		acceptScalar := nd == 0 && scalarEquiv(s)

		if !acceptVec && !sameDims && !acceptScalar {
                        // <--panic here!!
			panic(fmt.Sprintf("Node %v, has %d dimensions(Shape: %v). Input shape is %v, which has %d dimensions", n, n.Dims(), n.shape, s, s.Dims()))
		}
		// safe:
		n.shape = s
	}
	return f
}

chewxy pushed a commit that referenced this issue Jun 15, 2020
* Make reductionInferShape conservative to fix #384

The reductionInferShape currently doesn't respect along initially.
It aggressively squeezes dimensions. Not only does it affect normal tensor operation, but also it breaks the backprop autoDiff algorithm sometimes when the network containing BroadcastAdd, resulting in crash when calling Grad().

The change tries to strictly respect the parameter along, e.g.,
(100, 1) along 0, reduction to shape (1) instead ()
(1, 64, 1, 64) along 3 will reduce to (1, 64, 1)
(64, 1, 3, 2) along (2,3) will reduce to (64, 1).

Fixed unit tests.

* Remove inconsistent dimention for Sum op

After changing reductionType to subtract len(along) from reduction op, SumOp's dimension need to be adjusted with it. Otherwise SymDiff will crash for Sum in calcBroadcastShap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants