Skip to content
This repository has been archived by the owner on Mar 1, 2019. It is now read-only.

MNIST computation do not give the good result. #2

Open
owulveryck opened this issue Sep 10, 2018 · 4 comments
Open

MNIST computation do not give the good result. #2

owulveryck opened this issue Sep 10, 2018 · 4 comments

Comments

@owulveryck
Copy link
Owner

go test -run=mnist
--- FAIL: Example_mnist (0.02s)
got:
[55.41009 984.514 -1191.4886 -652.1293 802.4857 497.57553 -303.6763 952.77106 -233.73296 -672.92255]
want:
[5041.8887 -3568.878 -187.82423 -1685.797 -1183.3232 -614.42926 892.6643 -373.65845 -290.2623 -111.176216]
FAIL
exit status 1
FAIL    github.com/owulveryck/gorgonnx  0.034s

The convolution operator now seems to run as expected (according its unit test).

I have two options for debugging:

  • importing the code into caffe2 and run it step by step to see if the results match
  • write more unit tests for the operators to locate the problem
@owulveryck

This comment has been minimized.

@owulveryck
Copy link
Owner Author

I have tried the mnist example with caffe2 from the docker image:

docker run -it --rm onnx/onnx-docker:cpu

then run this script:

import numpy as np
import onnx
import os
import glob
import caffe2.python.onnx.backend as backend

from onnx import numpy_helper

model = onnx.load('model.onnx')
test_data_dir = 'test_data_set_0'

# Load inputs
inputs = []
inputs_num = len(glob.glob(os.path.join(test_data_dir, 'input_*.pb')))
for i in range(inputs_num):
    input_file = os.path.join(test_data_dir, 'input_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(input_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    inputs.append(numpy_helper.to_array(tensor))

# Load reference outputs
ref_outputs = []
ref_outputs_num = len(glob.glob(os.path.join(test_data_dir, 'output_*.pb')))
for i in range(ref_outputs_num):
    output_file = os.path.join(test_data_dir, 'output_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(output_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    ref_outputs.append(numpy_helper.to_array(tensor))

# Run the model on the backend
outputs = list(backend.run_model(model, inputs))

# Compare the results with reference outputs.
for ref_o, o in zip(ref_outputs, outputs):
    np.testing.assert_almost_equal(ref_o, o)

which gives this result:

Traceback (most recent call last):
  File "onnx_caffe2.py", line 33, in <module>
    outputs = list(backend.run_model(model, inputs))
  File "/root/programs/onnx/onnx/backend/base.py", line 86, in run_model
    return backend.run(inputs)
  File "/root/programs/pytorch/caffe2/python/onnx/backend_rep.py", line 57, in run
    self.workspace.RunNet(self.predict_net.name)
  File "/root/programs/pytorch/caffe2/python/onnx/workspace.py", line 63, in f
    return getattr(workspace, attr)(*args, **kwargs)
  File "/root/programs/pytorch/caffe2/python/workspace.py", line 219, in RunNet
    StringifyNetName(name), num_iter, allow_fail,
  File "/root/programs/pytorch/caffe2/python/workspace.py", line 180, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at reshape_op.h:110] total_size == size. 64 vs 256. Argument `shape` does not agree with the input data. (64 != 256)Error from operator:
input: "Pooling160_Output_0" input: "Pooling160_Output_0_reshape0_shape" output: "Pooling160_Output_0_reshape0" output: "OC2_DUMMY_1" name: "Times212_reshape0" type: "Reshape" device_option { device_type: 0 cuda_gpu_id: 0 }

I think that I will raise an issue within the onnx model zoo

@owulveryck
Copy link
Owner Author

I can now analyze the graph a bit better and it looks that a stupid Add operation is not working as expected. Here is a way to reproduce the behavior:

First checkout gorgonnx and the version that holds the tracer:

go get github.com/owulveryck/gorgonnx
git checkout 6027f688f41fb9ee7be8d66a218c396832704545

Then download the MNIST example:

cd $TMPDIR && curl -s -o - https://onnxzoo.blob.core.windows.net/models/opset_8/mnist/mnist.tar.gz | tar xvzf -

And then launch the graph:
go run $GOPATH/src/github.com/owulveryck/gorgonnx/cmd/main.go -model $TMPDIR/mnist/model.onnx -input $TMPDIR/mnist/test_data_set_0/input_0.pb

and point your browser to http://localhost:8080.

On the graph, click on the nodes Plus30_Output0 and Convolution28_0; they should be different, but are the same.
Which means that the broadcasting is ok, but the Add operation do not perform anything.

screenshot 2018-11-26 at 23 55 10

@owulveryck
Copy link
Owner Author

Bad news: I have tried to rewrite the parser and Gorgonia (see the branch "directed-graph").
But I still have the same result.
[55.41009 984.514 -1191.4886 -652.1293 802.4857 497.57553 -303.6763 952.77106 -233.73296 -672.92255]

So the problem does not come from the graph definition. It may be related to the computation phase of Gorgonia.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant