MNIST computation do not give the good result. #2

owulveryck · 2018-09-10T20:23:59Z

go test -run=mnist
--- FAIL: Example_mnist (0.02s)
got:
[55.41009 984.514 -1191.4886 -652.1293 802.4857 497.57553 -303.6763 952.77106 -233.73296 -672.92255]
want:
[5041.8887 -3568.878 -187.82423 -1685.797 -1183.3232 -614.42926 892.6643 -373.65845 -290.2623 -111.176216]
FAIL
exit status 1
FAIL    github.com/owulveryck/gorgonnx  0.034s

The convolution operator now seems to run as expected (according its unit test).

I have two options for debugging:

importing the code into caffe2 and run it step by step to see if the results match
write more unit tests for the operators to locate the problem

The text was updated successfully, but these errors were encountered:

owulveryck · 2018-10-09T19:07:40Z

I have tried the mnist example with caffe2 from the docker image:

docker run -it --rm onnx/onnx-docker:cpu

then run this script:

import numpy as np
import onnx
import os
import glob
import caffe2.python.onnx.backend as backend

from onnx import numpy_helper

model = onnx.load('model.onnx')
test_data_dir = 'test_data_set_0'

# Load inputs
inputs = []
inputs_num = len(glob.glob(os.path.join(test_data_dir, 'input_*.pb')))
for i in range(inputs_num):
    input_file = os.path.join(test_data_dir, 'input_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(input_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    inputs.append(numpy_helper.to_array(tensor))

# Load reference outputs
ref_outputs = []
ref_outputs_num = len(glob.glob(os.path.join(test_data_dir, 'output_*.pb')))
for i in range(ref_outputs_num):
    output_file = os.path.join(test_data_dir, 'output_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(output_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    ref_outputs.append(numpy_helper.to_array(tensor))

# Run the model on the backend
outputs = list(backend.run_model(model, inputs))

# Compare the results with reference outputs.
for ref_o, o in zip(ref_outputs, outputs):
    np.testing.assert_almost_equal(ref_o, o)

which gives this result:

Traceback (most recent call last):
  File "onnx_caffe2.py", line 33, in <module>
    outputs = list(backend.run_model(model, inputs))
  File "/root/programs/onnx/onnx/backend/base.py", line 86, in run_model
    return backend.run(inputs)
  File "/root/programs/pytorch/caffe2/python/onnx/backend_rep.py", line 57, in run
    self.workspace.RunNet(self.predict_net.name)
  File "/root/programs/pytorch/caffe2/python/onnx/workspace.py", line 63, in f
    return getattr(workspace, attr)(*args, **kwargs)
  File "/root/programs/pytorch/caffe2/python/workspace.py", line 219, in RunNet
    StringifyNetName(name), num_iter, allow_fail,
  File "/root/programs/pytorch/caffe2/python/workspace.py", line 180, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at reshape_op.h:110] total_size == size. 64 vs 256. Argument `shape` does not agree with the input data. (64 != 256)Error from operator:
input: "Pooling160_Output_0" input: "Pooling160_Output_0_reshape0_shape" output: "Pooling160_Output_0_reshape0" output: "OC2_DUMMY_1" name: "Times212_reshape0" type: "Reshape" device_option { device_type: 0 cuda_gpu_id: 0 }

I think that I will raise an issue within the onnx model zoo

owulveryck · 2018-11-26T22:58:23Z

I can now analyze the graph a bit better and it looks that a stupid Add operation is not working as expected. Here is a way to reproduce the behavior:

First checkout gorgonnx and the version that holds the tracer:

go get github.com/owulveryck/gorgonnx
git checkout 6027f688f41fb9ee7be8d66a218c396832704545

Then download the MNIST example:

cd $TMPDIR && curl -s -o - https://onnxzoo.blob.core.windows.net/models/opset_8/mnist/mnist.tar.gz | tar xvzf -

And then launch the graph:
go run $GOPATH/src/github.com/owulveryck/gorgonnx/cmd/main.go -model $TMPDIR/mnist/model.onnx -input $TMPDIR/mnist/test_data_set_0/input_0.pb

and point your browser to http://localhost:8080.

On the graph, click on the nodes Plus30_Output0 and Convolution28_0; they should be different, but are the same.
Which means that the broadcasting is ok, but the Add operation do not perform anything.

owulveryck · 2019-01-31T22:15:07Z

Bad news: I have tried to rewrite the parser and Gorgonia (see the branch "directed-graph").
But I still have the same result.
[55.41009 984.514 -1191.4886 -652.1293 802.4857 497.57553 -303.6763 952.77106 -233.73296 -672.92255]

So the problem does not come from the graph definition. It may be related to the computation phase of Gorgonia.

This comment has been minimized.

Sign in to view

owulveryck mentioned this issue Oct 9, 2018

The MNIST model seems corrupted onnx/models#103

Closed

owulveryck mentioned this issue Feb 1, 2019

Gorgonia's evaluation of the MNIST model does not give expected result oramasearch/onnx-go#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNIST computation do not give the good result. #2

MNIST computation do not give the good result. #2

owulveryck commented Sep 10, 2018

This comment has been minimized.

owulveryck commented Oct 9, 2018

owulveryck commented Nov 26, 2018

owulveryck commented Jan 31, 2019

MNIST computation do not give the good result. #2

MNIST computation do not give the good result. #2

Comments

owulveryck commented Sep 10, 2018

This comment has been minimized.

owulveryck commented Oct 9, 2018

owulveryck commented Nov 26, 2018

owulveryck commented Jan 31, 2019