Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[U May Need It]nvdla_runtime options #9

Closed
JunningWu opened this issue Dec 27, 2017 · 61 comments
Closed

[U May Need It]nvdla_runtime options #9

JunningWu opened this issue Dec 27, 2017 · 61 comments

Comments

@JunningWu
Copy link

./nvdla_runtime -loadable output.protobuf

Usage: ./nvdla_runtime [-options] --loadable <loadable_file>
where options include:
-h print this help message
-s launch test in server mode
--loadable <loadable_file>
--image <image_file>
--imgshift <shift_value>
--imgscale <scale_value>
--imgpower <power_value>
--softmax

@JunningWu
Copy link
Author

unhandled level 1 translation fault

1. I compiled alexnet caffe model on my virtual machine, the linux version is Ubuntu14.04. the compile process is OK. and I got the loadable file: output.protobuf, and the output dir: wisdom.dir, which contains layers/networks/tensors.

2. I want to run the nvdla_runtime with the NVDLA VP. so I copied the output file and the output dir to the VP.

Here is the Error.

/# ./nvdla_runtime --loadable output.protobuf
creating new runtime context...
[ 1181.040928] nvdla_runtime[1287]: unhandled level 1 translation fault (11) at 0x41ffb65a, esr 0x92000005, in libnvdla_runtime.so[ffffb54a2000+21000]
[ 1181.047469] CPU: 0 PID: 1287 Comm: nvdla_runtime Not tainted 4.13.3 #1
[ 1181.054489] Hardware name: linux,dummy-virt (DT)
[ 1181.055224] task: ffff80003db40e00 task.stack: ffff80003d16c000
[ 1181.055792] PC is at 0xffffb54b7d98
[ 1181.056133] LR is at 0xffffb54b3984
[ 1181.056400] pc : [<0000ffffb54b7d98>] lr : [<0000ffffb54b3984>] pstate: 80000000
[ 1181.056903] sp : 0000ffffd6afd030
[ 1181.057213] x29: 0000ffffd6afd030 x28: 0000000000000000
[ 1181.067488] x27: 0000000000000000 x26: 0000000000000000
[ 1181.067886] x25: 0000000000000000 x24: 0000000000000000
[ 1181.068077] x23: 0000000000000000 x22: 0000000000000000
[ 1181.071035] x21: 0000000039db2190 x20: 0000000000000000
[ 1181.071292] x19: 0000000041ffb65a x18: 0000000000000000
[ 1181.071481] x17: 0000ffffb54d3fb0 x16: 0000ffffb54b7d98
[ 1181.071732] x15: 0000000000000111 x14: 00000000000003f3
[ 1181.084578] x13: 0000000000000000 x12: 0000ffffb549f968
[ 1181.088542] x11: 0000000000000022 x10: 0000000000000007
[ 1181.093569] x9 : 0000000000001500 x8 : 0000000000000003
[ 1181.097488] x7 : 0000000000000001 x6 : 0000000039db2330
[ 1181.100436] x5 : 0000000000000041 x4 : 0000000000000001
[ 1181.111960] x3 : 0000ffffb54d4928 x2 : 0000ffffb54b3950
[ 1181.112223] x1 : 0000000000000006 x0 : 0000000041ffb65a
Segmentation fault

Anyone may help???

  1. GDB not available on prebuilt image?

/# gdb
-sh: gdb: not found

@xmchen1987
Copy link

@wujunning2011 I think your input loadable file is wrong. You can try any file, and will get the same error.

I use the loadable file in the docker, but get out of bounds error. Any suggestions?

./nvdla_runtime --loadable BDMA_L0_0_fbuf
creating new runtime context...
libnvdla<1> failed to open dla device
libnvdla<1> Out of bounds DLA instance 0 requested.
(DLA_TEST) Error 0x00000004: runtime->load failed (in RuntimeTest.cpp, function loadLoadable(), line 253)
(DLA_TEST) Error 0x00000004: (propagating from RuntimeTest.cpp, function run(), line 377)
(DLA_TEST) Error 0x00000004: (propagating from main.cpp, function launchTest(), line 92)

@jarodw0723
Copy link

@xmchen1987 Have u installed the driver (drm.ko, opendla.ko) first?

@JunningWu
Copy link
Author

@xmchen1987 can you share your loadable file?

@jarodw0723 when I install the drm.ko/opendla.ko, I got level 2 translation fault.
nvdla_runtime[1306]: unhandled level 2 translation fault (11) at 0x2bf2565a, esr 0x92000006

@jarodw0723
Copy link

@wujunning2011 You can find the loadable file in https://github.com/nvdla/sw/tree/master/regression/flatbufs/kmd

@xmchen1987
Copy link

@jarodw0723 After installing the driver, it sucesses. Thanks a lot.
@wujunning2011 As @jarodw0723 mentioned, I use the prebuild file in
https://github.com/nvdla/sw/tree/master/regression/flatbufs/kmd

@jarodw0723 Is the VP able to dump performance data currently? Or just for software development?

@jarodw0723
Copy link

@xmchen1987 It is just for software development.

@JunningWu
Copy link
Author

@jarodw0723 @xmchen1987 When I run in Docker mod, It is OK.
I wonder how to get my own loadable file using nvdla_compiler, such as AlexNet.caffemodel

@xmchen1987
Copy link

@jarodw0723 I see currently cmod has interface like:
NV_NVDLA_cmac::NV_NVDLA_cmac( sc_module_name module_name ):
NV_NVDLA_cmac_base(module_name),
// Delay setup
dma_delay_(SC_ZERO_TIME),
csb_delay_(SC_ZERO_TIME),
b_transport_delay_(SC_ZERO_TIME)

Do you have plan to develop cmod as performance model?

@geyijun
Copy link

geyijun commented Dec 28, 2017

I have the same problem as wujunning2011.
If the loadable file is wrong,how can i get loadable file using nvdla_compiler, such as AlexNet.caffemodel.
I have read UMD code ,and known the nvdla_runtime need "flatbuf" for loadable file .But the nvdla_compiler create out file format is "protobuf". Should i make the format convert,or is there any option for nvdla_compiler to make it create "flatbuf" file ??

@blueardour
Copy link

+1 same request to compile and run a custom model.

Besides, @jarodw0723 Is there any schedule when the performance profiling function be ready?

@xmchen1987
Copy link

@wujunning2011 @geyijun @blueardour output.protobuf is no the target loadable file. I use the default.nvdla, and succeed to run the test.

@JunningWu
Copy link
Author

@xmchen1987 Thank You very much. Maybe default.nvdla is the Loadable file.

@HKLee2040
Copy link

@xmchen1987 May I know more about your "default.nvdla" test?
What network and what command do you use to test "default.nvdla"?

@geyijun
Copy link

geyijun commented Jan 2, 2018

@xmchen1987 Thanks.

@blueardour
Copy link

blueardour commented Jan 3, 2018

Hi,
It stuck when run nvdla_runtime --loadable default.nvdla. Any clue?

................................
Welcome to Buildroot
nvdla login: root
Password:
# mount -t 9p -o trans=virtio r /mnt
# cd /mnt/
# ls
CMakeCache.txt README.md install_manifest.txt
CMakeFiles aarch64_toplevel libs
CMakeLists.txt cmake models
CPackConfig.cmake cmake_install.cmake scripts
CPackSourceConfig.cmake conf src
LICENSE docker tests
Makefile images
# cd images/
# cd linux-4.13.3/
# ls
??%@@???@8 drm.ko nvdla_runtime
CONV_D_L0_0_fbuf efi-virtio.rom opendla.ko
Image libnvdla_compiler.so rootfs.ext4
aarch64_nvdla.lua libnvdla_runtime.so
alexnet nvdla_compiler
# insmod drm.ko
# insmod opendla.ko
[ 35.852221] opendla: loading out-of-tree module taints kernel.
[ 35.863261] reset engine done
[ 35.872695] [drm] Initialized nvdla 0.0.0 20171017 for 10200000.nvdla on minor 0
# export LD_LIBRARY_PATH=$PWD
# ./nvdla_runtime --loadable alexnet/default.nvdla
creating new runtime context...
[ 55.045834] random: crng init done
^C^C^X^C^C^C # stuck here
................................

also tried agian:
................................
Welcome to Buildroot
nvdla login: root
Password:
# mount -t 9p -o trans=virtio r /mnt
# cd /mnt/images/linux-4.13.3/
# export LD_LIBRARY_PATH=$PWD
# cd alexnet/
# ./../nvdla_runtime --loadable default.nvdla
creating new runtime context...
[ 48.162132] random: crng init done
^C
# cd ..
# insmod drm.ko
# insmod opendla.ko
[ 70.167764] opendla: loading out-of-tree module taints kernel.
[ 70.179404] reset engine done
[ 70.188440] [drm] Initialized nvdla 0.0.0 20171017 for 10200000.nvdla on minor 0
# dmesg| tail
[ 1.893330] VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
[ 1.912059] devtmpfs: mounted
[ 2.054710] Freeing unused kernel memory: 1088K
[ 2.232956] EXT4-fs (vda): re-mounted. Opts: data=ordered
[ 3.830183] NET: Registered protocol family 10
[ 3.848857] Segment Routing with IPv6
[ 48.162132] random: crng init done
[ 70.167764] opendla: loading out-of-tree module taints kernel.
[ 70.179404] reset engine done
[ 70.188440] [drm] Initialized nvdla 0.0.0 20171017 for 10200000.nvdla on minor 0
# cd alexnet/
# ./../nvdla_runtime --loadable default.nvdla
creating new runtime context...
^C # stuck here
# dmesg| tail
[ 1.893330] VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
[ 1.912059] devtmpfs: mounted
[ 2.054710] Freeing unused kernel memory: 1088K
[ 2.232956] EXT4-fs (vda): re-mounted. Opts: data=ordered
[ 3.830183] NET: Registered protocol family 10
[ 3.848857] Segment Routing with IPv6
[ 48.162132] random: crng init done
[ 70.167764] opendla: loading out-of-tree module taints kernel.
[ 70.179404] reset engine done
[ 70.188440] [drm] Initialized nvdla 0.0.0 20171017 for 10200000.nvdla on minor 0
................................

@JunningWu
Copy link
Author

@blueardour I suppose it's because the AlexNet is too HUGE, which may take about 20mins to create the context, you can try LeNet first.

@blueardour
Copy link

blueardour commented Jan 4, 2018

@wujunning2011 Hi, thanks for your tips.
May I ask whether you ever successfully run the alexnet?
Based on your comment, I left the program running. After 14 hours, it seems the simulator still not finished the execution.
..................................
# insmod opendla.ko
[ 43.227254] opendla: loading out-of-tree module taints kernel.
[ 43.239206] reset engine done
[ 43.248391] [drm] Initialized nvdla 0.0.0 20171017 for 10200000.nvdla on minor 0
# export LD_LIBRARY_PATH=$PWD
# ./nvdla_runtime --loadable alexnet/default.nvdla
creating new runtime context...
[ 72.144474] random: crng init done
Unknown image type: submitting tasks...
[ 7082.524154] Enter:dla_read_network_config
[ 7082.528186] Exit:dla_read_network_config status=0
[ 7082.528669] Enter: dla_initiate_processors
[ 7082.531573] Enter: dla_submit_operation
[ 7082.532029] Prepare Convolution operation index 0 ROI 0 dep_count 1
[ 7082.532483] Enter: dla_prepare_operation
[ 7082.535457] processor:Convolution group:0, rdma_group:0 available
[ 7082.536056] Enter: dla_read_config
[ 7082.543696] Exit: dla_read_config
[ 7082.544123] Exit: dla_prepare_operation status=0
[ 7082.544593] Enter: dla_program_operation
[ 7082.546769] Program Convolution operation index 0 ROI 0 Group[0]
[ 7082.555487] no desc get due to index==-1
[ 7082.556460] no desc get due to index==-1
[ 7082.558436] no desc get due to index==-1
[ 7082.558787] no desc get due to index==-1
........................
7083.737498] Exit: dla_op_programmed
[ 7083.737643] Exit: dla_program_operation status=0
[ 7083.737814] Exit: dla_submit_operation
[ 7083.737961] Enter: dla_dequeue_operation
[ 7083.738115] Dequeue op from CDP processor, index=18 ROI=0
[ 7083.738301] Enter: dla_submit_operation
[ 7083.738456] Prepare CDP operation index 18 ROI 0 dep_count 1
[ 7083.738651] Enter: dla_prepare_operation
[ 7083.738871] processor:CDP group:1, rdma_group:1 available
[ 7083.739062] Enter: dla_read_config
[ 7083.741600] Exit: dla_read_config
[ 7083.741748] Exit: dla_prepare_operation status=0
[ 7083.741936] Enter: dla_program_operation
[ 7083.742096] Program CDP operation index 18 ROI 0 Group[1]
[ 7083.742494] Enter: dla_cdp_program
[ 7083.742563] Enter: processor_cdp_program
[ 7083.753187] Exit: processor_cdp_program
[ 7083.753201] Exit: dla_cdp_program
[ 7083.753356] no desc get due to index==-1
[ 7083.753615] no desc get due to index==-1
[ 7083.753760] no desc get due to index==-1
[ 7083.753910] no desc get due to index==-1
[ 7083.754058] no desc get due to index==-1
[ 7083.754210] no desc get due to index==-1
[ 7083.754362] Enter: dla_op_programmed
[ 7083.754505] Exit: dla_op_programmed
[ 7083.754649] Exit: dla_program_operation status=0
[ 7083.754817] Exit: dla_submit_operation
[ 7083.754966] Exit: dla_dequeue_operation
[ 7083.755133] Exit: dla_initiate_processors status=0
[ 7083.755376] Enter:dla_handle_events, processor:BDMA
[ 7083.755620] Exit:dla_handle_events, ret:0
[ 7083.755800] Enter:dla_handle_events, processor:Convolution
[ 7083.756012] Handle cdma weight done event, processor Convolution group 0
[ 7083.756260] Exit:dla_handle_events, ret:0
[ 7083.756416] Enter:dla_handle_events, processor:SDP
[ 7083.756592] Exit:dla_handle_events, ret:0
[ 7083.756758] Enter:dla_handle_events, processor:PDP
[ 7083.756937] Exit:dla_handle_events, ret:0
[ 7083.757092] Enter:dla_handle_events, processor:CDP
[ 7083.757269] Exit:dla_handle_events, ret:0
[ 7083.757422] Enter:dla_handle_events, processor:RUBIK
[ 7083.757602] Exit:dla_handle_events, ret:0

this is the last ouput after 14hours of execution.

As you mentioned a try of the lenet. The computing complexity of the Alexnet is about 1G mac according to the tool of Netscope CNN analyzer. However, most of my focused networks are bigger than Alexnet. Thus, if the simulator is so slow, it might be some kinds of unacceptable for me to run my own networks.

@JunningWu
Copy link
Author

@blueardour my ALexNet's Running is not successful, which is also stucked at somewhere.
According to the NVDLA VP configuration file, the system mem is 1MB, so this may have some influence on the AlexNet's Running.

with such huge NN, I suggest that you may USE Candence's Protium or Synopsys's ZEBU.

BTW, when I run the tiny Lenet, there stil has some errors, hope you can GOTO LeNet, and GIVE me some help.

@blueardour
Copy link

hi, @wujunning2011 sorry for late reply. After having a try of Lenet, I also failed to run it successfully.

@ned-varnica
Copy link

Hi, has any one found a solution to this? I am having the same issue and it seems that it is not the system virtual memory issue? Any help is appreciated. Thanks!

@prasshantg
Copy link
Collaborator

@JunningWu NVDLA VP configuration should be using 1GB system mem, which config file are you checking for it?

are you able to run LeNet?

@ned-varnica
Copy link

ned-varnica commented Feb 6, 2018

Hi, still having issues with AlexNet -- running it with 1GB system mem. I tried running utilizing the latest NVDLA updates (with this one, we can load a .jpg image format). Please see the attached log file for more info: There are some error messages that are ignored and it hangs at the last point shown in the log file:

20180205_pascalvoc_BoatRes227x227.jpg.log

Regarding LeNet: I was able to run it all the way thru without any issues (Here, the input file format used is .pgm) .

20180202_lenet_BoatRes28x28.pgm.log

@prasshantg
Copy link
Collaborator

I am able to reproduce it, created #21 for debugging AlexNet failure

@jwise jwise reopened this Feb 17, 2018
@jwise
Copy link

jwise commented Feb 17, 2018

FYI -- some of the team is out for CNY this week. I'll follow up to see who's around, but expect a little more latency on this one. Thanks!

@prasshantg
Copy link
Collaborator

We have resolved this issue, will fix push to KMD. Waiting for some verification results.

@ned-varnica
Copy link

Great, thanks. Much appreciated!

@qdchau
Copy link

qdchau commented Mar 1, 2018

Hi @prasshantg, @jwise. Would it be possible to ask for ballpark estimate of when the AlexNet fix will be available so we can update our team’s schedule?

@prasshantg
Copy link
Collaborator

@qdchau 5th Mar 2018

@qdchau
Copy link

qdchau commented Mar 2, 2018

Awesome. Thank you!

@prasshantg
Copy link
Collaborator

@qdchau @ned-varnica @JunningWu fix for alexnet pushed. please test it.

@qdchau
Copy link

qdchau commented Mar 8, 2018

Hi @prasshantg. The fix works for us. Thanks for your help!

@ned-varnica
Copy link

Thanks so much @prasshantg. Should we be expecting the correct output at this point? We tried this AlexNet with some images and got outputs that look like noise (negative values close to 0). On the other hand, when we run the same network on our local CPU we get very good prediction with same input images (1 out of 20 output values is a large positive number, and this matches the correct label). Do you have any recommendation how to proceed with debugging? Thanks!

@JunningWu
Copy link
Author

@ned-varnica I think the rawdump file will contain 1000 predictions, like this http://ddl.escience.cn/f/Qdtr.
By the way, I am using the BVLC trained model. and the input image is http://ddl.escience.cn/f/Qdts.

@prasshantg
Copy link
Collaborator

@JunningWu do you get expected results?

@JunningWu
Copy link
Author

@prasshantg I am trying to figure out whether the result is indicating "CAT".
The simulation process is ok, no more errors.

@ned-varnica
Copy link

ned-varnica commented Mar 12, 2018

@JunningWu In the example we are running, it has 20 outputs. The network was taken from Caffe Model Zoo http://heatmapping.org/files/bvlc_model_zoo/pascal_voc_2012_multilabel/deploy_x30.prototxt

It was trained on the following 20 categories:

  1. Aeroplane
  2. Bicycle
  3. Bird
  4. Boat
  5. Bottle
  6. Bus
  7. Car
  8. Cat
  9. Chair
  10. Cow
  11. Dining table
  12. Dog
  13. Horse
  14. Motorbike
  15. Person
  16. Potted plant
  17. Sheep
  18. Sofa
  19. Train
  20. TV monitor

In your example, looking at the rawdump file, it seems you are seeing the same issue as we do. All the entries (in your case 1000 of them, in our case 20 of them) show very small values and nothing stands out. At least, this is our experience so far.

@prasshantg
Attached is one JPG image we used and the corresponding rawdump file.

boatres227x227
20180306_pascalvoc_BoatRes227x227.jpg.dimg.txt

@prasshantg
Copy link
Collaborator

This could be due to missing mean subtraction feature in compiler. Let me confirm it.

@ned-varnica
Copy link

Thanks @prasshantg. I agree this is the part of it, but there is probably more to it. FYI, I tried removing mean subtraction in our local simulator (just to test this hypothesis) and the result still looks OK: It can still produce outputs showing that 'Boat" is much more likely than the other 19 outputs. The confidence is worse (compared to the confidence when the appropriate means are used), but looks fine. On the other hand, the outputs we get in the file 20180306_pascalvoc_BoatRes227x227.jpg.dimg.txt (please see previous message) are not showing this behavior.

@ferin08
Copy link

ferin08 commented Jun 21, 2018

I'm still not clear.
Is default.nvdla the loadable file? If yes, where can I find this file.

Also, how do I create a loadable file from output.protobuf which is the compiler output

@prasshantg
Copy link
Collaborator

@ferin08 Output from compiler is default.nvdla

@ferin08
Copy link

ferin08 commented Jun 21, 2018

After giving the prototext and caffe model file that I got from the DIGITS model output using this command:

./nvdla_compiler [-options] –prototxt <prototxt_file> –caffemodel <caffemodel_file> -o <outputpath>

I get an output file from the complier called output.protobuf
Nothing with nvdla.default
Where do we get nvdla.default?

@smsharif1991
Copy link
Contributor

Hi @ferin08 ,
./nvdla_compiler --prototxt alexnet.prototxt --model alexnet.caffemodel

After running the above command I get "basic.nvdla" file in the current folder from where it is run.
This basic.nvdla is the loadable file which is then feeded on to runtime for inferencing,

Check for *.nvdla in your current folder.

Thanks.

@ferin08
Copy link

ferin08 commented Jun 21, 2018

Hi @smsharif1991
Thank You! I was able to see the basic.nvdla loadable file.

Regarding the compiler and runtime test is there any more documentation other than http://nvdla.org/sw/test_application.html

@smsharif1991
Copy link
Contributor

Hi @ferin
Currently we have this much documentation only. What more additional info is required?? Let us know so that we can add it to the documentation.

Thanks.

@ferin08
Copy link

ferin08 commented Jun 22, 2018

How do we use '-s' to "launch test in server mode"?

Also, I tested a few images using --rawdump.
But the output in output.dimg isn't changing (It the same output as the first time).
I am removing the file before testing each time. Is there something that I am missing?

@prasshantg
Copy link
Collaborator

Which model are you using to test?

@ferin08
Copy link

ferin08 commented Jun 22, 2018

lenet

@ferin08
Copy link

ferin08 commented Jun 25, 2018

Still have not been able to resolve this issue.

"Also, I tested a few images using --rawdump.
But the output in output.dimg isn't changing (It the same output as the first time).
I am removing the file before testing each time. Is there something that I am missing?"

Anyway we can solve this problem?

@Adithyak1998
Copy link

@prasshantg @smsharif1991
Even I have the issue where my output.dimg isn't getting updated. It's stuck with the 1st output values I got. How do I fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests