cannot run "neon --gpu cudanet examples/convnet/i1k-alexnet-fp32.yaml" on tegra k1 #51

yuehusile · 2015-06-18T10:10:53Z

root@tegra-ubuntu:/home/hsl/neon# neon --gpu cudanet examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.util.persist:deserializing object from: examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.datasets.imageset:Imageset initialized with dtype <type 'numpy.float32'>
2015-06-19 12:47:19,332 WARNING:neon - setting log level to: 20
Traceback (most recent call last):
File "/usr/local/bin/neon", line 240, in
experiment, result, status = main()
File "/usr/local/bin/neon", line 202, in main
device_id=args.device_id)
File "/usr/local/lib/python2.7/dist-packages/neon/backends/init.py", line 157, in gen_backend
raise RuntimeError("Can't find CUDA capable GPU")
RuntimeError: Can't find CUDA capable GPU

I use a tegra k1 GPU

scttl · 2015-06-20T03:05:43Z

Since you appear to be running as root on Ubuntu, can you first make sure that nvidia-smi is in that users PATH, and produces sensible output when run from the command line? It doesn't look like this command is being found.

I'd also suggest having a look at the items in our installation FAQ: http://neon.nervanasys.com/docs/latest/faq.html

yuehusile · 2015-06-29T08:01:17Z

Hi scttl, thanks for your reply. you are right about that the nvidia-smi command is not found.
I've checked my installation and configuration carefully, and it seems that it's a tegra k1 specific problem.
What I am doing is trying to run neon on the nvidia jetson tk1 devkit, and NVML is not supported on Jetson TK1, so nvidia-smi command would not be found, even when CUDA installation is all right.
Is NVML required to run neon demo ? or Is there anyway to solve my problem without NVML ?

ps: I can ran caffe with cudnn with no problem on jetson tk1, so I guess the CUDA installation is all right

scttl · 2015-06-29T16:42:25Z

nvidia-smi is not required to run any of the examples, we just use it as a proxy to validate that the user has the CUDA SDK installed. Provided you were able to install the cudanet python library ok, for the moment you can work around your issue by editing neon/backends/__init__.py to replace the line:
gpuflag = (os.system("nvidia-smi > /dev/null 2>&1") == 0)
with
gpuflag = (os.system("nvcc --version > /dev/null 2>&1") == 0)

We made a similar change in the Makefile a while back, but needed to update the check here as well. I've created a fix, and will get this merged into master for the next release of neon.

yuehusile · 2015-06-30T02:49:37Z

thanks scttl! editing neon/backends/init.py works, but I still can't run this sample because another erro appears:
root@tegra-ubuntu:/home/hsl/neon# neon --gpu cudanet examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.util.persist:deserializing object from: examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.datasets.imageset:Imageset initialized with dtype <type 'numpy.float32'>
2015-07-01 04:52:29,170 WARNING:neon - setting log level to: 20
2015-07-01 04:52:31,733 INFO:init - Cudanet backend, RNG seed: None, numerr: None
2015-07-01 04:52:31,735 INFO:mlp - Layers:
ImageDataLayer d0: 3 x (224 x 224) nodes
ConvLayer conv1: 3 x (224 x 224) inputs, 64 x (55 x 55) nodes, RectLin act_fn
PoolingLayer pool1: 64 x (55 x 55) inputs, 64 x (27 x 27) nodes, Linear act_fn
ConvLayer conv2: 64 x (27 x 27) inputs, 192 x (27 x 27) nodes, RectLin act_fn
PoolingLayer pool2: 192 x (27 x 27) inputs, 192 x (13 x 13) nodes, Linear act_fn
ConvLayer conv3: 192 x (13 x 13) inputs, 384 x (13 x 13) nodes, RectLin act_fn
ConvLayer conv4: 384 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin act_fn
ConvLayer conv5: 256 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin act_fn
PoolingLayer pool3: 256 x (13 x 13) inputs, 256 x (6 x 6) nodes, Linear act_fn
FCLayer fc4096a: 9216 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout1: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc4096b: 4096 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout2: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc1000: 4096 inputs, 1000 nodes, Softmax act_fn
CostLayer cost: 1000 nodes, CrossEntropy cost_fn

2015-07-01 04:52:31,738 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,228 INFO:val_init - Generating AutoUniformValGen values of shape (363, 64)
2015-07-01 04:52:32,254 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,340 INFO:val_init - Generating AutoUniformValGen values of shape (1600, 192)
2015-07-01 04:52:32,370 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,432 INFO:val_init - Generating AutoUniformValGen values of shape (1728, 384)
2015-07-01 04:52:32,506 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,552 INFO:val_init - Generating AutoUniformValGen values of shape (3456, 256)
2015-07-01 04:52:32,602 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,639 INFO:val_init - Generating AutoUniformValGen values of shape (2304, 256)
2015-07-01 04:52:32,691 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,702 INFO:val_init - Generating AutoUniformValGen values of shape (4096, 9216)
2015-07-01 04:52:34,805 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:34,813 INFO:val_init - Generating AutoUniformValGen values of shape (4096, 4096)
2015-07-01 04:52:35,728 INFO:val_init - Generating AutoUniformValGen values of shape (1000, 4096)
Traceback (most recent call last):
File "/usr/local/bin/neon", line 240, in
experiment, result, status = main()
File "/usr/local/bin/neon", line 207, in main
experiment.initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit_predict_err.py", line 62, in initialize
super(FitPredictErrorExperiment, self).initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit.py", line 62, in initialize
self.model.initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/models/mlp.py", line 68, in initialize
dtype=self.layers[1].deltas_dtype)
File "/usr/local/lib/python2.7/dist-packages/neon/backends/cc2.py", line 536, in zeros
dtype=dtype)),
MemoryError

Is memory size a problem? tegra k1 has 2GB memory. or just something else lead to this problem? any advice for me to check out what happened?

apark263 · 2015-06-30T03:30:40Z

Try reducing your batch size down to 32 and see if the problem still
exists. If it runs then you probably don't have enough memory to train at
mb=128

Is there any particular reason you are using this system to train? You
would get much better performance by using a more standard graphics card.

On Monday, June 29, 2015, yuehusile notifications@github.com wrote:

thanks scttl! editing neon/backends/init.py works, but I still can't
run this sample because another erro appears:
root@tegra-ubuntu:/home/hsl/neon# neon --gpu cudanet
examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.util.persist:deserializing object from:
examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.datasets.imageset:Imageset initialized with dtype
2015-07-01 04:52:29,170 WARNING:neon - setting log level to: 20
2015-07-01 04:52:31,733 INFO:init - Cudanet backend, RNG seed: None,
numerr: None
2015-07-01 04:52:31,735 INFO:mlp - Layers:
ImageDataLayer d0: 3 x (224 x 224) nodes
ConvLayer conv1: 3 x (224 x 224) inputs, 64 x (55 x 55) nodes, RectLin
act_fn
PoolingLayer pool1: 64 x (55 x 55) inputs, 64 x (27 x 27) nodes, Linear
act_fn
ConvLayer conv2: 64 x (27 x 27) inputs, 192 x (27 x 27) nodes, RectLin
act_fn
PoolingLayer pool2: 192 x (27 x 27) inputs, 192 x (13 x 13) nodes, Linear
act_fn
ConvLayer conv3: 192 x (13 x 13) inputs, 384 x (13 x 13) nodes, RectLin
act_fn
ConvLayer conv4: 384 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin
act_fn
ConvLayer conv5: 256 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin
act_fn
PoolingLayer pool3: 256 x (13 x 13) inputs, 256 x (6 x 6) nodes, Linear
act_fn
FCLayer fc4096a: 9216 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout1: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc4096b: 4096 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout2: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc1000: 4096 inputs, 1000 nodes, Softmax act_fn
CostLayer cost: 1000 nodes, CrossEntropy cost_fn

2015-07-01 04:52:31,738 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,228 INFO:val_init - Generating AutoUniformValGen
values of shape (363, 64)
2015-07-01 04:52:32,254 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,340 INFO:val_init - Generating AutoUniformValGen
values of shape (1600, 192)
2015-07-01 04:52:32,370 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,432 INFO:val_init - Generating AutoUniformValGen
values of shape (1728, 384)
2015-07-01 04:52:32,506 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,552 INFO:val_init - Generating AutoUniformValGen
values of shape (3456, 256)
2015-07-01 04:52:32,602 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,639 INFO:val_init - Generating AutoUniformValGen
values of shape (2304, 256)
2015-07-01 04:52:32,691 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,702 INFO:val_init - Generating AutoUniformValGen
values of shape (4096, 9216)
2015-07-01 04:52:34,805 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:34,813 INFO:val_init - Generating AutoUniformValGen
values of shape (4096, 4096)
2015-07-01 04:52:35,728 INFO:val_init - Generating AutoUniformValGen
values of shape (1000, 4096)
Traceback (most recent call last):
File "/usr/local/bin/neon", line 240, in
experiment, result, status = main()
File "/usr/local/bin/neon", line 207, in main
experiment.initialize(backend)
File
"/usr/local/lib/python2.7/dist-packages/neon/experiments/fit_predict_err.py",
line 62, in initialize
super(FitPredictErrorExperiment, self).initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit.py",
line 62, in initialize
self.model.initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/models/mlp.py", line 68,
in initialize
dtype=self.layers[1].deltas_dtype)
File "/usr/local/lib/python2.7/dist-packages/neon/backends/cc2.py", line
536, in zeros
dtype=dtype)),
MemoryError

Is memory size a problem? tegra k1 has 2GB memory. or just something else
lead to this problem? any advice for me to check out what happened?

—
Reply to this email directly or view it on GitHub
#51 (comment).

scttl closed this as completed in 5e77260 Jul 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot run "neon --gpu cudanet examples/convnet/i1k-alexnet-fp32.yaml" on tegra k1 #51

cannot run "neon --gpu cudanet examples/convnet/i1k-alexnet-fp32.yaml" on tegra k1 #51

yuehusile commented Jun 18, 2015

scttl commented Jun 20, 2015

yuehusile commented Jun 29, 2015

scttl commented Jun 29, 2015

yuehusile commented Jun 30, 2015

apark263 commented Jun 30, 2015

cannot run "neon --gpu cudanet examples/convnet/i1k-alexnet-fp32.yaml" on tegra k1 #51

cannot run "neon --gpu cudanet examples/convnet/i1k-alexnet-fp32.yaml" on tegra k1 #51

Comments

yuehusile commented Jun 18, 2015

scttl commented Jun 20, 2015

yuehusile commented Jun 29, 2015

scttl commented Jun 29, 2015

yuehusile commented Jun 30, 2015

apark263 commented Jun 30, 2015