Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential issue for ffdl-databroker_s3/caffe-model job #8

Closed
Tomcli opened this issue Feb 12, 2018 · 4 comments
Closed

Potential issue for ffdl-databroker_s3/caffe-model job #8

Tomcli opened this issue Feb 12, 2018 · 4 comments
Labels
bug Something isn't working

Comments

@Tomcli
Copy link
Contributor

Tomcli commented Feb 12, 2018

When running the caffe-model job with FfDL. The databroker_s3 always having issue to pull one of the file from s3://mnist_lmdb_data/train/data.mdb

Using Object Storage account test at http://s3.default.svc.cluster.local
Download start: Mon Feb 12 19:38:10 UTC 2018
Downloading from bucket mnist_lmdb_data to /job/mnist_lmdb_data
Completed 256.0 KiB/68.8 MiB (213.3 KiB/s) with 4 file(s) remaining
Completed 264.0 KiB/68.8 MiB (217.1 KiB/s) with 4 file(s) remaining
download: s3://mnist_lmdb_data/test/lock.mdb to job/mnist_lmdb_data/test/lock.mdb
Completed 264.0 KiB/68.8 MiB (217.1 KiB/s) with 3 file(s) remaining
Completed 520.0 KiB/68.8 MiB (259.9 KiB/s) with 3 file(s) remaining
Completed 776.0 KiB/68.8 MiB (369.5 KiB/s) with 3 file(s) remaining
Completed 1.0 MiB/68.8 MiB (469.2 KiB/s) with 3 file(s) remaining  
Completed 1.3 MiB/68.8 MiB (585.1 KiB/s) with 3 file(s) remaining  
Completed 1.5 MiB/68.8 MiB (701.2 KiB/s) with 3 file(s) remaining  
Completed 1.8 MiB/68.8 MiB (817.2 KiB/s) with 3 file(s) remaining  
Completed 2.0 MiB/68.8 MiB (933.2 KiB/s) with 3 file(s) remaining  
Completed 2.3 MiB/68.8 MiB (825.7 KiB/s) with 3 file(s) remaining  
Completed 2.5 MiB/68.8 MiB (916.3 KiB/s) with 3 file(s) remaining  
Completed 2.8 MiB/68.8 MiB (973.5 KiB/s) with 3 file(s) remaining  
Completed 3.0 MiB/68.8 MiB (1.0 MiB/s) with 3 file(s) remaining    
Completed 3.3 MiB/68.8 MiB (1.1 MiB/s) with 3 file(s) remaining    
Completed 3.5 MiB/68.8 MiB (1.2 MiB/s) with 3 file(s) remaining    
Completed 3.8 MiB/68.8 MiB (1.3 MiB/s) with 3 file(s) remaining    
Completed 4.0 MiB/68.8 MiB (1.3 MiB/s) with 3 file(s) remaining    
Completed 4.3 MiB/68.8 MiB (1.4 MiB/s) with 3 file(s) remaining    
Completed 4.5 MiB/68.8 MiB (1.5 MiB/s) with 3 file(s) remaining    
Completed 4.8 MiB/68.8 MiB (1.4 MiB/s) with 3 file(s) remaining    
Completed 5.0 MiB/68.8 MiB (1.5 MiB/s) with 3 file(s) remaining    
Completed 5.3 MiB/68.8 MiB (1.6 MiB/s) with 3 file(s) remaining    
Completed 5.5 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining    
Completed 5.6 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining    
Completed 5.9 MiB/68.8 MiB (1.6 MiB/s) with 3 file(s) remaining    
Completed 6.1 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining    
Completed 6.4 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining    
Completed 6.6 MiB/68.8 MiB (1.8 MiB/s) with 3 file(s) remaining    
Completed 6.9 MiB/68.8 MiB (1.9 MiB/s) with 3 file(s) remaining    
Completed 7.1 MiB/68.8 MiB (1.8 MiB/s) with 3 file(s) remaining    
Completed 7.4 MiB/68.8 MiB (1.9 MiB/s) with 3 file(s) remaining    
Completed 7.6 MiB/68.8 MiB (2.0 MiB/s) with 3 file(s) remaining    
Completed 7.9 MiB/68.8 MiB (2.0 MiB/s) with 3 file(s) remaining    
Completed 8.1 MiB/68.8 MiB (2.1 MiB/s) with 3 file(s) remaining    
Completed 8.4 MiB/68.8 MiB (2.1 MiB/s) with 3 file(s) remaining    
Completed 8.6 MiB/68.8 MiB (2.2 MiB/s) with 3 file(s) remaining    
Completed 8.9 MiB/68.8 MiB (1.9 MiB/s) with 3 file(s) remaining    
Completed 8.9 MiB/68.8 MiB (1.8 MiB/s) with 3 file(s) remaining    
download: s3://mnist_lmdb_data/train/lock.mdb to job/mnist_lmdb_data/train/lock.mdb
Killed
Killed
download failed: s3://mnist_lmdb_data/train/data.mdb to job/mnist_lmdb_data/train/data.mdb [Errno 12] Cannot allocate memory

I also tried to increase the job memory and use IBM Cloud Object storage and still have the same issue. So I believe the issue could be

  1. https://github.com/albarji/caffe-demos/blob/master/mnist/mnist_train_lmdb/data.mdb file is corrupted and we should use a different dataset.
    or
  2. ffdl-databroker_s3 may have a bug when pulling certain file.
@whummer
Copy link
Contributor

whummer commented Feb 12, 2018

We may have to increase the loadTrainingDataMemInMB configuration:

loadTrainingDataMemInMB=100
It's definitely not ideal that this value is currently hardcoded :/ , but can you give it a try with, say, a value of 300?

@Tomcli
Copy link
Contributor Author

Tomcli commented Feb 12, 2018

@whummer increasing the loadTrainingDataMemInMB did solve this problem, thanks.

@animeshsingh
Copy link

Lets make 300 default?

@Tomcli
Copy link
Contributor Author

Tomcli commented Feb 13, 2018

Sure, 300 seems to work for all of our examples.

@Tomcli Tomcli added the bug Something isn't working label Feb 21, 2018
sboagibm pushed a commit to sboagibm/FfDL that referenced this issue May 20, 2018
hard-coded disable push metrics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants