Abandon (core dumped) while sampling #30

Correlation4 · 2020-05-03T21:49:08Z

I have been running this on Ubuntu 18.04.4 with an NVIDIA GT740M which is not optimal. Regardless of the model used, it always will stop with the same error.

Input:
python jukebox/sample.py --model=1b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=2 --hop_fraction=0.5,0.5,0.125

Output:
Using cuda True {'name': 'sample_5b', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 2, 'hop_fraction': (0.5, 0.5, 0.125)} Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128 Downloading from gce Restored from /home/XXXX/.cache/jukebox-assets/models/5b/vqvae.pth.tar 0: Loading vqvae in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /home/XXXX/XXXX/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /home/XXXX/XXXX/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:0, Cond downsample:4, Raw to tokens:8, Sample length:65536 Downloading from gce
Traceback (most recent call last): File "jukebox/sample.py", line 237, in <module> fire.Fire(run) File "/home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "jukebox/sample.py", line 234, in run save_samples(model, device, hps, sample_hps) File "jukebox/sample.py", line 157, in save_samples vqvae, priors = make_model(model, device, hps) File "/home/XXXX/XXXX/jukebox/jukebox/make_models.py", line 185, in make_model priors = [make_prior(setup_hparams(priors[level], dict()), vqvae, 'cpu') for level in levels] File "/home/XXXX/XXXX/jukebox/jukebox/make_models.py", line 185, in <listcomp> priors = [make_prior(setup_hparams(priors[level], dict()), vqvae, 'cpu') for level in levels] File "/home/XXXX/XXXX/jukebox/jukebox/make_models.py", line 169, in make_prior restore(hps, prior, hps.restore_prior) File "/home/XXXX/XXXX/jukebox/jukebox/make_models.py", line 54, in restore checkpoint = load_checkpoint(checkpoint_path) File "/home/XXXX/XXXX/jukebox/jukebox/make_models.py", line 37, in load_checkpoint checkpoint = t.load(restore, map_location=t.device('cpu')) File "/home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/serialization.py", line 529, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/serialization.py", line 709, in _legacy_load deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly) RuntimeError: unexpected EOF, expected 43488 more bytes. The file might be corrupted. terminate called after throwing an instance of 'c10::Error'
what(): owning_ptr == NullType::singleton() || owning_ptr->refcount_.load() > 0 INTERNAL ASSERT FAILED at /opt/conda/conda-bld/pytorch_1579040055865/work/c10/util/intrusive_ptr.h:348, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /opt/conda/conda-bld/pytorch_1579040055865/work/c10/util/intrusive_ptr.h:348) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7ff60d6aa627 in /home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x14879df (0x7ff61085c9df in /home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #2: THStorage_free + 0x17 (0x7ff611024fe7 in /home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #3: <unknown function> + 0x5639bd (0x7ff63e9f29bd in /home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/lib/libtorch_python.so) <omitting python frames> frame #27: __libc_start_main + 0xe7 (0x7ff64dd90b97 in /lib/x86_64-linux-gnu/libc.so.6)

Abandon (core dumped)

I can't tell if this is coming from my GPU not being compatible, and being completely new, I don't have enough knowledge to check every error outputted above. I might be wrong but if the GPU was the issue here, I would not get this error.

The text was updated successfully, but these errors were encountered:

ZVK · 2020-05-05T18:13:52Z

RuntimeError: unexpected EOF, expected 43488 more bytes. The file might be corrupted.

This happened when loading the checkpoint. Make sure the model was able to fully download from GCE to ~/.cache/jukebox_assets/. This error can happen when there is an interruption in the download or not enough space on the disk.

Correlation4 · 2020-05-06T16:21:31Z

Thanks for your quick answer.
So if i understand correctly, this has nothing to do with my GPU being a really low-end model? According to the documentation, "The hps are for a V100 GPU with 16 GB GPU memory.". I'm quite confuse about what this means. Moreover, i have 12.7Gb still unused on my root partition. This seems to be enough as the largest model weights about 11Gb but I guess it requires overall way more space, is that right?

ZVK mentioned this issue May 5, 2020

hey help when runnig the first sample test #38

Closed

prafullasd closed this as completed May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abandon (core dumped) while sampling #30

Abandon (core dumped) while sampling #30

Correlation4 commented May 3, 2020

ZVK commented May 5, 2020

Correlation4 commented May 6, 2020

Abandon (core dumped) while sampling #30

Abandon (core dumped) while sampling #30

Comments

Correlation4 commented May 3, 2020

ZVK commented May 5, 2020

Correlation4 commented May 6, 2020