Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abandon (core dumped) while sampling #30

Closed
Correlation4 opened this issue May 3, 2020 · 2 comments
Closed

Abandon (core dumped) while sampling #30

Correlation4 opened this issue May 3, 2020 · 2 comments

Comments

@Correlation4
Copy link

I have been running this on Ubuntu 18.04.4 with an NVIDIA GT740M which is not optimal. Regardless of the model used, it always will stop with the same error.

Input:
python jukebox/sample.py --model=1b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=2 --hop_fraction=0.5,0.5,0.125

Output:
Using cuda True {'name': 'sample_5b', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 2, 'hop_fraction': (0.5, 0.5, 0.125)} Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128 Downloading from gce Restored from /home/XXXX/.cache/jukebox-assets/models/5b/vqvae.pth.tar 0: Loading vqvae in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /home/XXXX/XXXX/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /home/XXXX/XXXX/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:0, Cond downsample:4, Raw to tokens:8, Sample length:65536 Downloading from gce
Traceback (most recent call last): File "jukebox/sample.py", line 237, in <module> fire.Fire(run) File "/home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "jukebox/sample.py", line 234, in run save_samples(model, device, hps, sample_hps) File "jukebox/sample.py", line 157, in save_samples vqvae, priors = make_model(model, device, hps) File "/home/XXXX/XXXX/jukebox/jukebox/make_models.py", line 185, in make_model priors = [make_prior(setup_hparams(priors[level], dict()), vqvae, 'cpu') for level in levels] File "/home/XXXX/XXXX/jukebox/jukebox/make_models.py", line 185, in <listcomp> priors = [make_prior(setup_hparams(priors[level], dict()), vqvae, 'cpu') for level in levels] File "/home/XXXX/XXXX/jukebox/jukebox/make_models.py", line 169, in make_prior restore(hps, prior, hps.restore_prior) File "/home/XXXX/XXXX/jukebox/jukebox/make_models.py", line 54, in restore checkpoint = load_checkpoint(checkpoint_path) File "/home/XXXX/XXXX/jukebox/jukebox/make_models.py", line 37, in load_checkpoint checkpoint = t.load(restore, map_location=t.device('cpu')) File "/home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/serialization.py", line 529, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/serialization.py", line 709, in _legacy_load deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly) RuntimeError: unexpected EOF, expected 43488 more bytes. The file might be corrupted. terminate called after throwing an instance of 'c10::Error'
what(): owning_ptr == NullType::singleton() || owning_ptr->refcount_.load() > 0 INTERNAL ASSERT FAILED at /opt/conda/conda-bld/pytorch_1579040055865/work/c10/util/intrusive_ptr.h:348, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /opt/conda/conda-bld/pytorch_1579040055865/work/c10/util/intrusive_ptr.h:348) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7ff60d6aa627 in /home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x14879df (0x7ff61085c9df in /home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #2: THStorage_free + 0x17 (0x7ff611024fe7 in /home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #3: <unknown function> + 0x5639bd (0x7ff63e9f29bd in /home/XXXX/.conda/envs/jukebox/lib/python3.7/site-packages/torch/lib/libtorch_python.so) <omitting python frames> frame #27: __libc_start_main + 0xe7 (0x7ff64dd90b97 in /lib/x86_64-linux-gnu/libc.so.6)

Abandon (core dumped)

I can't tell if this is coming from my GPU not being compatible, and being completely new, I don't have enough knowledge to check every error outputted above. I might be wrong but if the GPU was the issue here, I would not get this error.

@ZVK
Copy link

ZVK commented May 5, 2020

RuntimeError: unexpected EOF, expected 43488 more bytes. The file might be corrupted.

This happened when loading the checkpoint. Make sure the model was able to fully download from GCE to ~/.cache/jukebox_assets/. This error can happen when there is an interruption in the download or not enough space on the disk.

@Correlation4
Copy link
Author

Thanks for your quick answer.
So if i understand correctly, this has nothing to do with my GPU being a really low-end model? According to the documentation, "The hps are for a V100 GPU with 16 GB GPU memory.". I'm quite confuse about what this means. Moreover, i have 12.7Gb still unused on my root partition. This seems to be enough as the largest model weights about 11Gb but I guess it requires overall way more space, is that right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants