Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountered errors while executing training process #2 #39

Closed
Ma5onic opened this issue Aug 22, 2022 · 8 comments
Closed

Encountered errors while executing training process #2 #39

Ma5onic opened this issue Aug 22, 2022 · 8 comments

Comments

@Ma5onic
Copy link

Ma5onic commented Aug 22, 2022

(Using Leaderboard_B)
First I was stuck solving the environment and I let it sit for 30 min, but conda never finished creating the env from the yml.
Because I was using a cloud instance, I didn't have time to wait and I did this instead:

conda create -n mdx-net
conda update conda
conda config --add channels conda-forge
conda activate mdx-net
sudo apt-get install soundstretch
python -m pip install -r requirements.txt
python src/utils/data_augmentation.py --data_dir /real/path/to/musdbhq/ --train True --test True

It seems that the model doesn't allow me to train it with songs that don't contain vocals.

python src/utils/data_augmentation.py --data_dir /home/ubuntu/mdx-files/musdb/ --train True --test True
 10%|███████████████▉                                                                                                                                                     | 11/114 [01:13<11:25,  6.65s/it]
Traceback (most recent call last):
  File "src/utils/data_augmentation.py", line 111, in <module>
    main(parser.parse_args())
  File "src/utils/data_augmentation.py", line 30, in main
    save_shifted_dataset(p, t, data_dir, 'train')
  File "src/utils/data_augmentation.py", line 92, in save_shifted_dataset
    source = load_wav(in_path.joinpath(s_name+'.wav'))
  File "src/utils/data_augmentation.py", line 102, in load_wav
    return sf.read(path, samplerate=sr, dtype='float32')[0].T
  File "/home/ubuntu/.local/lib/python3.8/site-packages/soundfile.py", line 256, in read
    with SoundFile(file, 'r', samplerate, channels,
  File "/home/ubuntu/.local/lib/python3.8/site-packages/soundfile.py", line 629, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/soundfile.py", line 1183, in _open
    _error_check(_snd.sf_error(file_ptr),
  File "/home/ubuntu/.local/lib/python3.8/site-packages/soundfile.py", line 1357, in _error_check
    raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening '/home/ubuntu/mdx-files/musdb/train/Artificial Intelligence - Native Instruments/vocals.wav': System error.

I deleted the songs that didn't contain vocals, then the data augmentation succeeded, but all attempts to train failed and I didn't have time to do debugging in the cloud GPU instance.

Here is the output from: python run.py experiment=multigpu_other model=ConvTDFNet_other

/usr/lib/python3/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /usr/lib/python3/dist-packages/torchvision/image.so: undefined symbol: _ZNK3c106IValue23reportToTensorTypeErrorEv
  warn(f"Failed to load image Python extension: {e}")
Traceback (most recent call last):
  File "run.py", line 7, in <module>
    from pytorch_lightning.utilities import rank_zero_info
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 16, in <module>
    from torchmetrics import Accuracy as _Accuracy
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torchmetrics/__init__.py", line 14, in <module>
    from torchmetrics import functional  # noqa: E402
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torchmetrics/functional/__init__.py", line 14, in <module>
    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit, pit_permutate
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torchmetrics/functional/audio/__init__.py", line 26, in <module>
    from torchmetrics.functional.audio.pesq import perceptual_evaluation_speech_quality  # noqa: F401
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torchmetrics/functional/audio/pesq.py", line 20, in <module>
    import pesq as pesq_backend
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pesq/__init__.py", line 5, in <module>
    from ._pesq import pesq, pesq_batch
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pesq/_pesq.py", line 8, in <module>
    from .cypesq import cypesq, cypesq_retvals, cypesq_error_message as pesq_error_message
  File "__init__.pxd", line 238, in init cypesq
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0    36W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   34C    P0    33W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
@Ma5onic Ma5onic changed the title Encountered an error while executing training process Encountered errors while executing training process 2 Aug 22, 2022
@Ma5onic Ma5onic changed the title Encountered errors while executing training process 2 Encountered errors while executing training process #2 Aug 22, 2022
@KimberleyJensen
Copy link

@Ma5onic try pip install --upgrade numpy

@Satisfy256
Copy link

Had the same issue. Fixed by installing old dependencies from around 2021.
requirements.txt

@Ma5onic
Copy link
Author

Ma5onic commented Sep 24, 2022

@Ma5onic try pip install --upgrade numpy

@KimberleyJensen Thanks, but the newest version of numpy is incompatible.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.56.2 requires setuptools<60, but you have setuptools 63.4.1 which is incompatible.
hydra-optuna-sweeper 1.1.0.dev2 requires numpy<1.20.0, but you have numpy 1.23.3 which is incompatible.

Maybe I should have used conda to update instead. Thanks anyways.

@Satisfy256 Ouhhh! interesting, okay I'll nuke my current install and start over lol.

Had the same issue. Fixed by installing old dependencies from around 2021. requirements.txt

@Ma5onic
Copy link
Author

Ma5onic commented Sep 25, 2022

@KimberleyJensen, you're onto something though, the current requirements.txt seems to also contain an issue related the one you mentioned here. The requirements.txt that @Satisfy256 mentioned has demucs<=2.0.3 listed as a dependency... that file might be a little hidden gem because I could not find it it the committed file history:
https://github.com/kuielab/mdx-net/commits/main/requirements.txt
Same with the leaderboard B branch/tree https://github.com/kuielab/mdx-net/commits/Leaderboard_B/requirements.txt

Still waiting for conda to solve the environment 😢

@Satisfy256
Copy link

@Ma5onic I modified the requirements.txt to use old versions. I tested it out and it works for me in Ubuntu 20.04

@Ma5onic
Copy link
Author

Ma5onic commented Sep 29, 2022

@Satisfy256 okay, sick. That gives me hope, i'll start from scratch and try again.

@Ma5onic I modified the requirements.txt to use old versions. I tested it out and it works for me in Ubuntu 20.04

@Ma5onic
Copy link
Author

Ma5onic commented Sep 30, 2022

yay! it works!!!
Thank you very much

@Ma5onic Ma5onic closed this as completed Sep 30, 2022
@Ma5onic
Copy link
Author

Ma5onic commented Oct 1, 2022

Linux users with rtx cards, or anyone using a cloud instances will encounter dependency issues unrelated to the solution above. The pytorch landing page shows how the commands differ based on your OS/env

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants