Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matplotlib API change & NaNs for short clips & new hop_length #8

Closed
thorstenMueller opened this issue Oct 11, 2020 · 27 comments
Closed
Labels
question Further information is requested update

Comments

@thorstenMueller
Copy link

I'm trying to run training on a nvidia xavier agx device running nvidia docker container based on these https://ngc.nvidia.com/catalog/containers/nvidia:l4t-pytorch instructions.

But i receive following error:

Initializing logger...
Initializing model...
Number of parameters: 15810401
Initializing optimizer, scheduler and losses...
Initializing data loaders...

Traceback (most recent call last):
File "train.py", line 185, in
run(config, args)
File "train.py", line 72, in run
logger.log_specs(0, specs)
File "/media/908f901d-e80b-4a8e-8a16-9e0f1b896732/TTS/thorsten-de/models/model-v02/WaveGrad/logger.py", line 53, in log_specs
self.add_image(key, plot_tensor_to_numpy(image), iteration, dataformats='HWC')
File "/media/908f901d-e80b-4a8e-8a16-9e0f1b896732/TTS/thorsten-de/models/model-v02/WaveGrad/utils.py", line 66, in plot_tensor_to_numpy
im = ax.imshow(tensor, aspect="auto", origin="bottom", interpolation='none', cmap='hot')
File "/usr/local/lib/python3.6/dist-packages/matplotlib/init.py", line 1438, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "/usr/local/lib/python3.6/dist-packages/matplotlib/axes/_axes.py", line 5521, in imshow
resample=resample, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/matplotlib/image.py", line 905, in init
**kwargs
File "/usr/local/lib/python3.6/dist-packages/matplotlib/image.py", line 246, in init
cbook._check_in_list(["upper", "lower"], origin=origin)
File "/usr/local/lib/python3.6/dist-packages/matplotlib/cbook/init.py", line 2257, in _check_in_list
.format(v, k, ', '.join(map(repr, values))))
ValueError: 'bottom' is not a valid value for origin; supported values are 'upper', 'lower'

python3 -V: Python 3.6.9
pip3 -V: 20.2.3

Running pip3 list shows following installed packages:

absl-py (0.10.0)
appdirs (1.4.4)
cachetools (4.1.1)
certifi (2020.6.20)
chardet (3.0.4)
cycler (0.10.0)
Cython (0.29.20)
decorator (4.4.2)
future (0.18.2)
google-auth (1.22.1)
google-auth-oauthlib (0.4.1)
grpcio (1.32.0)
idna (2.10)
importlib-metadata (2.0.0)
kiwisolver (1.2.0)
Mako (1.1.3)
Markdown (3.3)
MarkupSafe (1.1.1)
matplotlib (3.3.1)
numpy (1.19.0)
oauthlib (3.1.0)
Pillow (7.2.0)
pip (9.0.1)
protobuf (3.13.0)
pyasn1 (0.4.8)
pyasn1-modules (0.2.8)
pycuda (2019.1.2)
pyparsing (2.4.7)
python-dateutil (2.8.1)
pytools (2020.3.1)
requests (2.24.0)
requests-oauthlib (1.3.0)
rsa (4.6)
setuptools (50.3.0)
six (1.15.0)
tensorboard (2.3.0)
tensorboard-plugin-wit (1.7.0)
torch (1.6.0)
torchaudio (0.6.0a0+d6f81d1)
torchvision (0.7.0a0+6631b74)
tqdm (4.50.2)
urllib3 (1.25.10)
Werkzeug (1.0.1)
wheel (0.35.1)
zipp (3.3.0)

I tried matplotlib (3.3.1) and 3.3.2 both with same result.

Any ideas what i miss?
Thank you.

@ivanvovk
Copy link
Owner

Hello. It's strange. Maybe they changed APIs in the latest versions. The version of matplotlib I am using is 3.2.1 and its ok. Try 2 things:

  1. Change value for key argument origin in this line from "bottom" to "lower". I guess it should do the same.
  2. If first step doesn't help, then try downgrading to 3.2.1 version.

@thorstenMueller
Copy link
Author

Thanks for your quick support.
Changing "bottom" to "lower" worked for me. I've made a mini pr - hopefully it's helpful :-).

@thorstenMueller
Copy link
Author

Training went (reproduceable) well until iteration 55. Then it's running into problems calculating loss stats.

Iteration: 52 | Losses: [15.780592918395996, 821.0237426757812]
Iteration: 53 | Losses: [4.594686985015869, 205.12646484375]
Iteration: 54 | Losses: [2.3868210315704346, 97.16974639892578]
Iteration: 55 | Losses: [1.1524507999420166, 78.44384002685547]
Iteration: 56 | Losses: [nan, nan]
Iteration: 57 | Losses: [nan, nan]
Iteration: 58 | Losses: [nan, nan]

Any idea on that?
Maybe i try to compile matplotlib 3.2.1 and try running with original "bottom" code.

@ivanvovk
Copy link
Owner

No, it is not connected to matplotlib. It is loss explosion problem which occurs sometimes. Try set in config lr to 5e-4 and scheduler_gamma to 0.9 as mentioned in #3 issue.

@ivanvovk ivanvovk added question Further information is requested update labels Oct 12, 2020
@thorstenMueller
Copy link
Author

Thanks for your reply and sorry that i've not seen this existing helpful issue before.
Sadly setting lr to 5e-4 and scheduler_gamma to 0.9 didn't change anything.

After reducing batchsize the "nan" problem occurs later:

  • Default batchsize 48: nan with step 56
  • Batchsize 32: nan with step 86
  • Batchsize 16: nan with step 184

Is there a better place for best practice config discussion than within this issue?

@ivanvovk
Copy link
Owner

@thorstenMueller, I am planning to push a new version of WaveGrad soon, which should be more robust to loss explosion problem. Please, check it in a few days.

@thorstenMueller
Copy link
Author

Thanks, sounds good.
I'll test again as soon you pushed a new version.

@ivanvovk
Copy link
Owner

@thorstenMueller Hello, sorry for being a bit late. I have updated the repo. I believe it should be more robust to loss explosion issue now.

@thorstenMueller
Copy link
Author

thorstenMueller commented Oct 25, 2020

Hey @ivanvovk .

Thanks for the huge update 👍 . I’ve set up a training run with my available german dataset (https://github.com/thorstenMueller/deep-learning-german-tts/) and training is running for 1 day without stopping because of errors.
RunningWaveGradTraining

But i could need some help in understanding it’s progress. Do you have an account on Mozilla discourse so we could discuss my questions there (https://discourse.mozilla.org/t/contributing-my-german-voice-for-tts/) to not blow up this „issue“?

Following things are in my mind right now:

  1. Should tensorboard dependency be added in requirements.txt?

  2. When is the best time to run notebook (12 .pt checkpoint files written at the moment).
    Is current training process finished at any point before running notebook?

  3. Audio samples are pure random noise and predicted graphs don't change

  4. I see lots of „NaN“ points (triangles) in tensorboard graphs (grad norm graph in TB)

TensorBoardImages
TensorBoardScalars
TensorBoardGradNorm

This is my used wavegrad config.

config1

config2

Taco2 Training is based on "hop_length": 256 so i'll need to adjust "factors" in config. Currently wavegrad training has value auf hop_length = 300.

Would be great if you can support me on this :-).
Thanks so far.

@ivanvovk
Copy link
Owner

Sorry, I have no account there. Write me on my mail iyuvovk@yandex.ru and we'll decide where to continue the discussion.

@thorstenMueller
Copy link
Author

Okay, i've sent you an email.
If it's okay for you we can communicate public within this issue.

@thorstenMueller
Copy link
Author

thorstenMueller commented Oct 25, 2020

Hey @ivanvovk .
I've got an error on training while epoch 19.

100%|#####################################################################################################################################| 97/97 [00:27<00:00,  3.48it/s]
Device: GPU. average_rtf=4.038705106806669
Epoch: 18 | Losses: [0.49297845363616943, 0.07294661551713943, 2.0598750710487366]
100%|#####################################################################################################################################| 97/97 [00:27<00:00,  3.48it/s]
Device: GPU. average_rtf=4.033360881088442
Epoch: 19 | Losses: [nan, nan, nan]
/usr/local/lib/python3.6/dist-packages/torch/utils/tensorboard/summary.py:422: RuntimeWarning: invalid value encountered in greater
  if abs(tensor).max() > 1:
Traceback (most recent call last):
  File "train.py", line 262, in <module>
    run_training(0, config, args)
  File "train.py", line 198, in run_training
    logger.log_audios(epoch, audios)
  File "/wavegrad/logger.py", line 55, in log_audios
    self.summary_writer.add_audio(key, audio, iteration, sample_rate=self.sample_rate)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/tensorboard/writer.py", line 676, in add_audio
    audio(tag, snd_tensor, sample_rate=sample_rate), global_step, walltime)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/tensorboard/summary.py", line 427, in audio
    tensor_list = [int(32767.0 * x) for x in tensor]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/tensorboard/summary.py", line 427, in <listcomp>
    tensor_list = [int(32767.0 * x) for x in tensor]
ValueError: cannot convert float NaN to integer
Segmentation fault (core dumped)

Tensorboard was running while this error occurs. Is a running tb a problem?
Do you have any idea what might be the reason or do you need more info from me?

@cuongnm5
Copy link

I have the same error when training vietnamese dataset.

Device: GPU. average_rtf=0.39042793247007823
Epoch: 17 | Losses: [nan, 0.49476008117198944, 3.333707571029663]
Device: GPU. average_rtf=0.43076214979387495
Epoch: 18 | Losses: [nan, nan, nan]
Traceback (most recent call last):
File "train.py", line 264, in
run_training(0, config, args)
File "train.py", line 198, in run_training
logger.log_audios(epoch, audios)
File "/data/cuongnm5/WaveGrad/logger.py", line 55, in log_audios
self.summary_writer.add_audio(key, audio, iteration, sample_rate=self.sample_rate)
File "/root/miniconda3/envs/cuongnm/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 676, in add_audio
audio(tag, snd_tensor, sample_rate=sample_rate), global_step, walltime)
File "/root/miniconda3/envs/cuongnm/lib/python3.7/site-packages/torch/utils/tensorboard/summary.py", line 427, in audio
tensor_list = [int(32767.0 * x) for x in tensor]
File "/root/miniconda3/envs/cuongnm/lib/python3.7/site-packages/torch/utils/tensorboard/summary.py", line 427, in
tensor_list = [int(32767.0 * x) for x in tensor]
ValueError: cannot convert float NaN to integer

My config:

"batch_size": 96,
"segment_length": 7200,
"lr": 5e-4,
"grad_clip_threshold": 1,
"scheduler_step_size": 1,
"scheduler_gamma": 0.9,
"n_epoch": 10000,
"n_samples_to_test": 4,
"test_interval": 1

@ivanvovk
Copy link
Owner

@dodoproptit99 Hello. Okay, that seems like a problem of pytorch mixed-precision training. I've just pushed a small update to the repo, where I added support to turn it off. Please pull new version, disable fp16-training here and decrease batch size (I suggest 48). I suppose it should help. And please report if it helps or not.

@cuongnm5
Copy link

cuongnm5 commented Nov 1, 2020

@ivanvovk Thanks for your reply! I try to decrease batch size to 48, 24 and disable fp16-training but i still got this error :(
I use RTX 2080Ti with CUDA 10.2.

Screenshot from 2020-11-01 14-43-38

@ivanvovk
Copy link
Owner

ivanvovk commented Nov 1, 2020

@dodoproptit99 what tensorboard output do you have?

@cuongnm5
Copy link

cuongnm5 commented Nov 1, 2020

@ivanvovk

  • logs/default_2:

{"model_config": {"factors": [5, 5, 3, 2, 2], "upsampling_preconv_out_channels": 768, "upsampling_out_channels": [512, 512, 256, 128, 128], "upsampling_dilations": [[1, 2, 1, 2], [1, 2, 1, 2], [1, 2, 4, 8], [1, 2, 4, 8], [1, 2, 4, 8]], "downsampling_preconv_out_channels": 32, "downsampling_out_channels": [128, 128, 256, 512], "downsampling_dilations": [[1, 2, 4], [1, 2, 4], [1, 2, 4], [1, 2, 4]]}, "data_config": {"sample_rate": 16000, "n_fft": 1024, "win_length": 1024, "hop_length": 300, "f_min": 80.0, "f_max": 8000, "n_mels": 80}, "training_config": {"logdir": "logs/default_2", "continue_training": false, "train_filelist_path": "filelists/train.txt", "test_filelist_path": "filelists/test.txt", "batch_size": 48, "segment_length": 7200, "lr": 0.0001, "grad_clip_threshold": 1, "scheduler_step_size": 1, "scheduler_gamma": 0.9, "n_epoch": 10000, "n_samples_to_test": 4, "test_interval": 1, "use_fp16": false, "training_noise_schedule": {"n_iter": 1000, "betas_range": [1e-06, 0.01]}, "test_noise_schedule": {"n_iter": 50, "betas_range": [1e-06, 0.01]}}, "dist_config": {"MASTER_ADDR": "localhost", "MASTER_PORT": "600010"}}

  • logs/default_3:

{"model_config": {"factors": [5, 5, 3, 2, 2], "upsampling_preconv_out_channels": 768, "upsampling_out_channels": [512, 512, 256, 128, 128], "upsampling_dilations": [[1, 2, 1, 2], [1, 2, 1, 2], [1, 2, 4, 8], [1, 2, 4, 8], [1, 2, 4, 8]], "downsampling_preconv_out_channels": 32, "downsampling_out_channels": [128, 128, 256, 512], "downsampling_dilations": [[1, 2, 4], [1, 2, 4], [1, 2, 4], [1, 2, 4]]}, "data_config": {"sample_rate": 16000, "n_fft": 1024, "win_length": 1024, "hop_length": 300, "f_min": 80.0, "f_max": 8000, "n_mels": 80}, "training_config": {"logdir": "logs/default_3", "continue_training": false, "train_filelist_path": "filelists/train.txt", "test_filelist_path": "filelists/test.txt", "batch_size": 24, "segment_length": 7200, "lr": 0.0001, "grad_clip_threshold": 1, "scheduler_step_size": 1, "scheduler_gamma": 0.9, "n_epoch": 10000, "n_samples_to_test": 4, "test_interval": 1, "use_fp16": false, "training_noise_schedule": {"n_iter": 1000, "betas_range": [1e-06, 0.01]}, "test_noise_schedule": {"n_iter": 50, "betas_range": [1e-06, 0.01]}}, "dist_config": {"MASTER_ADDR": "localhost", "MASTER_PORT": "600010"}}

  • logs/default:

{"model_config": {"factors": [5, 5, 3, 2, 2], "upsampling_preconv_out_channels": 768, "upsampling_out_channels": [512, 512, 256, 128, 128], "upsampling_dilations": [[1, 2, 1, 2], [1, 2, 1, 2], [1, 2, 4, 8], [1, 2, 4, 8], [1, 2, 4, 8]], "downsampling_preconv_out_channels": 32, "downsampling_out_channels": [128, 128, 256, 512], "downsampling_dilations": [[1, 2, 4], [1, 2, 4], [1, 2, 4], [1, 2, 4]]}, "data_config": {"sample_rate": 16000, "n_fft": 1024, "win_length": 1024, "hop_length": 300, "f_min": 80.0, "f_max": 8000, "n_mels": 80}, "training_config": {"logdir": "logs/default", "continue_training": false, "train_filelist_path": "filelists/train.txt", "test_filelist_path": "filelists/test.txt", "batch_size": 96, "segment_length": 7200, "lr": 0.0001, "grad_clip_threshold": 1, "scheduler_step_size": 1, "scheduler_gamma": 0.9, "n_epoch": 10000, "n_samples_to_test": 4, "test_interval": 1, "use_fp16": true, "training_noise_schedule": {"n_iter": 1000, "betas_range": [1e-06, 0.01]}, "test_noise_schedule": {"n_iter": 50, "betas_range": [1e-06, 0.01]}}, "dist_config": {"MASTER_ADDR": "localhost", "MASTER_PORT": "600010"}}

Screenshot from 2020-11-01 17-15-32
Screenshot from 2020-11-01 17-15-19

@ivanvovk
Copy link
Owner

ivanvovk commented Nov 1, 2020

@dodoproptit99 this is really strange. Can you please run the following script? Put it to the root folder of WaveGrad and run python check_data.py -c configs/YOUR_CONFIG -f filelists/YOUR_FILELIST. It will check whether mel-transformation results in bad values (infs or nans). Of course, don't forget to specify your GPU device through CUDA_VISIBLE_DEVICES.

@cuongnm5
Copy link

cuongnm5 commented Nov 1, 2020

@dodoproptit99 this is really strange. Can you please run the following script? Put it to the root folder of WaveGrad and run python check_data.py -c configs/YOUR_CONFIG -f filelists/YOUR_FILELIST. It will check whether mel-transformation results in bad values (infs or nans). Of course, don't forget to specify your GPU device through CUDA_VISIBLE_DEVICES.

This is my output:
Dataset has nans: False
Dataset has infs: True

Can u tell me more about that?
Thanks in advance ^^

@ivanvovk
Copy link
Owner

ivanvovk commented Nov 1, 2020

@dodoproptit99 Okay, I found the problem origin. Seems like your data contains audios of length less than segment_length (=7200 by default configuration). For such cases proper batching is resolved using padding with zeros. When transforming to mel-spectrogram, I take log10, which results in infinity values. I have just pushed an update which should solve this problem. Check it out by pulling latest repo changes. Please, report.

@cuongnm5
Copy link

cuongnm5 commented Nov 2, 2020

@ivanvovk It work ^^

Screenshot from 2020-11-02 17-02-04

@ivanvovk
Copy link
Owner

ivanvovk commented Nov 2, 2020

@dodoproptit99 glad to hear that! @thorstenMueller also check it out, probably it will solve your problem with NaNs too (if its still relevant).

@thorstenMueller
Copy link
Author

Thank's @ivanvovk for triggering me and updating code.
I've no problems with "NaN" currently. I'm having trouble on this:

Initializing logger...
Initializing model...
Number of WaveGrad parameters: 15810401
Initializing optimizer, scheduler and losses...
Initializing data loaders...
Start training...
Traceback (most recent call last):                                                                                                                                        
  File "train.py", line 262, in <module>
    run_training(0, config, args)
  File "train.py", line 117, in run_training
    loss = (model if args.n_gpus == 1 else model.module).compute_loss(mels, batch)
  File "/wavegrad/model/diffusion_process.py", line 176, in compute_loss
    eps_recon = self.nn(mels, y_noisy, continuous_sqrt_alpha_cumprod)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/wavegrad/model/nn.py", line 119, in forward
    ublock_outputs = ublock(x=ublock_outputs, scale=scale, shift=shift)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/wavegrad/model/upsampling.py", line 82, in forward
    outputs = self.first_block_main_branch['modulation'](outputs, scale, shift)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/wavegrad/model/upsampling.py", line 30, in forward
    outputs = self.featurewise_affine(x, scale, shift)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/wavegrad/model/linear_modulation.py", line 68, in forward
    outputs = scale * x + shift
RuntimeError: The size of tensor a (450) must match the size of tensor b (448) at non-singleton dimension 2
Segmentation fault (core dumped)

I'd like to try your tip, but i'm not sure how to do this:

New hop length, new struggles. Check, does mel spectrogram shape you obtain corresponds to the audio length or not. Take audio and convert it using this class. Mel length multiplied by 256 should be equal to audio length exactly.

@ivanvovk
Copy link
Owner

ivanvovk commented Nov 2, 2020

@thorstenMueller oh, sorry, I got what's wrong. Besides upsampling factors you also need to update segment length, that should be divisible by hop length. Change segment_length in your config to 7168, for example, and it will work.

@ivanvovk ivanvovk changed the title Problems on running training with matplotlib on nvidia xavier agx Matplotlib API change & NaNs for short clips & new hop_length Nov 2, 2020
@thorstenMueller
Copy link
Author

Thanks @ivanvovk .
I'll give it a try soon and report back to you.

@thorstenMueller
Copy link
Author

thorstenMueller commented Nov 3, 2020

Hey @ivanvovk .
Training is running for 12 hours without any problems (epoch 10).
Graphs and audio samples looks/sounds good.

Next point will be checking if generated melspecs are compatible with mozilla TTS project.

So thanks for your support and updates on this 👍 .
Scalars
melspecs

@ivanvovk
Copy link
Owner

ivanvovk commented Nov 3, 2020

@thorstenMueller glad that it works and you're welcome! Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested update
Projects
None yet
Development

No branches or pull requests

3 participants