Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble training model without Crepe f0 estimation #79

Closed
voodoohop opened this issue Apr 18, 2020 · 3 comments
Closed

Trouble training model without Crepe f0 estimation #79

voodoohop opened this issue Apr 18, 2020 · 3 comments

Comments

@voodoohop
Copy link

voodoohop commented Apr 18, 2020

Description

I am having trouble training models that don't rely on an f0 estimate from the Crepe pitch estimator. In my tests, whenever fundamental frequency estimation is part of the differential graph I cannot get any convergence of the additive synthesizer at all.

To reproduce it, I create a batch consisting of one sample generated with the additive synth as in the synths and effects tutorial notebook. I then try overfitting an autoencoder on that one sample, with code adapted from the training on one sample notebook.

The decoder uses an additive synthesizer too so, in theory, it should easily reconstruct the sample. Here is a Colab notebook that demonstrates the behavior. In order to make the model converge replace f0_encoder=f0_encoder with f0_encoder=None.

Results

Original Audio

Original Audio)

Reconstruction with an f0 encoder (3000 training steps)

Reconstruction f0 encoder
After the first few training steps, the loss does not improve anymore (around 18.-19.).

Reconstruction with f0 from Crepe (100 training steps)

reconstruction f0 crepe)
The model converges immediately with the loss going down to 3. in a short time.

Things I have tried

  • Pass the precalculated f0 estimate from Crepe to a dense layer with one output. Even then the model does not converge although a simple scaling of the input should be enough for reconstruction. I tried a combination of activation functions and rescaling.
  • In order to help avoid local minima, that could occur if the model optimizes for a fundamental frequency that is a multiple of the real fundamental frequency, I tried applying a coarse spectrogram loss using less FFT buckets. I didn't see any improvement.
  • Initialized a fake f0 estimator so it starts almost at the right frequency before starting to train with no success.

This is happening just trying to fit one sample. I tried fitting multiple samples too without success.

To Reproduce

Colab notebook

import time
import ddsp
from ddsp.training import (data, decoders, encoders, models, preprocessing, 
                           train_util)
import gin
import numpy as np
import tensorflow.compat.v2 as tf
import itertools

sample_rate = 16000

### Generate an audio sample using the additive synth

n_frames = 1000
hop_size = 64
n_samples = n_frames * hop_size

# Amplitude [batch, n_frames, 1].
# Make amplitude linearly decay over time.
amps = np.linspace(1.0, -3.0, n_frames,dtype=np.float32)
amps = amps[np.newaxis, :, np.newaxis]

# Harmonic Distribution [batch, n_frames, n_harmonics].
# Make harmonics decrease linearly with frequency.
n_harmonics = 20
harmonic_distribution = np.ones([n_frames, 1],dtype=np.float32) * np.linspace(1.0, -1.0, n_harmonics,dtype=np.float32)[np.newaxis, :]
harmonic_distribution = harmonic_distribution[np.newaxis, :, :]

# Fundamental frequency in Hz [batch, n_frames, 1].
f0_hz = 440.0 * np.ones([1, n_frames, 1],dtype=np.float32)

# Create synthesizer object.
additive_synth = ddsp.synths.Additive(n_samples=n_samples,
                                      scale_fn=ddsp.core.exp_sigmoid,
                                      sample_rate=sample_rate)

# Generate some audio.
audio = additive_synth(amps, harmonic_distribution, f0_hz)

# Create a batch of data (1 example) to train on

batch = {"audio": audio, "f0_hz": f0_hz, "amplitudes": amps, "loudness_db": np.ones_like(amps)}

dataset_iter = itertools.repeat(batch)
batch = next(dataset_iter)
audio = batch['audio']
n_samples = audio.shape[1]


### Create an autoencoder

# Create Neural Networks.
preprocessor = preprocessing.DefaultPreprocessor(time_steps=n_samples)

# f0 encoder
f0_encoder = encoders.ResnetF0Encoder(size="small")


encoder = encoders.MfccTimeDistributedRnnEncoder(rnn_channels = 256, 
                                                 rnn_type = 'gru', 
                                                 z_dims = 16, 
                                                 z_time_steps=125, 
                                                 f0_encoder=f0_encoder)
# set f0_encoder=None to use Crepe

decoder = decoders.RnnFcDecoder(rnn_channels = 256,
                                rnn_type = 'gru',
                                ch = 256,
                                layers_per_stack = 1,
                                output_splits = (('amps', 1),
                                                 ('harmonic_distribution', 45)))

# Create Processors.
additive = ddsp.synths.Additive(n_samples=n_samples, 
                                sample_rate=sample_rate,
                                name='additive')

# Create ProcessorGroup.
dag = [(additive, ['amps', 'harmonic_distribution', 'f0_hz'])]

processor_group = ddsp.processors.ProcessorGroup(dag=dag,
                                                 name='processor_group')


# Loss_functions
spectral_loss = ddsp.losses.SpectralLoss(loss_type='L1',
                                         mag_weight=1.0,
                                         logmag_weight=1.0)

strategy = train_util.get_strategy()

with strategy.scope():
  # Put it together in a model.
  model = models.Autoencoder(preprocessor=preprocessor,
                             encoder=encoder,
                             decoder=decoder,
                             processor_group=processor_group,
                             losses=[spectral_loss])
  trainer = train_util.Trainer(model, strategy, learning_rate=1e-3)


### Try overfitting to the synthetic sample

# Build model, easiest to just run forward pass.

trainer.build(batch)

for i in range(3000):
  losses = trainer.train_step(dataset_iter)
  res_str = 'step: {}\t'.format(i)
  for k, v in losses.items():
    res_str += '{}: {:.2f}\t'.format(k, v)
  print(res_str)

@jesseengel
Copy link
Contributor

Hi, thanks for the in-depth study and posting all the resources.

This is actually expected behavior at the moment. As we said in the paper, when training on a full dataset like NSynth the f0 encoder model can get a small loss and learn to generate audio that a CREPE model classifies as having the right f0, but does not currently estimate the correct f0 internally. It often falls into the local minima of predicting an interger multiple of f0 and then doing the best to match the data by manipulating the harmonic distribution. Unlike other neural networks, this problem will be even more exacerbated in fitting a single datapoint (not having the stochasticity of SGD to help in optimization).

We have some follow-up work that overcomes these challenges, and are working on getting in prepared for a conference submission next month, at which time I'll clean it up and submit it to the repo. Sorry for the delay, or if the original paper was misleading, but I think there are actually several ways to tackle this challenge and we should hopefully have them robust and added soon.

@voodoohop
Copy link
Author

Understood. That's good to know and thanks for all the amazing work. I'm really excited about the developments. Should I close this issue for now?

@jesseengel
Copy link
Contributor

Yah, and I look forward to posting more when I have it :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants