Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix in implementation of S-DTW backward @taras-sereda #15

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

taras-sereda
Copy link

Hey, I've found that in your implementation of S-DTW backward, E - matrices are not used, instead you are using G - matrices and their entries are ignoring scaling factors a, b, c.
What's the reason for this?
My guess you are doing this in order to preserve and propagate gradients, because they are vanishing due to small values of a, b, c. But I might be wrong, so I'd be glad to hear your motivation on doing this.

Playing with your code, I also found that gradients are vanishing, especially when bandwitdth=None.
So I'm solving this problem by normalizing distance matrix, by n_mel_channel. And with this normalization and exact implementation of S-dtw backward I'm able to converge on overfit experiments quicker then with non-exact computation of s-dtw backward.
I'm using these SDT hparams:

gamma = 0.05
warp = 256
bandwidth = 50

here is a small test I'm using for checks:

        target_spectro = np.load('')
        target_spectro = torch.from_numpy(target_spectro)
        target_spectro = target_spectro.unsqueeze(0).cuda()
        pred_spectro = torch.randn_like(target_spectro, requires_grad=True)

        optimizer = Adam([pred_spectro])

        # model fits in ~3k iterations
        n_iter = 4_000
        for i in range(n_iter):

            loss = self.numba_soft_dtw(pred_spectro, target_spectro)
            loss = loss / pred_spectro.size(1)
            loss.backward()

            if i % 1_000 == 0:
                print(f'iter: {i}, loss: {loss.item():.6f}')
                print(f'd_loss_pred {pred_spectro.grad.mean()}')

            optimizer.step()
            optimizer.zero_grad()

Curious to hear how your training is going!
Best. Taras

Hey, I've found that in your implementation of S-DTW backward, E - matrices are not used, instead you are using G - matrices and their entries are ignoring scaling factors `a, b, c`.
What's the reason for this? 
My guess you are doing this in order to preserve and propagate gradients, because they are vanishing due to small values of `a, b, c`. But I might be wrong, so I'd be glad to hear your motivation on doing this.

Playing with your code, I also found that gradients are vanishing, especially when `bandwitdth=None`.
So I'm solving this problem by normalizing distance matrix, by `n_mel_channel`. And with this normalization and exact implementation of S-dtw backward I'm able to converge on overfit experiments quicker then with non-exact computation of s-dtw backward.
I'm using these SDT hparams:
```
gamma = 0.05
warp = 256
bandwidth = 50
```


here is a small test I'm using for checks:
```
        target_spectro = np.load('')
        target_spectro = torch.from_numpy(target_spectro)
        target_spectro = target_spectro.unsqueeze(0).cuda()
        pred_spectro = torch.randn_like(target_spectro, requires_grad=True)

        optimizer = Adam([pred_spectro])

        # model fits in ~3k iterations
        n_iter = 4_000
        for i in range(n_iter):

            loss = self.numba_soft_dtw(pred_spectro, target_spectro)
            loss = loss / pred_spectro.size(1)
            loss.backward()

            if i % 1_000 == 0:
                print(f'iter: {i}, loss: {loss.item():.6f}')
                print(f'd_loss_pred {pred_spectro.grad.mean()}')

            optimizer.step()
            optimizer.zero_grad()
```

Curious to hear how your training is going!
Best. Taras
fix in implementation of S-DTW backward
@keonlee9420
Copy link
Owner

Hi @taras-sereda , thank you very much for your effort! I think what you claimed seems worth considering, and I'm training the model with your update, but unfortunately, it shows no evidence on convergence so far (it lasts about 9 hours).
So the reason for G is from the derivation of backward following the original Soft DTW paper(please refer to Algorithm2), applying on the version introduced in the parallel tacotron 2 paper(please refer to section 4.2.). G is already expected to utilize calculated E where each coef a, b, c is involved. But it was a bit ago so let me double-check on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants