Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlignerNet instead of MAS #81

Open
codeghees opened this issue Mar 7, 2024 · 18 comments
Open

AlignerNet instead of MAS #81

codeghees opened this issue Mar 7, 2024 · 18 comments

Comments

@codeghees
Copy link

Is it possible to use AlignerNet (aligner.py in pflow-tts repo) instead of MAS in VITS2?

What should be changed in the code? I am a bit confused on what the inputs should be.

@p0p4k
Copy link
Owner

p0p4k commented Mar 7, 2024

Check out pflow repo for guidance.

@codeghees
Copy link
Author

Hey, thanks for replying!

I did take most of the code from that repo. I am trying to debug why my alignment curve looks like this:
image

align_loss is being added to duration loss.
Inputs:

 aln_hard, aln_soft, aln_log, aln_mask = self.aligner(
            m_p.transpose(1,2), x_mask, y, y_mask
            )
attn = aln_mask.transpose(1,2).unsqueeze(1)
align_loss = self.aligner_loss(aln_log, x_lengths, y_lengths)
m_p is returned by TextEncoder.

Appreciate any insights!

@p0p4k
Copy link
Owner

p0p4k commented Mar 7, 2024

It might be that it is still learning the durations. It think MAS is good enough. What is important is the duration_predictor module.

@codeghees
Copy link
Author

So this above graph is after training for several days and thousands of steps. It seems like some bug - maybe in shape size or sth.

The output is basically the first word is legible but the rest is basically gibberish.

@p0p4k
Copy link
Owner

p0p4k commented Mar 7, 2024

I see. Maybe it needs some fixing to do, if you can start a PR we can debug this together. I am busy with other stuff.

@Tera2Space
Copy link

Probably same problem as we have in pflow)

@codeghees
Copy link
Author

@Tera2Space What was the problem?

@Tera2Space
Copy link

@Tera2Space What was the problem?

Code geass nice, problem was that it generate wrong aligment, i just though of reason: p0p4k/pflowtts_pytorch#24 (comment)

in your model, what shape is input to alignernet?

@codeghees
Copy link
Author

@Tera2Space @p0p4k I added a basic PR of the changes I have so far: #82

@codeghees
Copy link
Author

For a single batch - shapes look something like this:

m_p torch.Size([32, 192, 74])
x_mask torch.Size([32, 1, 74])
y_mask torch.Size([32, 1, 293])
m_p torch.Size([32, 192, 74])
x torch.Size([32, 192, 74])
y torch.Size([32, 80, 293])
m_p.transpose(1,2) torch.Size([32, 74, 192])

@Tera2Space
Copy link

For a single batch - shapes look something like this:

m_p torch.Size([32, 192, 74])
x_mask torch.Size([32, 1, 74])
y_mask torch.Size([32, 1, 293])
m_p torch.Size([32, 192, 74])
x torch.Size([32, 192, 74])
y torch.Size([32, 80, 293])
m_p.transpose(1,2) torch.Size([32, 74, 192])

hm than my guess was wrong/there are more problems.
I tested my idea on pflow and it didn't work too, so probably we need to check for another problems, I think we should use something like https://github.com/lucidrains/naturalspeech2-pytorch as reference.

@p0p4k
Copy link
Owner

p0p4k commented Mar 8, 2024

Iirc, I had yanked the alignernet from there.

@codeghees
Copy link
Author

Yeah currently trying out various inputs to Aligner; possibly an input issue. It might be an issue with the masks.

@p0p4k
Copy link
Owner

p0p4k commented Mar 9, 2024

still not convinced about putting efforts in aligner net, we must focus on a better TextEncoder instead.

@codeghees
Copy link
Author

do you think that's a bottleneck right now?

@p0p4k
Copy link
Owner

p0p4k commented Mar 10, 2024

For vits2 it should be the duration predictor, and for pflow it should be both textencoder and duration predictor. MAS gives good alignments during training, it is during inference that these models perform worse.

@Tera2Space
Copy link

pflow it should be both textencoder

how do you think we can improve pflow's encoder?

@nicemanis
Copy link

Hey, thanks for replying!

I did take most of the code from that repo. I am trying to debug why my alignment curve looks like this: image

align_loss is being added to duration loss. Inputs:

 aln_hard, aln_soft, aln_log, aln_mask = self.aligner(
            m_p.transpose(1,2), x_mask, y, y_mask
            )
attn = aln_mask.transpose(1,2).unsqueeze(1)
align_loss = self.aligner_loss(aln_log, x_lengths, y_lengths)
m_p is returned by TextEncoder.

Appreciate any insights!

Is padding included in the encoder timesteps (it seems to me that it is)? You can remove the padded part from the plot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants