Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we train with this yet? #10

Open
EmElleE opened this issue Aug 5, 2021 · 8 comments
Open

Can we train with this yet? #10

EmElleE opened this issue Aug 5, 2021 · 8 comments

Comments

@EmElleE
Copy link

EmElleE commented Aug 5, 2021

Just wondering if we can train with LJS on this implementation thanks!

@keonlee9420
Copy link
Owner

Hi @EmElleE , yes you could but need to tune hparam for residual encoder and it is really close to.

@ArEnSc
Copy link

ArEnSc commented Aug 10, 2021

@keonlee9420 quick question do you have the LJS Model? I would like to finetune on this, do you know how much data is required for fine tuning? also is the quality close to tacotron2? it seems like these days people use tacotron2 because it works well cloning voices. Do you think Parallel-Tacotron2 is similar or capable ?

@keonlee9420
Copy link
Owner

Hi @ArEnSc , I don't have it yet, but I'll share when I get it. But please note that the result would be much worse than expected since the maximum batch is too small compared to the original paper.

@huypl53
Copy link

huypl53 commented Aug 15, 2021

Take a look at this:

speaker_embedding_m = speaker_embedding.unsqueeze(1).expand(
    -1, max_mel_len, -1
)

position_enc = self.position_enc[
    :, :max_mel_len, :
].expand(batch_size, -1, -1)

enc_input = torch.cat([position_enc, speaker_embedding_m, mel], dim=-1)

speaker_embedding_m and mel both have max_mel_len in channel-1, but position_enc has max_seq_len+1 which is different. Therefore torch.cat will raise exception
Am I right?

@keonlee9420
Copy link
Owner

keonlee9420 commented Aug 17, 2021

Hi @phamlehuy53 , position_enc also has max_seq_len in that dimension.

@huypl53
Copy link

huypl53 commented Aug 17, 2021

Hi @phamlehuy53 , position_enc also has max_seq_len in that dimension.

But you notice that speaking_embedding_m and mel have max_mel_len instead, don't you?

@keonlee9420
Copy link
Owner

keonlee9420 commented Aug 17, 2021

oh, sorry I mistyped. position_enc has max_mel_len, not max_seq_len.

position_enc = self.position_enc[
    :, :max_mel_len, :
].expand(batch_size, -1, -1)

@huypl53
Copy link

huypl53 commented Aug 17, 2021

oh, sorry I mistyped. position_enc has max_mel_len, not max_seq_len.

position_enc = self.position_enc[
    :, :max_mel_len, :
].expand(batch_size, -1, -1)

Yep, when max_mel_len is higher than max_seq_len, the 1st dim of position_enc is sitll max_seq_len in length. That makes mismatch of dim in torch.cat's arguments
I'm sorry for missing this info in first question. Tks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants