Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtaining unusual alignment results while using the ESPnet2 Branchformer model. #30

Open
teinhonglo opened this issue Aug 2, 2023 · 2 comments

Comments

@teinhonglo
Copy link

teinhonglo commented Aug 2, 2023

Firstly, I want to express my admiration for the exceptional work accomplished here!

Recently, I've been facing the issue while using the ESPnet2 Branchformer model.
Despite following the instructions on here, I encountered poor alignment results.
This results occurred when I trained the model with phone-level transcriptions.

To understand this issue further, I experimented with two different token types, the details of which are as follows:
The accuracy of the two models is 95+%.

BPE-level tokens:

image

Phone-level tokens:

image

I would appreciate your guidance and insights to help me resolve these alignment issues.

Thank you in advance.
Tien-Hong

@lumaku
Copy link
Owner

lumaku commented Aug 2, 2023

Hey Tien-Hong,
thanks for writing this issue. Glad to see that this algorithm is useful for you!

I assume that your screenshots include the alignments with the corresponding token score?
Inspecting these score for the BPE-level tokens:

  • di55 has a score of 0.00, while nga55 has a score of -4.85, and xiag2 has a score of -8.3677. These probabilities are quite bad.

Inspection of the Phone-level tokens:

  • here, the token probabilities are mostly -0.000, which is unusually good (but happen with Transformer models) and may indicate a numerical problem?
  • Timing seems to be shifted by ~300 ms.

I recommend to re-check the following parameters:

  • The duration of the tokens seems to be unusually long. Maybe the timing variables need to be adapted (also, check the correct sample rate).
  • Subsampling: CTC accuracy depends on the ratio of tokens to CTC frames. I had good results with 3 frames for each token on average (to get blank tokens classified in-between). If you directly switched from BPE to phones, you may still need to adapt the subsampling ratio.
  • Check the performance of your CTC network: The alignments are only as good as the CTC output of the network itself. What was the CTC weight parameter during your training? If you decode CTC-only on your test set, how good/bad is the ASR performance compared to hybrid CTC/attention decoding?
  • Usually, Transformer models loose accuracy at the beginning and at the end of the aligned audio, maybe adding suitable padding to the audio file may help.

Also other issues may cause such misalignments that may be model-related; Give me a few days to find the time to investigate Branchformer alignments with an English language model.

@teinhonglo
Copy link
Author

teinhonglo commented Aug 4, 2023

Thank you for your kind response.

I have evaluated our trained model's CTC performance and verified the audio's sampling rate.
The sampling rate of the audio is 16000.

All configs I used are listed:

Both the CTC-ATT and CTC decoding performance (w/ the ctc suffix) of the model (Additionally, I attempted using the conformer-type encoder, but unfortunately, the alignment results remained as poor as the branchformer-type encoder.):

BPE-level tokens

exp/asr_train_asr_branchformer_raw_en_bpe735_sp

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_branchformer_asr_model_valid.acc.ave/test 2187 37456 94.77 4.97 0.27 0.07 5.30 43.35
decode_asr_branchformer_ctc_asr_model_valid.acc.ave/test 2187 37456 92.31 7.14 0.55 0.25 7.94 53.54

Phone-level tokens

exp/asr_train_asr_branchformer_raw_en_word_sp

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_branchformer_asr_model_valid.acc.ave/test 2187 37456 94.82 5.01 0.17 0.07 5.25 42.25
decode_asr_branchformer_ctc_asr_model_valid.acc.ave/test 2187 37456 94.52 5.29 0.18 0.08 5.56 44.67

Do you have any further suggestions?

Thank you in advance.
Tien-Hong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants