Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating Phone-Based lang (Lexicon ) into Zipformer Model #1606

Closed
kerolos opened this issue Apr 24, 2024 · 1 comment
Closed

Integrating Phone-Based lang (Lexicon ) into Zipformer Model #1606

kerolos opened this issue Apr 24, 2024 · 1 comment

Comments

@kerolos
Copy link

kerolos commented Apr 24, 2024

I'm seeking guidance on how to incorporate a Phone-Based Language Lexicon (in icefall/egs/librispeech/ASR/prepare.sh in Step 6) into the latest Zipformer Model, a state-of-the-art solution in speech recognition.

I'm unsure about which parameters need adjustment in the Zipformer Model Architecture to optimize performance specifically for phone-level recognition, rather than sub-word or sentence-piece levels, which are typical in Byte Pair Encoding (BPE) models.

Description:
I understand the benefits of open vocabulary systems like BPE, which eliminate the need for prior knowledge of word pronunciation, I'm unsure how BPE handles variations in word pronunciation found in training materials or using words in text training material without normalization them to all lower or upper characters. Additionally, during decoding, there's a possibility of encountering words with multiple variants or specific terminology (such as legal or medical terms or special foreign words) that may contain some tokens do not be in BPE model or not in the token list (tokens.txt)!
-How does BPE handle variations in word pronunciation during training and decoding !, What strategies can I use to address the limitations of BPE models when encountering specialized terminology or words with multiple variants during decoding! This might be the drawback of using BEP based lexicon system.

I have few questions:
1-How can I effectively use a Phone-Based Language Lexicon into the Zipformer Model ? And which Zipformer model or recipe shall be used ?
2- Which parameters in the Zipformer Model Architecture (that layers run at different speeds) should be adjusted or tuned to be able to work well with phone level not sub-word level or sentence piece level Byte Pair Encoding (BPE) as this model designed for ?

Also, It would be good for me to compare the old technology using the TDNN Model in original Kaldi to the Zipformer Model in Next-gen Kaldi icefall using Phone-Based lexicon with the same dataset and also in different languages.

Any advice on these questions would be greatly appreciated. Thanks in advance.

@wangtiance
Copy link
Contributor

You may refer to egs/librispeech/ASR/tiny_transducer_ctc on how to incorporate the phone lexicon. Basically you use a UniqLexicon object to convert texts to phone tokens. Note that it doesn't handle multiple pronunciations.

Based on my experience, BPE models have better WER than phone models. I'm looking forward to your results.

@JinZr JinZr closed this as completed May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants