New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transcription being concatenated oddly #5
Comments
hey, this looks like an OOV issue. since kenlm doesn't have a notion of partial tokens it helps a lot to pass a list of known |
That just sounds like a hotfix to me tbh. According to your reasoning the same should happen with the other implementation as well |
it could be that you can reproduce the same behavior of the other decoder by lowering some of the thresholds during prediction, for example |
if the 'no unigrams available' is a scenario that is important for you that's great to know. pretty sure the word concatenation can be fixed by adjusting how we score partial words (which is usually done by the trie made from the unigram list). |
oh, looking at the other repo, i believe it reads the unigrams from the arpa kenlm file? then you don't have to provide the list separately. however then you can't use a binary kenlm file for instantiation since as far as i know they don't contain the ngrams in an easy to read format |
Okay so I also followed the unigram option and it fixes the problem for now but this probably won't work for example for new names that we've never seen before. Let me see if I can provide you the logits. As for the other implementation, I was using the simple ARPA file so it might be the case as you suggested but I don't think the binary one should be that big of a problem. Also how about a simple extension to enable support for transformer-based lm scorers: https://github.com/simonepri/lm-scorer |
great, i will put it on my todo list to provide better support around the unigram list. |
I've tried it in a beam-scoring manner but that assumes that you get good beam predictions to begin with. So I think it'd be wonderful if you can patch that in. Obviously one can use the insanely fast and light transformers to get a better middleground. Looking forward to the fix and as well as the new feature. Thank you for the wonderful work! |
sounds great will have a look. thanks for the feedback and let me know if other issues come up! |
see PR #4 for some additional warnings around unigrams as well as improved partial scoring without a trie that should help with word concatenation |
I am trying to use the ctc decoding feature with kenlm on the wav2vec2 huggingface's logits
which returns the following output with beam size 64:
while when I was previously decoding with https://github.com/ynop/py-ctc-decode with the same lm and parameters getting:
I don't understand why the words are being concatenated together. Do you have any thoughts?
The text was updated successfully, but these errors were encountered: