Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcription being concatenated oddly #5

Closed
usmanfarooq619 opened this issue Jun 18, 2021 · 10 comments
Closed

Transcription being concatenated oddly #5

usmanfarooq619 opened this issue Jun 18, 2021 · 10 comments

Comments

@usmanfarooq619
Copy link

I am trying to use the ctc decoding feature with kenlm on the wav2vec2 huggingface's logits

vocab = ['l', 'z', 'u', 'k', 'f', 'r', 'g', 'i', 'v', 's', 'o', 'b', 'w', 'e', 'd', 'n', 'y', 'c', 'q', 'p', 'h', 't', 'a', 'x', ' ', 'j', 'm', '⁇', '', '⁇', '⁇']
alphabet = Alphabet.build_alphabet(vocab, ctc_token_idx=-3)
# Language Model
lm=LanguageModel(kenlm_model,alpha =0.169,
  beta = 0.055)
# build the decoder and decode the logits
decoder = BeamSearchDecoderCTC(alphabet,lm)

which returns the following output with beam size 64:

yeah jon okay i m calling from the clinic the family doctor clinessegryand this number six four five five one three o five

while when I was previously decoding with https://github.com/ynop/py-ctc-decode with the same lm and parameters getting:

yeah on okay i am calling from the clinic the family dot clinic try and this number six four five five one three o five

I don't understand why the words are being concatenated together. Do you have any thoughts?

@usmanfarooq619 usmanfarooq619 changed the title Transcription Issue with pyctcdecode. Transcription being concatenated oddly Jun 18, 2021
@gkucsko
Copy link
Contributor

gkucsko commented Jun 18, 2021

hey, this looks like an OOV issue. since kenlm doesn't have a notion of partial tokens it helps a lot to pass a list of known unigrams (basically all the words you expect to appear). pyctcdecode can then under the hood build a character trie, which can be probed very efficiently during decoding and OOV words will be efficiently downweighted as soon as they appear (rather than only indirectly through the LM once a space appears)
let me know if adding the unigrams during instantiation of the LM fixes it or we should look into it more carefully

@sarim-zafar
Copy link

That just sounds like a hotfix to me tbh. According to your reasoning the same should happen with the other implementation as well

@gkucsko
Copy link
Contributor

gkucsko commented Jun 18, 2021

it could be that you can reproduce the same behavior of the other decoder by lowering some of the thresholds during prediction, for example beam_prune_logp=-20 and token_min_logp=-8. The default parameters are tuned for high speed at similar accuracy in cases where LM as well as unigrams are present (the decoder prunes hypotheses continuously during decoding to minimize the number of beam proposals to maximize speed to compete with c++ implementations).
Is there a way you can check this on public data or share something with us so that i can try to reproduce? We usually always have access to a unigram list since we train the lm model, but very interested to hear if that's different for you and happy to work on optimizing the scenario without unigrams provided. Could also of course be that there is some other sneaky bug here that doesn't have anything to do with unigrams, but it's hard for me to test without being able to reproduce it.

@gkucsko
Copy link
Contributor

gkucsko commented Jun 18, 2021

if the 'no unigrams available' is a scenario that is important for you that's great to know. pretty sure the word concatenation can be fixed by adjusting how we score partial words (which is usually done by the trie made from the unigram list).

@gkucsko
Copy link
Contributor

gkucsko commented Jun 18, 2021

oh, looking at the other repo, i believe it reads the unigrams from the arpa kenlm file? then you don't have to provide the list separately. however then you can't use a binary kenlm file for instantiation since as far as i know they don't contain the ngrams in an easy to read format

@sarim-zafar
Copy link

Okay so I also followed the unigram option and it fixes the problem for now but this probably won't work for example for new names that we've never seen before. Let me see if I can provide you the logits.

As for the other implementation, I was using the simple ARPA file so it might be the case as you suggested but I don't think the binary one should be that big of a problem.

Also how about a simple extension to enable support for transformer-based lm scorers: https://github.com/simonepri/lm-scorer

@gkucsko
Copy link
Contributor

gkucsko commented Jun 18, 2021

great, i will put it on my todo list to provide better support around the unigram list.
new words are always a tricky thing, because if they are not in your kenlm model, then they will be scored as an OOV anyways. the likelihood of that you can tune with a separate parameter.
another good way however to deal with new words (that you know about, but are not yet in your language model) is to provide them as 'hotwords'. that way they will get compile into their own scorer and increase the likelihood of being transcribed. you can also tune their weight to decide how important they are.
In regards to neural LM models, it's definitely something we are curious about if people are interested in. Is your main goal to get better results while accepting a slower output? have you tried applying it in a beam re-scoring manner after getting the full outputs to see if you can improve your results?

@sarim-zafar
Copy link

I've tried it in a beam-scoring manner but that assumes that you get good beam predictions to begin with. So I think it'd be wonderful if you can patch that in. Obviously one can use the insanely fast and light transformers to get a better middleground. Looking forward to the fix and as well as the new feature. Thank you for the wonderful work!

@gkucsko
Copy link
Contributor

gkucsko commented Jun 18, 2021

sounds great will have a look. thanks for the feedback and let me know if other issues come up!

@gkucsko
Copy link
Contributor

gkucsko commented Jun 18, 2021

see PR #4 for some additional warnings around unigrams as well as improved partial scoring without a trie that should help with word concatenation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants