Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RemoveKaldiNonWords transformation not working as expected #30

Open
tomassykora opened this issue Jul 17, 2020 · 1 comment
Open

RemoveKaldiNonWords transformation not working as expected #30

tomassykora opened this issue Jul 17, 2020 · 1 comment

Comments

@tomassykora
Copy link

Hello, when trying to use the RemoveKaldiNonWords transformation, I get different results when comparing these text pairs:

  • <unk> xx to xx -> 0.5 WER (0.33 when not using SentencesToListOfWords)
  • <unk>xx to xx -> 0.0 WER

I'd expect both to be zero when using RemoveKaldiNonWords. Is it an actual bug or am I not understanding the usage correctly? I've tried different order combination of the transformations in the below code, the results were always the same.

transformation = jiwer.Compose([
    jiwer.RemoveKaldiNonWords(),
    jiwer.RemoveMultipleSpaces(),
    jiwer.RemoveWhiteSpace(replace_by_space=True),
    jiwer.SentencesToListOfWords(word_delimiter=' '),
])
wer = jiwer.wer(
    truth,
    hypothesis,
    truth_transform=transformation, 
    hypothesis_transform=transformation,
)
@nikvaessen
Copy link
Collaborator

nikvaessen commented Jul 18, 2020

Your transformation should also include jiwer.RemoveEmptyStrings.

import jiwer

truth = "hello"
hypothesis = "hello <unk>"

buggy_custom_transform = jiwer.Compose(
    [
        jiwer.RemoveKaldiNonWords(),
        jiwer.RemoveMultipleSpaces(),
        jiwer.RemoveWhiteSpace(replace_by_space=True),
        jiwer.SentencesToListOfWords(word_delimiter=" "),
    ]
)
working_custom_transform = jiwer.Compose(
    [
        jiwer.RemoveKaldiNonWords(),
        jiwer.RemoveMultipleSpaces(),
        jiwer.RemoveWhiteSpace(replace_by_space=True),
        jiwer.RemoveEmptyStrings(),
        jiwer.SentencesToListOfWords(word_delimiter=" "),
    ]
)

buggy_error_rate = jiwer.wer(
    truth,
    hypothesis,
    truth_transform=buggy_custom_transform,
    hypothesis_transform=buggy_custom_transform,
)
correct_error_rate = jiwer.wer(
    truth,
    hypothesis,
    truth_transform=working_custom_transform,
    hypothesis_transform=working_custom_transform,
)

print(f'after transform: truth={buggy_custom_transform(truth)}, hypothesis:{buggy_custom_transform(hypothesis)}')
print(f'after transform: truth={working_custom_transform(truth)}, hypothesis:{working_custom_transform(hypothesis)}')
print("buggy wer:", buggy_error_rate)
print("correct wer:", correct_error_rate)
after transform: truth=['hello'], hypothesis:['hello', '']
after transform: truth=['hello'], hypothesis:['hello']
buggy wer: 1.0
correct wer: 0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants