Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jiwer gives an error when passed a very long list of strings #83

Open
yashk2000 opened this issue Oct 18, 2023 · 6 comments
Open

jiwer gives an error when passed a very long list of strings #83

yashk2000 opened this issue Oct 18, 2023 · 6 comments

Comments

@yashk2000
Copy link

Issue

When passing a very long list of strings (>350k strings) as the reference and hypothesis, jiwer gives the following error:

chr() arg not in range(0x110000)

What's been tried:

  • Calculating wer on individual list elements - this works successfully with no error
  • Splitting the large lists into smaller chunks - this works successfully with no error
  • Passing the entire list to another library such as fastwer - this works successfully with no error

The error only seems to happen when the entire long list is passed into jiwer.

Additional Context

It seems like the vocabulary in the _word2char function isn't built properly. After adding words from the first N sentences in the list, words from rest of the sentences do not seem to be a part of the vocabulary. This results in the chr() arg not found error when these lines are executed.

Jiwer version - v3.0.3

@nikvaessen
Copy link
Collaborator

Thanks for reporting! Would you be able to share the length of the vocabulary object when generated from your input?

@yashk2000
Copy link
Author

Yep, it's 4484.

@nikvaessen
Copy link
Collaborator

nikvaessen commented Oct 19, 2023

I cannot reproduce it with the following toy data :(

import random
import string

from typing import List

import jiwer


def random_word(low=2, high=10, rng=random.Random()) -> string:
    word = ""

    for i in range(rng.randint(low, high + 1)):
        word += rng.choice(string.ascii_lowercase)

    return word


def generate_sentence(vocabulary: List[str], low=1, high=12, rng=random.Random()):
    sentence = []

    for i in range(rng.randint(low, high + 1)):
        sentence.append(rng.choice(vocabulary))

    return " ".join(sentence)


NUM_SENTENCE = 500_000
NUM_WORDS = 5000

print('generating vocab...')
vocabulary = list(set([random_word() for _ in range(NUM_WORDS)]))
print(len(vocabulary))

print("generating reference...")
ref = [generate_sentence(vocabulary) for _ in range(NUM_SENTENCE)]
print("generating hypotheses...")
hyp = [generate_sentence(vocabulary) for _ in range(NUM_SENTENCE)]
print("calculating wer...")
print(jiwer.wer(ref, hyp))

Can you share the word which fails to be included in the vocabulary?

@yashk2000
Copy link
Author

The words which are not included are normal english words - words from entire sentences aren't included like "Australia", "he", "run", etc.

Some sentences in my list also include numbers like "1", "10", so on and can also include non-english characters at time too. Could this be a potential cause of the issue?

@nikvaessen
Copy link
Collaborator

I think the size is not an issue, I think it's a specific sentence-pairing which fails. When you tested chunks of the dataset, did those chunks still span the entire range of reference/hypothesis pairs?

Also, do you use a custom transform, or do you use the default?

@yashk2000
Copy link
Author

@nikvaessen when I test chunks, the chunks do span the entire range of the pairs. I have also tried finding wer by looping over one pair at a time, that also works.

I'm using the default transform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants