Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty line handling #183

Open
juliakreutzer opened this issue Jun 14, 2022 · 5 comments
Open

Empty line handling #183

juliakreutzer opened this issue Jun 14, 2022 · 5 comments
Labels
bug Something isn't working work in process We are now working on this issue.

Comments

@juliakreutzer
Copy link
Collaborator

In translate mode, when a file with empty lines is provided, JoeyNMT's error message is not very helpful:

File "/usr/local/lib/python3.7/dist-packages/joeynmt/tokenizers.py", line 81, in pre_process
    assert raw_input is not None and len(raw_input) > 0, raw_input
AssertionError

Perhaps one could simply skip the line, or output a warning or more informative error message.
I'm not sure if the other modes are ready to handle empty lines, haven't tested it yet.

Here an example:
image

@may-
Copy link
Collaborator

may- commented Jun 14, 2022

  • background:
    An empty line raises an error in sacrebleu. maybe need to skip empty lines before evaluation??

  • If we remove empty lines internally, input line numbers and output line numbers in test will be different. For instance, sentence in line 100 of src file will not be aligned to the sentence in line 100 in trg.

    cf. ) in plaintext dataset, an empty line will be skiped in data loading.

    joeynmt/joeynmt/datasets.py

    Lines 154 to 159 in 6c580f8

    def load_data(self, path: str, **kwargs) -> Any:
    def _pre_process(seq, lang):
    if self.tokenizer[lang] is not None:
    seq = [self.tokenizer[lang].pre_process(s) for s in seq if len(s) > 0]
    return seq

    Basically, we should do this in parallel both for src and trg if trg is given. Otherwise, the number of sequences becomes different between src and trg if only src/trg contains an empty line.
    (In file stream input, we only have src and no trg, so it doesn't matter, maybe...)

  • need this empty line handling for all dataset types, including file streams.

@may- may- mentioned this issue Jun 14, 2022
@may- may- added the bug Something isn't working label Jun 14, 2022
@juliakreutzer
Copy link
Collaborator Author

Yes, very good points. My concern was mostly about the translate mode, where we don't have targets, and also no sacrebleu computation.
We don't want to filter, but we want the user to know that there's an empty line problem. What do you think about just raising an assertion with a more informative error message?

@may-
Copy link
Collaborator

may- commented Aug 31, 2022

@juliakreutzer

What do you think about just raising an assertion with a more informative error message?

yes, that sounds reasonable. I'll write an error message, then.
Currently, the error occurs in tokenization after the training/prediction loop has started, but we can raise an error in data loading, before the minibatches are constructed.

@may- may- added the work in process We are now working on this issue. label Sep 2, 2022
@may-
Copy link
Collaborator

may- commented Sep 7, 2022

Note: the same assertion error can happen, when the model generate an empty string (i.e. special symbol only, such as <unk> + </s>). Need to handle this not only in the data loading but also in prediction.
(alternatively, set generate unk: False and min_output_length > 2 in testing config.)

@may-
Copy link
Collaborator

may- commented Sep 19, 2022

more informative error message in v2.1.0:

def pre_process(self, raw_input: str) -> str:
"""
Pre-process text
- ex.) Lowercase, Normalize, Remove emojis,
Pre-tokenize(add extra white space before punc) etc.
- applied for all inputs both in training and inference.
"""
assert isinstance(raw_input, str) and raw_input.strip() != "", \
"The input sentence is empty! Please make sure " \
"that you are feeding a valid input."

FYI @juliakreutzer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working work in process We are now working on this issue.
Projects
None yet
Development

No branches or pull requests

2 participants