Empty line handling #183

juliakreutzer · 2022-06-14T01:15:11Z

In translate mode, when a file with empty lines is provided, JoeyNMT's error message is not very helpful:

File "/usr/local/lib/python3.7/dist-packages/joeynmt/tokenizers.py", line 81, in pre_process
    assert raw_input is not None and len(raw_input) > 0, raw_input
AssertionError

Perhaps one could simply skip the line, or output a warning or more informative error message.
I'm not sure if the other modes are ready to handle empty lines, haven't tested it yet.

Here an example:

The text was updated successfully, but these errors were encountered:

may- · 2022-06-14T05:42:37Z

background:
An empty line raises an error in sacrebleu. maybe need to skip empty lines before evaluation??

If we remove empty lines internally, input line numbers and output line numbers in test will be different. For instance, sentence in line 100 of src file will not be aligned to the sentence in line 100 in trg.

cf. ) in plaintext dataset, an empty line will be skiped in data loading.

joeynmt/joeynmt/datasets.py

Lines 154 to 159 in 6c580f8

    
           def load_data(self, path: str, **kwargs) -> Any: 
        
               def _pre_process(seq, lang): 
        
                   if self.tokenizer[lang] is not None: 
        
                       seq = [self.tokenizer[lang].pre_process(s) for s in seq if len(s) > 0] 
        
                   return seq

Basically, we should do this in parallel both for src and trg if trg is given. Otherwise, the number of sequences becomes different between src and trg if only src/trg contains an empty line.
(In file stream input, we only have src and no trg, so it doesn't matter, maybe...)

need this empty line handling for all dataset types, including file streams.

juliakreutzer · 2022-08-31T13:52:40Z

Yes, very good points. My concern was mostly about the translate mode, where we don't have targets, and also no sacrebleu computation.
We don't want to filter, but we want the user to know that there's an empty line problem. What do you think about just raising an assertion with a more informative error message?

may- · 2022-08-31T15:01:22Z

@juliakreutzer

What do you think about just raising an assertion with a more informative error message?

yes, that sounds reasonable. I'll write an error message, then.
Currently, the error occurs in tokenization after the training/prediction loop has started, but we can raise an error in data loading, before the minibatches are constructed.

may- · 2022-09-07T05:35:22Z

Note: the same assertion error can happen, when the model generate an empty string (i.e. special symbol only, such as <unk> + </s>). Need to handle this not only in the data loading but also in prediction.
(alternatively, set generate unk: False and min_output_length > 2 in testing config.)

may- · 2022-09-19T08:30:16Z

more informative error message in v2.1.0:

joeynmt/joeynmt/tokenizers.py

Lines 64 to 73 in 32eef89

    
               def pre_process(self, raw_input: str) -> str: 
        
                   """ 
        
                   Pre-process text 
        
                       - ex.) Lowercase, Normalize, Remove emojis, 
        
                           Pre-tokenize(add extra white space before punc) etc. 
        
                       - applied for all inputs both in training and inference. 
        
                   """ 
        
                   assert isinstance(raw_input, str) and raw_input.strip() != "", \ 
        
                       "The input sentence is empty! Please make sure " \ 
        
                       "that you are feeding a valid input."

FYI @juliakreutzer

may- mentioned this issue Jun 14, 2022

n_best IndexError #182

Closed

may- added the bug Something isn't working label Jun 14, 2022

may- added the work in process We are now working on this issue. label Sep 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty line handling #183

Empty line handling #183

juliakreutzer commented Jun 14, 2022

may- commented Jun 14, 2022 •

edited

juliakreutzer commented Aug 31, 2022

may- commented Aug 31, 2022

may- commented Sep 7, 2022

may- commented Sep 19, 2022 •

edited

Empty line handling #183

Empty line handling #183

Comments

juliakreutzer commented Jun 14, 2022

may- commented Jun 14, 2022 • edited

juliakreutzer commented Aug 31, 2022

may- commented Aug 31, 2022

may- commented Sep 7, 2022

may- commented Sep 19, 2022 • edited

may- commented Jun 14, 2022 •

edited

may- commented Sep 19, 2022 •

edited