Wrong Tokenization in SquadQG Evaluation Scripts #50

hzhwcmhf · 2022-04-23T06:25:59Z

Thanks for the great work.

I am reproducing the result reported in GLGE but find that the SquadQG evaluation script seem to use wrong tokenization.

In /script/evaluate/qg/eval_on_unilm_qg.py, the generated text are post-processed by fix_tokenization:

ProphetNet/GLGE_baselines/script/script/evaluate/qg/eval_on_unilm_qg.py

Lines 40 to 117 in 0a1b59c

    
           def fix_tokenization(text): 
        
               input_tokens = text.split() 
        
               output_tokens = [] 
        
               has_left_quote = False 
        
               has_left_single_quote = False 
        
               i = 0 
        
               prev_dash = False 
        
               while i < len(input_tokens): 
        
                   tok = input_tokens[i] 
        
                   flag_prev_dash = False 
        
                   if tok in _tok_dict.keys(): 
        
                       output_tokens.append(_tok_dict[tok]) 
        
                       i += 1 
        
                   elif tok == "\"": 
        
                       if has_left_quote: 
        
                           output_tokens.append("''") 
        
                       else: 
        
                           output_tokens.append("``") 
        
                       has_left_quote = not has_left_quote 
        
                       i += 1 
        
                   elif tok == "'" and len(output_tokens) > 0 and output_tokens[-1].endswith("n") and i < len(input_tokens) - 1 and input_tokens[i + 1] == "t": 
        
                       output_tokens[-1] = output_tokens[-1][:-1] 
        
                       output_tokens.append("n't") 
        
                       i += 2 
        
                   elif tok == "'" and i < len(input_tokens) - 1 and input_tokens[i + 1] in ("s", "d", "ll"): 
        
                       output_tokens.append("'"+input_tokens[i + 1]) 
        
                       i += 2 
        
                   elif tok == "'": 
        
                       if has_left_single_quote: 
        
                           output_tokens.append("'") 
        
                       else: 
        
                           output_tokens.append("`") 
        
                       has_left_single_quote = not has_left_single_quote 
        
                       i += 1 
        
                   elif tok == "." and i < len(input_tokens) - 2 and input_tokens[i + 1] == "." and input_tokens[i + 2] == ".": 
        
                       output_tokens.append("...") 
        
                       i += 3 
        
                   elif tok == "," and len(output_tokens) > 0 and _is_digit(output_tokens[-1]) and i < len(input_tokens) - 1 and _is_digit(input_tokens[i + 1]): 
        
                       # $ 3 , 000 -> $ 3,000 
        
                       output_tokens[-1] += ','+input_tokens[i + 1] 
        
                       i += 2 
        
                   elif tok == "." and len(output_tokens) > 0 and output_tokens[-1].isdigit() and i < len(input_tokens) - 1 and input_tokens[i + 1].isdigit(): 
        
                       # 3 . 03 -> $ 3.03 
        
                       output_tokens[-1] += '.'+input_tokens[i + 1] 
        
                       i += 2 
        
                   elif tok == "." and len(output_tokens) > 0 and len(output_tokens[-1]) == 1 and output_tokens[-1].isupper() and i < len(input_tokens) - 2 and len(input_tokens[i + 1]) == 1 and input_tokens[i + 1].isupper() and input_tokens[i + 2] == '.': 
        
                       # U . N . -> U.N. 
        
                       k = i+3 
        
                       while k+2 < len(input_tokens): 
        
                           if len(input_tokens[k + 1]) == 1 and input_tokens[k + 1].isupper() and input_tokens[k + 2] == '.': 
        
                               k += 2 
        
                           else: 
        
                               break 
        
                       output_tokens[-1] += ''.join(input_tokens[i:k]) 
        
                       i += 2 
        
                   elif tok == "-": 
        
                       if i < len(input_tokens) - 1 and input_tokens[i + 1] == "-": 
        
                           output_tokens.append("--") 
        
                           i += 2 
        
                       elif i == len(input_tokens) - 1 or i == 0: 
        
                           output_tokens.append("-") 
        
                           i += 1 
        
                       elif output_tokens[-1] not in string.punctuation and input_tokens[i + 1][0] not in string.punctuation: 
        
                           output_tokens[-1] += "-" 
        
                           i += 1 
        
                           flag_prev_dash = True 
        
                       else: 
        
                           output_tokens.append("-") 
        
                           i += 1 
        
                   elif prev_dash and len(output_tokens) > 0 and tok[0] not in string.punctuation: 
        
                       output_tokens[-1] += tok 
        
                       i += 1 
        
                   else: 
        
                       output_tokens.append(tok) 
        
                       i += 1 
        
                   prev_dash = flag_prev_dash 
        
               return " ".join(output_tokens)

For example, it turns . . . to ..., " to '', 1 , 000 to 1,000.

However, the original data do not like the sentence after fix_tokenization. Here are some samples from the test set:

What did Harff define as " short - lived outbursts by mobs . . . ? "
Who sang " Girls Love Beyoncé " in 2013 ?
What city in Montana has over 100 , 000 people ?

Moreover, I reproduce MASS-base and find the results are higher if we disable fix_tokenization:

	BLEU	METEOR	ROUGE-L
MASS-base reported in GLGE	20.1	24.4	49.4
MASS-base reproduce with fix_tokenization	20.69	24.92	49.21
MASS-base reproduce without fix_tokenization	22.54	25.03	50.27

I wonder whether I miss somthing or the reported results use a wrong tokenization?
I also hope that, if possible, the model outputs can be released to support fair and detailed comparisons.

Looking forward to your reply

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong Tokenization in SquadQG Evaluation Scripts #50

Wrong Tokenization in SquadQG Evaluation Scripts #50

hzhwcmhf commented Apr 23, 2022

Wrong Tokenization in SquadQG Evaluation Scripts #50

Wrong Tokenization in SquadQG Evaluation Scripts #50

Comments

hzhwcmhf commented Apr 23, 2022