Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong Tokenization in SquadQG Evaluation Scripts #50

Open
hzhwcmhf opened this issue Apr 23, 2022 · 0 comments
Open

Wrong Tokenization in SquadQG Evaluation Scripts #50

hzhwcmhf opened this issue Apr 23, 2022 · 0 comments

Comments

@hzhwcmhf
Copy link

Thanks for the great work.

I am reproducing the result reported in GLGE but find that the SquadQG evaluation script seem to use wrong tokenization.

In /script/evaluate/qg/eval_on_unilm_qg.py, the generated text are post-processed by fix_tokenization:

def fix_tokenization(text):
input_tokens = text.split()
output_tokens = []
has_left_quote = False
has_left_single_quote = False
i = 0
prev_dash = False
while i < len(input_tokens):
tok = input_tokens[i]
flag_prev_dash = False
if tok in _tok_dict.keys():
output_tokens.append(_tok_dict[tok])
i += 1
elif tok == "\"":
if has_left_quote:
output_tokens.append("''")
else:
output_tokens.append("``")
has_left_quote = not has_left_quote
i += 1
elif tok == "'" and len(output_tokens) > 0 and output_tokens[-1].endswith("n") and i < len(input_tokens) - 1 and input_tokens[i + 1] == "t":
output_tokens[-1] = output_tokens[-1][:-1]
output_tokens.append("n't")
i += 2
elif tok == "'" and i < len(input_tokens) - 1 and input_tokens[i + 1] in ("s", "d", "ll"):
output_tokens.append("'"+input_tokens[i + 1])
i += 2
elif tok == "'":
if has_left_single_quote:
output_tokens.append("'")
else:
output_tokens.append("`")
has_left_single_quote = not has_left_single_quote
i += 1
elif tok == "." and i < len(input_tokens) - 2 and input_tokens[i + 1] == "." and input_tokens[i + 2] == ".":
output_tokens.append("...")
i += 3
elif tok == "," and len(output_tokens) > 0 and _is_digit(output_tokens[-1]) and i < len(input_tokens) - 1 and _is_digit(input_tokens[i + 1]):
# $ 3 , 000 -> $ 3,000
output_tokens[-1] += ','+input_tokens[i + 1]
i += 2
elif tok == "." and len(output_tokens) > 0 and output_tokens[-1].isdigit() and i < len(input_tokens) - 1 and input_tokens[i + 1].isdigit():
# 3 . 03 -> $ 3.03
output_tokens[-1] += '.'+input_tokens[i + 1]
i += 2
elif tok == "." and len(output_tokens) > 0 and len(output_tokens[-1]) == 1 and output_tokens[-1].isupper() and i < len(input_tokens) - 2 and len(input_tokens[i + 1]) == 1 and input_tokens[i + 1].isupper() and input_tokens[i + 2] == '.':
# U . N . -> U.N.
k = i+3
while k+2 < len(input_tokens):
if len(input_tokens[k + 1]) == 1 and input_tokens[k + 1].isupper() and input_tokens[k + 2] == '.':
k += 2
else:
break
output_tokens[-1] += ''.join(input_tokens[i:k])
i += 2
elif tok == "-":
if i < len(input_tokens) - 1 and input_tokens[i + 1] == "-":
output_tokens.append("--")
i += 2
elif i == len(input_tokens) - 1 or i == 0:
output_tokens.append("-")
i += 1
elif output_tokens[-1] not in string.punctuation and input_tokens[i + 1][0] not in string.punctuation:
output_tokens[-1] += "-"
i += 1
flag_prev_dash = True
else:
output_tokens.append("-")
i += 1
elif prev_dash and len(output_tokens) > 0 and tok[0] not in string.punctuation:
output_tokens[-1] += tok
i += 1
else:
output_tokens.append(tok)
i += 1
prev_dash = flag_prev_dash
return " ".join(output_tokens)

For example, it turns . . . to ..., " to '', 1 , 000 to 1,000.

However, the original data do not like the sentence after fix_tokenization. Here are some samples from the test set:

What did Harff define as " short - lived outbursts by mobs . . . ? "
Who sang " Girls Love Beyoncé " in 2013 ?
What city in Montana has over 100 , 000 people ?

Moreover, I reproduce MASS-base and find the results are higher if we disable fix_tokenization:

BLEU METEOR ROUGE-L
MASS-base reported in GLGE 20.1 24.4 49.4
MASS-base reproduce with fix_tokenization 20.69 24.92 49.21
MASS-base reproduce without fix_tokenization 22.54 25.03 50.27

I wonder whether I miss somthing or the reported results use a wrong tokenization?
I also hope that, if possible, the model outputs can be released to support fair and detailed comparisons.

Looking forward to your reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant