XQuAD can have a better evaluation #28

Liangtaiwan · 2020-05-12T17:49:50Z

When evaluating XQuAD, it uses the original SQuAD evaluation scripts.
However, the script does not consider the tokenization for Chinese. (MLQA does)
ex. "天氣很好"(The weather is good.) should be tokenized to "天", "氣", "很", "好" or "天氣", "很好"

I'm sure that it will get a more convincing score on Chinese, but I'm not sure it's better on others or not.

I modify the original code to the following

def is_english_num(s):
  try:
    s.encode(encoding='utf-8').decode('ascii')
  except UnicodeDecodeError:
    return False
  else:
    return True

def get_tokens(s):
    if not s:
        return []
    if is_english_num(s):
        return normalize_answer(s).split()
    return [w for w in normalize_answer(s)]

def f1_score(prediction, ground_truth):
    prediction_tokens = get_tokens(prediction)
    ground_truth_tokens = get_tokens(ground_truth)
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

Maybe you can consider it?

The text was updated successfully, but these errors were encountered:

Liangtaiwan · 2020-05-12T17:58:13Z

Scores before changing the script

XQuAD
  en {"exact_match": 69.15966386554622, "f1": 81.326960111499}
  es {"exact_match": 50.84033613445378, "f1": 71.9177204984632}
  de {"exact_match": 49.15966386554622, "f1": 66.40521293685734}
  el {"exact_match": 31.428571428571427, "f1": 47.04197672432378}
  ru {"exact_match": 51.34453781512605, "f1": 68.7757929521295}
  tr {"exact_match": 29.831932773109244, "f1": 45.22649443514344}
  ar {"exact_match": 43.94957983193277, "f1": 60.508341054177116}
  vi {"exact_match": 13.193277310924369, "f1": 31.339180641908623}
  th {"exact_match": 18.65546218487395, "f1": 27.48267185662145}
  zh {"exact_match": 48.90756302521008, "f1": 58.35407496331861}
  hi {"exact_match": 26.386554621848738, "f1": 41.85833342624981}

Scores after changing the script

XQuAD
  en {"exact_match": 69.15966386554622, "f1": 81.12936130528962}
  es {"exact_match": 50.84033613445378, "f1": 70.39142998082033}
  de {"exact_match": 49.15966386554622, "f1": 65.50942887082724}
  el {"exact_match": 31.428571428571427, "f1": 56.924266644045474}
  ru {"exact_match": 51.34453781512605, "f1": 73.58736598799115}
  tr {"exact_match": 29.831932773109244, "f1": 47.694480985740995}
  ar {"exact_match": 43.94957983193277, "f1": 69.68323791997037}
  vi {"exact_match": 13.193277310924369, "f1": 38.22218477944847}
  th {"exact_match": 18.65546218487395, "f1": 41.452670791585874}
  zh {"exact_match": 48.90756302521008, "f1": 66.19815379584597}
  hi {"exact_match": 26.386554621848738, "f1": 52.00823001166374}

sebastianruder · 2020-05-13T10:32:49Z

Thanks for the note. I had experimented with using the MLQA evaluation script for XQuAD but only observed marginal differences in some experiments (as mentioned here). If the differences are indeed larger, we might consider updating the evaluation script. What model did you use to obtain the scores?

Liangtaiwan · 2020-05-13T12:24:09Z

Hi @sebastianruder, I used bert-base-multilingaul-cased.

However, as I mentioned in #8 (comment), there is a bug in scripts/*_qa.sh.
In the abovementioned result, I do not remove --do_lower_case argument.

Liangtaiwan · 2020-05-13T12:41:12Z

I'm sure that the difference is large on Chinese.
Before multilingual question answering dataset released, I do some zero-shot reading comprehension experiment on DRCD (Chinese) and KorQuAD (Korean).

The result is reported in https://arxiv.org/pdf/1909.09587.pdf.
If the evaluation script is modified as here, I got 'exact': 66.71396140749148, 'f1': 78.41471541616556 on DRCD. Without modified, I got "exact": 66.71396140749148, "f1": 66.71396140749148, on DRCD.

Liangtaiwan mentioned this issue May 12, 2020

XQuAD can have a better evaluation google-deepmind/xquad#2

Closed

melvinjosej closed this as completed Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XQuAD can have a better evaluation #28

XQuAD can have a better evaluation #28

Liangtaiwan commented May 12, 2020 •

edited

Loading

Liangtaiwan commented May 12, 2020

sebastianruder commented May 13, 2020

Liangtaiwan commented May 13, 2020

Liangtaiwan commented May 13, 2020 •

edited

Loading

XQuAD can have a better evaluation #28

XQuAD can have a better evaluation #28

Comments

Liangtaiwan commented May 12, 2020 • edited Loading

Liangtaiwan commented May 12, 2020

sebastianruder commented May 13, 2020

Liangtaiwan commented May 13, 2020

Liangtaiwan commented May 13, 2020 • edited Loading

Liangtaiwan commented May 12, 2020 •

edited

Loading

Liangtaiwan commented May 13, 2020 •

edited

Loading