Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XQuAD can have a better evaluation #28

Closed
Liangtaiwan opened this issue May 12, 2020 · 4 comments
Closed

XQuAD can have a better evaluation #28

Liangtaiwan opened this issue May 12, 2020 · 4 comments

Comments

@Liangtaiwan
Copy link
Contributor

Liangtaiwan commented May 12, 2020

When evaluating XQuAD, it uses the original SQuAD evaluation scripts.
However, the script does not consider the tokenization for Chinese. (MLQA does)
ex. "天氣很好"(The weather is good.) should be tokenized to "天", "氣", "很", "好" or "天氣", "很好"

I'm sure that it will get a more convincing score on Chinese, but I'm not sure it's better on others or not.

I modify the original code to the following

def is_english_num(s):
  try:
    s.encode(encoding='utf-8').decode('ascii')
  except UnicodeDecodeError:
    return False
  else:
    return True

def get_tokens(s):
    if not s:
        return []
    if is_english_num(s):
        return normalize_answer(s).split()
    return [w for w in normalize_answer(s)]

def f1_score(prediction, ground_truth):
    prediction_tokens = get_tokens(prediction)
    ground_truth_tokens = get_tokens(ground_truth)
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

Maybe you can consider it?

@Liangtaiwan
Copy link
Contributor Author

Scores before changing the script

XQuAD
  en {"exact_match": 69.15966386554622, "f1": 81.326960111499}
  es {"exact_match": 50.84033613445378, "f1": 71.9177204984632}
  de {"exact_match": 49.15966386554622, "f1": 66.40521293685734}
  el {"exact_match": 31.428571428571427, "f1": 47.04197672432378}
  ru {"exact_match": 51.34453781512605, "f1": 68.7757929521295}
  tr {"exact_match": 29.831932773109244, "f1": 45.22649443514344}
  ar {"exact_match": 43.94957983193277, "f1": 60.508341054177116}
  vi {"exact_match": 13.193277310924369, "f1": 31.339180641908623}
  th {"exact_match": 18.65546218487395, "f1": 27.48267185662145}
  zh {"exact_match": 48.90756302521008, "f1": 58.35407496331861}
  hi {"exact_match": 26.386554621848738, "f1": 41.85833342624981}

Scores after changing the script

XQuAD
  en {"exact_match": 69.15966386554622, "f1": 81.12936130528962}
  es {"exact_match": 50.84033613445378, "f1": 70.39142998082033}
  de {"exact_match": 49.15966386554622, "f1": 65.50942887082724}
  el {"exact_match": 31.428571428571427, "f1": 56.924266644045474}
  ru {"exact_match": 51.34453781512605, "f1": 73.58736598799115}
  tr {"exact_match": 29.831932773109244, "f1": 47.694480985740995}
  ar {"exact_match": 43.94957983193277, "f1": 69.68323791997037}
  vi {"exact_match": 13.193277310924369, "f1": 38.22218477944847}
  th {"exact_match": 18.65546218487395, "f1": 41.452670791585874}
  zh {"exact_match": 48.90756302521008, "f1": 66.19815379584597}
  hi {"exact_match": 26.386554621848738, "f1": 52.00823001166374}

@sebastianruder
Copy link
Collaborator

Thanks for the note. I had experimented with using the MLQA evaluation script for XQuAD but only observed marginal differences in some experiments (as mentioned here). If the differences are indeed larger, we might consider updating the evaluation script. What model did you use to obtain the scores?

@Liangtaiwan
Copy link
Contributor Author

Hi @sebastianruder, I used bert-base-multilingaul-cased.

However, as I mentioned in #8 (comment), there is a bug in scripts/*_qa.sh.
In the abovementioned result, I do not remove --do_lower_case argument.

@Liangtaiwan
Copy link
Contributor Author

Liangtaiwan commented May 13, 2020

I'm sure that the difference is large on Chinese.
Before multilingual question answering dataset released, I do some zero-shot reading comprehension experiment on DRCD (Chinese) and KorQuAD (Korean).

The result is reported in https://arxiv.org/pdf/1909.09587.pdf.
If the evaluation script is modified as here, I got 'exact': 66.71396140749148, 'f1': 78.41471541616556 on DRCD. Without modified, I got "exact": 66.71396140749148, "f1": 66.71396140749148, on DRCD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants