Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XQUAD results reproducability for mBERT #8

Closed
MaksymDel opened this issue Apr 16, 2020 · 10 comments
Closed

XQUAD results reproducability for mBERT #8

MaksymDel opened this issue Apr 16, 2020 · 10 comments

Comments

@MaksymDel
Copy link
Contributor

MaksymDel commented Apr 16, 2020

Hi, thanks for the benchmark and the accompanied code!

I am trying to replicate XQUAD scores from the (XTREME) paper using this repo's code.
I run mBert cased model with default parameters and strictly follow the instruction in the README file.

However, the results for some languages are much lower than the scores from the paper.
In particular for vi and th the gap is two-fold. There is also a significant drop for hi and el.
The e.g. en, es, de result, on the other hand, is comparable.

Below I provide a table with scores that I just obtained from running the code together with the corresponding numbers from the paper. @sebastianruder, could I please ask you to take a look at it.

paper: {"f1", "exact_match"}

XQuAD 

  en {"exact_match": 71.76470588235294, "f1": 83.86480699632085} paper: 83.5 / 72.2 
  es {"exact_match": 53.94957983193277, "f1": 73.27239623706365} paper: 75.5 / 56.9 
  de {"exact_match": 52.35294117647059, "f1": 69.47398743963343} paper: 70.6 / 54.0 
  el {"exact_match": 33.61344537815126, "f1": 48.94642083187724} paper: 62.6 / 44.9 
  ru {"exact_match": 52.10084033613445, "f1": 69.82661430981189} paper: 71.3 / 53.3
  tr {"exact_match": 32.35294117647059, "f1": 46.14441800236999} paper: 55.4 / 40.1
  ar {"exact_match": 42.52100840336134, "f1": 59.72583892569921} paper: 61.5 / 45.1 
  vi {"exact_match": 15.210084033613445, "f1": 33.112047090752164} paper: 69.5 / 49.6 
  th {"exact_match": 15.294117647058824, "f1": 24.87707204093759} paper: 42.7 / 33.5 
  zh {"exact_match": 48.99159663865546, "f1": 58.654625486558196} paper: 58.0 / 48.3 
  hi {"exact_match": 22.436974789915965, "f1": 38.31058195464005} paper: 59.2 / 46.0 
@sebastianruder
Copy link
Collaborator

Hi Max,
Thanks for your interest. For training BERT models on the QA tasks, we actually used the original BERT codebase as that was faster with Google infrastructure (see Appendix B in the paper). I'll check that the same results can be obtained with Transformers and will get back to you.

@MaksymDel MaksymDel changed the title XQUAD results reproducability XQUAD results reproducability for mBERT Apr 18, 2020
@MaksymDel
Copy link
Contributor Author

MaksymDel commented Apr 18, 2020

Thanks, Sebastian!

Interesting to see if there were differences in hparams that caused such a the difference. I can immediately see that several choices are hardcoded in the google's codebase that differ from what you pass in transformers version:

  1. linear learning rate decay in transformers vs polynomial lr decay in google's script
  2. weight_decay=0.0001 in transformers vs weight_decay_rate=0.01 in google's script
  3. adam epsilon=1e-8 in transformers vs 1e-6 in google's script

So unless you manually changed these values in the google's script these are some of the notable differences.

Meanwhile, I would like to also additionally confirm that the issue is only related to mBERT since for XLMR I got following avg numbers: 76.7 / 61.0 which is in pair with 76.6 / 60.8 from the paper.

@sebastianruder
Copy link
Collaborator

Thanks for the note, Max. Yes, these are some of the settings that should probably explain the difference in performance.
Yes, for XLM-R we went with the implementation (and the default hyper-parameters) in Transformers, so this should work out-of-the-box as expected.

@Liangtaiwan
Copy link
Contributor

@maksym-del, @sebastianruder
If you use the scripts/train_qa.sh and scripts/predict_qa.sh, you should remove --do_lower_case argument by yourself.
After removing the argument, I can get the results almost the same as the performance on paper.

line 53 and line 63

CUDA_VISIBLE_DEVICES=$GPU python third_party/run_squad.py \
--model_type ${MODEL_TYPE} \
--model_name_or_path ${MODEL} \
--do_lower_case \
--do_train \
--do_eval \
--train_file ${TRAIN_FILE} \
--predict_file ${PREDICT_FILE} \
--per_gpu_train_batch_size 4 \
--learning_rate ${LR} \
--num_train_epochs ${NUM_EPOCHS} \
--max_seq_length $MAXL \
--doc_stride 128 \
--save_steps -1 \
--overwrite_output_dir \
--gradient_accumulation_steps 4 \
--warmup_steps 500 \
--output_dir ${MODEL_PATH} \
--weight_decay 0.0001 \
--threads 8 \
--train_lang en \
--eval_lang en

CUDA_VISIBLE_DEVICES=${CUDA} python third_party/run_squad.py \
--model_type ${MODEL_TYPE} \
--model_name_or_path ${MODEL_PATH} \
--do_eval \
--do_lower_case \
--eval_lang ${lang} \
--predict_file "${TEST_FILE}" \
--output_dir "${PRED_DIR}" &> /dev/null

@Liangtaiwan
Copy link
Contributor

Here are the results I got on XQuAD

XQuAD
  en {"exact_match": 72.18487394957984, "f1": 84.05491660467752}
  es {"exact_match": 56.63865546218487, "f1": 75.50683844229154}
  de {"exact_match": 58.23529411764706, "f1": 73.97330302393942}
  el {"exact_match": 47.73109243697479, "f1": 64.71526367876008}
  ru {"exact_match": 54.285714285714285, "f1": 70.85210687094488}
  tr {"exact_match": 39.15966386554622, "f1": 54.04959679389641}
  ar {"exact_match": 47.39495798319328, "f1": 63.42460795613208}
  vi {"exact_match": 50.33613445378151, "f1": 69.39497841433942}
  th {"exact_match": 32.94117647058823, "f1": 42.04649738683358}
  zh {"exact_match": 48.99159663865546, "f1": 58.25216753368008}
  hi {"exact_match": 44.95798319327731, "f1": 58.764676794694026}

@hit-computer
Copy link

@Liangtaiwan Hi, I only find test data in download/xquad folder and this data just for test which do not have label. How can you get above result on XQuAD? Thanks :)

@Liangtaiwan
Copy link
Contributor

@hit-computer You can find the labels here. https://github.com/deepmind/xquad

@hit-computer
Copy link

@Liangtaiwan Thank you very much!

@sebastianruder
Copy link
Collaborator

Hi @hit-computer, I've answered in the corresponding issue. Please don't post in other unrelated issues but instead tag people in your issue.

@melvinjosej
Copy link
Collaborator

Closing this issue. Please re-open if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants