XQUAD results reproducability for mBERT #8

MaksymDel · 2020-04-16T17:49:11Z

Hi, thanks for the benchmark and the accompanied code!

I am trying to replicate XQUAD scores from the (XTREME) paper using this repo's code.
I run mBert cased model with default parameters and strictly follow the instruction in the README file.

However, the results for some languages are much lower than the scores from the paper.
In particular for vi and th the gap is two-fold. There is also a significant drop for hi and el.
The e.g. en, es, de result, on the other hand, is comparable.

Below I provide a table with scores that I just obtained from running the code together with the corresponding numbers from the paper. @sebastianruder, could I please ask you to take a look at it.

paper: {"f1", "exact_match"}

XQuAD 

  en {"exact_match": 71.76470588235294, "f1": 83.86480699632085} paper: 83.5 / 72.2 
  es {"exact_match": 53.94957983193277, "f1": 73.27239623706365} paper: 75.5 / 56.9 
  de {"exact_match": 52.35294117647059, "f1": 69.47398743963343} paper: 70.6 / 54.0 
  el {"exact_match": 33.61344537815126, "f1": 48.94642083187724} paper: 62.6 / 44.9 
  ru {"exact_match": 52.10084033613445, "f1": 69.82661430981189} paper: 71.3 / 53.3
  tr {"exact_match": 32.35294117647059, "f1": 46.14441800236999} paper: 55.4 / 40.1
  ar {"exact_match": 42.52100840336134, "f1": 59.72583892569921} paper: 61.5 / 45.1 
  vi {"exact_match": 15.210084033613445, "f1": 33.112047090752164} paper: 69.5 / 49.6 
  th {"exact_match": 15.294117647058824, "f1": 24.87707204093759} paper: 42.7 / 33.5 
  zh {"exact_match": 48.99159663865546, "f1": 58.654625486558196} paper: 58.0 / 48.3 
  hi {"exact_match": 22.436974789915965, "f1": 38.31058195464005} paper: 59.2 / 46.0

The text was updated successfully, but these errors were encountered:

sebastianruder · 2020-04-16T18:21:21Z

Hi Max,
Thanks for your interest. For training BERT models on the QA tasks, we actually used the original BERT codebase as that was faster with Google infrastructure (see Appendix B in the paper). I'll check that the same results can be obtained with Transformers and will get back to you.

MaksymDel · 2020-04-18T14:55:23Z

Thanks, Sebastian!

Interesting to see if there were differences in hparams that caused such a the difference. I can immediately see that several choices are hardcoded in the google's codebase that differ from what you pass in transformers version:

linear learning rate decay in transformers vs polynomial lr decay in google's script
weight_decay=0.0001 in transformers vs weight_decay_rate=0.01 in google's script
adam epsilon=1e-8 in transformers vs 1e-6 in google's script

So unless you manually changed these values in the google's script these are some of the notable differences.

Meanwhile, I would like to also additionally confirm that the issue is only related to mBERT since for XLMR I got following avg numbers: 76.7 / 61.0 which is in pair with 76.6 / 60.8 from the paper.

sebastianruder · 2020-04-19T12:09:07Z

Thanks for the note, Max. Yes, these are some of the settings that should probably explain the difference in performance.
Yes, for XLM-R we went with the implementation (and the default hyper-parameters) in Transformers, so this should work out-of-the-box as expected.

Liangtaiwan · 2020-05-13T08:46:44Z

@maksym-del, @sebastianruder
If you use the scripts/train_qa.sh and scripts/predict_qa.sh, you should remove --do_lower_case argument by yourself.
After removing the argument, I can get the results almost the same as the performance on paper.

line 53 and line 63

xtreme/scripts/train_qa.sh

Lines 50 to 71 in 5d7e462

    
           CUDA_VISIBLE_DEVICES=$GPU python third_party/run_squad.py \ 
        
             --model_type ${MODEL_TYPE} \ 
        
             --model_name_or_path ${MODEL} \ 
        
             --do_lower_case \ 
        
             --do_train \ 
        
             --do_eval \ 
        
             --train_file ${TRAIN_FILE} \ 
        
             --predict_file ${PREDICT_FILE} \ 
        
             --per_gpu_train_batch_size 4 \ 
        
             --learning_rate ${LR} \ 
        
             --num_train_epochs ${NUM_EPOCHS} \ 
        
             --max_seq_length $MAXL \ 
        
             --doc_stride 128 \ 
        
             --save_steps -1 \ 
        
             --overwrite_output_dir \ 
        
             --gradient_accumulation_steps 4 \ 
        
             --warmup_steps 500 \ 
        
             --output_dir ${MODEL_PATH} \ 
        
             --weight_decay 0.0001 \ 
        
             --threads 8 \ 
        
             --train_lang en \ 
        
             --eval_lang en

xtreme/scripts/predict_qa.sh

Lines 59 to 66 in 5d7e462

    
           CUDA_VISIBLE_DEVICES=${CUDA} python third_party/run_squad.py \ 
        
             --model_type ${MODEL_TYPE} \ 
        
             --model_name_or_path ${MODEL_PATH} \ 
        
             --do_eval \ 
        
             --do_lower_case \ 
        
             --eval_lang ${lang} \ 
        
             --predict_file "${TEST_FILE}" \ 
        
             --output_dir "${PRED_DIR}" &> /dev/null

Liangtaiwan · 2020-05-13T08:47:43Z

Here are the results I got on XQuAD

XQuAD
  en {"exact_match": 72.18487394957984, "f1": 84.05491660467752}
  es {"exact_match": 56.63865546218487, "f1": 75.50683844229154}
  de {"exact_match": 58.23529411764706, "f1": 73.97330302393942}
  el {"exact_match": 47.73109243697479, "f1": 64.71526367876008}
  ru {"exact_match": 54.285714285714285, "f1": 70.85210687094488}
  tr {"exact_match": 39.15966386554622, "f1": 54.04959679389641}
  ar {"exact_match": 47.39495798319328, "f1": 63.42460795613208}
  vi {"exact_match": 50.33613445378151, "f1": 69.39497841433942}
  th {"exact_match": 32.94117647058823, "f1": 42.04649738683358}
  zh {"exact_match": 48.99159663865546, "f1": 58.25216753368008}
  hi {"exact_match": 44.95798319327731, "f1": 58.764676794694026}

hit-computer · 2020-08-06T07:25:39Z

@Liangtaiwan Hi, I only find test data in download/xquad folder and this data just for test which do not have label. How can you get above result on XQuAD? Thanks :)

Liangtaiwan · 2020-08-06T07:30:26Z

@hit-computer You can find the labels here. https://github.com/deepmind/xquad

hit-computer · 2020-08-06T07:31:16Z

@Liangtaiwan Thank you very much!

sebastianruder · 2020-08-08T15:33:32Z

Hi @hit-computer, I've answered in the corresponding issue. Please don't post in other unrelated issues but instead tag people in your issue.

melvinjosej · 2020-09-10T20:26:22Z

Closing this issue. Please re-open if needed.

MaksymDel changed the title ~~XQUAD results reproducability~~ XQUAD results reproducability for mBERT Apr 18, 2020

Liangtaiwan mentioned this issue May 13, 2020

XQuAD can have a better evaluation #28

Closed

melvinjosej closed this as completed Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XQUAD results reproducability for mBERT #8

XQUAD results reproducability for mBERT #8

MaksymDel commented Apr 16, 2020 •

edited

Loading

sebastianruder commented Apr 16, 2020

MaksymDel commented Apr 18, 2020 •

edited

Loading

sebastianruder commented Apr 19, 2020

Liangtaiwan commented May 13, 2020

Liangtaiwan commented May 13, 2020

hit-computer commented Aug 6, 2020

Liangtaiwan commented Aug 6, 2020

hit-computer commented Aug 6, 2020

sebastianruder commented Aug 8, 2020

melvinjosej commented Sep 10, 2020

XQUAD results reproducability for mBERT #8

XQUAD results reproducability for mBERT #8

Comments

MaksymDel commented Apr 16, 2020 • edited Loading

sebastianruder commented Apr 16, 2020

MaksymDel commented Apr 18, 2020 • edited Loading

sebastianruder commented Apr 19, 2020

Liangtaiwan commented May 13, 2020

Liangtaiwan commented May 13, 2020

hit-computer commented Aug 6, 2020

Liangtaiwan commented Aug 6, 2020

hit-computer commented Aug 6, 2020

sebastianruder commented Aug 8, 2020

melvinjosej commented Sep 10, 2020

MaksymDel commented Apr 16, 2020 •

edited

Loading

MaksymDel commented Apr 18, 2020 •

edited

Loading