Skip to content

hsing-wang/WMT2020_BioMedical

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 

WMT2020_BioMedical_MT

http://www.statmt.org/wmt20/biomedical-translation-task.html

The Tencent AI Lab participated in the WMT20 Shared Task on Biomedical Translation in four language directions: German<->English, Chinese<->English. Our systems in German->Enlgish and English-> German were ranked 1st and 3rd respectively according to the automatic evaluation.

Data and Pre-trained models

Data provided by organizers

No. Corpus File En-Zh En-De En
1 Biomedical Translation wmt18/Medline/training/es-en n/a n/a 287811
wmt18/Medline/training/fr-en n/a n/a 627576
wmt18/Medline/training/pt-en n/a n/a 74645
wmt18/Medline/test - - -
wmt19/Medline/training/de-en n/a 40398 40398
wmt19/Medline/training/fr-en n/a n/a 75049
wmt19/Medline/training/es-en n/a n/a 100257
wmt19/Medline/training/pt-en n/a n/a 49918
wmt19/Medline/test/Medline - - -
wmt20/Medline/training/it-en n/a n/a 14756
wmt20/Medline/training/ru-en n/a n/a 46782
2 UFAL Medical Corpus shuffled.de-en n/a 37814533 n/a
3 HimL test sets khresmoi-summary-dev n/a 500 n/a
khresmoi-summary-test n/a 1000 n/a
4 Khresmoi development data himl-test-2015/cochrane n/a 1433 n/a
himl-test-2015/himl n/a 3892 n/a
himl-test-2015/nhs24 n/a 2459 n/a
himl-test-2017/cochrane n/a 467 n/a

Model hyperparameters

Deep Transformer Hybrid Transformer Big Transformer Large Transformer
Encoder Layer 40 40 6 20
Decoder Layer 6 6 6 6
Attention Heads 8 8 16 16
Embedding Size 512 512 1024 1024
FFN Size 2048 2048 4096 4096

Pre-trained models

Deep Transformer Hybrid Transformer Big Transformer Large Transformer
De->En download download download download
En->De download download download download

BPE Model: De-En BPE models

Synthetic Chinese-English bilingual data

Chinese-English Biomedical bilingual data

Training Details

Data Preprocessing

Moses scripts

$lang=en(de)

#Step1. normalize-punctuation

./mosesdecoder-master/scripts/tokenizer/normalize-punctuation.perl -l $lang < data.$lang > data.$lang.norm

#Step2. remove-non-printing-char

./mosesdecoder-master/scripts/tokenizer/remove-non-printing-char.perl < data.$lang.norm > data.$lang.norm.remv

#Step3. tokenize

./mosesdecoder-master/scripts/tokenizer/tokenizer.perl -l $lang -threads 10 < data.$lang.norm.remv > data.$lang.norm.remv.tok

Model: Transformer

Toolkit: Fairseq

Citation

Please kindly cite our paper if you find it helpful:

@inproceedings{wang2020tencent,
  title={Tencent AI Lab machine translation systems for the WMT20 biomedical translation task},
  author={Wang, Xing and Tu, Zhaopeng and Wang, Longyue and Shi, Shuming},
  booktitle={Proceedings of the Fifth Conference on Machine Translation},
  pages={881--886},
  year={2020}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages