High Quality EN-DE/EN-FR Translators #5419

sshleifer · 2020-06-30T23:41:39Z

Download instructions from torchub/fairseq: here
the BART conversion script should be reusable.

Open source status

[ x] the model implementation is available: (give details)
[ x] the model weights are available: (give details)
[ x] who are the authors: (mention them, if possible by @gh-username)
Sergey Edunov, @myleott Michael Auli, David Grangier

Paper: https://arxiv.org/pdf/1808.09381.pdf

Spec

Desired API:

mname = 'facebook/wmt-en-de'



model = FairseqTranslator.from_pretrained(mname)
tokenizer = FairseqBPETokenizer.from_pretrained(mname)  # AutoTokenizer should also work
batch = tokenizer.prepare_seq2seq_batch(['Maschinelles Lernen ist großartig!'])
translated = model.generate(**batch) # determine 
assert tokenizer.batch_decode(translated)[0] == 'Machine Learning is great'

add .rst docs, (see adding a new model instructions, but don't follow them too religiously if something seems suboptimal).
check timing, memory vs fairseq.
if lots of modeling code is added, common tests should pass.

Steps

Get tokenizer equivalence (The fairseq object should have an encode method, and there should be wgettable links of fairseq to get the relevant tokenizer files).
1b. Upload tokenizer to s3 so your tokenizer tests work on CI. You can work out of the stas/fairseq-en-de namespace on your modelhub account and then move everything over (or not) at the end.
Get model.forward/ "logits" equivalence (ignore differences less than 1e-6). This usually doesn't work the first time and you have to go line by line with two ipdb sessions (one fairseq, one hf) until you can find the line that's different. At this stage you should worry very little about code quality and just try to get integration tests passing.
Get model.generate/ "translation" equivalence. There may be small beam search discrepancies. For this you will need to figure out decoder_start_token_id, num_beams, and other config settings.
Upload Everything to S3.
Go through template
and make sure most of the reasonable things are done.
At this point a full integration test (as above) should pass.
Check memory, time and BLEU against fairseq (ideally in collab). Improve/document results in PR description.
test the scary parts: special tokens, padding insensitivity.
Docs/AutoConfig Etc.
Helpful: https://huggingface.co/transformers/model_sharing.html

Assigned to: @stas00

The text was updated successfully, but these errors were encountered:

cp-pc · 2020-07-04T08:51:47Z

Excuse me.
Will this model be added in the future, how long will it take?
Is currently only T5 and Bart can do machine translation?

sshleifer · 2020-07-05T23:06:37Z

I would guess that I get around to this by the end of July, but I can't be sure.

We also have MarianMTModel and 1000+ pretrained weights from Helsinki-NLP/ that do translation. Here is the list:
https://huggingface.co/Helsinki-NLP

stas00 · 2020-08-15T15:51:33Z

I will work on this one.

stas00 · 2020-08-18T18:35:11Z

Here is a lazy man's implementation that uses a simple proxy to the fairseq implementation and makes the spec test pass:

import torch

class FairseqProxy():
    def __init__(self, module):
        self.module = module
        
    @classmethod
    def from_pretrained(cls, mname): 
        return cls(module=torch.hub.load('pytorch/fairseq', mname, checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt', tokenizer='moses', bpe='fastbpe'))

class FairseqTranslator(FairseqProxy):
    
    def generate(self, **tokenized_sentences):
        return self.module.generate(tokenized_sentences['data'])
    
class FairseqBPETokenizer(FairseqProxy):

    def prepare_seq2seq_batch(self, sentences): # encode
        return {'data': [self.module.encode(sentence) for sentence in sentences]}
    
    def batch_decode(self, batched_hypos):
        return [self.module.decode(hypos[0]['tokens']) for hypos in batched_hypos]

# Look ma, I cheated and the test passes ;)
mname = 'transformer.wmt19.ru-en'
model = FairseqTranslator.from_pretrained(mname)
tokenizer = FairseqBPETokenizer.from_pretrained(mname)
batch = tokenizer.prepare_seq2seq_batch(["Машинное обучение - это здорово!"])
translated = model.generate(**batch)
assert tokenizer.batch_decode(translated)[0] == 'Machine learning is great!'

Now to the real work of porting...

stas00 · 2020-09-04T06:54:31Z

mostly done: #6940

stas00 · 2020-09-15T21:23:30Z

once #6940 is merged this issue is to be closed

sshleifer · 2020-09-15T22:37:13Z

FYI, Linked Pull requests automatically close the linked issue.

stas00 · 2020-09-15T22:55:51Z

I noticed that you already did the linking after leaving the comment, but decided to leave it as the previous comment of mine wasn't certain ;)

sshleifer added New model translation machine translation utilities and models labels Jun 30, 2020

sshleifer added this to To do in Examples/seq2seq via automation Jun 30, 2020

sshleifer added the Help wanted Extra attention is needed, help appreciated label Jun 30, 2020

sshleifer mentioned this issue Aug 23, 2020

Training command/weights jungokasai/deep-shallow#1

Closed

sshleifer moved this from To do to In progress in Examples/seq2seq Aug 28, 2020

stas00 mentioned this issue Sep 4, 2020

[ported model] FSMT (FairSeq MachineTranslation) #6940

Merged

sshleifer linked a pull request Sep 4, 2020 that will close this issue

[ported model] FSMT (FairSeq MachineTranslation) #6940

Merged

LysandreJik closed this as completed in #6940 Sep 17, 2020

Examples/seq2seq automation moved this from In progress to Done Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Quality EN-DE/EN-FR Translators #5419

High Quality EN-DE/EN-FR Translators #5419

sshleifer commented Jun 30, 2020 •

edited

cp-pc commented Jul 4, 2020

sshleifer commented Jul 5, 2020

stas00 commented Aug 15, 2020 •

edited

stas00 commented Aug 18, 2020 •

edited

stas00 commented Sep 4, 2020

stas00 commented Sep 15, 2020

sshleifer commented Sep 15, 2020

stas00 commented Sep 15, 2020

High Quality EN-DE/EN-FR Translators #5419

High Quality EN-DE/EN-FR Translators #5419

Comments

sshleifer commented Jun 30, 2020 • edited

Open source status

Spec

Steps

cp-pc commented Jul 4, 2020

sshleifer commented Jul 5, 2020

stas00 commented Aug 15, 2020 • edited

stas00 commented Aug 18, 2020 • edited

stas00 commented Sep 4, 2020

stas00 commented Sep 15, 2020

sshleifer commented Sep 15, 2020

stas00 commented Sep 15, 2020

sshleifer commented Jun 30, 2020 •

edited

stas00 commented Aug 15, 2020 •

edited

stas00 commented Aug 18, 2020 •

edited