Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XLM-R] by Facebook AI Research #1769

Closed
3 tasks
TheEdoardo93 opened this issue Nov 8, 2019 · 18 comments
Closed
3 tasks

[XLM-R] by Facebook AI Research #1769

TheEdoardo93 opened this issue Nov 8, 2019 · 18 comments
Labels

Comments

@TheEdoardo93
Copy link

TheEdoardo93 commented Nov 8, 2019

🌟New model addition

Model description

Yesterday, Facebook has released open source its new NLG model called XLM-R (XLM-RoBERTa) on arXiv. This model uses self-supervised training techniques to achieve state-of-the-art performance in cross-lingual understanding, a task in which a model is trained in one language and then used with other languages without additional training data. Our model improves upon previous multilingual approaches by incorporating more training data and languages — including so-called low-resource languages, which lack extensive labeled and unlabeled data sets.

Open Source status

  • the model implementation is available: here under the XLMRModel Python class (row 198)
  • the model weights are available: Yes, here more details
  • who are the authors: FacebookAI Research (Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov)

Additional context

Facebook says these two sentences about this new model in their blog:

XLM-R represents an important step toward our vision of providing the best possible experience on our platforms for everyone, regardless of what language they speak

We hope to improve the performance of multilingual models created by the research community, particularly systems that use self-supervised training methods to better understand low-resource languages.

XLM-R has been trained on 2.5T of data across 100 languages data filtered from Common Crawl

@julien-c
Copy link
Member

julien-c commented Nov 8, 2019

cc @aconneau 😬

@ricardorei
Copy link

is there any update in the XLM-R model?

@ngoyal2707
Copy link

Let me know if you need some-help in porting the xlm-r models to HF.

@stefan-it
Copy link
Collaborator

stefan-it commented Dec 4, 2019

I think that's maybe not the correct way, but I adjusted the convert_roberta_original_pytorch_checkpoint_to_pytorch.py script to convert the fairseq model into a transformers compatible model file. I used the sentencepiece BPE loader and adjusted the vocab size.

Then I used the CamemBERT model class to perform some evaluations on NER. But the result is not really good (I tried to replicate the CoNLL-2003 for English).

So I guess it is not as simple as this first attempt 😅


Gist for the conversion script is here.

The CamemBERT model configuration looks pretty much the same as XLM-R large?!

@CZWin32768
Copy link

CZWin32768 commented Dec 10, 2019

I think that's maybe not the correct way, but I adjusted the convert_roberta_original_pytorch_checkpoint_to_pytorch.py script to convert the fairseq model into a transformers compatible model file. I used the sentencepiece BPE loader and adjusted the vocab size.

Then I used the CamemBERT model class to perform some evaluations on NER. But the result is not really good (I tried to replicate the CoNLL-2003 for English).

So I guess it is not as simple as this first attempt 😅

Gist for the conversion script is here.

The CamemBERT model configuration looks pretty much the same as XLM-R large?!

Hi @stefan-it, do you have any update for your attempt?

@stefan-it
Copy link
Collaborator

stefan-it commented Dec 11, 2019

The final models have been released today 😍

https://github.com/pytorch/fairseq/tree/master/examples/xlmr

So I'm going to try the conversion with these models tomorrow/in the next days :)

@stefan-it
Copy link
Collaborator

stefan-it commented Dec 11, 2019

I think the model conversion is done correctly. But: the CamembertTokenizer implementation can't be used, because it adds some special tokens. I had to modify the tokenizer to match the output of the fairseq tokenization/.encode() method :) I'll report back some results on NER later.

update: I could achieve 90.41% on CoNLL-2003 (English), paper reports 92.74 (using Flair).
update 2: Using the run_ner.py example (incl. some hours of tokenization debugging...): 96.22 (dev) and 91.91 (test).

@ricardorei
Copy link

Btw I was using the XLM-R v0 checkpoints in a project I'm working on and the v0 checkpoints worked slightly better than the checkpoints added today. Is it possible to also add the older checkpoints?

@TheEdoardo93
Copy link
Author

I think it's the best solution to offer both checkpoint versions! In my opinion, the ideal case is that, as like to other models in Transformers, you can select which version of XLM-R checkpoints to use, e.g.

> from transformers import XLMRModel
> base_model = XLMRModel.from_pretrained('xlmr-base') # 250M parameters
> large_model = XLMRModel.from_pretrained('xlmr-large') # 560M parameters

Btw I was using the XLM-R v0 checkpoints in a project I'm working on and the v0 checkpoints worked slightly better than the checkpoints added today. Is it possible to also add the older checkpoints?

@ricardorei
Copy link

Btw using XLM-R I encounter this issue:
Batch size affecting output. #2401

This is really annoying and makes it hard to use the model.

@stale
Copy link

stale bot commented Mar 9, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Mar 9, 2020
@stale stale bot closed this as completed Mar 16, 2020
@mohammedayub44
Copy link

@ricardorei Did you happen to successfully use the XLM-R model ?

I'm trying to see how this model can be used as pretrained step for NMT tasks, I tried raw version from XLM facebook repo and ran into multiple OOM issues.

The best suggestion so far I got is to try smaller version of Fairseq xlmr (base) on p3dn.24xlarge instance or the Google TPU Pytorch way.

Thanks !

@ricardorei
Copy link

ricardorei commented May 6, 2020

@mohammedayub44

I am using the base model which runs well in a 12GB GPU with batch size of 8. Depending on your implementation and task you can run even bigger batches (16, 24 for example).

And I am also using the version directly from Fairseq, because you can load the v0 checkpoint.

The variability in my prediction with different batch sizes I could never figure out. Probably some floating-point precision issues going on under the hood. It doesn't change overall performance but it is annoying...

@foxik
Copy link

foxik commented May 6, 2020

BTW, I am using the TF variant from https://huggingface.co/jplu/tf-xlm-roberta-base and https://huggingface.co/jplu/tf-xlm-roberta-large . I have successfully finetuned even the large model on a 16GB GPU and it was performing substantially better than the base model (on Czech Q&A).

@mohammedayub44
Copy link

@ricardorei
Thanks for the confirmation. I'm okay with v0 checkpoints, I just need to check if the model can be fine-tuned for NMT. I'm guessing you're fine tuning for Classification tasks.

If you could share the prepare and train commands you are using. It would be easier than digging deep into every fairseq hyperparamter.

Thanks !

@mohammedayub44
Copy link

mohammedayub44 commented May 6, 2020

@foxik Is TF variant more suitable for fine-tuning. Any particular preprocessing steps you carried out for fine-tuning. If you can share them, I can map the same for NMT task.

Thanks !

@ricardorei
Copy link

ricardorei commented May 7, 2020

@mohammedayub44 Yes I was using it for classification/regression. In your case, you need the encoder and decoder part which would take a lot more space. I would suggest that you share parameters between you encoder and decoder.

I know that, with the right hyperparameter, you can achieve good results by sharing the parameters between your encoder and decoder -> A Simple and Effective Approach to Automatic Post-Editing
with Transfer Learning

In terms of hyperparameters that I am using, they are very simple. I freeze the encoder for 1 epoch while fine-tuning the classification head and then I fine-tune the entire model. My classification-head has a learning rate of 0.00003 while XLM-R has 0.00001. The optimizer is a standard Adam. This combination of gradual unfreezing with discriminative learning rates works well in my task.

@mohammedayub44
Copy link

mohammedayub44 commented May 7, 2020

@ricardorei Thanks for sharing the paper. Some interesting results there.
Any hints on how I can setup both encoder and decoder of XLM-R and share the parameters using HuggingFace library. I could only find LM fine-tuning examples and notebook file. Nothing on NMT based fine-tuning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants