Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT and BERT pretrained models in French #1356

Closed
mauceri opened this issue Sep 27, 2019 · 33 comments
Closed

GPT and BERT pretrained models in French #1356

mauceri opened this issue Sep 27, 2019 · 33 comments
Labels

Comments

@mauceri
Copy link

mauceri commented Sep 27, 2019

🚀 Need for GPT and BERT pretrained models in French

All models are in English only and the multilingual models are quite poor

Motivation

Applications like tools for writers and linguists need fully dedicated language support

Additional context

The computation cost to pretrain models in French is still high and it’s difficult for individuals to afford it, I would be glad to take a part of the burden

@BramVanroy
Copy link
Collaborator

BramVanroy commented Sep 28, 2019

Pre-training is indeed a tough pill to swallow. First of all you need a good dataset (does such dataset exist for French?), second you need a lot of processing power. A lot. If a dataset is available (preprocessed, ready to train) then I'd be willing to look into training the model on hardware that I have available.

@nestordemeure
Copy link

Have you an example of a good dataset prepared for the english language (my experience on such things is limited to training Glove on a cleaned dump of the french wikipedia) ?

@BramVanroy
Copy link
Collaborator

English BERT was trained on Wikipedia and BookCorpus for 1M steps.

After reading throug hthe BERT readme, I have to retract my previous statement, though. I do not have the resources to pretrain such a model. I thought it would be max one week on a V100, but they speak of four days on 4 to 16 cloud TPUs. I do not possess such power!

@mauceri
Copy link
Author

mauceri commented Sep 28, 2019 via email

@BramVanroy
Copy link
Collaborator

Reading this comparison post, 16 TPUv2's are about twice as fast as 8x V100's that are in the ec2 instances. I would then guess that you'd have to run training for a week.

@julien-c
Copy link
Member

julien-c commented Sep 28, 2019

Order of magnitude for the compute cost (on cloud platforms) of pre-training a large model is anywhere between $10k and $100k. That's for one pre-training, and you usually at least start multiple ones to search the hyperparameter space.

RoBERTa was pre-trained for 24 hours on 1,024 (full size, 32GB) V100s.

@BramVanroy
Copy link
Collaborator

Order of magnitude for the compute cost (on cloud platforms) of pre-training a large model is anywhere between $10k and $100k. That's for one pre-training, and you usually at least start multiple ones to search the hyperparameter space.

RoBERTa was pre-trained for 24 hours on 1,024 (full size, 32GB) V100s.

Pretty sure that this is applicable for everyone here.

@cedspam
Copy link
Contributor

cedspam commented Sep 30, 2019

i made a dataset by converting books from bibebook package to text files.
it's a package of 1 700 Créative Commons BY-SA and public domain book in french

livre en francais kaggle dataset

@mauceri
Copy link
Author

mauceri commented Sep 30, 2019 via email

@stefan-it
Copy link
Collaborator

Hi all,

I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text).

I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished.

Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT.

@mauceri
Copy link
Author

mauceri commented Oct 5, 2019 via email

@julien-c
Copy link
Member

julien-c commented Oct 7, 2019

That's awesome @stefan-it. Let us know if we can help.

@mgrankin
Copy link
Contributor

mgrankin commented Oct 7, 2019

I'm training the GPT-2 on corpus of Russian classical literature. I've modified training script to make it more robust and useful. You can find it here.

@mauceri
Copy link
Author

mauceri commented Oct 7, 2019 via email

@calusbr
Copy link

calusbr commented Oct 17, 2019

Hi all,

I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text).

I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished.

Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT.

@stefan-it Could you explain to me how you trained your model from scratch without using Bert multilingual?

I would like to train BERT from scratch for a textual base in PT-BR (8GB data). Is it possible to use the run_lm_finetuning.py code to perform this process without using the multi-language bert model?

I already have a vocab.txt for the PT-BR base and I don't want to load initial weights.

Is there any script or tutorial to perform this process step by step?

@mauceri
Copy link
Author

mauceri commented Oct 17, 2019 via email

@stefan-it
Copy link
Collaborator

Hi @calusbr,

I'm using the official Google BERT implementation from this repository on a TPU. Then the trained model TensorFlow model can easily be converted into a Transformers-compatible one (so I can be used with this library).

Regarding to your question: if you don't want to use and fine-tune the multi-lingual BERT model, you could try to train a model with the official BERT implementation for a few steps (Google Colab has TPU support). Then you can fine-tune this model with transformers (or you can try to use the Colab instance) :)

@yedide
Copy link

yedide commented Nov 4, 2019

Hi all,

I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text).

I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished.

Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT.

Hi @stefan-it !
Very happy to know that you will possibly able to share this model with us!
Do you have any update on it?
Many thanks!! :)

@stefan-it
Copy link
Collaborator

Sure, no problem :)

I did some experiments with a training corpus size from 16 to 40 GB. I used the same fine-tuning parameters as used in the SciBERT paper/repository. That means training with a sequence length of 128, then fine-tuning with a sequence length of 512.

Unfortunately, the model trained from scratch is ~ 0.5% worse than the multilingual model on a WikiNER split (80/10/10). In another experiment I used the TensorFlow checkpoint from the multilingual cased model and did training with a sequence length of 128. This results in a +0.2% "boost" on WikiNER.

However, for PoS tagging the model (trained from scratch) is always better (~0.3%) than the BERT multilingual cased model (I used 4 PoS tagging datasets).

I'm currently doing more experiments (mainly focussing on training corpus cleaning...) and will report back here :)

@mauceri
Copy link
Author

mauceri commented Nov 4, 2019 via email

@BramVanroy
Copy link
Collaborator

Thanks for your work @stefan-it. It's nice, but perhaps disappointing, to see that the multilingual models aren't that bad after all. From what I read, the multilingual models were said to perform poorly but from your tests it seems that is not (laways?) the case.

@stefan-it
Copy link
Collaborator

I think we should wait for CamemBERT then 😅

https://camembert-model.fr/

@julien-c
Copy link
Member

Coming soon! cc @louismartin @LysandreJik

@TheEdoardo93
Copy link

Two days ago they released on arXiv the https://128.84.21.199/pdf/1911.03894.pdf

I think we should wait for CamemBERT then

https://camembert-model.fr/

@julien-c
Copy link
Member

CamemBERT was merged into master: #1822

I'll keep this issue open for GPT.

@piegu
Copy link
Contributor

piegu commented Nov 26, 2019

Hello, this thread is what I was looking for but I'm not sure I found the answer to my questions:

  • how long does it take to go through GPT-2 and BERT in French?
  • what configuration of GPUs?
  • what size of corpus?

Thanks a lot in advance.

@louismartin
Copy link
Contributor

louismartin commented Dec 4, 2019

We trained CamemBERT on 138GB of raw text on 256 GPUs (32 GB Tesla V100) for 1 day.

@mauceri
Copy link
Author

mauceri commented Dec 4, 2019 via email

@piegu
Copy link
Contributor

piegu commented Dec 4, 2019

We trained CamemBERT on 138GB of raw text on 258 GPUs (32 GB Tesla V100) for 1 day.

Thanks @louismartin. I find great what your did and published with CamemBERT (I'm French :-) ) and the fact you share as well this kind of information.

About your answer: 258 GPUs Tesla V100... waoooooo!!!!!
Where did you find this power of computation? In Facebook AI?

I read in the Download section of CamemBERT site that the model has only 110 millions of parameters. Was it worth to train it on 132 GB of data?

@zlinao
Copy link

zlinao commented Jan 31, 2020

Hi all,

I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text).

I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished.

Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT.

Hi @stefan-it , do you mind to upload your French Bert check point ? I am interested in your model for generation task. Thanks

@stale
Copy link

stale bot commented Mar 31, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Mar 31, 2020
@stale stale bot closed this as completed Apr 7, 2020
@louisabraham
Copy link

Hi, any news about a French GPT?

@BramVanroy
Copy link
Collaborator

Hi, any news about a French GPT?

You can use the model hub to search for this. One such model is belgpt2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests