-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPT and BERT pretrained models in French #1356
Comments
Pre-training is indeed a tough pill to swallow. First of all you need a good dataset (does such dataset exist for French?), second you need a lot of processing power. A lot. If a dataset is available (preprocessed, ready to train) then I'd be willing to look into training the model on hardware that I have available. |
Have you an example of a good dataset prepared for the english language (my experience on such things is limited to training Glove on a cleaned dump of the french wikipedia) ? |
English BERT was trained on Wikipedia and BookCorpus for 1M steps. After reading throug hthe BERT readme, I have to retract my previous statement, though. I do not have the resources to pretrain such a model. I thought it would be max one week on a V100, but they speak of four days on 4 to 16 cloud TPUs. I do not possess such power! |
Hi Bram,
I planned to use the French Wikipedia and some Gutenberg famous French works like La comédie humaine for a start, I let you know when I finish to preprocess them. Concerning the hardware I would like to use gpu ec2 spot instances but I do not know how long I’ll have to run them and if it exceeds my meagre financial resources.
Envoyé de mon iPad
… Le 28 sept. 2019 à 10:53, Nestor Demeure ***@***.***> a écrit :
Have you an example of a good dataset prepared for the english language (my experience on such things is limited to training Glove on a cleaned dump of the french wikipedia) ?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Reading this comparison post, 16 TPUv2's are about twice as fast as 8x V100's that are in the ec2 instances. I would then guess that you'd have to run training for a week. |
Order of magnitude for the compute cost (on cloud platforms) of pre-training a large model is anywhere between $10k and $100k. That's for one pre-training, and you usually at least start multiple ones to search the hyperparameter space. RoBERTa was pre-trained for 24 hours on 1,024 (full size, 32GB) V100s. |
Pretty sure that this is applicable for everyone here. |
i made a dataset by converting books from bibebook package to text files. |
Wonderful! Thank you very much!
… Le 30 sept. 2019 à 12:33, cedspam ***@***.***> a écrit :
i made a dataset by converting books from bibebook <http://www.bibebook.com/> to text files.
it's a package of 1 700 Créative Commons BY-SA and public domain book in french
livre francais kaggle dataset <https://www.kaggle.com/cedriclacrambe/livres-en-francais>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1356?email_source=notifications&email_token=AAHXAP2JVSBU2KSTRLJI6HDQMHIZJA5CNFSM4I3IELGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD75GLQY#issuecomment-536503747>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHXAP7ER7H4ERVY7J7JS7LQMHIZJANCNFSM4I3IELGA>.
|
Hi all, I'm currently preparing the I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished. Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT. |
Great news!
Envoyé de mon iPad
… Le 5 oct. 2019 à 20:20, Stefan Schweter ***@***.***> a écrit :
Hi all,
I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text).
I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished.
Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
That's awesome @stefan-it. Let us know if we can help. |
I'm training the GPT-2 on corpus of Russian classical literature. I've modified training script to make it more robust and useful. You can find it here. |
Thanks for sharing Mikhail :)
Envoyé de mon iPad
… Le 7 oct. 2019 à 17:53, Mikhail Grankin ***@***.***> a écrit :
I'm training the GPT-2 on corpus of Russian classical literature. I've modified training script to make it more robust and useful. You can find it here.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@stefan-it Could you explain to me how you trained your model from scratch without using Bert multilingual? I would like to train BERT from scratch for a textual base in PT-BR (8GB data). Is it possible to use the run_lm_finetuning.py code to perform this process without using the multi-language bert model? I already have a vocab.txt for the PT-BR base and I don't want to load initial weights. Is there any script or tutorial to perform this process step by step? |
I don’t know if this link https://github.com/facebookresearch/XLM can answer your question.
Envoyé de mon iPad
… Le 17 oct. 2019 à 20:03, calusbr ***@***.***> a écrit :
Hi all,
I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text).
I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished.
Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT.
@stefan-it Could you explain to me how you trained your model from scratch without using Bert multilingual?
I would like to train BERT from scratch for a textual base in PT-BR (8GB data). Is it possible to use the run_lm_finetuning.py code to perform this process without using the multi-language bert model?
I already have a vocab.txt for the PT-BR base and I don't want to load initial weights.
Is there any script or tutorial to perform this process step by step?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hi @calusbr, I'm using the official Google BERT implementation from this repository on a TPU. Then the trained model TensorFlow model can easily be converted into a Transformers-compatible one (so I can be used with this library). Regarding to your question: if you don't want to use and fine-tune the multi-lingual BERT model, you could try to train a model with the official BERT implementation for a few steps (Google Colab has TPU support). Then you can fine-tune this model with |
Hi @stefan-it ! |
Sure, no problem :) I did some experiments with a training corpus size from 16 to 40 GB. I used the same fine-tuning parameters as used in the SciBERT paper/repository. That means training with a sequence length of 128, then fine-tuning with a sequence length of 512. Unfortunately, the model trained from scratch is ~ 0.5% worse than the multilingual model on a WikiNER split (80/10/10). In another experiment I used the TensorFlow checkpoint from the multilingual cased model and did training with a sequence length of 128. This results in a +0.2% "boost" on WikiNER. However, for PoS tagging the model (trained from scratch) is always better (~0.3%) than the BERT multilingual cased model (I used 4 PoS tagging datasets). I'm currently doing more experiments (mainly focussing on training corpus cleaning...) and will report back here :) |
Thanks Stefan !
… Le 4 nov. 2019 à 11:33, Stefan Schweter ***@***.***> a écrit :
Sure, no problem :)
I did some experiments with a training corpus size from 16 to 40 GB. I used the same fine-tuning parameters as used in the SciBERT paper/repository. That means training with a sequence length of 128, then fine-tuning with a sequence length of 512.
Unfortunately, the model trained from scratch is ~ 0.5% worse than the multilingual model on a WikiNER split (80/10/10). In another experiment I used the TensorFlow checkpoint from the multilingual cased model and did training with a sequence length of 128. This results in a +0.2% "boost" on WikiNER.
However, for PoS tagging the model (trained from scratch) is always better (~0.3%) than the BERT multilingual cased model (I used 4 PoS tagging datasets).
I'm currently doing more experiments (mainly focussing on training corpus cleaning...) and will report back here :)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1356?email_source=notifications&email_token=AAHXAP7ZVWXK4GP236MLDIDQR726NA5CNFSM4I3IELGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC6ZPXY#issuecomment-549296095>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHXAP3PKUDBDELDVEGMUT3QR726NANCNFSM4I3IELGA>.
|
Thanks for your work @stefan-it. It's nice, but perhaps disappointing, to see that the multilingual models aren't that bad after all. From what I read, the multilingual models were said to perform poorly but from your tests it seems that is not (laways?) the case. |
I think we should wait for CamemBERT then 😅 |
Coming soon! cc @louismartin @LysandreJik |
Two days ago they released on arXiv the https://128.84.21.199/pdf/1911.03894.pdf
|
CamemBERT was merged into master: #1822 I'll keep this issue open for GPT. |
Hello, this thread is what I was looking for but I'm not sure I found the answer to my questions:
Thanks a lot in advance. |
We trained CamemBERT on 138GB of raw text on 256 GPUs (32 GB Tesla V100) for 1 day. |
Thank you very much for this valuable information !
Christian Mauceri, PhD
Le 4 déc. 2019 à 16:17 +0100, Louis Martin <notifications@github.com>, a écrit :
… We trained CamemBERT on 138GB of raw text on 258 GPUs (32 GB Tesla V100) for 1 day.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Thanks @louismartin. I find great what your did and published with CamemBERT (I'm French :-) ) and the fact you share as well this kind of information. About your answer: 258 GPUs Tesla V100... waoooooo!!!!! I read in the Download section of CamemBERT site that the model has only 110 millions of parameters. Was it worth to train it on 132 GB of data? |
Hi @stefan-it , do you mind to upload your French Bert check point ? I am interested in your model for generation task. Thanks |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi, any news about a French GPT? |
You can use the model hub to search for this. One such model is belgpt2. |
🚀 Need for GPT and BERT pretrained models in French
All models are in English only and the multilingual models are quite poor
Motivation
Applications like tools for writers and linguists need fully dedicated language support
Additional context
The computation cost to pretrain models in French is still high and it’s difficult for individuals to afford it, I would be glad to take a part of the burden
The text was updated successfully, but these errors were encountered: