GPT and BERT pretrained models in French #1356

mauceri · 2019-09-27T16:01:31Z

🚀 Need for GPT and BERT pretrained models in French

All models are in English only and the multilingual models are quite poor

Motivation

Applications like tools for writers and linguists need fully dedicated language support

Additional context

The computation cost to pretrain models in French is still high and it’s difficult for individuals to afford it, I would be glad to take a part of the burden

BramVanroy · 2019-09-28T07:48:53Z

Pre-training is indeed a tough pill to swallow. First of all you need a good dataset (does such dataset exist for French?), second you need a lot of processing power. A lot. If a dataset is available (preprocessed, ready to train) then I'd be willing to look into training the model on hardware that I have available.

nestordemeure · 2019-09-28T08:53:04Z

Have you an example of a good dataset prepared for the english language (my experience on such things is limited to training Glove on a cleaned dump of the french wikipedia) ?

BramVanroy · 2019-09-28T09:05:58Z

English BERT was trained on Wikipedia and BookCorpus for 1M steps.

After reading throug hthe BERT readme, I have to retract my previous statement, though. I do not have the resources to pretrain such a model. I thought it would be max one week on a V100, but they speak of four days on 4 to 16 cloud TPUs. I do not possess such power!

mauceri · 2019-09-28T09:53:32Z

Hi Bram, I planned to use the French Wikipedia and some Gutenberg famous French works like La comédie humaine for a start, I let you know when I finish to preprocess them. Concerning the hardware I would like to use gpu ec2 spot instances but I do not know how long I’ll have to run them and if it exceeds my meagre financial resources. Envoyé de mon iPad

…

Le 28 sept. 2019 à 10:53, Nestor Demeure ***@***.***> a écrit : Have you an example of a good dataset prepared for the english language (my experience on such things is limited to training Glove on a cleaned dump of the french wikipedia) ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

BramVanroy · 2019-09-28T11:01:52Z

Reading this comparison post, 16 TPUv2's are about twice as fast as 8x V100's that are in the ec2 instances. I would then guess that you'd have to run training for a week.

julien-c · 2019-09-28T12:57:26Z

Order of magnitude for the compute cost (on cloud platforms) of pre-training a large model is anywhere between $10k and $100k. That's for one pre-training, and you usually at least start multiple ones to search the hyperparameter space.

RoBERTa was pre-trained for 24 hours on 1,024 (full size, 32GB) V100s.

BramVanroy · 2019-09-28T13:55:41Z

Order of magnitude for the compute cost (on cloud platforms) of pre-training a large model is anywhere between $10k and $100k. That's for one pre-training, and you usually at least start multiple ones to search the hyperparameter space.

RoBERTa was pre-trained for 24 hours on 1,024 (full size, 32GB) V100s.

Pretty sure that this is applicable for everyone here.

cedspam · 2019-09-30T10:33:30Z

i made a dataset by converting books from bibebook package to text files.
it's a package of 1 700 Créative Commons BY-SA and public domain book in french

livre en francais kaggle dataset

mauceri · 2019-09-30T11:15:35Z

Wonderful! Thank you very much!

…

Le 30 sept. 2019 à 12:33, cedspam ***@***.***> a écrit : i made a dataset by converting books from bibebook <http://www.bibebook.com/> to text files. it's a package of 1 700 Créative Commons BY-SA and public domain book in french livre francais kaggle dataset <https://www.kaggle.com/cedriclacrambe/livres-en-francais> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1356?email_source=notifications&email_token=AAHXAP2JVSBU2KSTRLJI6HDQMHIZJA5CNFSM4I3IELGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD75GLQY#issuecomment-536503747>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHXAP7ER7H4ERVY7J7JS7LQMHIZJANCNFSM4I3IELGA>.

stefan-it · 2019-10-05T18:20:29Z

Hi all,

I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text).

I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished.

Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT.

mauceri · 2019-10-05T22:04:42Z

Great news! Envoyé de mon iPad

…

Le 5 oct. 2019 à 20:20, Stefan Schweter ***@***.***> a écrit : Hi all, I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text). I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished. Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

julien-c · 2019-10-07T13:11:41Z

That's awesome @stefan-it. Let us know if we can help.

mgrankin · 2019-10-07T15:53:11Z

I'm training the GPT-2 on corpus of Russian classical literature. I've modified training script to make it more robust and useful. You can find it here.

mauceri · 2019-10-07T19:44:20Z

Thanks for sharing Mikhail :) Envoyé de mon iPad

…

Le 7 oct. 2019 à 17:53, Mikhail Grankin ***@***.***> a écrit : I'm training the GPT-2 on corpus of Russian classical literature. I've modified training script to make it more robust and useful. You can find it here. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

calusbr · 2019-10-17T18:02:45Z

Hi all,

I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text).

I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished.

Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT.

@stefan-it Could you explain to me how you trained your model from scratch without using Bert multilingual?

I would like to train BERT from scratch for a textual base in PT-BR (8GB data). Is it possible to use the run_lm_finetuning.py code to perform this process without using the multi-language bert model?

I already have a vocab.txt for the PT-BR base and I don't want to load initial weights.

Is there any script or tutorial to perform this process step by step?

mauceri · 2019-10-17T18:36:14Z

I don’t know if this link https://github.com/facebookresearch/XLM can answer your question. Envoyé de mon iPad

…

Le 17 oct. 2019 à 20:03, calusbr ***@***.***> a écrit : Hi all, I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text). I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished. Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT. @stefan-it Could you explain to me how you trained your model from scratch without using Bert multilingual? I would like to train BERT from scratch for a textual base in PT-BR (8GB data). Is it possible to use the run_lm_finetuning.py code to perform this process without using the multi-language bert model? I already have a vocab.txt for the PT-BR base and I don't want to load initial weights. Is there any script or tutorial to perform this process step by step? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

stefan-it · 2019-10-17T21:17:52Z

Hi @calusbr,

I'm using the official Google BERT implementation from this repository on a TPU. Then the trained model TensorFlow model can easily be converted into a Transformers-compatible one (so I can be used with this library).

Regarding to your question: if you don't want to use and fine-tune the multi-lingual BERT model, you could try to train a model with the official BERT implementation for a few steps (Google Colab has TPU support). Then you can fine-tune this model with transformers (or you can try to use the Colab instance) :)

yedide · 2019-11-04T10:23:48Z

Hi all,

I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text).

I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished.

Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT.

Hi @stefan-it !
Very happy to know that you will possibly able to share this model with us!
Do you have any update on it?
Many thanks!! :)

stefan-it · 2019-11-04T10:32:45Z

Sure, no problem :)

I did some experiments with a training corpus size from 16 to 40 GB. I used the same fine-tuning parameters as used in the SciBERT paper/repository. That means training with a sequence length of 128, then fine-tuning with a sequence length of 512.

Unfortunately, the model trained from scratch is ~ 0.5% worse than the multilingual model on a WikiNER split (80/10/10). In another experiment I used the TensorFlow checkpoint from the multilingual cased model and did training with a sequence length of 128. This results in a +0.2% "boost" on WikiNER.

However, for PoS tagging the model (trained from scratch) is always better (~0.3%) than the BERT multilingual cased model (I used 4 PoS tagging datasets).

I'm currently doing more experiments (mainly focussing on training corpus cleaning...) and will report back here :)

mauceri · 2019-11-04T10:34:37Z

Thanks Stefan !

…

Le 4 nov. 2019 à 11:33, Stefan Schweter ***@***.***> a écrit : Sure, no problem :) I did some experiments with a training corpus size from 16 to 40 GB. I used the same fine-tuning parameters as used in the SciBERT paper/repository. That means training with a sequence length of 128, then fine-tuning with a sequence length of 512. Unfortunately, the model trained from scratch is ~ 0.5% worse than the multilingual model on a WikiNER split (80/10/10). In another experiment I used the TensorFlow checkpoint from the multilingual cased model and did training with a sequence length of 128. This results in a +0.2% "boost" on WikiNER. However, for PoS tagging the model (trained from scratch) is always better (~0.3%) than the BERT multilingual cased model (I used 4 PoS tagging datasets). I'm currently doing more experiments (mainly focussing on training corpus cleaning...) and will report back here :) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1356?email_source=notifications&email_token=AAHXAP7ZVWXK4GP236MLDIDQR726NA5CNFSM4I3IELGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC6ZPXY#issuecomment-549296095>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHXAP3PKUDBDELDVEGMUT3QR726NANCNFSM4I3IELGA>.

BramVanroy · 2019-11-04T10:36:49Z

Thanks for your work @stefan-it. It's nice, but perhaps disappointing, to see that the multilingual models aren't that bad after all. From what I read, the multilingual models were said to perform poorly but from your tests it seems that is not (laways?) the case.

stefan-it · 2019-11-10T12:06:28Z

I think we should wait for CamemBERT then 😅

https://camembert-model.fr/

julien-c · 2019-11-10T12:40:56Z

Coming soon! cc @louismartin @LysandreJik

TheEdoardo93 · 2019-11-12T08:35:52Z

Two days ago they released on arXiv the https://128.84.21.199/pdf/1911.03894.pdf

I think we should wait for CamemBERT then

https://camembert-model.fr/

julien-c · 2019-11-18T16:05:57Z

CamemBERT was merged into master: #1822

I'll keep this issue open for GPT.

piegu · 2019-11-26T14:41:39Z

Hello, this thread is what I was looking for but I'm not sure I found the answer to my questions:

how long does it take to go through GPT-2 and BERT in French?
what configuration of GPUs?
what size of corpus?

Thanks a lot in advance.

louismartin · 2019-12-04T15:17:55Z

We trained CamemBERT on 138GB of raw text on 256 GPUs (32 GB Tesla V100) for 1 day.

mauceri · 2019-12-04T15:45:29Z

Thank you very much for this valuable information ! Christian Mauceri, PhD Le 4 déc. 2019 à 16:17 +0100, Louis Martin <notifications@github.com>, a écrit :

…

We trained CamemBERT on 138GB of raw text on 258 GPUs (32 GB Tesla V100) for 1 day. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

piegu · 2019-12-04T17:08:29Z

We trained CamemBERT on 138GB of raw text on 258 GPUs (32 GB Tesla V100) for 1 day.

Thanks @louismartin. I find great what your did and published with CamemBERT (I'm French :-) ) and the fact you share as well this kind of information.

About your answer: 258 GPUs Tesla V100... waoooooo!!!!!
Where did you find this power of computation? In Facebook AI?

I read in the Download section of CamemBERT site that the model has only 110 millions of parameters. Was it worth to train it on 132 GB of data?

zlinao · 2020-01-31T09:45:31Z

Hi all,

I'm currently preparing the .tfrecords (both cased and uncased) for a French BERT model (corpus is mainly taken from Wikipedia + OPUS corpora, resulting in ~20GB of text).

I'll share the results (TF checkpoints + Transformers weights) whenever the training on TPU has finished.

Evaluation tasks for that model are a bit limited, so I would evaluate the model for PoS tagging and NER (Universal Dependencies and WikiANN) and compare the model with mBERT.

Hi @stefan-it , do you mind to upload your French Bert check point ? I am interested in your model for generation task. Thanks

stale · 2020-03-31T10:32:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

louisabraham · 2020-11-25T17:41:36Z

Hi, any news about a French GPT?

BramVanroy · 2020-11-25T18:43:30Z

Hi, any news about a French GPT?

You can use the model hub to search for this. One such model is belgpt2.

LysandreJik mentioned this issue Oct 3, 2019

GPT-2 Training on non-english text #1407

Closed

julien-c mentioned this issue Nov 22, 2019

How to contribute to “Write with transformer”? #1358

Closed

stale bot added the wontfix label Mar 31, 2020

stale bot closed this as completed Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT and BERT pretrained models in French #1356

GPT and BERT pretrained models in French #1356

mauceri commented Sep 27, 2019 •

edited

Loading

BramVanroy commented Sep 28, 2019 •

edited

Loading

nestordemeure commented Sep 28, 2019

BramVanroy commented Sep 28, 2019

mauceri commented Sep 28, 2019 via email

BramVanroy commented Sep 28, 2019

julien-c commented Sep 28, 2019 •

edited

Loading

BramVanroy commented Sep 28, 2019

cedspam commented Sep 30, 2019 •

edited

Loading

mauceri commented Sep 30, 2019 via email

stefan-it commented Oct 5, 2019

mauceri commented Oct 5, 2019 via email

julien-c commented Oct 7, 2019

mgrankin commented Oct 7, 2019

mauceri commented Oct 7, 2019 via email

calusbr commented Oct 17, 2019

mauceri commented Oct 17, 2019 via email

stefan-it commented Oct 17, 2019

yedide commented Nov 4, 2019

stefan-it commented Nov 4, 2019

mauceri commented Nov 4, 2019 via email

BramVanroy commented Nov 4, 2019

stefan-it commented Nov 10, 2019

julien-c commented Nov 10, 2019

TheEdoardo93 commented Nov 12, 2019

julien-c commented Nov 18, 2019

piegu commented Nov 26, 2019

louismartin commented Dec 4, 2019 •

edited

Loading

mauceri commented Dec 4, 2019 via email

piegu commented Dec 4, 2019 •

edited

Loading

zlinao commented Jan 31, 2020

stale bot commented Mar 31, 2020

louisabraham commented Nov 25, 2020

BramVanroy commented Nov 25, 2020

GPT and BERT pretrained models in French #1356

GPT and BERT pretrained models in French #1356

Comments

mauceri commented Sep 27, 2019 • edited Loading

🚀 Need for GPT and BERT pretrained models in French

Motivation

Additional context

BramVanroy commented Sep 28, 2019 • edited Loading

nestordemeure commented Sep 28, 2019

BramVanroy commented Sep 28, 2019

mauceri commented Sep 28, 2019 via email

BramVanroy commented Sep 28, 2019

julien-c commented Sep 28, 2019 • edited Loading

BramVanroy commented Sep 28, 2019

cedspam commented Sep 30, 2019 • edited Loading

mauceri commented Sep 30, 2019 via email

stefan-it commented Oct 5, 2019

mauceri commented Oct 5, 2019 via email

julien-c commented Oct 7, 2019

mgrankin commented Oct 7, 2019

mauceri commented Oct 7, 2019 via email

calusbr commented Oct 17, 2019

mauceri commented Oct 17, 2019 via email

stefan-it commented Oct 17, 2019

yedide commented Nov 4, 2019

stefan-it commented Nov 4, 2019

mauceri commented Nov 4, 2019 via email

BramVanroy commented Nov 4, 2019

stefan-it commented Nov 10, 2019

julien-c commented Nov 10, 2019

TheEdoardo93 commented Nov 12, 2019

julien-c commented Nov 18, 2019

piegu commented Nov 26, 2019

louismartin commented Dec 4, 2019 • edited Loading

mauceri commented Dec 4, 2019 via email

piegu commented Dec 4, 2019 • edited Loading

zlinao commented Jan 31, 2020

stale bot commented Mar 31, 2020

louisabraham commented Nov 25, 2020

BramVanroy commented Nov 25, 2020

mauceri commented Sep 27, 2019 •

edited

Loading

BramVanroy commented Sep 28, 2019 •

edited

Loading

julien-c commented Sep 28, 2019 •

edited

Loading

cedspam commented Sep 30, 2019 •

edited

Loading

louismartin commented Dec 4, 2019 •

edited

Loading

piegu commented Dec 4, 2019 •

edited

Loading