BERT pre-training using only domain specific text #615

nightowlcity · 2019-05-02T11:34:10Z

BERT is pre-trained using Wikipedia and other sources of normal text, but my problem domain has a very specific vocabulary & grammar. Is there an easy way to train BERT completely from domain specific data (preferably using Keras)?

The amount of pre-training data is not issue and we are not looking for the SOTA results. We would do fine with a smaller scale model, but it has to be trained from our data.

hsm207 · 2019-05-04T10:49:16Z

You can pre-train BERT from scratch using your own data using the run_pretraining.py script.

PetreanuAndi · 2019-05-16T11:26:26Z

What if we want to leverage the already pre-trained model (and its language knowledge) and fine-tune on a specific closed-domain dataset? It seems like we need to have the exact same word-embeddings in order for it to leverage the existent knowledge for fine-tuning. How do we know if our word-embeddings match with those used by google in their vocab? What if we have new words in our vocab that were not present in the original trained vocab?

Thank you, much appreciated!

hsm207 · 2019-05-18T07:36:00Z

The word embeddings are stored in the checkpoint files too. Also the 'words' are actually wordpiece tokens. This kind of tokenization is handled by the create_pretraining_data.py script. So you don't have to worry if your word-embeddings match with those used by google in their vocab.

If you still want to add new words, there are a few issues in this repo discussing this. You can start by reading #396.

PetreanuAndi · 2019-05-20T08:13:26Z

@hsm207 , i don't worry about not being able to train a model, with the correct input "format".
i worry that the "fine-tuning" process is not actually language fine-tuning, but rather new knowledge learning (given that multi-lingual vocabs are quite small, at least for my native language).

Is it not safe to assume that if i am "fine-tuning" on a corpus that has a lot of new words (out-of-original-vocab), then i'm not actually fine-tuning the model, I am in fact re-training from scratch? (given that it has not seen those words before, or has labeled them )

From tokenization script :
"if is_bad: # (out of vocab word)
output_tokens.append(self.unk_token)"

hsm207 · 2019-05-22T21:25:43Z

The small vocab isn't really a problem. You will need to see how many unknown tokens you end up with after the word piece tokenization step. If there is a lot and the words are important to your domain, then it is a problem so you will need to add these words to the vocab.

PetreanuAndi · 2019-05-23T10:49:01Z

@hsm207 , "add these words to the vocab" -> does that not imply re-training from scratch on a specific-language corpus? Otherwise, the unknown tokens will just be assigned random values, i guess? (random values => no correlation or relationship between words, bad :( right? )

hsm207 · 2019-05-23T12:14:27Z

@PetreanuAndi yes, you are right. Adding words to the vocab and then finetuning it to your corpus essentially means training the words from scratch on your domain specific corpus.

PetreanuAndi · 2019-05-23T12:18:31Z

@hsm207 thank you. i have already begun doing just that, but needed some peer-review confirmation for my actions :) Such a shame tho. The infamous ImageNet moment for NLP, such praise, much awe, it is actually proficient just for english and chinese :)

Scagin · 2019-05-29T06:39:38Z

You can pre-train BERT from scratch using your own data using the run_pretraining.py script.

I try training my own pre-trained model by using the [run_pretraining.py].But I found it can only run in CPU, not GPU. Is it the problem of TPUEstimator? How can I run the code in GPU?

hsm207 · 2019-05-29T14:34:25Z

How did you figure out it was running on your CPU and not GPU? Was it based on the logs or nvidia-smi?

Anyway, you can try the implementation from Hugging Face. It looks like they have figured out how to run the pretraining using GPUs.

008karan · 2019-06-25T11:35:30Z

@Scagin facing the same issue. How you solved it?

gsasikiran · 2019-08-13T11:53:23Z

You can pre-train BERT from scratch using your own data using the run_pretraining.py script.

After pretraining using run_pretraining.py, the model has given checkpoints, but I need the word embeddings. Can I derive word embeddings from check points?

hsm207 · 2019-08-16T16:39:02Z

@gsasikiran yes, you can. See here for details and adapt as necessary.

gsasikiran · 2019-08-19T09:11:55Z

@hsm207 Thank you. It worked and resulted in a json file. But that file consists the 16 dictionaries tokens of only [CLS] and [SEP]. I wonder what about the remaining words in the vocabulary.

hsm207 · 2019-08-19T11:52:00Z

@gsasikiran can you share a minimal and fully reproducible example? I'd like to run your code myself.

gsasikiran · 2019-08-19T19:48:26Z

https://colab.research.google.com/drive/1ZXn2cVpyvfUscN_-FD1Z_h3HW9xMi_nd

Here I provide the link to the colab with my program. Let me know, if I had to provide my training data and vocab.txt too.

hsm207 · 2019-08-19T23:31:44Z

@gsasikiran you need to provide everything so that i can reproduce your results.

gsasikiran · 2019-08-20T07:53:07Z

data_files.zip
bert_config_file: bert_config.json
input_file: training_data.txt
vocab_file:deep_vocab.txt

EDIT: I have removed the training_data, which may subject to copyrights.

gsasikiran · 2019-09-22T18:51:02Z

@hsm207 Have you got the embeddings?

hsm207 · 2019-09-29T04:09:08Z

@gsasikiran I have trouble running your notebook.

Specifically I am getting this error:

Can you insert into the notebook all the code needed to download the data for your use case too?

gsasikiran · 2019-09-30T11:07:47Z

https://colab.research.google.com/drive/1ZXn2cVpyvfUscN_-FD1Z_h3HW9xMi_nd

I hope this helps

hsm207 · 2019-10-01T23:54:46Z

@gsasikiran I don't see a problem with the results. I can view the embeddings for all the tokens in my input.

See the last cell in this notebook: https://gist.github.com/hsm207/143b6349ed1c92960be0dc1c6165d551

gsasikiran · 2019-10-02T13:05:08Z

@hsm207 Thank you for your time. I have seen the problem, which is my input text. The input text which I have given has many empty lines at the starting and returned only tokens [CLS] and [SEP] token embeddings for those lines.

vr25 · 2019-10-17T19:53:40Z

Hi,

I am trying to use the domain-specific BERT (FinBERT) for my task but it looks like files such as config.json, pytorch_model.bin, and vocab.txt are missing in the repository. On the other hand, there are two vocabulary files which were created as part of the creating domain-specific pretrained FinBERT.

I was wondering if I can use the above BERT_pretraining_share.ipynb to create the config.json, pytorch_model.bin and then use it as the domain-specific bert-base-uncased model.

Thanks.

gsasikiran · 2019-10-18T07:05:40Z

@vr25 I have no problem.

ldb-46 · 2019-10-18T07:39:08Z

What the fuck are you doin????

imayachita · 2020-01-13T07:48:39Z

Hi @PetreanuAndi and @nightowlcity,
Did you manage to pre-train/fine-tune your model on your domain-specific text? Does pre-train from scratch and adding words in the vocab give a significant impact compared to just fine-tune it? Thanks!

viva2202 · 2020-04-14T13:05:04Z

@PetreanuAndi , @nightowlcity @imayachita I would also be very interested in your experiences. I am also faced with the decision to fine-tune a model or to train a new one from the scratch.

Thank you in advance for sharing your experiences!

Alaminmolla · 2020-05-02T03:20:35Z

[https://ffii.org/search/FFII/feed/rss2/](https://ffii.org/search/FFII/feed/rss2/) connect with drive to gsuite claud with ca future reference

Alaminmolla · 2020-05-11T21:46:19Z

W5 h3 responsable

nagads · 2021-02-01T06:08:04Z

The small vocab isn't really a problem. You will need to see how many unknown tokens you end up with after the word piece tokenization step. If there is a lot and the words are important to your domain, then it is a problem so you will need to add these words to the vocab.

@hsm207 Do you have any suggestions on how much domain corpus data is needed to learn new vocab embeddings , assuming i would leverage on already learnt weights for existing words in current vocab, thanks

hsm207 · 2021-02-01T10:55:33Z

@nagads
I would look to the ULMFiT model for guidance.

They tested the general pretrain -> domain specific pretrain -> task-specific finetuning approach on several datasets:

nagads · 2021-02-02T06:28:39Z

@hsm207 Thanks a ton. This is helpful.

nagads · 2021-02-19T06:55:34Z

@pkrishnavamshi could you clarify rationale for re-train rather than finetune again on smaller dataset. thanks.
re-train in BERT has various connotations.

you want to learn new vocabulary like in case of biobert
you want to finetune Language model but with same vocab to better fit the tone and tenor of domain (like in case of ULMfit)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT pre-training using only domain specific text #615

BERT pre-training using only domain specific text #615

nightowlcity commented May 2, 2019

hsm207 commented May 4, 2019

PetreanuAndi commented May 16, 2019

hsm207 commented May 18, 2019

PetreanuAndi commented May 20, 2019

hsm207 commented May 22, 2019

PetreanuAndi commented May 23, 2019

hsm207 commented May 23, 2019

PetreanuAndi commented May 23, 2019

Scagin commented May 29, 2019 •

edited

hsm207 commented May 29, 2019

008karan commented Jun 25, 2019

gsasikiran commented Aug 13, 2019

hsm207 commented Aug 16, 2019

gsasikiran commented Aug 19, 2019

hsm207 commented Aug 19, 2019

gsasikiran commented Aug 19, 2019

hsm207 commented Aug 19, 2019

gsasikiran commented Aug 20, 2019 •

edited

gsasikiran commented Sep 22, 2019

hsm207 commented Sep 29, 2019

gsasikiran commented Sep 30, 2019

hsm207 commented Oct 1, 2019

gsasikiran commented Oct 2, 2019

vr25 commented Oct 17, 2019

gsasikiran commented Oct 18, 2019

ldb-46 commented Oct 18, 2019 via email

imayachita commented Jan 13, 2020

viva2202 commented Apr 14, 2020

Alaminmolla commented May 2, 2020

Alaminmolla commented May 11, 2020

nagads commented Feb 1, 2021

hsm207 commented Feb 1, 2021

nagads commented Feb 2, 2021

nagads commented Feb 19, 2021

BERT pre-training using only domain specific text #615

BERT pre-training using only domain specific text #615

Comments

nightowlcity commented May 2, 2019

hsm207 commented May 4, 2019

PetreanuAndi commented May 16, 2019

hsm207 commented May 18, 2019

PetreanuAndi commented May 20, 2019

hsm207 commented May 22, 2019

PetreanuAndi commented May 23, 2019

hsm207 commented May 23, 2019

PetreanuAndi commented May 23, 2019

Scagin commented May 29, 2019 • edited

hsm207 commented May 29, 2019

008karan commented Jun 25, 2019

gsasikiran commented Aug 13, 2019

hsm207 commented Aug 16, 2019

gsasikiran commented Aug 19, 2019

hsm207 commented Aug 19, 2019

gsasikiran commented Aug 19, 2019

hsm207 commented Aug 19, 2019

gsasikiran commented Aug 20, 2019 • edited

gsasikiran commented Sep 22, 2019

hsm207 commented Sep 29, 2019

gsasikiran commented Sep 30, 2019

hsm207 commented Oct 1, 2019

gsasikiran commented Oct 2, 2019

vr25 commented Oct 17, 2019

gsasikiran commented Oct 18, 2019

ldb-46 commented Oct 18, 2019 via email

imayachita commented Jan 13, 2020

viva2202 commented Apr 14, 2020

Alaminmolla commented May 2, 2020

Alaminmolla commented May 11, 2020

nagads commented Feb 1, 2021

hsm207 commented Feb 1, 2021

nagads commented Feb 2, 2021

nagads commented Feb 19, 2021

Scagin commented May 29, 2019 •

edited

gsasikiran commented Aug 20, 2019 •

edited