Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT pre-training using only domain specific text #615

Open
nightowlcity opened this issue May 2, 2019 · 34 comments
Open

BERT pre-training using only domain specific text #615

nightowlcity opened this issue May 2, 2019 · 34 comments

Comments

@nightowlcity
Copy link

BERT is pre-trained using Wikipedia and other sources of normal text, but my problem domain has a very specific vocabulary & grammar. Is there an easy way to train BERT completely from domain specific data (preferably using Keras)?

The amount of pre-training data is not issue and we are not looking for the SOTA results. We would do fine with a smaller scale model, but it has to be trained from our data.

@hsm207
Copy link
Contributor

hsm207 commented May 4, 2019

You can pre-train BERT from scratch using your own data using the run_pretraining.py script.

@PetreanuAndi
Copy link

What if we want to leverage the already pre-trained model (and its language knowledge) and fine-tune on a specific closed-domain dataset? It seems like we need to have the exact same word-embeddings in order for it to leverage the existent knowledge for fine-tuning. How do we know if our word-embeddings match with those used by google in their vocab? What if we have new words in our vocab that were not present in the original trained vocab?

Thank you, much appreciated!

@hsm207
Copy link
Contributor

hsm207 commented May 18, 2019

The word embeddings are stored in the checkpoint files too. Also the 'words' are actually wordpiece tokens. This kind of tokenization is handled by the create_pretraining_data.py script. So you don't have to worry if your word-embeddings match with those used by google in their vocab.

If you still want to add new words, there are a few issues in this repo discussing this. You can start by reading #396.

@PetreanuAndi
Copy link

@hsm207 , i don't worry about not being able to train a model, with the correct input "format".
i worry that the "fine-tuning" process is not actually language fine-tuning, but rather new knowledge learning (given that multi-lingual vocabs are quite small, at least for my native language).

Is it not safe to assume that if i am "fine-tuning" on a corpus that has a lot of new words (out-of-original-vocab), then i'm not actually fine-tuning the model, I am in fact re-training from scratch? (given that it has not seen those words before, or has labeled them )

From tokenization script :
"if is_bad: # (out of vocab word)
output_tokens.append(self.unk_token)"

@hsm207
Copy link
Contributor

hsm207 commented May 22, 2019

The small vocab isn't really a problem. You will need to see how many unknown tokens you end up with after the word piece tokenization step. If there is a lot and the words are important to your domain, then it is a problem so you will need to add these words to the vocab.

@PetreanuAndi
Copy link

@hsm207 , "add these words to the vocab" -> does that not imply re-training from scratch on a specific-language corpus? Otherwise, the unknown tokens will just be assigned random values, i guess? (random values => no correlation or relationship between words, bad :( right? )

@hsm207
Copy link
Contributor

hsm207 commented May 23, 2019

@PetreanuAndi yes, you are right. Adding words to the vocab and then finetuning it to your corpus essentially means training the words from scratch on your domain specific corpus.

@PetreanuAndi
Copy link

@hsm207 thank you. i have already begun doing just that, but needed some peer-review confirmation for my actions :) Such a shame tho. The infamous ImageNet moment for NLP, such praise, much awe, it is actually proficient just for english and chinese :)

@Scagin
Copy link

Scagin commented May 29, 2019

You can pre-train BERT from scratch using your own data using the run_pretraining.py script.

I try training my own pre-trained model by using the [run_pretraining.py].But I found it can only run in CPU, not GPU. Is it the problem of TPUEstimator? How can I run the code in GPU?

@hsm207
Copy link
Contributor

hsm207 commented May 29, 2019

How did you figure out it was running on your CPU and not GPU? Was it based on the logs or nvidia-smi?

Anyway, you can try the implementation from Hugging Face. It looks like they have figured out how to run the pretraining using GPUs.

@008karan
Copy link

@Scagin facing the same issue. How you solved it?

@gsasikiran
Copy link

You can pre-train BERT from scratch using your own data using the run_pretraining.py script.

After pretraining using run_pretraining.py, the model has given checkpoints, but I need the word embeddings. Can I derive word embeddings from check points?

@hsm207
Copy link
Contributor

hsm207 commented Aug 16, 2019

@gsasikiran yes, you can. See here for details and adapt as necessary.

@gsasikiran
Copy link

@hsm207 Thank you. It worked and resulted in a json file. But that file consists the 16 dictionaries tokens of only [CLS] and [SEP]. I wonder what about the remaining words in the vocabulary.
json_file

@hsm207
Copy link
Contributor

hsm207 commented Aug 19, 2019

@gsasikiran can you share a minimal and fully reproducible example? I'd like to run your code myself.

@gsasikiran
Copy link

https://colab.research.google.com/drive/1ZXn2cVpyvfUscN_-FD1Z_h3HW9xMi_nd

Here I provide the link to the colab with my program. Let me know, if I had to provide my training data and vocab.txt too.

@hsm207
Copy link
Contributor

hsm207 commented Aug 19, 2019

@gsasikiran you need to provide everything so that i can reproduce your results.

@gsasikiran
Copy link

gsasikiran commented Aug 20, 2019

data_files.zip
bert_config_file: bert_config.json
input_file: training_data.txt
vocab_file:deep_vocab.txt

EDIT: I have removed the training_data, which may subject to copyrights.

@gsasikiran
Copy link

@hsm207 Have you got the embeddings?

@hsm207
Copy link
Contributor

hsm207 commented Sep 29, 2019

@gsasikiran I have trouble running your notebook.

Specifically I am getting this error:

image

Can you insert into the notebook all the code needed to download the data for your use case too?

@gsasikiran
Copy link

@hsm207
Copy link
Contributor

hsm207 commented Oct 1, 2019

@gsasikiran I don't see a problem with the results. I can view the embeddings for all the tokens in my input.

See the last cell in this notebook: https://gist.github.com/hsm207/143b6349ed1c92960be0dc1c6165d551

@gsasikiran
Copy link

@hsm207 Thank you for your time. I have seen the problem, which is my input text. The input text which I have given has many empty lines at the starting and returned only tokens [CLS] and [SEP] token embeddings for those lines.

@vr25
Copy link

vr25 commented Oct 17, 2019

Hi,

I am trying to use the domain-specific BERT (FinBERT) for my task but it looks like files such as config.json, pytorch_model.bin, and vocab.txt are missing in the repository. On the other hand, there are two vocabulary files which were created as part of the creating domain-specific pretrained FinBERT.

I was wondering if I can use the above BERT_pretraining_share.ipynb to create the config.json, pytorch_model.bin and then use it as the domain-specific bert-base-uncased model.

Thanks.

@gsasikiran
Copy link

@vr25 I have no problem.

@ldb-46
Copy link

ldb-46 commented Oct 18, 2019 via email

@imayachita
Copy link

Hi @PetreanuAndi and @nightowlcity,
Did you manage to pre-train/fine-tune your model on your domain-specific text? Does pre-train from scratch and adding words in the vocab give a significant impact compared to just fine-tune it? Thanks!

@viva2202
Copy link

@PetreanuAndi , @nightowlcity @imayachita I would also be very interested in your experiences. I am also faced with the decision to fine-tune a model or to train a new one from the scratch.

Thank you in advance for sharing your experiences!

@Alaminmolla
Copy link

[https://ffii.org/search/FFII/feed/rss2/](https://ffii.org/search/FFII/feed/rss2/) connect with drive to gsuite claud with ca future reference

@Alaminmolla
Copy link

W5 h3 responsable

@nagads
Copy link

nagads commented Feb 1, 2021

The small vocab isn't really a problem. You will need to see how many unknown tokens you end up with after the word piece tokenization step. If there is a lot and the words are important to your domain, then it is a problem so you will need to add these words to the vocab.

@hsm207 Do you have any suggestions on how much domain corpus data is needed to learn new vocab embeddings , assuming i would leverage on already learnt weights for existing words in current vocab, thanks

@hsm207
Copy link
Contributor

hsm207 commented Feb 1, 2021

@nagads
I would look to the ULMFiT model for guidance.

They tested the general pretrain -> domain specific pretrain -> task-specific finetuning approach on several datasets:

image

@nagads
Copy link

nagads commented Feb 2, 2021

@hsm207 Thanks a ton. This is helpful.

@nagads
Copy link

nagads commented Feb 19, 2021

@pkrishnavamshi could you clarify rationale for re-train rather than finetune again on smaller dataset. thanks.
re-train in BERT has various connotations.

  1. you want to learn new vocabulary like in case of biobert
  2. you want to finetune Language model but with same vocab to better fit the tone and tenor of domain (like in case of ULMfit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests