Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example for fine tuning BERT language model #124

Merged
merged 4 commits into from Jan 7, 2019

Conversation

Projects
None yet
6 participants
@deepset-ai
Copy link
Contributor

deepset-ai commented Dec 18, 2018

We are currently working on fine-tuning the language model on a new target corpus. This should improve the model, if the language style in your target corpus differs significantly from the one initially used for training BERT (Wiki + BookCorpus), but is still too small for training BERT from scratch. In our case, we apply this on a rather technical english corpus.

The sample script is loading a pre-trained BERT model and fine-tunes it as a language model (masked tokens & nextSentence) on your target corpus. The samples from the target corpus can either be fed to the model directly from memory or read from disk one-by-one.

Training the language model from scratch without loading a pre-trained BERT model is also not very difficult to do from here. In contrast, to the original tf repo, you can do the training with multi-GPU instead of TPU.

We thought this might be also helpful for others.

Add example for fine tuning BERT language model (#1)
Adds an example for loading a pre-trained BERT model and fine tune it as a language model (masked tokens & nextSentence) on your target corpus.
@thomwolf

This comment has been minimized.

Copy link
Member

thomwolf commented Dec 18, 2018

This looks like a great addition!

Is it a full re-implementation of the pre-training script?

@deepset-ai

This comment has been minimized.

Copy link
Contributor Author

deepset-ai commented Dec 18, 2018

The implementation uses the same sampling parameters and logic, but it's not a one-by-one re-implementation of the original pre-training script.

Main differences:

  • In the original repo they first create a training set of TFrecords from a raw corpus (create_pretraining_data.py) and then perform model training using run_pretraining.py. We decided
    against this two step procedure and do the conversion from raw text to sample "on the fly" (more similar to this repo from codertimo). With this we can actually generate new samples every epoch.
  • We currently feed in pair of lines (= sentences) as one sample, while the original repo fills 90% of samples up with more sentences until max_seq_length is reached (for our use case this did not make any sense)

Main similarities:

  • All sampling / masking probabilities and parameters
  • Format of raw corpus (one sentence per line & empty line as doc delimiter)
  • Sampling strategy: Random nextSentence must be from another document
  • The data reader of codertimo is similar to our code, but didn't really match the original method of sampling.

Happy to clarify further details!

@davidefiocco

This comment has been minimized.

Copy link
Contributor

davidefiocco commented Dec 18, 2018

Hi @deepset-ai this is great and, just a suggestion, maybe if this makes it to the repo it would be great to include something in the README too about this functionality in this pull request?

@deepset-ai

This comment has been minimized.

Copy link
Contributor Author

deepset-ai commented Dec 19, 2018

Just added some basic documentation to the README. Happy to include more, if @thomwolf thinks that this makes sense.

@thomwolf

This comment has been minimized.

Copy link
Member

thomwolf commented Dec 19, 2018

Yes, I was going to ask you to add some information in the readme, it's great. The more is the better. If you can also add instructions on how to download a dataset for the training as in the other examples it would be perfect. If your dataset is private, do you have in mind another dataset that would let the users try your script easily? If not it's ok, don't worry.

Another thing is that the fp16 logic has now been switched to NVIDIA's apex module and we have gotten rid of the optimize_on_cpu option (see the relevant PR for more details). You can see the changes in the current examples like run_squad.py, it's actually a lot simpler since we don't have to manage parameters copy in the example and it's also faster. Do you think you could adapt the fp16 parts of your script similarly?

logger = logging.getLogger(__name__)


class BERTDataset(Dataset):

This comment has been minimized.

@thomwolf

thomwolf Dec 19, 2018

Member

I like this class. I think we should actually create a data.py module in the main package that would gather a few utilities to work more easily with BERT that could be imported from the package instead of copying them from script to script. I'm thinking about this dataset class but also utilities like convert_example_to_features and maybe even your random_word function.

Maybe we should add some abstract classes/low-level functions from which the data manipulation logic of the other examples (run_classifier, run_squad and extract_feature) could be also build.

What do you think? I haven't look at the details yet so maybe it doesn't make sense. If you don't have time to look at that I will work it through when I start working on the next release.

This comment has been minimized.

@tholor

tholor Dec 20, 2018

Contributor

I also think this would add value, since there's probably quite a few things you could share between the examples. In addition, this module would be helpful for people who develop new, more specific down-stream tasks. Unfortunately, I probably won't have time to work on this in the next weeks. It would be great, if you could take over, when working on the next release.

while item > doc_end:
doc_id += 1
doc_start = doc_end + 1
doc_end += len(self.all_docs[doc_id]) - 1

This comment has been minimized.

@thomwolf

thomwolf Dec 19, 2018

Member

Is there a specific reason you iterate every time on the dataset rather than constructing a index->doc mapping when you read the file?

This comment has been minimized.

@tholor

tholor Dec 20, 2018

Contributor

You are totally right. This is a left-over from another approach. Creating an initial mapping makes way more sense. I have now added a mapping for index -> {doc_id, line}.

try:
output_label.append(tokenizer.vocab[token])
except KeyError:
# For unknown words (should not occur with BPE vocab)

This comment has been minimized.

@thomwolf

thomwolf Dec 19, 2018

Member

Should we log this e.g. with a warning?

This comment has been minimized.

@tholor

tholor Dec 20, 2018

Contributor

Done

type=int,
default=1,
help="Number of updates steps to accumualte before performing a backward/update pass.")
parser.add_argument('--optimize_on_cpu',

This comment has been minimized.

@thomwolf

thomwolf Dec 19, 2018

Member

We can remove this now (see #116)

This comment has been minimized.

@tholor

tholor Dec 20, 2018

Contributor

Done

for n, param in model.named_parameters()]
else:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']

This comment has been minimized.

@thomwolf

thomwolf Dec 19, 2018

Member

This part has changed also, see the new names in the current examples

This comment has been minimized.

@tholor

tholor Dec 20, 2018

Contributor

Done. I have tried to replicate the apex usage from the other examples. Since I have not much experience with apex yet, you might wanna check briefly, if there's something I missed.

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if n not in no_decay], 'weight_decay_rate': 0.01},

This comment has been minimized.

@thomwolf

thomwolf Dec 19, 2018

Member

And this was wrong and is now fixed (should be something like p for name, p in param_optimizer if any(n in no_decay for n in name))

This comment has been minimized.

@tholor

tholor Dec 20, 2018

Contributor

Done

@Rocketknight1

This comment has been minimized.

Copy link

Rocketknight1 commented Dec 19, 2018

This is something I'd been working on as well, congrats on a nice implementation!

One question, though: I noticed you stripped out the code for evaluating on a test set, but when fine-tuning the LM on a smaller corpus, would it be worth keeping that in? Overfitting is much more of a risk in a smaller corpus.

add exemplary training data. update to nvidia apex. refactor 'item ->…
… line in doc' mapping. add warning for unknown word.

@deepset-ai deepset-ai changed the title Add example for fine tuning BERT language model (#1) Add example for fine tuning BERT language model Dec 20, 2018

@tholor

This comment has been minimized.

Copy link
Contributor

tholor commented Dec 20, 2018

This is something I'd been working on as well, congrats on a nice implementation!

One question, though: I noticed you stripped out the code for evaluating on a test set, but when fine-tuning the LM on a smaller corpus, would it be worth keeping that in? Overfitting is much more of a risk in a smaller corpus.

@Rocketknight1, you are right that we will probably need some better evaluation here. Currently, I have the feeling though that the evaluation on down-stream tasks is more meaningful (see also Jacob Devlin's comment here). But in addition, some better monitoring of the loss during and after training would be nice.

Do you already have something in place and would like to contribute on this? Otherwise, I will try to find some time during the upcoming holidays to add this.

@julien-c julien-c force-pushed the huggingface:master branch 3 times, most recently from 4a8c950 to 8da280e Dec 20, 2018

@Rocketknight1

This comment has been minimized.

Copy link

Rocketknight1 commented Dec 22, 2018

This is something I'd been working on as well, congrats on a nice implementation!
One question, though: I noticed you stripped out the code for evaluating on a test set, but when fine-tuning the LM on a smaller corpus, would it be worth keeping that in? Overfitting is much more of a risk in a smaller corpus.

@Rocketknight1, you are right that we will probably need some better evaluation here. Currently, I have the feeling though that the evaluation on down-stream tasks is more meaningful (see also Jacob Devlin's comment here). But in addition, some better monitoring of the loss during and after training would be nice.

Do you already have something in place and would like to contribute on this? Otherwise, I will try to find some time during the upcoming holidays to add this.

I don't have any evaluation code either, unfortunately! It might be easier to just evaluate on the final classification task, so it's not really urgent. I'll experiment with LM fine-tuning when I'm back at work in January. If I get good benefits on classification tasks I'll see what effect early stopping based on validation loss has, and if that turns out to be useful too I can submit a PR for it?

@kaushaltrivedi

This comment has been minimized.

Copy link

kaushaltrivedi commented Dec 31, 2018

Have you thought about extending the vocabulary after fine-tuning on custom dataset. This could be useful if the custom dataset has specific terms related to that domain.

@tholor

This comment has been minimized.

Copy link
Contributor

tholor commented Jan 2, 2019

Have you thought about extending the vocabulary after fine-tuning on custom dataset. This could be useful if the custom dataset has specific terms related to that domain.

Adjusting the vocabulary before fine-tuning could be interesting, but you would need some smart approach to exchange "less important" tokens from the original byte pair vocab with "important" ones from your custom corpus (while maintaining the pre-trained embeddings for the rest of the vocab meaningful).
We don't work on this at the moment. Looking forward to a PR, if you have time to work on this.

@kaushaltrivedi

This comment has been minimized.

Copy link

kaushaltrivedi commented Jan 3, 2019

Have you thought about extending the vocabulary after fine-tuning on custom dataset. This could be useful if the custom dataset has specific terms related to that domain.

Adjusting the vocabulary before fine-tuning could be interesting, but you would need some smart approach to exchange "less important" tokens from the original byte pair vocab with "important" ones from your custom corpus (while maintaining the pre-trained embeddings for the rest of the vocab meaningful).
We don't work on this at the moment. Looking forward to a PR, if you have time to work on this.

Yes I am working on it. The idea is to add more items to the pretrained vocabulary. Also will adjust the model layers: bert.embeddings.word_embeddings.weight, cls.predictions.decoder.weight with the mean weights and also update cls.predictions.bias with mean bias for additional vocabulary words.

Will send out a PR once I test it.

@thomwolf

This comment has been minimized.

Copy link
Member

thomwolf commented Jan 7, 2019

Ok this looks very good, I am merging, thanks a lot @tholor!

@thomwolf thomwolf merged commit c18bdb4 into huggingface:master Jan 7, 2019

qwang70 pushed a commit to DRL36/pytorch-pretrained-BERT that referenced this pull request Mar 2, 2019

Merge pull request huggingface#124 from deepset-ai/master
Add example for fine tuning BERT language model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.