Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans to support longer sequences? #27

Closed
cbockman opened this issue Nov 2, 2018 · 14 comments
Closed

Plans to support longer sequences? #27

cbockman opened this issue Nov 2, 2018 · 14 comments

Comments

@cbockman
Copy link
Contributor

cbockman commented Nov 2, 2018

Right now, the model (correct me if I'm wrong) appears to be locked down to sequences of max 512, based on running & playing with the code (and this makes sense in the context of the paper).

Are there any near-term plans to support longer sequences?

Offhand, this would potentially require multiple issues to be addressed, including 1) allowing positional embeddings that can extend for longer or perhaps arbitrary lengths (with some degradation over longer lengths than it has been trained on, of course) (possibly using something like multiple sinusoidal embeddings, like in the original transformer paper?) and 2) containing/limiting the Transformer quadratic memory explosion (my first gut would be to try something like the techniques in "Generating Wikipedia by Summarizing Long Sequences" https://arxiv.org/abs/1801.10198).

Right now--from first pass--it seems like the way to use this over longer sequences is to chunk the docs into sequences (either inline with fixed lengths, or possibly as pre-processing on boundaries like sentences or paragraphs) and apply BERT in a feature-input mode, and then feed into something else downstream (like universal transformer).

All of this seems doable, but is 1) more complicated from an engineering perspective and 2) loses the ability to fine-tune (at least in any way that is obvious to me).

(Of course, having a model adept to longer sequences like in https://arxiv.org/abs/1801.10198 has model power trade-offs, such that it is plausible that the feature-based approach could still plausibly be more superior?)

@jacobdevlin-google
Copy link
Contributor

We don't plan to make major changes to this library, so anything like that would be part of a separate project.

Our recommended recipe is exactly what you describe (it's what we do for SQuAD), but you can actually fine-tune on it normally (we just don't do it for SQuAD because only a few percent of SQuAD documents are longer than 384 do so it didnt matter. But we should have).

Let's say you have:

the man went to the store and bought a gallon of milk

And had max_seq_length = 6, stride = 3, then you could split it up like this:

the man went to the store
to the store and bought a
and bought a gallon of milk

So from BertModel's perspective this is a 3x6 minibatch, but crucially you can reshape it after you get it back from BertModel.get_sequence_output() and softmax over all the tokens when you compute the loss (with some masking to make sure you don't double count the boundary words like to the store and and bought a). So you will be fine-tuning over the whole document end-to-end. The exact implementation is task-specific of course.

@vr25
Copy link

vr25 commented Oct 21, 2019

to the store a

Hi,
It looks like a good solution wherein a longer sequence is broken down into shorter sequences. I was wondering if it is feasible to apply the same technique to sequence of length ~100,000 tokens.
Also, could you elaborate more on reshaping from the implementation point of view?
Thanks.

@oakkas
Copy link

oakkas commented Jan 30, 2020

Hi @vr25 Did you find any good solution to your response?
I am working on classifying really dong sequence of documents and it seems not working well since the max_sequence limit of 512. I just found this issue and wanted to give it a try.

@vr25
Copy link

vr25 commented Mar 22, 2020

Hi @oakkas
No, I haven't tried the above solution yet but I will resume this soon. Do you have any updates on this and would like to share here?

Thanks!

@oakkas
Copy link

oakkas commented Mar 22, 2020

Hi again @vr25.
Not yet either. I was pulled into another project for now but hopping to start experimenting soon too.

@dbsousa01
Copy link

Hey @vr25 @oakkas

Did any of you try it already?

@vmaryasin
Copy link

vmaryasin commented Nov 24, 2020

Hi, I had to do classification of long texts, most of which had 500 - 1000 tokens, but some could contain up to 500k tokens.

So I did a system, heavily inspired by Jakob Devlin's comment above: split my 1024-token text into a minibatch of 2x512 tokens, then concatenated the 2 outputs of the CLS tokens 2 x 768 -> 1536 and put a regular classification head on top of it. Then finetuned the whole system end-to-end.
(The particular implementation was on FlauBERT from huggingface transformers trained in Pytorch.)

Due to very specific nature of my texts (lots of numbers, tables and other structures), I didn't do any striding. So basically, I had no attention span between the 2 individual parts of text.

For my problem this trick gave a meaningful perf gain, but it didn't change the world. The classifier was already doing quite well on truncated texts of 512 tokens, I just managed to push it a bit more. I also tried minibatches of 4x512 tokens. But this didn't give meaningful improvement as only ~5% of my texts were longer than 1024 tokens. These conclusions are task specific, of course.

We also tried to implement a BERT with an attention mechanism of the Longformer and compare with it: https://arxiv.org/abs/2004.05150
But for reasons unknown, the latter took significantly more time to train in our configuration, so we abandoned the idea. Note, however, that this long training is almost certainly not due to the Longformer itself. I'm very curious to see such comparison if someone makes it. :)

@donglinz
Copy link

@vmaryasin Can you please elaborate why do you believe the long training process is certainly not due to the Longformer itself?

Thanks

@vmaryasin
Copy link

@donglinz It'll get a bit messy and historical here. :)

There exist 2 French language BERTS: CamemBERT and FlauBERT with slightly different implementation. At that time, we haven't yet made a final choice of the model for the project. It turned out that due to the way attention layers are coded and accessed in 2 models, it was much easier to implement Longformer on Camembert. On the contrary, for historical reasons the above trick was implemented on Flaubert.

What we observed was that a Longformer took significantly more time to train. But later on we noticed that a large part of this delay was actually due to pure Camembert being slower than pure Flaubert in_our_particular_setup. This was the strangest thing and unfortunately, we didn't find the reason why. And as we already had a system which was working alright, we abandoned Camembert and did not try to reimplement Longformer on FlauBERT either.

So it's best to say, that I don't have any conclusion about Longformer. But I wanted to mention it above as it seems to be a good option and I'm curious to see the comparison of the two methods.

@anubhav562
Copy link

Hi, I am working on a NER task and I have a lot of sequences of text where the sequence_length goes above 512, is anyone having any idea of how can I tackle this.

One solution which I have thought of is: simply dividing the input sequence into multiple input sequences so that it fits the limit of 512, then apply the NER classification to these sequences and bind back the NER labels.

Has anyone faced a similar issue/ has found a better solution to this? Please let me know.

@lucaguarro
Copy link

lucaguarro commented Sep 15, 2021

So I did a system, heavily inspired by Jakob Devlin's comment above: split my 1024-token text into a minibatch of 2x512 tokens, then concatenated the 2 outputs of the CLS tokens 2 x 768 -> 1536 and put a regular classification head on top of it. Then finetuned the whole system end-to-end.

Hi @vmaryasin what did you do for the samples that only had 1 output (aka fit into a 512 chunk)? Did you duplicate the first output or pad it somehow? And similarly what did you do for the samples that exceeded 1024 tokens or were they so few that you could just truncate?

@vmaryasin
Copy link

Hi @lucaguarro,

  1. I padded the input text with zeros to 1024 length the same way a shorter than 512-token text is padded to fit in one BERT. This way I always had 2 BERT outputs.
  2. I truncated the text.

I believe, those are specific design choices, and I would suggest you test them in your task. To be honest, I didn't even ask myself your Q1. As for the longer texts, I tried training a 4x512 model, but it wasn't any better, to say the least.
Below is my text length histogram, I should have probably tried the 3x512 instead...
input_text_length_histogram

@MichalBrzozowski91
Copy link

MichalBrzozowski91 commented Feb 8, 2022

Hi, we have implemented the classification model for longer texts following the Jacob Devlin' comment:

  • Repo is available here.
  • Length of the input text is arbitrary, however for longer texts it needs proportionally more GPU RAM during fine-tuning, because all text chunks are fed to BERT as a minibatch.
  • As a default the standard english bert-base-uncased model is used as a pretrained model. However, it is possible to use any Bert or Roberta model.
  • More technical details can be found here.

@MichalBrzozowski91
Copy link

Hi, I described more details about the above open source implementation in the Medium articles:

Repo is available here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants