Plans to support longer sequences? #27

cbockman · 2018-11-02T02:27:54Z

Right now, the model (correct me if I'm wrong) appears to be locked down to sequences of max 512, based on running & playing with the code (and this makes sense in the context of the paper).

Are there any near-term plans to support longer sequences?

Offhand, this would potentially require multiple issues to be addressed, including 1) allowing positional embeddings that can extend for longer or perhaps arbitrary lengths (with some degradation over longer lengths than it has been trained on, of course) (possibly using something like multiple sinusoidal embeddings, like in the original transformer paper?) and 2) containing/limiting the Transformer quadratic memory explosion (my first gut would be to try something like the techniques in "Generating Wikipedia by Summarizing Long Sequences" https://arxiv.org/abs/1801.10198).

Right now--from first pass--it seems like the way to use this over longer sequences is to chunk the docs into sequences (either inline with fixed lengths, or possibly as pre-processing on boundaries like sentences or paragraphs) and apply BERT in a feature-input mode, and then feed into something else downstream (like universal transformer).

All of this seems doable, but is 1) more complicated from an engineering perspective and 2) loses the ability to fine-tune (at least in any way that is obvious to me).

(Of course, having a model adept to longer sequences like in https://arxiv.org/abs/1801.10198 has model power trade-offs, such that it is plausible that the feature-based approach could still plausibly be more superior?)

jacobdevlin-google · 2018-11-02T04:06:58Z

We don't plan to make major changes to this library, so anything like that would be part of a separate project.

Our recommended recipe is exactly what you describe (it's what we do for SQuAD), but you can actually fine-tune on it normally (we just don't do it for SQuAD because only a few percent of SQuAD documents are longer than 384 do so it didnt matter. But we should have).

Let's say you have:

the man went to the store and bought a gallon of milk

And had max_seq_length = 6, stride = 3, then you could split it up like this:

the man went to the store
to the store and bought a
and bought a gallon of milk

So from BertModel's perspective this is a 3x6 minibatch, but crucially you can reshape it after you get it back from BertModel.get_sequence_output() and softmax over all the tokens when you compute the loss (with some masking to make sure you don't double count the boundary words like to the store and and bought a). So you will be fine-tuning over the whole document end-to-end. The exact implementation is task-specific of course.

vr25 · 2019-10-21T18:00:23Z

to the store a

Hi,
It looks like a good solution wherein a longer sequence is broken down into shorter sequences. I was wondering if it is feasible to apply the same technique to sequence of length ~100,000 tokens.
Also, could you elaborate more on reshaping from the implementation point of view?
Thanks.

oakkas · 2020-01-30T20:38:33Z

Hi @vr25 Did you find any good solution to your response?
I am working on classifying really dong sequence of documents and it seems not working well since the max_sequence limit of 512. I just found this issue and wanted to give it a try.

vr25 · 2020-03-22T04:45:01Z

Hi @oakkas
No, I haven't tried the above solution yet but I will resume this soon. Do you have any updates on this and would like to share here?

Thanks!

oakkas · 2020-03-22T22:25:13Z

Hi again @vr25.
Not yet either. I was pulled into another project for now but hopping to start experimenting soon too.

dbsousa01 · 2020-10-12T14:25:51Z

Hey @vr25 @oakkas

Did any of you try it already?

vmaryasin · 2020-11-24T17:35:21Z

Hi, I had to do classification of long texts, most of which had 500 - 1000 tokens, but some could contain up to 500k tokens.

So I did a system, heavily inspired by Jakob Devlin's comment above: split my 1024-token text into a minibatch of 2x512 tokens, then concatenated the 2 outputs of the CLS tokens 2 x 768 -> 1536 and put a regular classification head on top of it. Then finetuned the whole system end-to-end.
(The particular implementation was on FlauBERT from huggingface transformers trained in Pytorch.)

Due to very specific nature of my texts (lots of numbers, tables and other structures), I didn't do any striding. So basically, I had no attention span between the 2 individual parts of text.

For my problem this trick gave a meaningful perf gain, but it didn't change the world. The classifier was already doing quite well on truncated texts of 512 tokens, I just managed to push it a bit more. I also tried minibatches of 4x512 tokens. But this didn't give meaningful improvement as only ~5% of my texts were longer than 1024 tokens. These conclusions are task specific, of course.

We also tried to implement a BERT with an attention mechanism of the Longformer and compare with it: https://arxiv.org/abs/2004.05150
But for reasons unknown, the latter took significantly more time to train in our configuration, so we abandoned the idea. Note, however, that this long training is almost certainly not due to the Longformer itself. I'm very curious to see such comparison if someone makes it. :)

donglinz · 2020-11-25T11:47:16Z

@vmaryasin Can you please elaborate why do you believe the long training process is certainly not due to the Longformer itself?

Thanks

vmaryasin · 2020-11-25T12:19:54Z

@donglinz It'll get a bit messy and historical here. :)

There exist 2 French language BERTS: CamemBERT and FlauBERT with slightly different implementation. At that time, we haven't yet made a final choice of the model for the project. It turned out that due to the way attention layers are coded and accessed in 2 models, it was much easier to implement Longformer on Camembert. On the contrary, for historical reasons the above trick was implemented on Flaubert.

What we observed was that a Longformer took significantly more time to train. But later on we noticed that a large part of this delay was actually due to pure Camembert being slower than pure Flaubert in_our_particular_setup. This was the strangest thing and unfortunately, we didn't find the reason why. And as we already had a system which was working alright, we abandoned Camembert and did not try to reimplement Longformer on FlauBERT either.

So it's best to say, that I don't have any conclusion about Longformer. But I wanted to mention it above as it seems to be a good option and I'm curious to see the comparison of the two methods.

anubhav562 · 2021-06-24T19:30:42Z

Hi, I am working on a NER task and I have a lot of sequences of text where the sequence_length goes above 512, is anyone having any idea of how can I tackle this.

One solution which I have thought of is: simply dividing the input sequence into multiple input sequences so that it fits the limit of 512, then apply the NER classification to these sequences and bind back the NER labels.

Has anyone faced a similar issue/ has found a better solution to this? Please let me know.

lucaguarro · 2021-09-15T03:21:16Z

So I did a system, heavily inspired by Jakob Devlin's comment above: split my 1024-token text into a minibatch of 2x512 tokens, then concatenated the 2 outputs of the CLS tokens 2 x 768 -> 1536 and put a regular classification head on top of it. Then finetuned the whole system end-to-end.

Hi @vmaryasin what did you do for the samples that only had 1 output (aka fit into a 512 chunk)? Did you duplicate the first output or pad it somehow? And similarly what did you do for the samples that exceeded 1024 tokens or were they so few that you could just truncate?

vmaryasin · 2021-09-17T14:39:27Z

Hi @lucaguarro,

I padded the input text with zeros to 1024 length the same way a shorter than 512-token text is padded to fit in one BERT. This way I always had 2 BERT outputs.
I truncated the text.

I believe, those are specific design choices, and I would suggest you test them in your task. To be honest, I didn't even ask myself your Q1. As for the longer texts, I tried training a 4x512 model, but it wasn't any better, to say the least.
Below is my text length histogram, I should have probably tried the 3x512 instead...

MichalBrzozowski91 · 2022-02-08T12:18:30Z

Hi, we have implemented the classification model for longer texts following the Jacob Devlin' comment:

Repo is available here.
Length of the input text is arbitrary, however for longer texts it needs proportionally more GPU RAM during fine-tuning, because all text chunks are fed to BERT as a minibatch.
As a default the standard english bert-base-uncased model is used as a pretrained model. However, it is possible to use any Bert or Roberta model.
More technical details can be found here.

MichalBrzozowski91 · 2024-01-24T08:57:50Z

Hi, I described more details about the above open source implementation in the Medium articles:

Repo is available here.

jacobdevlin-google closed this as completed Nov 3, 2018

SparkJiao mentioned this issue Dec 28, 2018

Build a BertTextFieldEmbedder allenai/allennlp#2236

Closed

alvin-leong mentioned this issue Mar 27, 2019

Help with implementing strides into features for multi-label classifier huggingface/transformers#414

Closed

Ayuei mentioned this issue Jul 25, 2019

Sequence length more than 512 huggingface/transformers#894

Closed

eladbitton mentioned this issue Aug 6, 2019

Support longer sequences with BertForSequenceClassification huggingface/transformers#974

Closed

anassalamah mentioned this issue Sep 28, 2019

Chunking Long Documents for Classification Tasks huggingface/transformers#1360

Closed

mortonjt mentioned this issue Apr 26, 2020

Allow for longer sequences mortonjt/language-alignment#2

Open

yzhangcs mentioned this issue May 30, 2020

RuntimeError: copy_if failed to synchronize yzhangcs/parser#27

Closed

SeanBannister mentioned this issue Jun 3, 2020

token indices sequence length is longer than the specified maximum sequence length huggingface/transformers#1791

Closed

yzhangcs mentioned this issue Nov 7, 2020

position_ids yzhangcs/parser#44

Closed

JorgeOchoaReyes mentioned this issue Apr 16, 2023

Q&A with fixed passage nicknochnack/QnA-Web-App-with-React-and-Tensorflow.JS#1

Open

matteobrv mentioned this issue Feb 29, 2024

A few general questions mim-solutions/bert_for_longer_texts#30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plans to support longer sequences? #27

Plans to support longer sequences? #27

cbockman commented Nov 2, 2018

jacobdevlin-google commented Nov 2, 2018

vr25 commented Oct 21, 2019

oakkas commented Jan 30, 2020

vr25 commented Mar 22, 2020

oakkas commented Mar 22, 2020

dbsousa01 commented Oct 12, 2020

vmaryasin commented Nov 24, 2020 •

edited

Loading

donglinz commented Nov 25, 2020

vmaryasin commented Nov 25, 2020

anubhav562 commented Jun 24, 2021

lucaguarro commented Sep 15, 2021 •

edited

Loading

vmaryasin commented Sep 17, 2021

MichalBrzozowski91 commented Feb 8, 2022 •

edited

Loading

MichalBrzozowski91 commented Jan 24, 2024

Plans to support longer sequences? #27

Plans to support longer sequences? #27

Comments

cbockman commented Nov 2, 2018

jacobdevlin-google commented Nov 2, 2018

vr25 commented Oct 21, 2019

oakkas commented Jan 30, 2020

vr25 commented Mar 22, 2020

oakkas commented Mar 22, 2020

dbsousa01 commented Oct 12, 2020

vmaryasin commented Nov 24, 2020 • edited Loading

donglinz commented Nov 25, 2020

vmaryasin commented Nov 25, 2020

anubhav562 commented Jun 24, 2021

lucaguarro commented Sep 15, 2021 • edited Loading

vmaryasin commented Sep 17, 2021

MichalBrzozowski91 commented Feb 8, 2022 • edited Loading

MichalBrzozowski91 commented Jan 24, 2024

vmaryasin commented Nov 24, 2020 •

edited

Loading

lucaguarro commented Sep 15, 2021 •

edited

Loading

MichalBrzozowski91 commented Feb 8, 2022 •

edited

Loading