-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plans to support longer sequences? #27
Comments
We don't plan to make major changes to this library, so anything like that would be part of a separate project. Our recommended recipe is exactly what you describe (it's what we do for SQuAD), but you can actually fine-tune on it normally (we just don't do it for SQuAD because only a few percent of SQuAD documents are longer than 384 do so it didnt matter. But we should have). Let's say you have:
And had
So from |
Hi, |
Hi @vr25 Did you find any good solution to your response? |
Hi @oakkas Thanks! |
Hi again @vr25. |
Hi, I had to do classification of long texts, most of which had 500 - 1000 tokens, but some could contain up to 500k tokens. So I did a system, heavily inspired by Jakob Devlin's comment above: split my 1024-token text into a minibatch of 2x512 tokens, then concatenated the 2 outputs of the CLS tokens 2 x 768 -> 1536 and put a regular classification head on top of it. Then finetuned the whole system end-to-end. Due to very specific nature of my texts (lots of numbers, tables and other structures), I didn't do any striding. So basically, I had no attention span between the 2 individual parts of text. For my problem this trick gave a meaningful perf gain, but it didn't change the world. The classifier was already doing quite well on truncated texts of 512 tokens, I just managed to push it a bit more. I also tried minibatches of 4x512 tokens. But this didn't give meaningful improvement as only ~5% of my texts were longer than 1024 tokens. These conclusions are task specific, of course. We also tried to implement a BERT with an attention mechanism of the Longformer and compare with it: https://arxiv.org/abs/2004.05150 |
@vmaryasin Can you please elaborate why do you believe the long training process is certainly not due to the Longformer itself? Thanks |
@donglinz It'll get a bit messy and historical here. :) There exist 2 French language BERTS: CamemBERT and FlauBERT with slightly different implementation. At that time, we haven't yet made a final choice of the model for the project. It turned out that due to the way attention layers are coded and accessed in 2 models, it was much easier to implement Longformer on Camembert. On the contrary, for historical reasons the above trick was implemented on Flaubert. What we observed was that a Longformer took significantly more time to train. But later on we noticed that a large part of this delay was actually due to pure Camembert being slower than pure Flaubert in_our_particular_setup. This was the strangest thing and unfortunately, we didn't find the reason why. And as we already had a system which was working alright, we abandoned Camembert and did not try to reimplement Longformer on FlauBERT either. So it's best to say, that I don't have any conclusion about Longformer. But I wanted to mention it above as it seems to be a good option and I'm curious to see the comparison of the two methods. |
Hi, I am working on a NER task and I have a lot of sequences of text where the sequence_length goes above 512, is anyone having any idea of how can I tackle this. One solution which I have thought of is: simply dividing the input sequence into multiple input sequences so that it fits the limit of 512, then apply the NER classification to these sequences and bind back the NER labels. Has anyone faced a similar issue/ has found a better solution to this? Please let me know. |
Hi @vmaryasin what did you do for the samples that only had 1 output (aka fit into a 512 chunk)? Did you duplicate the first output or pad it somehow? And similarly what did you do for the samples that exceeded 1024 tokens or were they so few that you could just truncate? |
Hi @lucaguarro,
I believe, those are specific design choices, and I would suggest you test them in your task. To be honest, I didn't even ask myself your Q1. As for the longer texts, I tried training a 4x512 model, but it wasn't any better, to say the least. |
Hi, we have implemented the classification model for longer texts following the Jacob Devlin' comment:
|
Right now, the model (correct me if I'm wrong) appears to be locked down to sequences of max 512, based on running & playing with the code (and this makes sense in the context of the paper).
Are there any near-term plans to support longer sequences?
Offhand, this would potentially require multiple issues to be addressed, including 1) allowing positional embeddings that can extend for longer or perhaps arbitrary lengths (with some degradation over longer lengths than it has been trained on, of course) (possibly using something like multiple sinusoidal embeddings, like in the original transformer paper?) and 2) containing/limiting the Transformer quadratic memory explosion (my first gut would be to try something like the techniques in "Generating Wikipedia by Summarizing Long Sequences" https://arxiv.org/abs/1801.10198).
Right now--from first pass--it seems like the way to use this over longer sequences is to chunk the docs into sequences (either inline with fixed lengths, or possibly as pre-processing on boundaries like sentences or paragraphs) and apply BERT in a feature-input mode, and then feed into something else downstream (like universal transformer).
All of this seems doable, but is 1) more complicated from an engineering perspective and 2) loses the ability to fine-tune (at least in any way that is obvious to me).
(Of course, having a model adept to longer sequences like in https://arxiv.org/abs/1801.10198 has model power trade-offs, such that it is plausible that the feature-based approach could still plausibly be more superior?)
The text was updated successfully, but these errors were encountered: