Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you handle large documents? #2295

Closed
zbloss opened this issue Dec 24, 2019 · 5 comments
Closed

How do you handle large documents? #2295

zbloss opened this issue Dec 24, 2019 · 5 comments
Labels

Comments

@zbloss
Copy link

zbloss commented Dec 24, 2019

❓ Questions & Help

I have been a huge fan of this library for a while now. I've used it to accomplish things like sentence classification, a chat bot, and even stock market price prediction, this is truly a fantastic library. But I have not yet learned how to tackle large documents (e.g. documents 10x the size of the model's max length).

An example. A task I would love to accomplish is document abstraction, however the documents I am dealing with are upwards of 3,000+ words long and I'm afraid that taking the first 512 or 768 tokens will not yield a quality summary.

One idea that I was kicking around, but have not put code to yet, involved taking a window of 512-tokens to produce a model output and then repeating this process, shifting the window of 512-tokens, until I have covered my entire corpus. Then I will repeat the process until I have an input that can fit into my model.

There must be a better way. I have heard of developers using these NLP models to summarize large legal documents and legislation, which can be hundreds of pages, let alone thousands of words. Am I missing something, am I overthinking this problem?

@cedspam
Copy link
Contributor

cedspam commented Dec 24, 2019

apparently the answer may be to feed smaller sequences of tokens and use the past input keyword itn pytorch models or hidden states in tensorflow. models both this past input and the stateful nature of models aren't documented. it would be interesting to have methods to manage big inputs

@malteos
Copy link
Contributor

malteos commented Jan 6, 2020

Recent models like Transformers-XL and XLNet already support longer sequences. Although, the available pretrained models are imho only using 512 tokens.

Some additional pointers:

@stale
Copy link

stale bot commented Mar 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Mar 6, 2020
@stale stale bot closed this as completed Mar 13, 2020
@lethienhoa
Copy link

lethienhoa commented Nov 19, 2020

There are two main methods:

  • Concatenating 'short' BERT altogether (which consists of 512 tokens max)
  • Constructing a real long BERT (CogLTX, Blockwise BERT, Longformer, Big Bird)

I resumed some typical papers of BERT for long text in this post : Paper Dissected and Recap #4 : which BERT for long text ?
You can have an overview of all methods there.

@pratikchhapolika
Copy link

@lethienhoa does all long BERT is capable of any length text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants
@lethienhoa @zbloss @cedspam @malteos @pratikchhapolika and others