How do you handle large documents? #2295

zbloss · 2019-12-24T02:06:55Z

❓ Questions & Help

I have been a huge fan of this library for a while now. I've used it to accomplish things like sentence classification, a chat bot, and even stock market price prediction, this is truly a fantastic library. But I have not yet learned how to tackle large documents (e.g. documents 10x the size of the model's max length).

An example. A task I would love to accomplish is document abstraction, however the documents I am dealing with are upwards of 3,000+ words long and I'm afraid that taking the first 512 or 768 tokens will not yield a quality summary.

One idea that I was kicking around, but have not put code to yet, involved taking a window of 512-tokens to produce a model output and then repeating this process, shifting the window of 512-tokens, until I have covered my entire corpus. Then I will repeat the process until I have an input that can fit into my model.

There must be a better way. I have heard of developers using these NLP models to summarize large legal documents and legislation, which can be hundreds of pages, let alone thousands of words. Am I missing something, am I overthinking this problem?

cedspam · 2019-12-24T11:04:30Z

apparently the answer may be to feed smaller sequences of tokens and use the past input keyword itn pytorch models or hidden states in tensorflow. models both this past input and the stateful nature of models aren't documented. it would be interesting to have methods to manage big inputs

malteos · 2020-01-06T21:24:52Z

Recent models like Transformers-XL and XLNet already support longer sequences. Although, the available pretrained models are imho only using 512 tokens.

Some additional pointers:

Long-form document classification with BERT. Blogpost, Code
See ICLR 2020 reviews:
- BERT-AL: BERT for Arbitrarily Long Document Understanding
- Blockwise Self-Attention for Long Document Understanding
Easy-to-use interface to fine-tuned BERT models for computing semantic similarity
Ye, Z. et al. 2019. BP-Transformer: Modelling Long-Range Context via Binary Partitioning. (2019). Paper Code

stale · 2020-03-06T21:25:17Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lethienhoa · 2020-11-19T17:39:31Z

There are two main methods:

Concatenating 'short' BERT altogether (which consists of 512 tokens max)
Constructing a real long BERT (CogLTX, Blockwise BERT, Longformer, Big Bird)

I resumed some typical papers of BERT for long text in this post : Paper Dissected and Recap #4 : which BERT for long text ?
You can have an overview of all methods there.

pratikchhapolika · 2022-02-23T04:31:49Z

@lethienhoa does all long BERT is capable of any length text?

stale bot added the wontfix label Mar 6, 2020

stale bot closed this as completed Mar 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you handle large documents? #2295

How do you handle large documents? #2295

zbloss commented Dec 24, 2019

cedspam commented Dec 24, 2019

malteos commented Jan 6, 2020

stale bot commented Mar 6, 2020

lethienhoa commented Nov 19, 2020 •

edited

Loading

pratikchhapolika commented Feb 23, 2022

How do you handle large documents? #2295

How do you handle large documents? #2295

Comments

zbloss commented Dec 24, 2019

❓ Questions & Help

cedspam commented Dec 24, 2019

malteos commented Jan 6, 2020

stale bot commented Mar 6, 2020

lethienhoa commented Nov 19, 2020 • edited Loading

pratikchhapolika commented Feb 23, 2022

lethienhoa commented Nov 19, 2020 •

edited

Loading