-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do you handle large documents? #2295
Comments
apparently the answer may be to feed smaller sequences of tokens and use the past input keyword itn pytorch models or hidden states in tensorflow. models both this past input and the stateful nature of models aren't documented. it would be interesting to have methods to manage big inputs |
Recent models like Transformers-XL and XLNet already support longer sequences. Although, the available pretrained models are imho only using 512 tokens. Some additional pointers:
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
There are two main methods:
I resumed some typical papers of BERT for long text in this post : Paper Dissected and Recap #4 : which BERT for long text ? |
@lethienhoa does |
❓ Questions & Help
I have been a huge fan of this library for a while now. I've used it to accomplish things like sentence classification, a chat bot, and even stock market price prediction, this is truly a fantastic library. But I have not yet learned how to tackle large documents (e.g. documents 10x the size of the model's max length).
An example. A task I would love to accomplish is document abstraction, however the documents I am dealing with are upwards of 3,000+ words long and I'm afraid that taking the first 512 or 768 tokens will not yield a quality summary.
One idea that I was kicking around, but have not put code to yet, involved taking a window of 512-tokens to produce a model output and then repeating this process, shifting the window of 512-tokens, until I have covered my entire corpus. Then I will repeat the process until I have an input that can fit into my model.
There must be a better way. I have heard of developers using these NLP models to summarize large legal documents and legislation, which can be hundreds of pages, let alone thousands of words. Am I missing something, am I overthinking this problem?
The text was updated successfully, but these errors were encountered: