-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token limit often exceeded with PDF files #353
Comments
Hi @TaylorN15 we ran the file you provided the link to. It seems to relate to table processing. If a file is greater than our target token count for a chunk, this is not respected. We have added a task to our board to split tables by chunk size and repeat the table header rows in each chunk.. When we switched to using unstructured.io for non-PDF documents, we were aware of the same issue there. They were planning on adding this feature. So, we need to make the change in our code, and follow up with unstructured to confirm if this has been fixed and update that path also. This issue has been updated to an enhancement |
Thanks @georearl. I actually wrote a function that will chunk a table whilst keeping header rows intact. I was using it in the previous version of the app to chunk Excel files before you introduced the Unstructured library. I can share the code if you’d like? It may assist? |
That would be great. please share the code, or feel free to create a PR |
Here's what I had previously. I updated the build_chunks function and added a new function for chunking tables. I haven't done a lot of testing on this, but it seemed to work well. One issue I can foresee is where the table headers alone exceed the token limit, it would not work.
|
@TaylorN15 checking in to see if you've made progress on this and if you will be submitting a PR? |
I had only made a start with the above code. It works in some cases and others not. And anything that gets run through the Unstructured library it won’t work with. I feel like it’s a key decision outside of my purview :) |
Thank you for the feedback. We'll keep this open for review. |
resolved and included in the code base. Thank you TaylorN15 |
@georearl - what if the table headers are larger than the chunk size as I mentioned earlier? |
We have some large PDF files and during the chunking process, it seems to be often creating chunks that well exceed the "target size". For example, one document (which can be downloaded here), has one chunk over 80,000 tokens in length.
There are several other chunks created from this same file that are smaller but still exceed the target size by a substantial amount.
The text was updated successfully, but these errors were encountered: