Token limit often exceeded with PDF files #353

TaylorN15 · 2023-11-21T04:02:15Z

We have some large PDF files and during the chunking process, it seems to be often creating chunks that well exceed the "target size". For example, one document (which can be downloaded here), has one chunk over 80,000 tokens in length.

There are several other chunks created from this same file that are smaller but still exceed the target size by a substantial amount.

georearl · 2023-11-21T17:01:45Z

Hi @TaylorN15 we ran the file you provided the link to. It seems to relate to table processing. If a file is greater than our target token count for a chunk, this is not respected. We have added a task to our board to split tables by chunk size and repeat the table header rows in each chunk..

When we switched to using unstructured.io for non-PDF documents, we were aware of the same issue there. They were planning on adding this feature. So, we need to make the change in our code, and follow up with unstructured to confirm if this has been fixed and update that path also.

This issue has been updated to an enhancement

TaylorN15 · 2023-11-21T18:36:09Z

Thanks @georearl. I actually wrote a function that will chunk a table whilst keeping header rows intact. I was using it in the previous version of the app to chunk Excel files before you introduced the Unstructured library. I can share the code if you’d like? It may assist?

georearl · 2023-11-22T17:08:57Z

That would be great. please share the code, or feel free to create a PR

TaylorN15 · 2023-11-23T02:33:35Z

Here's what I had previously. I updated the build_chunks function and added a new function for chunking tables. I haven't done a lot of testing on this, but it seemed to work well. One issue I can foresee is where the table headers alone exceed the token limit, it would not work.

    def build_chunks(self, document_map, myblob_name, myblob_uri, chunk_target_size):
        """ Function to build chunk outputs based on the document map """

        chunk_text = ''
        chunk_size = 0
        file_number = 0
        page_number = 0
        previous_section_name = document_map['structure'][0]['section']
        previous_title_name = document_map['structure'][0]["title"]
        previous_subtitle_name = document_map['structure'][0]["subtitle"]
        page_list = []
        chunk_count = 0

        def finalize_chunk():
            nonlocal chunk_text, chunk_count, chunk_size, file_number, page_list, page_number
            if chunk_text:  # Only write out if there is text to write
                self.write_chunk(myblob_name, myblob_uri, file_number,
                                 chunk_size, chunk_text, page_list,
                                 previous_section_name, previous_title_name, previous_subtitle_name)
                chunk_count += 1
                file_number += 1  # Increment the file/chunk number
            # Reset the chunk variables
            chunk_text = ''
            chunk_size = 0
            page_list = []
            page_number = 0  # Reset the page_number for the new chunk

        for index, paragraph_element in enumerate(document_map['structure']):
            paragraph_size = self.token_count(paragraph_element["text"])
            paragraph_text = paragraph_element["text"]
            section_name = paragraph_element["section"]
            title_name = paragraph_element["title"]
            subtitle_name = paragraph_element["subtitle"]

            # Handle table paragraphs separately
            if paragraph_element["type"] == "table":
                # Check if the table needs to be split into multiple chunks
                if paragraph_size > chunk_target_size:
                    # Split the table into chunks with headers
                    table_chunks = self.chunk_table_with_headers(paragraph_text, chunk_target_size)
                    for table_chunk in table_chunks:
                        finalize_chunk()  # Finalize the previous chunk before starting a new one
                        chunk_text = minify_html.minify(table_chunk)  # Set the current chunk to the table chunk
                        chunk_size = self.token_count(chunk_text)  # Update the chunk size
                        finalize_chunk()  # Finalize the current table chunk
                    continue  # Skip to the next paragraph element

            # Check if a new chunk should be started
            if (chunk_size + paragraph_size >= chunk_target_size) or \
               (section_name != previous_section_name) or \
               (title_name != previous_title_name) or \
               (subtitle_name != previous_subtitle_name):
                finalize_chunk()

            # Add paragraph to the chunk
            chunk_text += "\n" + paragraph_text
            chunk_size += paragraph_size
            if page_number != paragraph_element["page_number"]:
                page_list.append(paragraph_element["page_number"])
                page_number = paragraph_element["page_number"]

            # Update previous section, title, and subtitle
            previous_section_name = section_name
            previous_title_name = title_name
            previous_subtitle_name = subtitle_name

            # Finalize the last chunk after the loop
            if index == len(document_map['structure']) - 1:
                finalize_chunk()

        logging.info("Chunking is complete")
        return chunk_count
    
    def chunk_table_with_headers(self, table_html, chunk_target_size):
        soup = BeautifulSoup(table_html, 'html.parser')

        # Check for and extract the thead and tbody, or default to entire table
        thead = soup.find('thead')
        tbody = soup.find('tbody') or soup.find('table')
        rows = soup.find_all('tr') if not tbody else tbody.find_all('tr')

        header_html = f"<table>{minify_html.minify(str(thead))}" if thead else "<table>"
        
        # Initialize chunks list and current_chunk with the header
        current_chunk = header_html
        chunks = []

        def add_current_chunk():
            nonlocal current_chunk
            # Close the table tag for the current chunk and add it to the chunks list
            if current_chunk.strip() and not current_chunk.endswith("<table>"):
                current_chunk += '</table>'
                chunks.append(current_chunk)
                # Start a new chunk with header if it exists
                current_chunk = header_html

        for row in rows:
            # If adding this row to the current chunk exceeds the target size, start a new chunk
            row_html = minify_html.minify(str(row))
            if self.token_count(current_chunk + row_html) > chunk_target_size:
                add_current_chunk()

            # Add the current row to the chunk
            current_chunk += row_html

        # Add the final chunk if there's any content left
        add_current_chunk()
        
        return chunks

lon-tierney · 2023-12-08T18:06:10Z

@TaylorN15 checking in to see if you've made progress on this and if you will be submitting a PR?

TaylorN15 · 2023-12-08T20:02:37Z

I had only made a start with the above code. It works in some cases and others not. And anything that gets run through the Unstructured library it won’t work with. I feel like it’s a key decision outside of my purview :)

lon-tierney · 2023-12-08T20:47:13Z

Thank you for the feedback. We'll keep this open for review.

georearl · 2024-04-16T18:17:25Z

resolved and included in the code base. Thank you TaylorN15

TaylorN15 · 2024-04-16T18:22:11Z

resolved and included in the code base. Thank you TaylorN15

@georearl - what if the table headers are larger than the chunk size as I mentioned earlier?

dayland assigned georearl Nov 21, 2023

dayland added the enhancement New feature or request label Nov 21, 2023

georearl mentioned this issue Jan 5, 2024

Geearl/6323 large tables #429

Merged

georearl closed this as completed Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token limit often exceeded with PDF files #353

Token limit often exceeded with PDF files #353

TaylorN15 commented Nov 21, 2023

georearl commented Nov 21, 2023

TaylorN15 commented Nov 21, 2023 •

edited

Loading

georearl commented Nov 22, 2023

TaylorN15 commented Nov 23, 2023

lon-tierney commented Dec 8, 2023

TaylorN15 commented Dec 8, 2023

lon-tierney commented Dec 8, 2023

georearl commented Apr 16, 2024

TaylorN15 commented Apr 16, 2024

Token limit often exceeded with PDF files #353

Token limit often exceeded with PDF files #353

Comments

TaylorN15 commented Nov 21, 2023

georearl commented Nov 21, 2023

TaylorN15 commented Nov 21, 2023 • edited Loading

georearl commented Nov 22, 2023

TaylorN15 commented Nov 23, 2023

lon-tierney commented Dec 8, 2023

TaylorN15 commented Dec 8, 2023

lon-tierney commented Dec 8, 2023

georearl commented Apr 16, 2024

TaylorN15 commented Apr 16, 2024

TaylorN15 commented Nov 21, 2023 •

edited

Loading