Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geearl/6323 large tables #429

Merged
merged 134 commits into from
Jan 20, 2024
Merged

Geearl/6323 large tables #429

merged 134 commits into from
Jan 20, 2024

Conversation

georearl
Copy link
Contributor

@georearl georearl commented Jan 5, 2024

This PR relates to GitHub issues #410 and #353.

The root cause is we add sequentially add paragraphs to a chunk until we hit the max token limit, 750 by default. Now with regular text when we reach a para that will push us over, we break it down sentence by sentence and add these to the chunk until we hit the limit, then add the remaining sentences to the next chunk.

This doesn't work for tables.

When we have a table that spans 2 pages the code to break it down by sentence doesn't work, it sees the whole table as a single sentence. The outcome is we have a single chunk holding the tables from each page. The user experience is a failure based on chunks being too large if they are returned as part of a RAG request.

The change includes the following:

  • We now split tables by rows when they cross a max token count. We considered keeping a table as a single chunk using a page boundary, but Daylan requested we keep the max chunk size for tables at 1024 tokens.
  • We repeat top rows that FR identifies as heading cells. We tag these rows as
  • If a 2 tables are identified as a single table (which is signified by consecutive tables with no text based paragraphs between them) then we include the section form the first table to all subsequent tables. The scenario is useful if a table spans multiple pages, but the header rows are only on the first page of the table, this context would be lost as the table is chunked.

ryonsteele and others added 30 commits December 6, 2023 22:30
This reverts commit c5ac17a, reversing
changes made to 6641eed.
Update the process flow architecture with latest product naming
…ployment-hf

Add check for existing CUA deployment object and remove
dayland-ms and others added 23 commits January 11, 2024 12:41
…-1.0-release

Azure estimation 1.0 release
…rt-and-ux-docs

Update deployment troubleshooting and improve UX analysis panel docs
Updating hard link to redirect link for YouTube
Fixed typo and broken image links
…inks

Update sample data links in user_experience.md
Remove functions_flow.md and update related documentation
…low-docs

Merge pull request #454 from microsoft/geearl/function-flow-doc
Update bug_report.md template with additional instructions and details
Changed
Azure Services
The following list of Azure Services will be deployed for IA Accelerator, version 0.4 delta:

to 
Azure Services
The following list of Azure Services will be deployed for IA Accelerator, version 1.0:
…/microsoft/PubSec-Info-Assistant into geearl/6323-large-tables"

This reverts commit c6792d7, reversing
changes made to 4fdc07d.
@georearl georearl merged commit aa712e3 into vNext-Dev Jan 20, 2024
6 checks passed
@georearl georearl deleted the geearl/6323-large-tables branch January 20, 2024 01:03
dayland pushed a commit that referenced this pull request Jan 24, 2024
lukasvalach added a commit to lukasvalach/PubSec-Info-Assistant that referenced this pull request Apr 8, 2024
* Merge pull request microsoft#429 from microsoft/geearl/6323-large-tables

Geearl/6323 large tables

* Resolve function debug issue and add logic for multiple table spans

* Update deployment.md

Updates on Setting the right tenant if you're part of multiple tenant. Otherwise deployment will fail.

* Bump fastapi from 0.103.2 to 0.109.1 in /app/enrichment

Bumps [fastapi](https://github.com/tiangolo/fastapi) from 0.103.2 to 0.109.1.
- [Release notes](https://github.com/tiangolo/fastapi/releases)
- [Commits](tiangolo/fastapi@0.103.2...0.109.1)

---
updated-dependencies:
- dependency-name: fastapi
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* gov changes

* gov changes

* gov changes

* new app service app setting and readme update

* url update

* Pushing out temp fixes until python sdk is resolved.

* Remove media_service and avam modules

* script fix

* updated link

* Remove base_url breaking param

* Resolve issue with new gov logic on aoai endpoint resolution

* Merge pull request microsoft#523 and microsoft#530 from vNext-Dev for large table fixes

* Update app.py

added `.lower()` to ensure the str read is in correctly converted to a bool.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dayland <dayland@microsoft.com>
Co-authored-by: dayland <48474707+dayland@users.noreply.github.com>
Co-authored-by: avidunixuser <avidunixuser@users.noreply.github.com>
Co-authored-by: ryonsteele <ryonsteele@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Danimal <dbiscup@microsoft.com>
Co-authored-by: Brandon Rohrer <brandon.rohrer@outlook.com>
Co-authored-by: Nehemiah Kuhns <85817913+nhwkuhns@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants