Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Context
Our chunking logic does not respect URL boundaries, so technically it is possible for the code to break a URL into two during chunking. One of our major customers YW recently encountered this (although for them, the main reason was Form Recognizer not correctly recognizing URLs that span across lines).
Irrespective of Form Recognizer, we can make sure our chunking step does not break URLs. This PR ensures this for PDF documents that use Form Recognizer.
What scenario does it contribute to?
Same as 1 and 2
If it fixes an open issue, please link to the issue here.
The issue is being tracked outside of this repository.
It would benefit any user who uses Form Recognizer for their PDF document ingestion.
We want to make this code available to any customer facing URL breaking issue, who may want to adapt this into their solutions.
Description
Contribution Checklist