Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protect urls during chunking #635

Merged
merged 1 commit into from
Feb 22, 2024
Merged

Conversation

satarupaguha11
Copy link
Contributor

Motivation and Context

  1. Why is this change required?

Our chunking logic does not respect URL boundaries, so technically it is possible for the code to break a URL into two during chunking. One of our major customers YW recently encountered this (although for them, the main reason was Form Recognizer not correctly recognizing URLs that span across lines).

  1. What problem does it solve?

Irrespective of Form Recognizer, we can make sure our chunking step does not break URLs. This PR ensures this for PDF documents that use Form Recognizer.

  1. What scenario does it contribute to?
    Same as 1 and 2

  2. If it fixes an open issue, please link to the issue here.

The issue is being tracked outside of this repository.

  1. Does this solve an issue or add a feature that all users of this sample app can benefit from?
    It would benefit any user who uses Form Recognizer for their PDF document ingestion.
    We want to make this code available to any customer facing URL breaking issue, who may want to adapt this into their solutions.

Description

  • Identify URLs before chunking using a regex
  • Replace URL by a placeholder (maintain a dictionary mapping)
  • Chunk as usual
  • Replace placeholder by URL in the chunks by looking-up in the dictionary mapping

Contribution Checklist

  • [ x ] I have built and tested the code locally
  • For frontend changes, I have pulled the latest code from main, built the frontend, and committed all static files.
  • [ x] This is a change for all users of this app. No code or asset is specific to my use case or my organization.
  • [ x] I didn't break any existing functionality 😄

@satarupaguha11 satarupaguha11 merged commit 8b7815e into main Feb 22, 2024
1 check passed
@satarupaguha11 satarupaguha11 deleted the saguha/protect_urls_pdf branch February 22, 2024 02:16
sudo-init pushed a commit to sudo-init/sample-app-chatGPT that referenced this pull request Sep 20, 2024
nikhilnagaraj pushed a commit to Admin-bh-Edge/Edge-Comp-Policies that referenced this pull request Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants