Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance textractor to better support RAG use cases #603

Closed
davidmezzetti opened this issue Nov 28, 2023 · 0 comments
Closed

Enhance textractor to better support RAG use cases #603

davidmezzetti opened this issue Nov 28, 2023 · 0 comments
Assignees
Milestone

Comments

@davidmezzetti
Copy link
Member

Currently, the textractor pipeline is rather basic. It simply calls Apache Tika (if available) and returns the raw text. If Tika isn't available, the textractor converts HTML to raw text.

Tika is a mature and stable project with a large number of file formats supported. It also supports extracting content to XHTML. The following improvements should be made to better support downstream retrieval augmented generation (RAG) use cases.

  • Support section parsing. This will add a new flag called sections. When enabled, it will split the text by section or page breaks. This will better organize content into related sections.
  • Improve paragraph parsing. Add better paragraph detection.
  • Preserve whitespace formatting. Currently the textractor strips a lot of formatting that would be useful to a RAG pipeline.
  • Export all formats to XHTML. Add an XHTML parser that can cleanly convert content to raw text.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant