Enhance textractor to better support RAG use cases #603

davidmezzetti · 2023-11-28T20:28:34Z

Currently, the textractor pipeline is rather basic. It simply calls Apache Tika (if available) and returns the raw text. If Tika isn't available, the textractor converts HTML to raw text.

Tika is a mature and stable project with a large number of file formats supported. It also supports extracting content to XHTML. The following improvements should be made to better support downstream retrieval augmented generation (RAG) use cases.

Support section parsing. This will add a new flag called sections. When enabled, it will split the text by section or page breaks. This will better organize content into related sections.
Improve paragraph parsing. Add better paragraph detection.
Preserve whitespace formatting. Currently the textractor strips a lot of formatting that would be useful to a RAG pipeline.
Export all formats to XHTML. Add an XHTML parser that can cleanly convert content to raw text.

The text was updated successfully, but these errors were encountered:

davidmezzetti added this to the v6.3.0 milestone Nov 28, 2023

davidmezzetti self-assigned this Nov 28, 2023

davidmezzetti closed this as completed in 591e730 Nov 28, 2023

davidmezzetti mentioned this issue Nov 28, 2023

Update text extraction notebook #604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance textractor to better support RAG use cases #603

Enhance textractor to better support RAG use cases #603

davidmezzetti commented Nov 28, 2023

Enhance textractor to better support RAG use cases #603

Enhance textractor to better support RAG use cases #603

Comments

davidmezzetti commented Nov 28, 2023