Textractor is producing garbled output #661

scott-vsi · 2024-02-07T17:41:35Z

I was trying to run 52_Build_RAG_pipelines_with_txtai.ipynb and was getting garbled output from the Textractor.

from txtai.pipeline import Textractor
textractor = Textractor()
text = textractor("txtai/article.pdf")

(I have also found that the path to article.pdf must be an absolute path)

Here is a sample of the output:

%PDF-1.5
%äüöß
2 0 obj
<>
stream
x��ZɎ�6��W���*����
���Cu�_|Ԛ�_g�=��t|�����������!��_QZCxB��������%A�A�=V��7�N��i_��<��{4�ʢ/0W�.$DZϣ��S��NC>�&����Z�'��E��q�����,�z��V@:i���'>�H�����Ƨ s�]]k�l#����8�Z4���jN�j�Jb�SR��z��d��d]'��+���I�����x� ��u��e�!0Pe�*�F$yXI4'�M�研FV�b8���K��Y�����CN�$�u)��g��ث��z�ߗ���6�'l�O�w��YmQ��M�8&�đ�4�C�?����ꈇz��P0b�L�M���9������"

It looks like from this comment if Tika is not working, it falls back to beautifulsoup, which is the case here (textractor.checkjava() is False). Would you expect the output from beautifulsoup to be useless like this?

The text was updated successfully, but these errors were encountered:

scott-vsi · 2024-02-07T17:42:55Z

I found a brief note in 10_Extract_text_from_documents.ipynb that I have to install Java (openjdk-8-jdk) for Tika to work.

This note should be added to 52_Build_RAG_pipelines_with_txtai.ipynb and perhaps noted somewhere in the README.

davidmezzetti · 2024-02-07T22:16:03Z

I have a note in #646 to add this to the FAQ/documentation and error message.

I've seen a number of people run into this issue and when Java isn't installed it's hard to debug (see here and here)

scott-vsi closed this as completed Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Textractor is producing garbled output #661

Textractor is producing garbled output #661

scott-vsi commented Feb 7, 2024

scott-vsi commented Feb 7, 2024

davidmezzetti commented Feb 7, 2024

Textractor is producing garbled output #661

Textractor is producing garbled output #661

Comments

scott-vsi commented Feb 7, 2024

scott-vsi commented Feb 7, 2024

davidmezzetti commented Feb 7, 2024