Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textractor is producing garbled output #661

Closed
scott-vsi opened this issue Feb 7, 2024 · 2 comments
Closed

Textractor is producing garbled output #661

scott-vsi opened this issue Feb 7, 2024 · 2 comments

Comments

@scott-vsi
Copy link

I was trying to run 52_Build_RAG_pipelines_with_txtai.ipynb and was getting garbled output from the Textractor.

from txtai.pipeline import Textractor
textractor = Textractor()
text = textractor("txtai/article.pdf")

(I have also found that the path to article.pdf must be an absolute path)

Here is a sample of the output:

%PDF-1.5
%äüöß
2 0 obj
<>
stream
x��ZɎ�6��W���*����
���Cu�_|Ԛ�_g�=��t|�����������!��_QZCxB��������%A�A�=V��7�N��i_��<��{4�ʢ/0W�.$DZϣ��S��NC>�&����Z�'��E��q�����,�z��V@:i���'>�H�����Ƨ s�]]k�l#����8�Z4���jN�j�Jb�SR��z��d��d]'��+���I�����x� ��u��e�!0Pe�*�F$yXI4'�M�研FV�b8���K��Y�����CN�$�u)��g��ث��z�ߗ���6�'l�O�w��YmQ��M�8&�đ�4�C�?����ꈇz��P0b�L�M���9������"

It looks like from this comment if Tika is not working, it falls back to beautifulsoup, which is the case here (textractor.checkjava() is False). Would you expect the output from beautifulsoup to be useless like this?

@scott-vsi
Copy link
Author

I found a brief note in 10_Extract_text_from_documents.ipynb that I have to install Java (openjdk-8-jdk) for Tika to work.

This note should be added to 52_Build_RAG_pipelines_with_txtai.ipynb and perhaps noted somewhere in the README.

@davidmezzetti
Copy link
Member

I have a note in #646 to add this to the FAQ/documentation and error message.

I've seen a number of people run into this issue and when Java isn't installed it's hard to debug (see here and here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants