Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content] with large pdf files #393

Open
tomthecat opened this issue Oct 8, 2018 · 17 comments

Comments

@tomthecat
Copy link

php occ fulltextsearch:index stops indexing at pdf files with

Exception: Elasticsearch\Common\Exceptions\ServerErrorResponseException
│ Message: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]

I deleted the first pdf where indexing stopped, started the indexing command again. fulltextsearch indexing stalled again on a pdf file. And again after deleting this one too.

Common pattern: all pdf files were larger than 70 Mbyte.

Elasticsearch is running with 8 GB of RAM:

Active: active (running) since Mon 2018-10-08 16:14:13 CEST; 1h 10min ago
Docs: http://www.elastic.co
Main PID: 504 (java)
CGroup: /system.slice/elasticsearch.service
|-504 /bin/java -Xms8g -Xmx8g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava...
`-807 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller

Latest apps installed (1.01) and configured.

Any hints on this? I would love to use fulltextsearch on my files...

@tomthecat
Copy link
Author

Just for the record: I am running NC 14

$ php occ status

  • installed: true
  • version: 14.0.1.1
  • versionstring: 14.0.1
  • edition:

@ArtificialOwl
Copy link
Member

have you run some test before the first index ?

@tomthecat
Copy link
Author

Yes I did:

$ php occ fulltextsearch:test

.Testing your current setup:
Creating mocked content provider. ok
Testing mocked provider: get indexable documents. (2 items) ok
Loading search platform. (Elasticsearch) ok
Testing search platform. ok
Locking process ok
Removing test. ok
Pausing 3 seconds 1 2 3 ok
Initializing index mapping. ok
Indexing generated documents. ok
Pausing 3 seconds 1 2 3 ok
Retreiving content from a big index (license). (size: 32386) ok
Comparing document with source. ok
Searching basic keywords:

  • 'test' (result: 1, expected: ["simple"]) ok
  • 'document is a simple test' (result: 2, expected: ["simple","license"]) ok
  • '"document is a test"' (result: 0, expected: []) ok
  • '"document is a simple test"' (result: 1, expected: ["simple"]) ok
  • 'document is a simple -test' (result: 1, expected: ["license"]) ok
  • 'document is a simple +test' (result: 1, expected: ["simple"]) ok
  • '-document is a simple test' (result: 0, expected: []) ok
    Updating documents access. ok
    Pausing 3 seconds 1 2 3 ok
    Searching with group access rights:
  • 'license' - [] - (result: 0, expected: []) ok
  • 'license' - ["group_1"] - (result: 1, expected: ["license"]) ok
  • 'license' - ["group_1","group_2"] - (result: 1, expected: ["license"]) ok
  • 'license' - ["group_3","group_2"] - (result: 1, expected: ["license"]) ok
  • 'license' - ["group_3"] - (result: 0, expected: []) ok
    Searching with share rights:
  • 'license' - notuser - (result: 0, expected: []) ok
  • 'license' - user2 - (result: 1, expected: ["license"]) ok
  • 'license' - user3 - (result: 1, expected: ["license"]) ok
    Removing test. ok
    Unlocking process ok

@ArtificialOwl
Copy link
Member

can you reset, test and re-index ?

./occ fulltextsearch:reset
./occ fulltextsearch:test
./occ fulltextsearch:index

@ArtificialOwl
Copy link
Member

ArtificialOwl commented Oct 11, 2018

I also did some test on my side, and elasticsearch returns me an error 'Request Entity Too Large'. I would say that pdf files bigger than ~70MB will not be indexed by elasticsearch

Would you send me the pdf that crash your index ?

@tomthecat
Copy link
Author

@reset | test | index: I did this before twice, no luck.

@pdf: Sure, please check your private mail.

@pdf
Copy link

pdf commented Oct 11, 2018

@tomthecat using '@'s is how you contact people on Github, please be a little careful with them.

@ArtificialOwl
Copy link
Member

@tomthecat haven't receive any email, if you host the file, can you send me the link to maxence@nextcloud.com ?

@tomthecat
Copy link
Author

daita: Did you receive my mail?

@ArtificialOwl
Copy link
Member

yup, if we're talking about a 170MB pdf ? :-)

@tomthecat
Copy link
Author

What do you think: is there a chance to make fulltextsearch skip these files and not to abort?

@ArtificialOwl
Copy link
Member

should be fixed in 1.0.2

@tomthecat
Copy link
Author

Updated to 1.0.2. Seems a bit better now, but:
I still receive the

Exception: Elasticsearch\Common\Exceptions\ServerErrorResponseException │ Message: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]

error at very large PDF files and also at a large PPT.
See PM for a link to these files.

@ArtificialOwl
Copy link
Member

The error I have is a BadRequest400Exception from elasticsearch, which is typical for pdf file bigger than 70MB.

Have you change anything to the configuration of your ES ?

@tomthecat
Copy link
Author

Nope. I followed the instructions given here: https://fribeiro.org/tech/2018/02/07/nextcloud-full-text-elasticsearch/ and here: https://decatec.de/home-server/volltextsuche-in-nextcloud-mit-ocr/ (for tesseract OCR) without any additional tweaking.

@Agraphie
Copy link

Agraphie commented Aug 24, 2019

I get the same error with a PDF of ~3 MB but around 130 pages. I can also send you the pdf if needed.

@aiveras
Copy link

aiveras commented Feb 19, 2020

I have the same error. Is it possible to configure fulltextsearch to skip these pdfs? We have a lot of pdfs bigger than 70Mb...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants