java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content] with large pdf files #393

tomthecat · 2018-10-08T15:26:35Z

php occ fulltextsearch:index stops indexing at pdf files with

Exception: Elasticsearch\Common\Exceptions\ServerErrorResponseException
│ Message: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]

I deleted the first pdf where indexing stopped, started the indexing command again. fulltextsearch indexing stalled again on a pdf file. And again after deleting this one too.

Common pattern: all pdf files were larger than 70 Mbyte.

Elasticsearch is running with 8 GB of RAM:

Active: active (running) since Mon 2018-10-08 16:14:13 CEST; 1h 10min ago
Docs: http://www.elastic.co
Main PID: 504 (java)
CGroup: /system.slice/elasticsearch.service
|-504 /bin/java -Xms8g -Xmx8g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava...
`-807 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller

Latest apps installed (1.01) and configured.

Any hints on this? I would love to use fulltextsearch on my files...

The text was updated successfully, but these errors were encountered:

tomthecat · 2018-10-11T07:54:20Z

Just for the record: I am running NC 14

$ php occ status

installed: true
version: 14.0.1.1
versionstring: 14.0.1
edition:

ArtificialOwl · 2018-10-11T08:20:30Z

have you run some test before the first index ?

tomthecat · 2018-10-11T08:43:39Z

Yes I did:

$ php occ fulltextsearch:test

.Testing your current setup:
Creating mocked content provider. ok
Testing mocked provider: get indexable documents. (2 items) ok
Loading search platform. (Elasticsearch) ok
Testing search platform. ok
Locking process ok
Removing test. ok
Pausing 3 seconds 1 2 3 ok
Initializing index mapping. ok
Indexing generated documents. ok
Pausing 3 seconds 1 2 3 ok
Retreiving content from a big index (license). (size: 32386) ok
Comparing document with source. ok
Searching basic keywords:

'test' (result: 1, expected: ["simple"]) ok
'document is a simple test' (result: 2, expected: ["simple","license"]) ok
'"document is a test"' (result: 0, expected: []) ok
'"document is a simple test"' (result: 1, expected: ["simple"]) ok
'document is a simple -test' (result: 1, expected: ["license"]) ok
'document is a simple +test' (result: 1, expected: ["simple"]) ok
'-document is a simple test' (result: 0, expected: []) ok
Updating documents access. ok
Pausing 3 seconds 1 2 3 ok
Searching with group access rights:
'license' - [] - (result: 0, expected: []) ok
'license' - ["group_1"] - (result: 1, expected: ["license"]) ok
'license' - ["group_1","group_2"] - (result: 1, expected: ["license"]) ok
'license' - ["group_3","group_2"] - (result: 1, expected: ["license"]) ok
'license' - ["group_3"] - (result: 0, expected: []) ok
Searching with share rights:
'license' - notuser - (result: 0, expected: []) ok
'license' - user2 - (result: 1, expected: ["license"]) ok
'license' - user3 - (result: 1, expected: ["license"]) ok
Removing test. ok
Unlocking process ok

ArtificialOwl · 2018-10-11T12:08:43Z

can you reset, test and re-index ?

./occ fulltextsearch:reset
./occ fulltextsearch:test
./occ fulltextsearch:index

ArtificialOwl · 2018-10-11T13:24:08Z

I also did some test on my side, and elasticsearch returns me an error 'Request Entity Too Large'. I would say that pdf files bigger than ~70MB will not be indexed by elasticsearch

Would you send me the pdf that crash your index ?

tomthecat · 2018-10-11T15:39:45Z

@reset | test | index: I did this before twice, no luck.

@pdf: Sure, please check your private mail.

pdf · 2018-10-11T21:10:40Z

@tomthecat using '@'s is how you contact people on Github, please be a little careful with them.

ArtificialOwl · 2018-10-12T06:04:02Z

@tomthecat haven't receive any email, if you host the file, can you send me the link to maxence@nextcloud.com ?

tomthecat · 2018-10-15T19:47:05Z

daita: Did you receive my mail?

ArtificialOwl · 2018-10-16T11:03:11Z

yup, if we're talking about a 170MB pdf ? :-)

tomthecat · 2018-10-16T12:13:49Z

What do you think: is there a chance to make fulltextsearch skip these files and not to abort?

ArtificialOwl · 2018-10-19T06:56:18Z

should be fixed in 1.0.2

tomthecat · 2018-10-22T08:55:17Z

Updated to 1.0.2. Seems a bit better now, but:
I still receive the

Exception: Elasticsearch\Common\Exceptions\ServerErrorResponseException │ Message: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]

error at very large PDF files and also at a large PPT.
See PM for a link to these files.

ArtificialOwl · 2018-10-23T12:26:30Z

The error I have is a BadRequest400Exception from elasticsearch, which is typical for pdf file bigger than 70MB.

Have you change anything to the configuration of your ES ?

tomthecat · 2018-10-24T07:15:15Z

Nope. I followed the instructions given here: https://fribeiro.org/tech/2018/02/07/nextcloud-full-text-elasticsearch/ and here: https://decatec.de/home-server/volltextsuche-in-nextcloud-mit-ocr/ (for tesseract OCR) without any additional tweaking.

Agraphie · 2019-08-24T07:54:46Z

I get the same error with a PDF of ~3 MB but around 130 pages. I can also send you the pdf if needed.

aiveras · 2020-02-19T07:30:54Z

I have the same error. Is it possible to configure fulltextsearch to skip these pdfs? We have a lot of pdfs bigger than 70Mb...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content] with large pdf files #393

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content] with large pdf files #393

tomthecat commented Oct 8, 2018

tomthecat commented Oct 11, 2018

ArtificialOwl commented Oct 11, 2018

tomthecat commented Oct 11, 2018

ArtificialOwl commented Oct 11, 2018

ArtificialOwl commented Oct 11, 2018 •

edited

tomthecat commented Oct 11, 2018

pdf commented Oct 11, 2018

ArtificialOwl commented Oct 12, 2018

tomthecat commented Oct 15, 2018

ArtificialOwl commented Oct 16, 2018

tomthecat commented Oct 16, 2018

ArtificialOwl commented Oct 19, 2018

tomthecat commented Oct 22, 2018

ArtificialOwl commented Oct 23, 2018

tomthecat commented Oct 24, 2018

Agraphie commented Aug 24, 2019 •

edited

aiveras commented Feb 19, 2020

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content] with large pdf files #393

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content] with large pdf files #393

Comments

tomthecat commented Oct 8, 2018

tomthecat commented Oct 11, 2018

ArtificialOwl commented Oct 11, 2018

tomthecat commented Oct 11, 2018

ArtificialOwl commented Oct 11, 2018

ArtificialOwl commented Oct 11, 2018 • edited

tomthecat commented Oct 11, 2018

pdf commented Oct 11, 2018

ArtificialOwl commented Oct 12, 2018

tomthecat commented Oct 15, 2018

ArtificialOwl commented Oct 16, 2018

tomthecat commented Oct 16, 2018

ArtificialOwl commented Oct 19, 2018

tomthecat commented Oct 22, 2018

ArtificialOwl commented Oct 23, 2018

tomthecat commented Oct 24, 2018

Agraphie commented Aug 24, 2019 • edited

aiveras commented Feb 19, 2020

ArtificialOwl commented Oct 11, 2018 •

edited

Agraphie commented Aug 24, 2019 •

edited