Lucene exception while adding file: Document contains at least one immense term in field="full" #2130

wizwin · 2018-05-29T11:24:46Z

May 29, 2018 10:02:41 AM org.opensolaris.opengrok.index.IndexDatabase lambda$null$1
WARNING: ERROR addFile(): /external/icu/icu4c/source/data/coll/zh.txt
**java.lang.IllegalArgumentException: Document contains at least one immense term in field="full" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.**  The prefix of the first immense term is: '[-27, -123, -103, -27, -123, -101, -27, -123, -98, -27, -123, -99, -27, -123, -95, -27, -123, -93, -27, -105, -89, -25, -109, -87, -25, -77, -114, -28, -72, -128]...', original message: bytes can be at most 32766 in length; got 39180
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:240)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:496)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1729)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1464)
	at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:732)
	at org.opensolaris.opengrok.index.IndexDatabase.lambda$null$1(IndexDatabase.java:1049)
	at java.util.stream.Collectors.lambda$groupingByConcurrent$51(Collectors.java:1070)
	at java.util.stream.ReferencePipeline.lambda$collect$1(ReferencePipeline.java:496)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.pollAndExecCC(ForkJoinPool.java:1190)
	at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1879)
	at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2045)
	at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:404)
	at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
	at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:496)
	at org.opensolaris.opengrok.index.IndexDatabase.lambda$indexParallel$2(IndexDatabase.java:1038)
	at java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1424)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180
	at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:263)
	at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
	... 33 more

The text was updated successfully, but these errors were encountered:

vladak · 2018-05-29T12:31:33Z

Can you post the contents of the file somewhere ? Dne út 29. 5. 2018 13:25 uživatel WiZarD <notifications@github.com> napsal:

…

May 29, 2018 10:02:41 AM org.opensolaris.opengrok.index.IndexDatabase lambda$null$1 WARNING: ERROR addFile(): /external/icu/icu4c/source/data/coll/zh.txt *java.lang.IllegalArgumentException: Document contains at least one immense term in field="full" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.* The prefix of the first immense term is: '[-27, -123, -103, -27, -123, -101, -27, -123, -98, -27, -123, -99, -27, -123, -95, -27, -123, -93, -27, -105, -89, -25, -109, -87, -25, -77, -114, -28, -72, -128]...', original message: bytes can be at most 32766 in length; got 39180 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:240) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:496) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1729) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1464) at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:732) at org.opensolaris.opengrok.index.IndexDatabase.lambda$null$1(IndexDatabase.java:1049) at java.util.stream.Collectors.lambda$groupingByConcurrent$51(Collectors.java:1070) at java.util.stream.ReferencePipeline.lambda$collect$1(ReferencePipeline.java:496) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.pollAndExecCC(ForkJoinPool.java:1190) at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1879) at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2045) at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:404) at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:496) at org.opensolaris.opengrok.index.IndexDatabase.lambda$indexParallel$2(IndexDatabase.java:1038) at java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1424) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180 at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:263) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786) ... 33 more — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2130>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACzGDBgDLijYoAA9XqjFII0tDZJUWpmPks5t3TAAgaJpZM4URTml> .

wizwin · 2018-05-30T16:21:40Z

http://androidxref.com/6.0.1_r10/xref/external/icu/icu4c/source/data/coll/zh.txt

idodeclare · 2018-05-31T00:06:24Z

This is fixed by PR #2104, which caps the maximum length of an indexed token (or else skips it entirely) while allowing other (eligible) tokens in a file to be handled.

tarzanek · 2018-06-01T08:53:42Z

there is also another fix for this, but only enabled on jflex layer for few analysers, I guess we should enable it for plain analyzer , too

xiaopao2014 · 2021-03-04T02:05:38Z

is this issue fixed? i try with the 1.5.12 version ,still got this issue

xiaopao2014 · 2021-03-04T02:10:20Z

command:
opengrok-indexer -J=-Djava.util.logging.config.file=/home/llbeing/opengrok/etc/logging.properties -J=-Xmx8g -a /home/llbeing/opengrok/dist/lib/opengrok.jar -- -c /usr/local/bin/ctags -s /home/llbeing/opengrok_source -d /home/llbeing/opengrok/data -H -P -S -G -W /home/llbeing/opengrok/etc/configuration.xml -U http://localhost:8080/source > ./logout.log

logFile: https://drive.google.com/file/d/171_XDJg0etm7eRDVnF2PBEzAw0EcY4Aw/view?usp=sharing

problem file:https://drive.google.com/file/d/1FlJocecYxNBmMoXF-v9T7oQgMZ83Tzx4/view?usp=sharing

vladak · 2021-03-04T08:02:20Z

Attaching the files here.
valid_utf16.txt
opengrok_index_fail_log.log

vladak · 2021-03-04T08:03:18Z

If the file really contains UTF-16 I wonder if this conflicts with UTF-8 being used internally in the indexer.

xiaopao2014 · 2021-03-04T10:30:54Z

but see from log that it's something with the file length issue.
It‘s acceptable for me that if opengrok-index get success without this files

Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180

GeoffreyLu · 2021-04-03T01:18:29Z

Exactly the same issue when creating index on Android-11.0.0_r8.

Source file: external/icu/icu4c/source/data/coll/zh.txt
Opengrok Rel: 1.5.11
OS: Ubuntu 16.04.7 LTS

BTW, issue #2211 and #2826 are also observed in the log.

hhhaiai · 2022-06-06T10:40:33Z

how fix it

vladak · 2022-06-06T12:00:13Z

how fix it

Someone needs to come and resurrect the PR mentioned in #2130 (comment) so that it is agreed upon.

hhhaiai · 2022-06-08T03:40:53Z

oho~~~

oliver-ap · 2022-07-18T20:38:46Z

Hi,
I have the exact same issue trying to reindex the same Android version, running Opengrok 1.7.2.

Has someone find out the solution?

vladak · 2022-07-19T07:33:03Z

The solution is to settle on agreeable fix in OpenGrok and implement it - see #2130 (comment)

tarzanek added the bug label Jun 1, 2018

vladak mentioned this issue Jun 6, 2022

Document contains at least one immense term in field="full" #3969

Closed

vladak changed the title ~~Lucene exception while adding file~~ Lucene exception while adding file: Document contains at least one immense term in field="full" Jun 6, 2022

vladak added the indexer label Jun 6, 2022

oliver-ap mentioned this issue Jul 19, 2022

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" Error when running reindexing in Opengrok #3999

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lucene exception while adding file: Document contains at least one immense term in field="full" #2130

Lucene exception while adding file: Document contains at least one immense term in field="full" #2130

wizwin commented May 29, 2018 •

edited by vladak

Loading

vladak commented May 29, 2018 via email

wizwin commented May 30, 2018

idodeclare commented May 31, 2018

tarzanek commented Jun 1, 2018

xiaopao2014 commented Mar 4, 2021

xiaopao2014 commented Mar 4, 2021

vladak commented Mar 4, 2021

vladak commented Mar 4, 2021

xiaopao2014 commented Mar 4, 2021

GeoffreyLu commented Apr 3, 2021

hhhaiai commented Jun 6, 2022

vladak commented Jun 6, 2022

hhhaiai commented Jun 8, 2022

oliver-ap commented Jul 18, 2022

vladak commented Jul 19, 2022

Lucene exception while adding file: Document contains at least one immense term in field="full" #2130

Lucene exception while adding file: Document contains at least one immense term in field="full" #2130

Comments

wizwin commented May 29, 2018 • edited by vladak Loading

vladak commented May 29, 2018 via email

wizwin commented May 30, 2018

idodeclare commented May 31, 2018

tarzanek commented Jun 1, 2018

xiaopao2014 commented Mar 4, 2021

xiaopao2014 commented Mar 4, 2021

vladak commented Mar 4, 2021

vladak commented Mar 4, 2021

xiaopao2014 commented Mar 4, 2021

GeoffreyLu commented Apr 3, 2021

hhhaiai commented Jun 6, 2022

vladak commented Jun 6, 2022

hhhaiai commented Jun 8, 2022

oliver-ap commented Jul 18, 2022

vladak commented Jul 19, 2022

wizwin commented May 29, 2018 •

edited by vladak

Loading