Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lucene exception while adding file: Document contains at least one immense term in field="full" #2130

Open
wizwin opened this issue May 29, 2018 · 15 comments

Comments

@wizwin
Copy link

wizwin commented May 29, 2018

May 29, 2018 10:02:41 AM org.opensolaris.opengrok.index.IndexDatabase lambda$null$1
WARNING: ERROR addFile(): /external/icu/icu4c/source/data/coll/zh.txt
**java.lang.IllegalArgumentException: Document contains at least one immense term in field="full" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.**  The prefix of the first immense term is: '[-27, -123, -103, -27, -123, -101, -27, -123, -98, -27, -123, -99, -27, -123, -95, -27, -123, -93, -27, -105, -89, -25, -109, -87, -25, -77, -114, -28, -72, -128]...', original message: bytes can be at most 32766 in length; got 39180
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:240)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:496)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1729)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1464)
	at org.opensolaris.opengrok.index.IndexDatabase.addFile(IndexDatabase.java:732)
	at org.opensolaris.opengrok.index.IndexDatabase.lambda$null$1(IndexDatabase.java:1049)
	at java.util.stream.Collectors.lambda$groupingByConcurrent$51(Collectors.java:1070)
	at java.util.stream.ReferencePipeline.lambda$collect$1(ReferencePipeline.java:496)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.pollAndExecCC(ForkJoinPool.java:1190)
	at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1879)
	at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2045)
	at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:404)
	at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
	at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:496)
	at org.opensolaris.opengrok.index.IndexDatabase.lambda$indexParallel$2(IndexDatabase.java:1038)
	at java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1424)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180
	at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:263)
	at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
	... 33 more
@vladak
Copy link
Member

vladak commented May 29, 2018 via email

@wizwin
Copy link
Author

wizwin commented May 30, 2018

@idodeclare
Copy link
Contributor

This is fixed by PR #2104, which caps the maximum length of an indexed token (or else skips it entirely) while allowing other (eligible) tokens in a file to be handled.

@tarzanek
Copy link
Contributor

tarzanek commented Jun 1, 2018

there is also another fix for this, but only enabled on jflex layer for few analysers, I guess we should enable it for plain analyzer , too

@tarzanek tarzanek added the bug label Jun 1, 2018
@xiaopao2014
Copy link

is this issue fixed? i try with the 1.5.12 version ,still got this issue

@xiaopao2014
Copy link

command:
opengrok-indexer -J=-Djava.util.logging.config.file=/home/llbeing/opengrok/etc/logging.properties -J=-Xmx8g -a /home/llbeing/opengrok/dist/lib/opengrok.jar -- -c /usr/local/bin/ctags -s /home/llbeing/opengrok_source -d /home/llbeing/opengrok/data -H -P -S -G -W /home/llbeing/opengrok/etc/configuration.xml -U http://localhost:8080/source > ./logout.log

logFile: https://drive.google.com/file/d/171_XDJg0etm7eRDVnF2PBEzAw0EcY4Aw/view?usp=sharing

problem file:https://drive.google.com/file/d/1FlJocecYxNBmMoXF-v9T7oQgMZ83Tzx4/view?usp=sharing

@vladak
Copy link
Member

vladak commented Mar 4, 2021

Attaching the files here.
valid_utf16.txt
opengrok_index_fail_log.log

@vladak
Copy link
Member

vladak commented Mar 4, 2021

If the file really contains UTF-16 I wonder if this conflicts with UTF-8 being used internally in the indexer.

@xiaopao2014
Copy link

but see from log that it's something with the file length issue.
It‘s acceptable for me that if opengrok-index get success without this files

Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 39180

@GeoffreyLu
Copy link

Exactly the same issue when creating index on Android-11.0.0_r8.

Source file: external/icu/icu4c/source/data/coll/zh.txt
Opengrok Rel: 1.5.11
OS: Ubuntu 16.04.7 LTS

BTW, issue #2211 and #2826 are also observed in the log.

@vladak vladak changed the title Lucene exception while adding file Lucene exception while adding file: Document contains at least one immense term in field="full" Jun 6, 2022
@vladak vladak added the indexer label Jun 6, 2022
@hhhaiai
Copy link

hhhaiai commented Jun 6, 2022

how fix it

@vladak
Copy link
Member

vladak commented Jun 6, 2022

how fix it

Someone needs to come and resurrect the PR mentioned in #2130 (comment) so that it is agreed upon.

@hhhaiai
Copy link

hhhaiai commented Jun 8, 2022

oho~~~

@oliver-ap
Copy link

Hi,
I have the exact same issue trying to reindex the same Android version, running Opengrok 1.7.2.

Has someone find out the solution?

@vladak
Copy link
Member

vladak commented Jul 19, 2022

The solution is to settle on agreeable fix in OpenGrok and implement it - see #2130 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants