IllegalArgumentException when adding a file longer than 2 GiB #2560

Shooter3k · 2018-11-28T22:21:45Z

We've been using opengrok for many years and I've come to accept the 'warning' messages that show up in the logs. That being said, at what point are they "bugs" or things you guys want to know and I/we should report them?

Examples:

2018-11-28 11:40:16.756  WARNING [org.opengrok.indexer.index] - ERROR addFile(): /source/fakefolder Use Data Extract/trunk/docs/etg_mbr.txt.gz
java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset; got startOffset=2147483647,endOffset=-2147483611

2018-11-28 11:36:15.474  WARNING [org.opengrok.indexer.util] - Non-zero exit status -1 from command [/usr/bin/svn, log, --non-interactive, --xml, -v, /source/fakefolder] in directory /n/source/fakefolder

2018-11-28 11:36:15.475  WARNING [org.opengrok.indexer.history] - An error occurred while creating cache for /source/fakefolder(SubversionRepository)
org.opengrok.indexer.history.HistoryException: Failed to get history for: "/source/fakefolder" Exit code: -1

2018-11-28 11:36:15.474   SEVERE [org.opengrok.indexer.util] - Failed to read from process: /usr/bin/svn
java.io.IOException: An error occurred while parsing the xml output
	at org.opengrok.indexer.history.SubversionHistoryParser.processStream(SubversionHistoryParser.java:195)

2018-11-28 11:22:59.605  WARNING [org.opengrok.indexer.util] - Non-zero exit status 1 from command [/usr/bin/svn, log, --non-interactive, --xml, -v, -l1, /source/fakefolder@] in directory /source/fakefolder

The text was updated successfully, but these errors were encountered:

vladak · 2018-11-29T10:39:24Z

In general, it never hurts to submit a issue for a problem however be prepared you will need to do your homework w.r.t. investigation.

The Subversion might be a local problem.

The IllegalArgumentException could be a bug. What version are you running at the moment ? Could you share the contents of the file ?

tulinkry · 2018-11-29T10:41:10Z

for the svn you can go to the directory /n/source/fakefolder and run the command /usr/bin/svn log --non-interactive --xml -v /source/fakefolder to see what went wrong

Shooter3k · 2018-12-03T22:03:18Z

Thanks for the suggestion. It would seem we're getting authentication errors most likely because I'm running the opengrok indexer on an account that does not have access to the SVN repository.

for the svn you can go to the directory /n/source/fakefolder and run the command /usr/bin/svn log --non-interactive --xml -v /source/fakefolder to see what went wrong

vladak · 2018-12-05T12:03:18Z

There is a (clunky) way how to pass username/password to Subversion process - using the OPENGROK_SUBVERSION_USERNAME / OPENGROK_SUBVERSION_PASSWORD environment variables.

vladak · 2018-12-05T12:06:02Z

Can you try to get more info about the IllegalArgumentException problem ? (i.e. share the contents of the file that seem to cause this)

Shooter3k · 2018-12-07T19:43:28Z

(FYI: I accidentally closed and reopened this issue)

The file contains secret information about our company but it's a 3GB text file inside of a 300MB .gz file.
My assumption is the size of the file is causing issues.

Is there anything I could check without share the actual file itself?

Can you try to get more info about the IllegalArgumentException problem ? (i.e. share the contents of the file that seem to cause this)

vladak · 2018-12-12T14:34:05Z

Is there a stack trace in the log associated with the IllegalArgumentException ?

The logger used in this case should log one I believe as it is called like this from IndexDatabase#indexParallel():

1184                          try {
1185                              if (alreadyClosedCounter.get() > 0) {
1186                                  ret = false;
1187                              } else {
1188                                  pctags = ctagsPool.get();
1189                                  addFile(x.file, x.path, pctags);
1190                                  successCounter.incrementAndGet();
1191                                  ret = true;
1192                              }
...
1208                          } catch (RuntimeException|IOException e) {
1209                              String errmsg = String.format("ERROR addFile(): %s",
1210                                  x.file);
1211                              LOGGER.log(Level.WARNING, errmsg, e);
1212                              x.exception = e;
1213                              ret = false;
1214                          } finally {

This is because IllegalArgumentException extends RuntimeException.

The exception likely comes from one of the analyzers - addFile() calls AnalyzerGuru#populateDocument() that performs:

580          if (fa != null) {
581              Genre g = fa.getGenre();
582              if (g == Genre.PLAIN || g == Genre.XREFABLE || g == Genre.HTML) {
583                  doc.add(new Field(QueryBuilder.T, g.typeName(), string_ft_stored_nanalyzed_norms));
584              }
585              fa.analyze(doc, StreamSource.fromFile(file), xrefOut);

In your case it could be GZIPAnalyzer or the analyzer for the contents therein.

vladak · 2018-12-12T14:35:00Z

Also, maybe worth trying to bisect the original file (assuming the exception is caused by the contents and not the compressed image) and see if you could find the spot which causes the problem.

Shooter3k · 2018-12-12T16:11:38Z

Unfortunately, someone should have never checked a 300MB compressed (3GB uncompressed) text file like this into our repo. I have no desire to get opengrok to index the file but if you guys need me to debug it for future development, I will. I was planning to either ignore the file or delete it

Here is the stack trace.

java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset; got startOffset=2147483647,endOffset=-2147483611
	at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:113)
	at org.opengrok.indexer.analysis.JFlexTokenizer.setAttribs(JFlexTokenizer.java:133)
	at org.opengrok.indexer.analysis.JFlexTokenizer.symbolMatched(JFlexTokenizer.java:108)
	at org.opengrok.indexer.analysis.JFlexSymbolMatcher.onSymbolMatched(JFlexSymbolMatcher.java:102)
	at org.opengrok.indexer.analysis.plain.PlainFullTokenizer.yylex(PlainFullTokenizer.java:726)
	at org.opengrok.indexer.analysis.JFlexTokenizer.incrementToken(JFlexTokenizer.java:98)
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:787)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1609)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1228)
	at org.opengrok.indexer.index.IndexDatabase.addFile(IndexDatabase.java:785)
	at org.opengrok.indexer.index.IndexDatabase.lambda$null$1(IndexDatabase.java:1186)
	at java.base/java.util.stream.Collectors.lambda$groupingByConcurrent$59(Unknown Source)
	at java.base/java.util.stream.ReferencePipeline.lambda$collect$1(Unknown Source)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(Unknown Source)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown Source)
	at java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source)
	at java.base/java.util.stream.ForEachOps$ForEachTask.compute(Unknown Source)
	at java.base/java.util.concurrent.CountedCompleter.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.localPopAndExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)

vladak · 2018-12-14T12:03:26Z

2147483647 is 2^31-1, i.e. 2 GiB short by 1 byte and abs(-2147483611) is 2GiB short by 37 bytes so probably overflow of a 2 GiB signed int value by 37 bytes. This might be caused by huge token processed by PlainFullTokenizer.

If you run OpenGrok before 1.1-rc80 the chances are you are bumping into the issue fixed in cset 3e49081 - normally the Java lexer classes generated from the .lex descriptions should be changed not to accept too long tokens.

vladak · 2018-12-14T12:21:01Z

Or perhaps this is actually a bug triggered by file size greater than 2 GiB.

src/main/java/org/opengrok/indexer/analysis/JFlexTokenizer.java uses int for the token offsets within the file:

119      /**
120       * Clears, and then resets the instances attributes per the specified
121       * arguments.
122       * @param str the matched symbol
123       * @param start the match start position
124       * @param end the match end position
125       */
126      protected void setAttribs(String str, int start, int end) {
127          clearAttributes();
128          //FIXME increasing below by one(default) might be tricky, need more analysis
129          // after lucene upgrade to 3.5 below is most probably not even needed
130          this.posIncrAtt.setPositionIncrement(1);
131          this.termAtt.setEmpty();
132          this.termAtt.append(str);
133          this.offsetAtt.setOffset(start, end);
134      }

The trouble starts in src/main/java/org/opengrok/indexer/analysis/SymbolMatchedEvent:

39      private final int start;
40      private final int end;

and bubbles up to src/main/java/org/opengrok/indexer/analysis/JFlexSymbolMatcher.java.

The trouble is that JFlex's yychar is int.

vladak · 2018-12-14T12:28:26Z

In the meantime we could limit the maximum size of files to 2 GiB. Maybe time to revisit #534.

vladak · 2018-12-14T18:55:38Z

Actually, limiting on input file size cannot work given that how GZip analyzer works - it is based on streams.

vladak added the question label Nov 29, 2018

vladak changed the title ~~When should someone report issues?~~ IllegalArgumentException when adding a file Dec 5, 2018

Shooter3k closed this as completed Dec 7, 2018

Shooter3k reopened this Dec 7, 2018

vladak added bug and removed question labels Dec 14, 2018

vladak changed the title ~~IllegalArgumentException when adding a file~~ IllegalArgumentException when adding a file longer than 2 GiB Dec 14, 2018

vladak mentioned this issue Dec 14, 2018

Lexer has negative yychar when processing files larger than 2 GiB jflex-de/jflex#536

Closed

idodeclare mentioned this issue Mar 28, 2020

Alter handling of huge text files #3097

Open

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Apr 18, 2020

Fix oracle#2560 : recognize Huge Text in gzip or bzip2

3b69c37

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue May 10, 2020

Fix oracle#2560 : recognize Huge Text in gzip or bzip2

751f947

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue May 27, 2020

Fix oracle#2560 : recognize Huge Text in gzip or bzip2

f9db6c5

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Aug 20, 2020

Fix oracle#2560 : recognize Huge Text in gzip or bzip2

1f7e790

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Sep 27, 2020

Fix oracle#2560 : recognize Huge Text in gzip or bzip2

7ea61d7

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Sep 27, 2020

Fix oracle#2560 : recognize Huge Text in gzip or bzip2

eb3f9a0

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 6, 2020

Fix oracle#2560 : recognize Huge Text in gzip or bzip2

03c0528

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 6, 2020

Fix oracle#2560 : recognize Huge Text in gzip or bzip2

8367127

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 7, 2020

Fix oracle#2560 : recognize Huge Text in gzip or bzip2

1435791

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 9, 2020

Fix oracle#2560 : recognize Huge Text in gzip or bzip2

f9ff866

vladak added the indexer label Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IllegalArgumentException when adding a file longer than 2 GiB #2560

IllegalArgumentException when adding a file longer than 2 GiB #2560

Shooter3k commented Nov 28, 2018 •

edited by vladak

Loading

vladak commented Nov 29, 2018

tulinkry commented Nov 29, 2018

Shooter3k commented Dec 3, 2018

vladak commented Dec 5, 2018

vladak commented Dec 5, 2018

Shooter3k commented Dec 7, 2018 •

edited

Loading

vladak commented Dec 12, 2018

vladak commented Dec 12, 2018

Shooter3k commented Dec 12, 2018 •

edited by vladak

Loading

vladak commented Dec 14, 2018 •

edited

Loading

vladak commented Dec 14, 2018

vladak commented Dec 14, 2018

vladak commented Dec 14, 2018

IllegalArgumentException when adding a file longer than 2 GiB #2560

IllegalArgumentException when adding a file longer than 2 GiB #2560

Comments

Shooter3k commented Nov 28, 2018 • edited by vladak Loading

vladak commented Nov 29, 2018

tulinkry commented Nov 29, 2018

Shooter3k commented Dec 3, 2018

vladak commented Dec 5, 2018

vladak commented Dec 5, 2018

Shooter3k commented Dec 7, 2018 • edited Loading

vladak commented Dec 12, 2018

vladak commented Dec 12, 2018

Shooter3k commented Dec 12, 2018 • edited by vladak Loading

vladak commented Dec 14, 2018 • edited Loading

vladak commented Dec 14, 2018

vladak commented Dec 14, 2018

vladak commented Dec 14, 2018

Shooter3k commented Nov 28, 2018 •

edited by vladak

Loading

Shooter3k commented Dec 7, 2018 •

edited

Loading

Shooter3k commented Dec 12, 2018 •

edited by vladak

Loading

vladak commented Dec 14, 2018 •

edited

Loading