Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alter handling of huge text files #3097

Open
idodeclare opened this issue Mar 28, 2020 · 2 comments
Open

Alter handling of huge text files #3097

idodeclare opened this issue Mar 28, 2020 · 2 comments

Comments

@idodeclare
Copy link
Contributor

Issue #3090 asks whether matches might be excerpted in results from the search API to avoid a performance-killing situation such as returning a line that is a gigabyte in length. There is the open #2732 to convert SearchEngine to use the modern Lucene unified highlighter. With that PR's new HitFormatter, it would be fairly straight-forward to refactor to use the same excerpting as applied by LineHighlight for UI search.

Huge text files present additional problems, however, for OpenGrok.

The Lucene uhighlight API makes it ultimately impossible to avoid loading full, indexed source content into memory. While in some places in the API, Lucene permits content to be represented as CharSequence, which would allow (with a bit of work) to lazily load source content into memory; the final formatting via Lucene PassageFormatter is done with a method, format(Passage[] passages, String content), where a String is demanded.

As well keep in mind that Lucene postings have an offset datatype of int, so content past an offset of 2,147,483,647 cannot be indexed for OpenGrok to present context, since OpenGrok chooses to be able to store postings-with-offsets so that later context presentation is not re-analyzing files. (Currently OpenGrok does not limit the number of characters read, which results in issues like #2560. The latest JFlex 1.8.x has revised its yychar as a long, but Lucene would still have an int limit for offsets.)

For huge text files then I can think of a few possible choices:

  • Allow setting an upper limit of characters to be read from files so that "full, indexed source content" is capped, and continue to use PassageFormatter. This means however that some content from very large files would be missing from the index. (Currently all content from >2GB files is missing from the index.)

or

  • Index the content fully, but do not store postings with offsets, and do not enable any showing of context. OpenGrok would merely be able to report yes or no whether a huge text file was matched by a particular query.

or

  • Break up very large documents into virtual, partial documents (fitting within int and likely fitting within say short to make the pieces very manageable), and fully index the pieces, and allow presenting context for each piece separately.

I generally think the second option might be satisfactory. Is there truly much utility to excerpting from a 1GB JSON file? What does "context" mean within such a file? I don't expect realizing that option would be too difficult. I suppose it could be done by reclassifying huge Genre.PLAIN files as Genre.DATA; but still using the plain-text analyzer and, where applicable, a language-specific symbol tokenizer; and also avoiding XREF generation (by virtue of being Genre.DATA).

@tarzanek
Copy link
Contributor

I vote for second option too, such huge files are no good for humans anyways(humans would filter them anyhow), so why bother with them? No gain, very narrow use case, they generally should just get rid of the way (exactly as stated in #1646 (comment) comment).
This is a source code engine, not heuristics, so we need to keep our focus in mind.
Big files are data, not source code (so another option is to have a size limit on source code analyzers and degrade the analyzer if the limit is hit).
So I agree with PLAIN -> DATA option too.

@idodeclare
Copy link
Contributor Author

OK I'm glad to get agreement

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Apr 17, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Apr 18, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Apr 18, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue May 10, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue May 27, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Aug 20, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Sep 27, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Sep 27, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 6, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 6, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 7, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 9, 2020
@vladak vladak added the indexer label Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants