Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to skip large files while indexing ? #1646

Closed
ChristopheBordieu opened this issue Jun 30, 2017 · 12 comments
Closed

Is there a way to skip large files while indexing ? #1646

ChristopheBordieu opened this issue Jun 30, 2017 · 12 comments
Labels

Comments

@ChristopheBordieu
Copy link

ChristopheBordieu commented Jun 30, 2017

Hello,

Not a bug. Just a question.
When indexing, is there a way to skip large files whose sizes are greater than X kB (or MB or GB) ?

@vladak
Copy link
Member

vladak commented Jun 30, 2017

There is an enhancement #534 filed to track this. As a workaround, specify the files/pattern as ignored (-i).

@vladak vladak closed this as completed Jun 30, 2017
@ChristopheBordieu
Copy link
Author

Ok. Thanks for feedback.
Cannot use the workaround -i. I need something completely automatic and not set via configuration.

@tarzanek
Copy link
Contributor

tarzanek commented Jul 2, 2017

@ChristopheBordieu fwiw, 1.0 release has fixes for most of big files problems (ctags parsing is completely fixed, some languages - we limit length of tokens for parser - https://github.com/OpenGrok/OpenGrok/blob/master/build.xml#L270 )
I hoped I will add more lanugages to the limit of token size if need , but I need to know which other languages have problems with long tokens (and overflow) - if you have a list of problematic files, ev. can generate some stats (which file extension/analyzer fails), it'd help tremendously ... please file a new bug with this list, I make sure the respective analyzers will get fixed asap

@ChristopheBordieu
Copy link
Author

Hi @tarzanek
I am following up all the issues you manage. Actually, the problem is not OpenGrok but my users :-D ! They put on their git repos some very big files. Use cases can be discussed for ever... So, like Bitbucket does when indexing, I would be interested by a way to skip any file, whatever the type, greater in size than X kB.
Despite the display of homepage is ugly for our instance (cf my ticket 1478), I will put in prod 1.1-rc5 this week because newer Lucene, file parser fix and plenty of useful fixes... Then, If I encounter parsing issues in my logfiles, I will report them for sure.
Thank!

@tarzanek
Copy link
Contributor

tarzanek commented Jul 3, 2017

So ... it could be a good improvement
we can perhaps add a simple option to indexer that will set this limit ... dare to write up a patch?
I am willing to give pointers and guidance.

@vladak
Copy link
Member

vladak commented Jul 4, 2017

As I wrote in #534 this is not as simple as it looks.

@ChristopheBordieu
Copy link
Author

I do not know Java... So do not wait for me for a patch :-) And it is not simple !

@jhaber
Copy link

jhaber commented Apr 14, 2020

Just wanted to chime in that we're seeing slow search performance after upgrading from a very old version of OpenGrok. We believe that it's caused by a few large JSON files (5-10MB). For certain terms, we're seeing search take a very long time or time out. However, when we filter out JSON files from the search it returns very quickly. We'll try to find a workaround in the meantime.

EDIT: We're on OpenGrok 1.3.8

@vladak
Copy link
Member

vladak commented Apr 14, 2020

Incidentally, file processing times could be part of the statistics (#579).

@jhaber
Copy link

jhaber commented Apr 14, 2020

In our case we don't care much about processing times since it's offline (up to a certain point of course 😄 ). However we care a lot about search speed since our engineers rely on interactive searching of OpenGrok and want it to be fast.

It seems like these big JSON files are causing searches to be at least an order of magnitude slower. Using a file path or file type filter to exclude JSON files makes searches snappy again.

We recently upgraded from OpenGrok 0.12.1.5, which didn't seem to have a JsonAnalyzer. So our other hunch is that JsonAnalyzer is now doing more sophisticated indexing of JSON files, which leads to poor search performance on big JSON files. Does that seem plausible?

In the short term we're going to try switching JSON files to use the PlainAnalyzer, and also ignore any JSON file over 100KB when indexing.

@idodeclare
Copy link
Contributor

We recently upgraded from OpenGrok 0.12.1.5, which didn't seem to have a JsonAnalyzer. So our other hunch is that JsonAnalyzer is now doing more sophisticated indexing of JSON files, which leads to poor search performance on big JSON files. Does that seem plausible?

In the short term we're going to try switching JSON files to use the PlainAnalyzer, and also ignore any JSON file over 100KB when indexing.

My guess is the Lucene Unified Highlighter, which forces to read full source into memory. That will still be active even if you use PlainAnalyzer for JSON. See my write-up in #3097.

@jhaber
Copy link

jhaber commented Apr 15, 2020

We recently upgraded from OpenGrok 0.12.1.5, which didn't seem to have a JsonAnalyzer. So our other hunch is that JsonAnalyzer is now doing more sophisticated indexing of JSON files, which leads to poor search performance on big JSON files. Does that seem plausible?
In the short term we're going to try switching JSON files to use the PlainAnalyzer, and also ignore any JSON file over 100KB when indexing.

My guess is the Lucene Unified Highlighter, which forces to read full source into memory. That will still be active even if you use PlainAnalyzer for JSON. See my write-up in #3097.

Seems like you were exactly right. Switching to PlainAnalyzer made no discernible improvement, but updating our indexing pipeline to skip JSON files >100KB took searches from 8 seconds to 30 milliseconds

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Apr 18, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue May 10, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue May 27, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Aug 20, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Sep 27, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Sep 27, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 6, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 6, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 7, 2020
idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants