Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make index database "optimization" explicit operation #3983

Merged
merged 17 commits into from
Jul 25, 2022

Conversation

vladak
Copy link
Member

@vladak vladak commented Jul 5, 2022

This change removes the automatic index "optimization" (a.k.a. reduction of segment count for given index to 1) at the end of the indexing. The outcome is reduced indexing time - indexing linux kernel repo with history off takes 06:49 minutes, out of which 50 seconds is the "optimization" step; with the step off, the indexing time is 6:03 minutes, which means 12 % improvement. On the other hand, it might theoretically slow down the search a bit and also the index size might increase - in multi segment index, each segment typically contains deleted documents. These are expunged on segment merge which happens during indexing. Comparing the size of the index of the linux kernel repository without and with the changes, there is no significant difference - in both cases the index directory had 1.5 GB (the real difference was some 45 MB) and there were 16 segments at the end of the indexing in the changed case. The number of the segments is not the same for each reindex of the same data due to the parallel nature of the indexing.

While this might seem as a overturn to the less optimal handling, it just normalizes the situation - Lucene documents consider the reduction to single segment to be extreme and useful only for long term index archival. The other alternative would be to reduce the segments to some other number however that would be highly dependent on given index. To address the search latency, I plan to submit a PR that will create IndexSearcher objects with a dedicated executors so that index segments can be searched in parallel.

I chose a bit of a more aggressive approach - instead of setting the default to false, this is now explicit operation invoked via the --reduceSegmentCount indexer option. Still softer approach than ripping this functionality out altogether. The limitation of the programming of this option's handling is that when used together with index database update, it reduces the segment count for all index databases, not just those that changed. It can be run as a standalone operation when combined with the -n / --noIndex option.

That said, I think the reduceSegmentCount functionality should be eventually replaced by tunables to control the Lucene segment merge policy type and its parameters.

@vladak vladak added the indexer label Jul 5, 2022
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jul 5, 2022
@vladak vladak changed the title Optimize off make index database "optimization" explicit operation Jul 5, 2022
@vladak vladak marked this pull request as draft July 5, 2022 20:35
@vladak vladak marked this pull request as ready for review July 5, 2022 20:51
@vladak vladak marked this pull request as draft July 6, 2022 20:20
@vladak vladak marked this pull request as ready for review July 7, 2022 20:18
@vladak
Copy link
Member Author

vladak commented Jul 7, 2022

After running the tests without the segment count reduction, ProjectsControllerTest#testListFiles() started failing. The cause was IndexDatabase#getFiles() used a weird way how to interact with iterator, in this case TermsEnum object. Normally, one uses iterator in Java like this:

Iterator<T> iterator = SomeType.iterator();
 
while (iterator.hasNext()) {
        //  Moving cursor to next element
        T data = iterator.next();
        ... // process data
}

or something like:

Iterator<T> iterator = SomeType.iterator();
 
T data;
while ((data = iterator.next()) != null) {
        ...  // process data
}

however in this case the pattern was:

ireader = DirectoryReader.open(indexDirectory); // open existing index
if (ireader.numDocs() > 0) {
terms = MultiTerms.getTerms(ireader, QueryBuilder.U);
iter = terms.iterator(); // init uid iterator
}
while (iter != null && iter.term() != null) {
String value = iter.term().utf8ToString();
if (value.isEmpty()) {
iter.next();
continue;
}
files.add(Util.uid2url(value));
BytesRef next = iter.next();
if (next == null) {
iter = null;
}
}

For some reason, when an index has multiple segments, the first call to iter.term() is null, even though iter.next() returns valid object. Rewriting the getFiles() method to use the iterator properly, fixes the test case failure. Same goes for IndexDatabase#listTokens().

Now, IndexDatabase contains various methods used via update() that also follow this pattern, notably processFile(), processFileIncremental and processTrailingTerms(). It works there because of this code:

uidIter = terms.iterator();
TermsEnum.SeekStatus stat = uidIter.seekCeil(new BytesRef(startUid)); //init uid

After the seekCeil() method is called, uidIter.term() becomes non-null. It seems that all the above listed indexer methods rely on this fact. This is the reason the indexing still works after this change. Chaning the uidIter interactions blindly to call uidIter.next() first thing would break indexing I think, because the seekCeil() likely leads to calling next(), so the first term would be skipped.

@vladak
Copy link
Member Author

vladak commented Jul 7, 2022

Looking at the listTokens() functionality, it seems there is no way it can be used. Also, it just emitted a list of tokens to log via prepareIndexer(). Further, the selection of tokens seems to be somewhat arbitrary. So I removed it. If there is ever need for similar feature, it should be implemented as API.

@vladak
Copy link
Member Author

vladak commented Jul 7, 2022

Also, I noticed that the dirty and interrupted flags in IndexDatabase are not used as they should so tried to fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
indexer OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant