make index database "optimization" explicit operation #3983

vladak · 2022-07-05T18:11:11Z

This change removes the automatic index "optimization" (a.k.a. reduction of segment count for given index to 1) at the end of the indexing. The outcome is reduced indexing time - indexing linux kernel repo with history off takes 06:49 minutes, out of which 50 seconds is the "optimization" step; with the step off, the indexing time is 6:03 minutes, which means 12 % improvement. On the other hand, it might theoretically slow down the search a bit and also the index size might increase - in multi segment index, each segment typically contains deleted documents. These are expunged on segment merge which happens during indexing. Comparing the size of the index of the linux kernel repository without and with the changes, there is no significant difference - in both cases the index directory had 1.5 GB (the real difference was some 45 MB) and there were 16 segments at the end of the indexing in the changed case. The number of the segments is not the same for each reindex of the same data due to the parallel nature of the indexing.

While this might seem as a overturn to the less optimal handling, it just normalizes the situation - Lucene documents consider the reduction to single segment to be extreme and useful only for long term index archival. The other alternative would be to reduce the segments to some other number however that would be highly dependent on given index. To address the search latency, I plan to submit a PR that will create IndexSearcher objects with a dedicated executors so that index segments can be searched in parallel.

I chose a bit of a more aggressive approach - instead of setting the default to false, this is now explicit operation invoked via the --reduceSegmentCount indexer option. Still softer approach than ripping this functionality out altogether. The limitation of the programming of this option's handling is that when used together with index database update, it reduces the segment count for all index databases, not just those that changed. It can be run as a standalone operation when combined with the -n / --noIndex option.

That said, I think the reduceSegmentCount functionality should be eventually replaced by tunables to control the Lucene segment merge policy type and its parameters.

vladak · 2022-07-07T20:41:37Z

After running the tests without the segment count reduction, ProjectsControllerTest#testListFiles() started failing. The cause was IndexDatabase#getFiles() used a weird way how to interact with iterator, in this case TermsEnum object. Normally, one uses iterator in Java like this:

Iterator<T> iterator = SomeType.iterator();
 
while (iterator.hasNext()) {
        //  Moving cursor to next element
        T data = iterator.next();
        ... // process data
}

or something like:

Iterator<T> iterator = SomeType.iterator();
 
T data;
while ((data = iterator.next()) != null) {
        ...  // process data
}

however in this case the pattern was:

opengrok/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java

Lines 1791 to 1808 in 9bbba82

    
           ireader = DirectoryReader.open(indexDirectory); // open existing index 
        
           if (ireader.numDocs() > 0) { 
        
               terms = MultiTerms.getTerms(ireader, QueryBuilder.U); 
        
               iter = terms.iterator(); // init uid iterator 
        
           } 
        
           while (iter != null && iter.term() != null) { 
        
               String value = iter.term().utf8ToString(); 
        
               if (value.isEmpty()) { 
        
                   iter.next(); 
        
                   continue; 
        
               } 
        
               files.add(Util.uid2url(value)); 
        
               BytesRef next = iter.next(); 
        
               if (next == null) { 
        
                   iter = null; 
        
               } 
        
           }

For some reason, when an index has multiple segments, the first call to iter.term() is null, even though iter.next() returns valid object. Rewriting the getFiles() method to use the iterator properly, fixes the test case failure. Same goes for IndexDatabase#listTokens().

Now, IndexDatabase contains various methods used via update() that also follow this pattern, notably processFile(), processFileIncremental and processTrailingTerms(). It works there because of this code:

opengrok/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java

Lines 634 to 635 in 9bbba82

    
           uidIter = terms.iterator(); 
        
           TermsEnum.SeekStatus stat = uidIter.seekCeil(new BytesRef(startUid)); //init uid

After the seekCeil() method is called, uidIter.term() becomes non-null. It seems that all the above listed indexer methods rely on this fact. This is the reason the indexing still works after this change. Chaning the uidIter interactions blindly to call uidIter.next() first thing would break indexing I think, because the seekCeil() likely leads to calling next(), so the first term would be skipped.

vladak · 2022-07-07T22:17:09Z

Looking at the listTokens() functionality, it seems there is no way it can be used. Also, it just emitted a list of tokens to log via prepareIndexer(). Further, the selection of tokens seems to be somewhat arbitrary. So I removed it. If there is ever need for similar feature, it should be implemented as API.

vladak · 2022-07-07T23:03:53Z

Also, I noticed that the dirty and interrupted flags in IndexDatabase are not used as they should so tried to fix that.

fixes oracle#3982

it does not seem to be used

Also, actually use the dirty flag to emit a warning message.

vladak added the indexer label Jul 5, 2022

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jul 5, 2022

vladak changed the title ~~Optimize off~~ make index database "optimization" explicit operation Jul 5, 2022

vladak force-pushed the optimize_off branch from fcf3954 to 3782bab Compare July 5, 2022 20:24

vladak marked this pull request as draft July 5, 2022 20:35

vladak marked this pull request as ready for review July 5, 2022 20:51

vladak force-pushed the optimize_off branch from 60de82f to 047de44 Compare July 6, 2022 20:13

vladak marked this pull request as draft July 6, 2022 20:20

vladak marked this pull request as ready for review July 7, 2022 20:18

vladak added 16 commits July 25, 2022 14:23

make index "optimization" explicit action

6e6f7ca

fixes oracle#3982

finish the rename

8c50368

fix style

840f203

refactor dirty file removal

e3326f3

fix handling of the dirty flag

3bb9b72

make sure IndexDatabase.updateAll() clears the dirty file

e73563a

no need to check whether the index is dirty

8f1a895

use iterator correctly

0fc9e1c

revert unnecessary year bumps

8f2325e

revert year bump that is unnecessary

c62fab9

improve the comment

93bbb3d

remove the list tokens (create dictionary) functionality

4ef0b13

it does not seem to be used

do not use the dirty flag in reduce segment count code

f97c058

Also, actually use the dirty flag to emit a warning message.

actualy set interrupted to true

393fb24

rename

33140d0

add comment

483202e

vladak force-pushed the optimize_off branch from 2f419fd to 483202e Compare July 25, 2022 12:23

fix rebase

e9d5484

vladak merged commit c3b0c92 into oracle:master Jul 25, 2022

vladak deleted the optimize_off branch July 25, 2022 13:06

vladak mentioned this pull request Jul 25, 2022

make sure all IndexSearcher instances use an executor #4008

Merged

vladak mentioned this pull request Jan 20, 2023

the indexer must skip deleted documents in multi-segment indexes #4030

Closed

vladak mentioned this pull request Jul 11, 2023

Error "IndexDeletionPolicy changed, normally only one commit should be stored" #4385

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make index database "optimization" explicit operation #3983

make index database "optimization" explicit operation #3983

vladak commented Jul 5, 2022 •

edited

vladak commented Jul 7, 2022 •

edited

vladak commented Jul 7, 2022

vladak commented Jul 7, 2022

make index database "optimization" explicit operation #3983

make index database "optimization" explicit operation #3983

Conversation

vladak commented Jul 5, 2022 • edited

vladak commented Jul 7, 2022 • edited

vladak commented Jul 7, 2022

vladak commented Jul 7, 2022

vladak commented Jul 5, 2022 •

edited

vladak commented Jul 7, 2022 •

edited