Feature/tokenizer mode #2104

idodeclare · 2018-05-10T03:21:34Z

Hello,

Please consider for integration this patch to add a mode switch OPENGROK_ALL_NONWHITESPACE (--allNonWhitespace [on|off] [default off]) to index all non-whitespace for FULL queries.

With the mode activated, the set of tokens currently indexed by the current pre-release 1.1-rc26 will still return results, but additionally all strings of contiguous non-whitespace will be multiply tokenized to allow flexible but precise matching of text fragments that can include e.g. punctuation (any non-whitespace really).

As an example, the source text (the "license") in release 1.1-rc26 will only tokenize the following:

the
license

so the user cannot home in on queries that include the punctuation parentheses or quotes.

With --allNonWhitespace on, the same source text (the "license") will be indexed with the following tokens (and position offsets):

(the	|1
the	|0
"license	|1
"license"	|0
"license")	|0
license	|0
license"	|0
license")	|0

to support the same previous tokens but additionally a number of very precise queries indexing true full-text.

The non-whitespace runs are broken on specific, most-commonly found places in programming languages: e.g. word/non-word boundaries, open- and closing- punctuation boundaries, and quoting character boundaries (with new support for recognizing English contractions so that words such as that'll will not be broken).

The cost, of course, is an index that consumes more space. In release 1.1-rc26 I find that indexes are approximately 75% of the size of source code. With --allNonWhitespace on, indexes are at least 100% of source code size and often higher. When I index a sample of ten large, common open source projects, the total index size is 125% of source code size.

I find, however, that some repos (depending on runs of non-whitespace) can even have indexes 200% the size of source code. Despite this, I find that the utility of producing matches on a multitude of fragments of non-whitespace is invaluable.

Since this patch will support queries with punctuation, it also updates QueryBuilder to avoid re-escaping punctuation when the user includes explicit Lucene escape codes. E.g., QueryBuilder will continue to support a user submitting a DEFS query method1:forValue: and escape the colons for Lucene. OpenGrok will also now support a query method1\:forValue\: where the user includes explicitly the Lucene escapes and no longer throw an error.

Thank you.

ahornace · 2018-05-10T11:41:53Z

Hello,

maybe I am missing something but why not just add (, ", etc. as terms and then just use PhraseQuery (e.g. "(the")? Just a handful of new terms would be added to the index which is negligible. Regarding the performance, I'm not completely sure but I think that your way will be faster for these lookups but some other queries (e.g. wildcard queries or regexp queries will be slowed down by the factor of the changed lucene index size).

Also, how does this perform on minified code, e.g. *.min.js files?

Regardless, to me this feels like twisting Lucene. If we truly want to provide full-text search then I believe that it should be done differently: e.g. https://swtch.com/~rsc/regexp/regexp4.html or https://livegrep.com/search/linux

Regards,
Adam

tarzanek · 2018-05-13T13:49:24Z

I like that this is "experimental" :)
I am also curious what it will do on minified code, but then I hope the jflex token length limiter will kick in there.
Let's finish the review (I did like 20%, so need more time, especially on how do you preserve backwards compatibility, since the analysers are written in the way that they ignore special chars on purpose)
and I'd be happy for other reviews too (and thanks Adam, this is an interesting idea with PhraseQuery, I think OpenGrok has a huge debt in using queries properly and in most effective way (for user of course) )

idodeclare · 2018-05-17T02:25:06Z

maybe I am missing something but why not just add (, ", etc. as terms and then just use PhraseQuery (e.g. "(the")?

@Orviss, you're not quite understanding Lucene as an inverted index. Firstly, being inverted allows Lucene to support multiple terms to index the same text location. That is the basis for Lucene's synonym handling. This patch leverages the inverted index to allow multiple text fragments to index the same or nearly the same source code location. E.g. for a C syntax fragment parent->data a user could easily query either parent->data or parent-> or ->data and get good answers from OpenGrok — without requiring the quoted syntax of a phrase query.

Secondly, Lucene PhraseQuery allows querying sequences of terms — but unless the query is for multiple terms, then it is no different from a regular term query. Your example, "(the" is a query for a single term, (the, so it would not benefit from adding punctuation as additional terms and actually would not return results if punctuation were indexed as a separate term. A phrase query like "( the" — i.e., containing whitespace — would be required to query source code (the (without whitespace) that is indexed with two consecutive terms for the parenthesis and the English word.

idodeclare · 2018-05-17T02:37:47Z

curious what it will do on minified code

@tarzanek , I pushed a test of minified css and a revision to ensure that default plain-full tokenization is used when new word-breaking operations reach the const limits — which will happen for minified code.

tarzanek · 2018-05-17T03:54:03Z

looking ...
we need to fix the javadoc problem so travis behaves again ...

tarzanek · 2018-05-17T03:54:57Z

ah ...
I see ...
[ERROR] /home/travis/build/oracle/opengrok/test/org/opensolaris/opengrok/util/CustomAssertions.java:112: error: reference not found
[ERROR] * {@link #assertSymbolStream(java.lang.Class, java.io.InputStream, java.util.List, boolean)}
[ERROR] ^

ahornace · 2018-05-17T05:50:43Z

Actually, PhraseQuery's text is tokenized. Therefore, queries like "class.member" or "class . member" are the same. However, you can easily override this behaviour in QueryParserBase#newFieldQuery -> just override this method in OpenGrok's CustomQueryParser. Therefore, there would be no problem with query "(the" and treating the ( and the as separate terms.

idodeclare · 2018-05-18T00:03:15Z

@tarzanek, thank you for spotting that. I was baffled because NetBeans "Generate Javadoc (opengrok)" was running without error. I guess it does not generate for test/ classes, unlike the travis build.

coveralls · 2018-05-18T00:18:11Z

Pull Request Test Coverage Report for Build 2584

500 of 566 (88.34%) changed or added relevant lines in 17 files are covered.
410 unchanged lines in 9 files lost coverage.
Overall coverage increased (+0.2%) to 73.008%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/org/opensolaris/opengrok/search/TermEscaperBase.java	6	8	75.0%
src/org/opensolaris/opengrok/search/SearchEngine.java	0	2	0.0%
src/org/opensolaris/opengrok/search/context/Context.java	30	33	90.91%
src/org/opensolaris/opengrok/analysis/JFlexSymbolMatcher.java	4	14	28.57%
src/org/opensolaris/opengrok/util/TextTrieMap.java	139	153	90.85%
src/org/opensolaris/opengrok/analysis/JFlexTokenizer.java	260	295	88.14%

Files with Coverage Reduction	New Missed Lines	%
src/org/opensolaris/opengrok/analysis/JFlexTokenizer.java	1	87.86%
src/org/opensolaris/opengrok/history/SCCSRepository.java	1	79.27%
src/org/opensolaris/opengrok/history/HistoryGuru.java	2	75.4%
src/org/opensolaris/opengrok/history/BitKeeperRepository.java	3	79.72%
src/org/opensolaris/opengrok/history/RepositoryFactory.java	3	86.27%
opengrok-indexer/target/generated-sources/jflex/org/opensolaris/opengrok/search/context/PlainLineTokenizer.java	64	75.0%
opengrok-indexer/target/generated-sources/jflex/org/opensolaris/opengrok/analysis/uue/UuencodeFullTokenizer.java	72	66.29%
opengrok-indexer/target/generated-sources/jflex/org/opensolaris/opengrok/analysis/plain/PlainSymbolTokenizer.java	115	0.0%
opengrok-indexer/target/generated-sources/jflex/org/opensolaris/opengrok/analysis/plain/PlainFullTokenizer.java	149	0.0%

Totals
Change from base Build 2580:	0.2%
Covered Lines:	32041
Relevant Lines:	43887

💛 - Coveralls

idodeclare · 2018-05-18T00:18:22Z

@Orviss , that would break support for languages with punctuation in their symbols names, such as Objective-C with names like sendAction:toObject:forAllCells: or for Forth with names such as ram!16.

ahornace · 2018-05-18T10:53:11Z

Your provided examples still has to be escaped in normal query then why not escape them in PhraseQuery as well? However, let's suppose that there is a token which contains punctuation that will be detected by PhraseQuery but not the file analyzer (analyzer treats it as a single token) and is not reserved query symbol (not !, :, *, etc. ). (something like t#t (just an example for illustration – no idea if that applies to our case))

Different approaches that comes to my mind:

we could tokenize PhraseQuery's text based on the selected type field. (E.g. if Objective-C is selected then : would not be a separate token) (however, this is quite constraining)
we could use your approach but specifically tailored for the tokens which contain punctuation that is deemed as a separate token by the PhraseQuery's analyzer (I think there would not be as much new terms as with your approach)
do nothing, let the users escape them -> with suggester implementation there will be no problem because suggester can escape it for them (it knows it is a single term) (suggester works with PhraseQuery as well)

I, personally, like the third option.

Why I prefer using PhraseQuery to your approach:

whitespaces are ignored; therefore, no difference between methodName (param) and methodName(param) whereas your approach will find documents with only the later example
nearly no additional tokens; therefore, almost no storage impact and negligible performance impact on some queries and the upcoming suggester (the performance of which for some queries depends on the term count)

- Make getContext() strictly require a defined Reader, and remove unused path where Reader and Writer are null. Extract getContextHits() to support uses where formerly null was passed as Reader. - Use TagDesc tuple class instead of String[].

- Recognize a base set of English contractions using an ICU-import of a text trie.

Also, allow full non-whitespace for DEFS and REFS queries.

Also: - add Turkish tests

vladak · 2018-08-28T13:26:17Z

There does not seem to be universal agreement on this one so closing.

tarzanek self-assigned this May 13, 2018

tarzanek added this to the 1.1 milestone May 13, 2018

idodeclare mentioned this pull request May 31, 2018

Lucene exception while adding file: Document contains at least one immense term in field="full" #2130

Open

idodeclare added 10 commits June 20, 2018 19:06

Refactor Context API for comprehensibility

750082a

- Make getContext() strictly require a defined Reader, and remove unused path where Reader and Writer are null. Extract getContextHits() to support uses where formerly null was passed as Reader. - Use TagDesc tuple class instead of String[].

Add TokenizerMode. Support OPENGROK_ALL_NONWHITESPACE.

c963efd

- Recognize a base set of English contractions using an ICU-import of a text trie.

Recognize existing Lucene escapes in QueryBuilder

2b9ae67

Also, allow full non-whitespace for DEFS and REFS queries.

Add test of minified css. Fallback to default PlainFullTokenizer.

cf3ab19

Follow 5a41c63, "Allow to skip tests requiring Ctags..."

2333c6c

Fix test/ Javadoc

099c786

Use simpler, better performing snapshot tracking

106aa67

Fix offsets and lower-case later via String not Char

f4e3b63

Also: - add Turkish tests

Fix required notices

dd4cf6b

Return when possible, to reduce memory footprint

4b75383

idodeclare force-pushed the feature/tokenizer_mode branch from 0b032b5 to 4b75383 Compare June 21, 2018 00:52

vladak closed this Aug 28, 2018

idodeclare deleted the feature/tokenizer_mode branch November 6, 2018 22:43

oliver-ap mentioned this pull request Jul 19, 2022

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" Error when running reindexing in Opengrok #3999

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/tokenizer mode #2104

Feature/tokenizer mode #2104

idodeclare commented May 10, 2018

ahornace commented May 10, 2018

tarzanek commented May 13, 2018

idodeclare commented May 17, 2018

idodeclare commented May 17, 2018

tarzanek commented May 17, 2018

tarzanek commented May 17, 2018

ahornace commented May 17, 2018

idodeclare commented May 18, 2018

coveralls commented May 18, 2018 •

edited

Loading

idodeclare commented May 18, 2018

ahornace commented May 18, 2018

vladak commented Aug 28, 2018

Feature/tokenizer mode #2104

Feature/tokenizer mode #2104

Conversation

idodeclare commented May 10, 2018

ahornace commented May 10, 2018

tarzanek commented May 13, 2018

idodeclare commented May 17, 2018

idodeclare commented May 17, 2018

tarzanek commented May 17, 2018

tarzanek commented May 17, 2018

ahornace commented May 17, 2018

idodeclare commented May 18, 2018

coveralls commented May 18, 2018 • edited Loading

Pull Request Test Coverage Report for Build 2584

💛 - Coveralls

idodeclare commented May 18, 2018

ahornace commented May 18, 2018

vladak commented Aug 28, 2018

coveralls commented May 18, 2018 •

edited

Loading