Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/tokenizer mode #2104

Closed
wants to merge 10 commits into from

Conversation

idodeclare
Copy link
Contributor

Hello,

Please consider for integration this patch to add a mode switch OPENGROK_ALL_NONWHITESPACE (--allNonWhitespace [on|off] [default off]) to index all non-whitespace for FULL queries.

With the mode activated, the set of tokens currently indexed by the current pre-release 1.1-rc26 will still return results, but additionally all strings of contiguous non-whitespace will be multiply tokenized to allow flexible but precise matching of text fragments that can include e.g. punctuation (any non-whitespace really).

As an example, the source text (the "license") in release 1.1-rc26 will only tokenize the following:

the
license

so the user cannot home in on queries that include the punctuation parentheses or quotes.

With --allNonWhitespace on, the same source text (the "license") will be indexed with the following tokens (and position offsets):

(the	|1
the	|0
"license	|1
"license"	|0
"license")	|0
license	|0
license"	|0
license")	|0

to support the same previous tokens but additionally a number of very precise queries indexing true full-text.

The non-whitespace runs are broken on specific, most-commonly found places in programming languages: e.g. word/non-word boundaries, open- and closing- punctuation boundaries, and quoting character boundaries (with new support for recognizing English contractions so that words such as that'll will not be broken).

The cost, of course, is an index that consumes more space. In release 1.1-rc26 I find that indexes are approximately 75% of the size of source code. With --allNonWhitespace on, indexes are at least 100% of source code size and often higher. When I index a sample of ten large, common open source projects, the total index size is 125% of source code size.

I find, however, that some repos (depending on runs of non-whitespace) can even have indexes 200% the size of source code. Despite this, I find that the utility of producing matches on a multitude of fragments of non-whitespace is invaluable.

Since this patch will support queries with punctuation, it also updates QueryBuilder to avoid re-escaping punctuation when the user includes explicit Lucene escape codes. E.g., QueryBuilder will continue to support a user submitting a DEFS query method1:forValue: and escape the colons for Lucene. OpenGrok will also now support a query method1\:forValue\: where the user includes explicitly the Lucene escapes and no longer throw an error.

Thank you.

@ahornace
Copy link
Contributor

Hello,

maybe I am missing something but why not just add (, ", etc. as terms and then just use PhraseQuery (e.g. "(the")? Just a handful of new terms would be added to the index which is negligible. Regarding the performance, I'm not completely sure but I think that your way will be faster for these lookups but some other queries (e.g. wildcard queries or regexp queries will be slowed down by the factor of the changed lucene index size).

Also, how does this perform on minified code, e.g. *.min.js files?

Regardless, to me this feels like twisting Lucene. If we truly want to provide full-text search then I believe that it should be done differently: e.g. https://swtch.com/~rsc/regexp/regexp4.html or https://livegrep.com/search/linux

Regards,
Adam

@tarzanek tarzanek self-assigned this May 13, 2018
@tarzanek tarzanek added this to the 1.1 milestone May 13, 2018
@tarzanek
Copy link
Contributor

I like that this is "experimental" :)
I am also curious what it will do on minified code, but then I hope the jflex token length limiter will kick in there.
Let's finish the review (I did like 20%, so need more time, especially on how do you preserve backwards compatibility, since the analysers are written in the way that they ignore special chars on purpose)
and I'd be happy for other reviews too (and thanks Adam, this is an interesting idea with PhraseQuery, I think OpenGrok has a huge debt in using queries properly and in most effective way (for user of course) )

@idodeclare
Copy link
Contributor Author

maybe I am missing something but why not just add (, ", etc. as terms and then just use PhraseQuery (e.g. "(the")?

@Orviss, you're not quite understanding Lucene as an inverted index. Firstly, being inverted allows Lucene to support multiple terms to index the same text location. That is the basis for Lucene's synonym handling. This patch leverages the inverted index to allow multiple text fragments to index the same or nearly the same source code location. E.g. for a C syntax fragment parent->data a user could easily query either parent->data or parent-> or ->data and get good answers from OpenGrok — without requiring the quoted syntax of a phrase query.

Secondly, Lucene PhraseQuery allows querying sequences of terms — but unless the query is for multiple terms, then it is no different from a regular term query. Your example, "(the" is a query for a single term, (the, so it would not benefit from adding punctuation as additional terms and actually would not return results if punctuation were indexed as a separate term. A phrase query like "( the" — i.e., containing whitespace — would be required to query source code (the (without whitespace) that is indexed with two consecutive terms for the parenthesis and the English word.

@idodeclare
Copy link
Contributor Author

curious what it will do on minified code

@tarzanek , I pushed a test of minified css and a revision to ensure that default plain-full tokenization is used when new word-breaking operations reach the const limits — which will happen for minified code.

@tarzanek
Copy link
Contributor

looking ...
we need to fix the javadoc problem so travis behaves again ...

@tarzanek
Copy link
Contributor

ah ...
I see ...
[ERROR] /home/travis/build/oracle/opengrok/test/org/opensolaris/opengrok/util/CustomAssertions.java:112: error: reference not found
[ERROR] * {@link #assertSymbolStream(java.lang.Class, java.io.InputStream, java.util.List, boolean)}
[ERROR] ^

@ahornace
Copy link
Contributor

Actually, PhraseQuery's text is tokenized. Therefore, queries like "class.member" or "class . member" are the same. However, you can easily override this behaviour in QueryParserBase#newFieldQuery -> just override this method in OpenGrok's CustomQueryParser. Therefore, there would be no problem with query "(the" and treating the ( and the as separate terms.

@idodeclare
Copy link
Contributor Author

@tarzanek, thank you for spotting that. I was baffled because NetBeans "Generate Javadoc (opengrok)" was running without error. I guess it does not generate for test/ classes, unlike the travis build.

@coveralls
Copy link

coveralls commented May 18, 2018

Pull Request Test Coverage Report for Build 2584

  • 500 of 566 (88.34%) changed or added relevant lines in 17 files are covered.
  • 410 unchanged lines in 9 files lost coverage.
  • Overall coverage increased (+0.2%) to 73.008%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/org/opensolaris/opengrok/search/TermEscaperBase.java 6 8 75.0%
src/org/opensolaris/opengrok/search/SearchEngine.java 0 2 0.0%
src/org/opensolaris/opengrok/search/context/Context.java 30 33 90.91%
src/org/opensolaris/opengrok/analysis/JFlexSymbolMatcher.java 4 14 28.57%
src/org/opensolaris/opengrok/util/TextTrieMap.java 139 153 90.85%
src/org/opensolaris/opengrok/analysis/JFlexTokenizer.java 260 295 88.14%
Files with Coverage Reduction New Missed Lines %
src/org/opensolaris/opengrok/analysis/JFlexTokenizer.java 1 87.86%
src/org/opensolaris/opengrok/history/SCCSRepository.java 1 79.27%
src/org/opensolaris/opengrok/history/HistoryGuru.java 2 75.4%
src/org/opensolaris/opengrok/history/BitKeeperRepository.java 3 79.72%
src/org/opensolaris/opengrok/history/RepositoryFactory.java 3 86.27%
opengrok-indexer/target/generated-sources/jflex/org/opensolaris/opengrok/search/context/PlainLineTokenizer.java 64 75.0%
opengrok-indexer/target/generated-sources/jflex/org/opensolaris/opengrok/analysis/uue/UuencodeFullTokenizer.java 72 66.29%
opengrok-indexer/target/generated-sources/jflex/org/opensolaris/opengrok/analysis/plain/PlainSymbolTokenizer.java 115 0.0%
opengrok-indexer/target/generated-sources/jflex/org/opensolaris/opengrok/analysis/plain/PlainFullTokenizer.java 149 0.0%
Totals Coverage Status
Change from base Build 2580: 0.2%
Covered Lines: 32041
Relevant Lines: 43887

💛 - Coveralls

@idodeclare
Copy link
Contributor Author

@Orviss , that would break support for languages with punctuation in their symbols names, such as Objective-C with names like sendAction:toObject:forAllCells: or for Forth with names such as ram!16.

@ahornace
Copy link
Contributor

Your provided examples still has to be escaped in normal query then why not escape them in PhraseQuery as well? However, let's suppose that there is a token which contains punctuation that will be detected by PhraseQuery but not the file analyzer (analyzer treats it as a single token) and is not reserved query symbol (not !, :, *, etc. ). (something like t#t (just an example for illustration – no idea if that applies to our case))

Different approaches that comes to my mind:

  • we could tokenize PhraseQuery's text based on the selected type field. (E.g. if Objective-C is selected then : would not be a separate token) (however, this is quite constraining)
  • we could use your approach but specifically tailored for the tokens which contain punctuation that is deemed as a separate token by the PhraseQuery's analyzer (I think there would not be as much new terms as with your approach)
  • do nothing, let the users escape them -> with suggester implementation there will be no problem because suggester can escape it for them (it knows it is a single term) (suggester works with PhraseQuery as well)

I, personally, like the third option.

Why I prefer using PhraseQuery to your approach:

  • whitespaces are ignored; therefore, no difference between methodName (param) and methodName(param) whereas your approach will find documents with only the later example
  • nearly no additional tokens; therefore, almost no storage impact and negligible performance impact on some queries and the upcoming suggester (the performance of which for some queries depends on the term count)

- Make getContext() strictly require a defined
  Reader, and remove unused path where Reader and
  Writer are null. Extract getContextHits() to
  support uses where formerly null was passed as
  Reader.
- Use TagDesc tuple class instead of String[].
- Recognize a base set of English contractions
  using an ICU-import of a text trie.
Also, allow full non-whitespace for DEFS and REFS
queries.
@vladak
Copy link
Member

vladak commented Aug 28, 2018

There does not seem to be universal agreement on this one so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants