-
Notifications
You must be signed in to change notification settings - Fork 739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/tokenizer mode #2104
Feature/tokenizer mode #2104
Conversation
Hello, maybe I am missing something but why not just add Also, how does this perform on minified code, e.g. Regardless, to me this feels like twisting Lucene. If we truly want to provide full-text search then I believe that it should be done differently: e.g. https://swtch.com/~rsc/regexp/regexp4.html or https://livegrep.com/search/linux Regards, |
I like that this is "experimental" :) |
@Orviss, you're not quite understanding Lucene as an inverted index. Firstly, being inverted allows Lucene to support multiple terms to index the same text location. That is the basis for Lucene's synonym handling. This patch leverages the inverted index to allow multiple text fragments to index the same or nearly the same source code location. E.g. for a C syntax fragment Secondly, Lucene |
@tarzanek , I pushed a test of minified css and a revision to ensure that default plain-full tokenization is used when new word-breaking operations reach the const limits — which will happen for minified code. |
looking ... |
ah ... |
Actually, |
@tarzanek, thank you for spotting that. I was baffled because NetBeans "Generate Javadoc (opengrok)" was running without error. I guess it does not generate for |
@Orviss , that would break support for languages with punctuation in their symbols names, such as Objective-C with names like |
Your provided examples still has to be escaped in normal query then why not escape them in Different approaches that comes to my mind:
I, personally, like the third option. Why I prefer using
|
- Make getContext() strictly require a defined Reader, and remove unused path where Reader and Writer are null. Extract getContextHits() to support uses where formerly null was passed as Reader. - Use TagDesc tuple class instead of String[].
- Recognize a base set of English contractions using an ICU-import of a text trie.
Also, allow full non-whitespace for DEFS and REFS queries.
Also: - add Turkish tests
0b032b5
to
4b75383
Compare
There does not seem to be universal agreement on this one so closing. |
Hello,
Please consider for integration this patch to add a mode switch
OPENGROK_ALL_NONWHITESPACE
(--allNonWhitespace [on|off]
[default off]) to index all non-whitespace forFULL
queries.With the mode activated, the set of tokens currently indexed by the current pre-release 1.1-rc26 will still return results, but additionally all strings of contiguous non-whitespace will be multiply tokenized to allow flexible but precise matching of text fragments that can include e.g. punctuation (any non-whitespace really).
As an example, the source text
(the "license")
in release 1.1-rc26 will only tokenize the following:so the user cannot home in on queries that include the punctuation parentheses or quotes.
With
--allNonWhitespace on
, the same source text(the "license")
will be indexed with the following tokens (and position offsets):to support the same previous tokens but additionally a number of very precise queries indexing true full-text.
The non-whitespace runs are broken on specific, most-commonly found places in programming languages: e.g. word/non-word boundaries, open- and closing- punctuation boundaries, and quoting character boundaries (with new support for recognizing English contractions so that words such as
that'll
will not be broken).The cost, of course, is an index that consumes more space. In release 1.1-rc26 I find that indexes are approximately 75% of the size of source code. With
--allNonWhitespace on
, indexes are at least 100% of source code size and often higher. When I index a sample of ten large, common open source projects, the total index size is 125% of source code size.I find, however, that some repos (depending on runs of non-whitespace) can even have indexes 200% the size of source code. Despite this, I find that the utility of producing matches on a multitude of fragments of non-whitespace is invaluable.
Since this patch will support queries with punctuation, it also updates
QueryBuilder
to avoid re-escaping punctuation when the user includes explicit Lucene escape codes. E.g.,QueryBuilder
will continue to support a user submitting aDEFS
querymethod1:forValue:
and escape the colons for Lucene. OpenGrok will also now support a querymethod1\:forValue\:
where the user includes explicitly the Lucene escapes and no longer throw an error.Thank you.