HSEARCH-2678 Add a simple query string entry point to the DSL #1318

Closed
wants to merge 4 commits into
from

Conversation

Projects
None yet
3 participants
@gsmet
Member

gsmet commented Feb 17, 2017

So, this has a very long history as I started to work on this in 2011 :). This patch is very different from what I envisioned at the start.

Why revisit it now? Got kicked out last week of $oldJob GitHub organization and this branch was hosted in the hibernate-search repository of this organization so it was high time to host it in a safe place. While at it, I decided to rebase it, polish it and well... here we are...

Some history

The idea of this patch came from the impossibility to execute a plain text query with a AND operator with the keyword() DSL.

Basically, we had our users entering "word1 word2" as a search query and they wanted only the results containing word1 and word2. We also wanted this to work with multiple fields. Meaning I would want to look for results containing word1 and word2 but either in name or description.

It looks like a simple issue but it's not that simple, especially when analyzers come into play.

Our users might also want to use simple "advanced" query features like phrase queries or negative queries.

I tried several times to arrive to something working well, thought about using DisMax...

The current version of the patch became possible when Lucene introduced the SimpleQueryParser (which is also available on Elasticsearch) which is a very simple lenient parser that can be used exactly for this purpose.

How it works

You can see a couple of examples in the doc added at the start of this patch. It should give you a good overview of how it works.

I also added a couple of tests that demonstrates the features.

While quite simple by itself, I think it's a great feature for the end user.

@gsmet

This comment has been minimized.

Show comment
Hide comment
@gsmet

gsmet Feb 23, 2017

Member

Jenkins, retest this please

Member

gsmet commented Feb 23, 2017

Jenkins, retest this please

@yrodiere

Here are my two cents. It's mostly about naming, I'm afraid. But given we're talking about introducing a new API, I'm not sure it's really "bikeshedding"... It's not like naming is a detail when building new APIs.

@@ -289,6 +289,74 @@ Query luceneQuery = mythQB.keyword()
In the previous example, only field name is boosted to 5.
+===== Plain text queries

This comment has been minimized.

@yrodiere

yrodiere Feb 28, 2017

Member

Maybe we could call this "native" queries instead of "plain text" queries? Granted, it would be less obvious that these queries are not "type safe", but the "native" keyword seems to fit quite well, especially when you consider native queries in JPA.

Also, we may not allow the same syntax for every indexing service. I see you plugged this to Elasticsearch's simple query strings, whose syntax is similar, but it probably has some differences. So, the word "native" would warn users about that issue.

@yrodiere

yrodiere Feb 28, 2017

Member

Maybe we could call this "native" queries instead of "plain text" queries? Granted, it would be less obvious that these queries are not "type safe", but the "native" keyword seems to fit quite well, especially when you consider native queries in JPA.

Also, we may not allow the same syntax for every indexing service. I see you plugged this to Elasticsearch's simple query strings, whose syntax is similar, but it probably has some differences. So, the word "native" would warn users about that issue.

+[source, JAVA]
+----
+Query luceneQuery = mythQB
+ .text()

This comment has been minimized.

@yrodiere

yrodiere Feb 28, 2017

Member

Following my suggestion above, "text()" would become "nativeQuery" or something like that. "native" is a Java keyword, unfortunately. I guess "native$" or "native_" are not something we can consider.

In any case, I think "text" is prone to error; it gives the impression this is the only full-text enabled query, which is wrong. I understand what we're trying to say is that the query definition will be textual, but it's ambiguous.

Some suggestions (if you don't like nativeQuery):

  • parsed
  • queryString
  • fromString
@yrodiere

yrodiere Feb 28, 2017

Member

Following my suggestion above, "text()" would become "nativeQuery" or something like that. "native" is a Java keyword, unfortunately. I guess "native$" or "native_" are not something we can consider.

In any case, I think "text" is prone to error; it gives the impression this is the only full-text enabled query, which is wrong. I understand what we're trying to say is that the query definition will be textual, but it's ambiguous.

Some suggestions (if you don't like nativeQuery):

  • parsed
  • queryString
  • fromString
+Query luceneQuery = mythQB
+ .text()
+ .onField("history")
+ .defaultOperatorIsAnd()

This comment has been minimized.

@yrodiere

yrodiere Feb 28, 2017

Member

This is a bit clumsy.

So...:

  1. defaultOperator(Operator.AND) isn't much better
  2. or()/and() lack some context
  3. I guess .conjunctive/.disjunctive are a tad too obscure

If nobody finds a better option, I think I'd say we go for 1... Even though it's far from ideal.

@yrodiere

yrodiere Feb 28, 2017

Member

This is a bit clumsy.

So...:

  1. defaultOperator(Operator.AND) isn't much better
  2. or()/and() lack some context
  3. I guess .conjunctive/.disjunctive are a tad too obscure

If nobody finds a better option, I think I'd say we go for 1... Even though it's far from ideal.

+
+ JsonBuilder.Object queryBuilder = JsonBuilder.object()
+ .addProperty( "query", query.getQuery() )
+ .addProperty( "defaultOperator", query.isMatchAll() ? "and" : "or" );

This comment has been minimized.

@yrodiere

yrodiere Feb 28, 2017

Member

It's default_operator, not defaultOperator. It may work in ES2, but I've seen cases where ES5 dropped such undocumented syntax, so we should stay away from such syntax.

@yrodiere

yrodiere Feb 28, 2017

Member

It's default_operator, not defaultOperator. It may work in ES2, but I've seen cases where ES5 dropped such undocumented syntax, so we should stay away from such syntax.

+
+ /**
+ * Value searched in the fields.
+ * The value is passed to the respective fields analyzers.

This comment has been minimized.

@yrodiere

yrodiere Feb 28, 2017

Member

This javadoc should be rephrased: the value is parsed, it's not analyzed by the field analyzers. Only part of it is actually analyzed.

@yrodiere

yrodiere Feb 28, 2017

Member

This javadoc should be rephrased: the value is parsed, it's not analyzed by the field analyzers. Only part of it is actually analyzed.

+
+ /**
+ * @param fields The field names the term query is executed on. The underlying properties for the specified
+ * fields need to be of the same type. For example, it is not possible to use this method with a mixture of

This comment has been minimized.

@yrodiere

yrodiere Feb 28, 2017

Member

This javadoc seems wrong; we can target any field, regardless of its type. We may want to warn users that queries apply on the bridged data, though (so dates may have a weird format, for instance).

@yrodiere

yrodiere Feb 28, 2017

Member

This javadoc seems wrong; we can target any field, regardless of its type. We may want to warn users that queries apply on the bridged data, though (so dates may have a weird format, for instance).

+ // we use the analyzers defined in the schema as it is not possible to define per field schema here
+ RemoteSimpleQueryStringQuery.Builder builder = new RemoteSimpleQueryStringQuery.Builder()
+ .query( searchText )
+ .defaultOperatorIsAnd( defaultOperatorIsAnd );

This comment has been minimized.

@yrodiere

yrodiere Feb 28, 2017

Member

We don't pass the reference to the analyzer here... If I understand correctly, this is because we can't define per-field analyzers in Elasticsearch's simple_query_string. We only have one global analyzer parameter.

Maybe this should be documented? Users may be surprised when they write:

searchFactory.buildQueryBuilder().forEntity( MyEntity.class ).overridesForField( "foo", "someQueryAnalyzer" ).get()
    .text().forField( "foo" ).matching( "blah blah" ).createQuery();

... and the resulting query doesn't use the overridden analyzer.

Maybe we should also support the case where there's only one field, and throw an error if there are multiple fields with different analyzers?

@yrodiere

yrodiere Feb 28, 2017

Member

We don't pass the reference to the analyzer here... If I understand correctly, this is because we can't define per-field analyzers in Elasticsearch's simple_query_string. We only have one global analyzer parameter.

Maybe this should be documented? Users may be surprised when they write:

searchFactory.buildQueryBuilder().forEntity( MyEntity.class ).overridesForField( "foo", "someQueryAnalyzer" ).get()
    .text().forField( "foo" ).matching( "blah blah" ).createQuery();

... and the resulting query doesn't use the overridden analyzer.

Maybe we should also support the case where there's only one field, and throw an error if there are multiple fields with different analyzers?

@@ -289,6 +289,74 @@ Query luceneQuery = mythQB.keyword()
In the previous example, only field name is boosted to 5.
+===== Plain text queries
+
+Plain text queries use the Lucene SimpleQueryParser. The query text is parsed and it exposes more

This comment has been minimized.

@yrodiere

yrodiere Feb 28, 2017

Member

Maybe we should add something to the Elasticsearch integration doc, to mention that the syntax there is the one for simple_query_string (which may be slightly different, at least in the future).

@yrodiere

yrodiere Feb 28, 2017

Member

Maybe we should add something to the Elasticsearch integration doc, to mention that the syntax there is the one for simple_query_string (which may be slightly different, at least in the future).

@gsmet gsmet changed the title from HSEARCH-917 Add a plain text feature based on the Lucene SimpleQueryParser to HSEARCH-2678 Add a simple query string entry point to the DSL Apr 7, 2017

@gsmet

This comment has been minimized.

Show comment
Hide comment
@gsmet

gsmet Apr 7, 2017

Member

@yrodiere I think I addressed all your comments and we should be on the same page regarding this feature. The first 2 commits should probably be squashed after review.

As for the build status, the failure looks unrelated.

Member

gsmet commented Apr 7, 2017

@yrodiere I think I addressed all your comments and we should be on the same page regarding this feature. The first 2 commits should probably be squashed after review.

As for the build status, the failure looks unrelated.

@yrodiere

This comment has been minimized.

Show comment
Hide comment
@yrodiere

yrodiere Apr 10, 2017

Member

Jenkins, retest this please

Member

yrodiere commented Apr 10, 2017

Jenkins, retest this please

@yrodiere

Here are my comments; it looks good except for a few typos.
Maybe @Sanne wants to have a look before we merge, since we're adding new APIs?

+ else {
+ List<String> overridingAnalyzers = new ArrayList<>();
+ overridingAnalyzers.addAll( analyzers );
+ overridingAnalyzers.add( overridingRemoteAnalyzerName );

This comment has been minimized.

@yrodiere

yrodiere Apr 10, 2017

Member

From what I can see above, overridingRemoteAnalyzerName is already in the analyzers set, so we could just pass the set to the logger method (if you made the parameter a Collection, that is 😁 )

			analyzers.add( queryRemoteAnalyzerName ); // Value added to the set here
 			if ( !queryRemoteAnalyzerName.equals( originalRemoteAnalyzerName ) ) {
 				if ( overridingRemoteAnalyzerName == null ) {
 					overridingRemoteAnalyzerName = queryRemoteAnalyzerName; // Same value put into overridingRemoteAnalyzerName here
 				}
@yrodiere

yrodiere Apr 10, 2017

Member

From what I can see above, overridingRemoteAnalyzerName is already in the analyzers set, so we could just pass the set to the logger method (if you made the parameter a Collection, that is 😁 )

			analyzers.add( queryRemoteAnalyzerName ); // Value added to the set here
 			if ( !queryRemoteAnalyzerName.equals( originalRemoteAnalyzerName ) ) {
 				if ( overridingRemoteAnalyzerName == null ) {
 					overridingRemoteAnalyzerName = queryRemoteAnalyzerName; // Same value put into overridingRemoteAnalyzerName here
 				}

This comment has been minimized.

@gsmet

gsmet Apr 10, 2017

Member

Well spotted, fixed.

@gsmet

gsmet Apr 10, 2017

Member

Well spotted, fixed.

@@ -44,6 +44,13 @@
PhraseContext phrase();
+ /**
+ * Build a plain text query.

This comment has been minimized.

@yrodiere

yrodiere Apr 10, 2017

Member

Should probably be "Build a query from a simple query string"... ? Same for the @return below.

@yrodiere

yrodiere Apr 10, 2017

Member

Should probably be "Build a query from a simple query string"... ? Same for the @return below.

This comment has been minimized.

@gsmet

gsmet Apr 10, 2017

Member

Yup, fixed.

@gsmet

gsmet Apr 10, 2017

Member

Yup, fixed.

+ */
+public interface SimpleQueryStringContext extends QueryCustomization<SimpleQueryStringContext> {
+ /**
+ * @param field The field name the plain text query is executed on

This comment has been minimized.

@yrodiere

yrodiere Apr 10, 2017

Member

Same here, should be about a "simple query string" rather than a "plain text query", since we decided to use that terminology.

@yrodiere

yrodiere Apr 10, 2017

Member

Same here, should be about a "simple query string" rather than a "plain text query", since we decided to use that terminology.

This comment has been minimized.

@gsmet

gsmet Apr 10, 2017

Member

Fixed.

@gsmet

gsmet Apr 10, 2017

Member

Fixed.

+ /**
+ * @param field The field name the plain text query is executed on
+ *
+ * @return {@code SimpleQueryStringMatchingContext} to continue the term query

This comment has been minimized.

@yrodiere

yrodiere Apr 10, 2017

Member

Copy/paste error here, it's not a term query

@yrodiere

yrodiere Apr 10, 2017

Member

Copy/paste error here, it's not a term query

This comment has been minimized.

@gsmet

gsmet Apr 10, 2017

Member

Fixed, sorry about that.

@gsmet

gsmet Apr 10, 2017

Member

Fixed, sorry about that.

+
+ /**
+ * @param field The first field added to the list of fields (follows the same rules described below for fields)
+ * @param fields The field names the term query is executed on

This comment has been minimized.

@yrodiere

yrodiere Apr 10, 2017

Member

Copy/paste error here, it's not a term query

@yrodiere

yrodiere Apr 10, 2017

Member

Copy/paste error here, it's not a term query

This comment has been minimized.

@gsmet

gsmet Apr 10, 2017

Member

Fixed, sorry about that.

@gsmet

gsmet Apr 10, 2017

Member

Fixed, sorry about that.

+ /**
+ * Define the default operator as AND.
+ */
+ SimpleQueryStringDefinitionTermination useAndAsDefaultOperator();

This comment has been minimized.

@yrodiere

yrodiere Apr 10, 2017

Member

withAndAsDefaultOperator, maybe? This would make the DSL feel a bit more natural: "simple query string, on field 'foo', with 'and' as default operator, matching 'bar'"
It's just a suggestion, I don't mean to argue about this. Ignore it if you don't like it.

@yrodiere

yrodiere Apr 10, 2017

Member

withAndAsDefaultOperator, maybe? This would make the DSL feel a bit more natural: "simple query string, on field 'foo', with 'and' as default operator, matching 'bar'"
It's just a suggestion, I don't mean to argue about this. Ignore it if you don't like it.

This comment has been minimized.

@gsmet

gsmet Apr 10, 2017

Member

Done.

@gsmet

gsmet Apr 10, 2017

Member

Done.

@gsmet

This comment has been minimized.

Show comment
Hide comment
@gsmet

gsmet Apr 10, 2017

Member

@yrodiere I pushed a new version:

  • squashed the 2 first commits
  • added one commit to rename useAnd...
  • added one commit to add lucene-queryparser to OSGi and distribution. It was already present in the WildFly modules.
Member

gsmet commented Apr 10, 2017

@yrodiere I pushed a new version:

  • squashed the 2 first commits
  • added one commit to rename useAnd...
  • added one commit to add lucene-queryparser to OSGi and distribution. It was already present in the WildFly modules.
@yrodiere

Looks good to me!
@Sanne, do you want to check out the new API? Otherwise I'll merge it.

@Sanne

This comment has been minimized.

Show comment
Hide comment
@Sanne

Sanne Apr 13, 2017

Member

Great work!

Looks very good. The only detail I'm not convinced of is the starting point of the DSL:

qb.simpleQueryString()

It feels a bit at odds with the other methods, but not having a better idea I merged it.

Proposal: let's post a dedicated blog to present / explain the feature and see if someone has a better name to offer.

Merged!

Member

Sanne commented Apr 13, 2017

Great work!

Looks very good. The only detail I'm not convinced of is the starting point of the DSL:

qb.simpleQueryString()

It feels a bit at odds with the other methods, but not having a better idea I merged it.

Proposal: let's post a dedicated blog to present / explain the feature and see if someone has a better name to offer.

Merged!

@Sanne Sanne closed this Apr 13, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment