Skip to content

Latest commit

 

History

History
864 lines (688 loc) · 39.7 KB

elasticsearch-integration.asciidoc

File metadata and controls

864 lines (688 loc) · 39.7 KB

Integration with Elasticsearch

Status

Caution

This feature is a work in progress. Make sure to read the Limitations section!

The integration with Elasticsearch is in development and should be considered experimental. We do think we have the basics covered and we are looking for feedback.

Patches can be sent as pull requests to the Github repository, but also general feedback, suggestions and questions are very welcome. To get in touch or find other interesting links for contributors, see the Community page of the Hibernate website.

Goal of the Elasticsearch integration

The goal of integrating with Elasticsearch is to allow Hibernate Search users to benefit from the full-text capabilities integrated with Hibernate ORM but replacing the local Lucene based index with a remote Elasticsearch service.

There could be various reasons to prefer this over an "embedded Lucene" approach:

  • wish to separate the service running your application from the Search service

  • integrate with an existing Elasticsearch instance

  • benefit from Elasticsearch’s out of the box horizontal scalability features

  • explore the data updated by an Hibernate powered application using the Elasticsearch dashboard integrations such as Kibana

There are a couple of drawbacks compared to the embedded Lucene approach though:

  • incur a performance penalty of remote RPCs both for index updates and to run queries

  • need to manage an additional service

  • possibly need to buy and manage additional servers

Which solution is best will depend on the specific needs of your system and your organization.

Note
Why not use Elasticsearch directly

The #1 reason is that Hibernate Search integrates perfectly with Hibernate ORM. All changes done to your objects will trigger the necessary index changes transparently.

  • it will honor the transaction boundary - i.e. not do the indexing work if the transaction ends up in rollback

  • changes to cascaded objects are handled

  • changes to nested object embedded in a root indexed entity are handled

  • changes will be sent in bulk - i.e. optimized systematically for you

  • etc.

There is no more paradigm shift in your code. You are working on Hibernate ORM managed objects, doing your queries on object properties with a nice DSL.

Getting started and configuration

To experiment with the Elasticsearch integration you will have to download Elasticsearch and run it: Hibernate Search connects to an Elasticsearch node but does not provide one.

One option is to use the Elasticsearch Docker image (see here for Elasticsearch 2).

Elasticsearch version

Hibernate Search expects an Elasticsearch cluster running version 2.x or 5.x. The version running on your cluster will be automatically detected on startup, and Hibernate Search will adapt its behavior based on the detected version.

When upgrading your Elasticsearch cluster though, some administrative tasks are still required on your cluster: Hibernate Search will not take care of those.

Warning

Hibernate Search does not support the string datatype on Elasticsearch 5.x. Thus, if you upgrade your cluster from 2.x to 5.x, you will need to delete your indexes manually and reindex your data.

The targeted version is largely transparent to Hibernate Search users, but there are a few differences in how Hibernate Search behaves depending on the Elasticsearch version that may affect you. The table details those differences.

2.x 5.x

Configuration required for purges

Enable the delete-by-query plugin

None

Datatype used for text fields in Elasticsearch

string

text (if analyzed) or keyword (if not). The string datatype has been deprecated in Elasticsearch 5.0.

Norms

Not implemented

Implemented

Implementation of @Field.indexNullAs for analyzed text fields

null_value is added to the mapping, null values are indexed as such.

null_value is not available for text fields, null values are replaced with the indexNullAs value explicitly when indexing.

Note

Hibernate Search internal tests run against Elasticsearch {testElasticsearchVersion} by default.

Dependencies in your Java application

In addition to the usual dependencies like Hibernate ORM and Hibernate Search, you will need the new hibernate-search-elasticsearch jar.

Example 1. Maven dependencies for Hibernate Search with Elasticsearch
<dependency>
   <groupId>org.hibernate</groupId>
   <artifactId>hibernate-search-elasticsearch</artifactId>
   <version>{hibernateSearchVersion}</version>
</dependency>

Elasticsearch configuration

Hibernate Search can work with an Elasticsearch server without altering its configuration.

However some features offered by Hibernate Search require specific configuration:

  • on Elasticsearch 2.x only (not necessary on 5.x): if you want to be able to use the Hibernate Search MassIndexer with purgeAllOnStart enabled - it is enabled by default -, or to use FullTextSession.purge() or FullTextSession.purgeAll(), install the delete-by-query plugin

  • if you want to retrieve the distance in a geolocation query, enable the lang-groovy plugin, see Elasticsearch Spatial queries

  • if you want to use paging (as opposed to scrolling) on result sets larger than 10000 elements (for instance access the 10001st result), you may increase the value of the index.max_result_window property (default is 10000).

Hibernate Search configuration

Configuration is minimal. Add the configuration properties to your persistence.xml or where you put the rest of your Hibernate Search configuration.

Select Elasticsearch as the backend

hibernate.search.default.indexmanager elasticsearch

Hostname and port for Elasticsearch

hibernate.search.default.elasticsearch.host http://127.0.0.1:9200 (default)

You may also select multiple hosts (separated by whitespace characters), so that they are assigned requests in turns (load balancing):

hibernate.search.default.elasticsearch.host http://es1.mycompany.com:9200 http://es2.mycompany.com:9200

In the example above, the first request will go to es1, the second to es2, the third to es1, and so on.

Also note having multiple hosts will enable failover: if one node happens to fail to serve a request (timeout, server error, invalid HTTP response, …​), the same request will be sent to the next one; if the second request is served without error, the failure will be blamed on the node and no error will be reported to the application.

The failover feature will also be enabled when you only have one configured host but other hosts have been added thanks to automatic discovery (see below).

Username for Elasticsearch connection

hibernate.search.default.elasticsearch.username ironman (default is empty, meaning anonymous access)

Password for Elasticsearch connection

hibernate.search.default.elasticsearch.password j@rV1s (default is empty)

Caution

If you used HTTP instead of HTTPS in any of the Elasticsearch host URLs (see above), your password will be transmitted in clear text over the network.

Select the index creation strategy

hibernate.search.default.elasticsearch.index_schema_management_strategy CREATE (default)

Let’s see the options for the index_schema_management_strategy property:

Value Definition

none

The index, its mappings and the analyzer definitions will not be created, deleted nor altered. Hibernate Search will not even check that the index already exists.

validate

The index, its mappings and analyzer definitions will be checked for conflicts with Hibernate Search’s metamodel. The index, its mappings and analyzer definitions will not be created, deleted nor altered.

update

The index, its mappings and analyzer definitions will be created, existing mappings will be updated if there are no conflicts. + Caution: if analyzer definitions have to be updated, the index will be closed automatically during the update.

create

The default: an existing index will not be altered, a missing index will be created along with their mappings and analyzer definitions.

drop-and-create

Indexes will be deleted if existing and then created along with their mappings and analyzer definitions. This will delete all content from the indexes!

drop-and-create-and-drop

Similar to drop-and-create but will also delete the index at shutdown. Commonly used for tests.

Caution
Strategies in production environments

It is strongly recommended to use either none or validate in a production environment. drop-and-create and drop-and-create-and-drop are obviously unsuitable in this context (unless you want to reindex everything upon every startup), and update may leave your mapping half-updated in case of conflict.

To be precise, if your mapping changed in an incompatible way, such as a field having its type changed, updating the mapping may be impossible without manual intervention. In this case, the update strategy will prevent Hibernate Search from starting, but it may already have successfully updated the mappings for another index, making a rollback difficult at best.

Also, when updating analyzer definitions, Hibernate Search will stop the affected indexes during the update. This means the update strategy should be used with caution when multiple clients use Elasticsearch indexes managed by Hibernate Search: those clients should be synchronized in such a way that while Hibernate Search is starting, no other client tries to use the index.

For these reasons, migrating your mapping should be considered a part of your deployment process and be planned cautiously.

Note

Mapping validation is as permissive as possible. Fields or mappings that are unknown to Hibernate Search will be ignored, and settings that are more powerful than required (e.g. a field annotated with @Field(index = Index.NO) in Search but marked as "index": analyzed in Elasticsearch) will be deemed valid.

One exception should be noted, though: date formats must match exactly the formats specified by Hibernate Search, due to implementation constraints.

Maximum time to wait for the successful execution of a request to the Elasticsearch server before failing (in ms)

hibernate.search.default.elasticsearch.request_timeout 60000 (default)

The execution time of a request includes the time needed to establish a connection, to send the request, and to receive the whole response, optionally retrying in case of node failures.

Maximum time to wait for a connection to the Elasticsearch server before failing (in ms)

hibernate.search.default.elasticsearch.connection_timeout 3000 (default)

Maximum time to wait for a response from the Elasticsearch server before failing (in ms)

hibernate.search.default.elasticsearch.read_timeout 60000 (default)

Maximum number of simultaneous connections to the Elasticsearch cluster

hibernate.search.default.elasticsearch.max_total_connection 20 (default)

Maximum number of simultaneous connections to a single Elasticsearch server

hibernate.search.default.elasticsearch.max_total_connection_per_route 2 (default)

Whether to enable automatic discovery of servers in the Elasticsearch cluster (true or false)

hibernate.search.default.elasticsearch.discovery.enabled false (default)

When using automatic discovery, the Elasticsearch client will periodically probe for new nodes in the cluster, and will add those to the server list (see host above). Similarly, the client will periodically check whether registered servers still respond, and will remove them from the server list if they don’t.

Time interval between two executions of the automatic discovery (in seconds)

hibernate.search.default.elasticsearch.discovery.refresh_interval 10 (default)

This setting will only be taken into account if automatic discovery is enabled (see above).

Scheme to use when connecting to automatically discovered nodes (http or https)

hibernate.search.default.elasticsearch.discovery.default_scheme http (default)

This setting will only be taken into account if automatic discovery is enabled (see above).

Maximum time to wait for the indexes to become available before failing (in ms)

hibernate.search.default.elasticsearch.index_management_wait_timeout 10000 (default)

This setting is ignored when the NONE strategy is selected, since the index will not be checked on startup (see above).

This value must be lower than the read timeout (see above).

Status an index must at least have in order for Hibernate Search to work with it (one of "green", "yellow" or "red")

hibernate.search.default.elasticsearch.required_index_status green (default)

Only operate if the index is at this level or safer. In development, set this value to yellow if the number of nodes started is below the number of expected replicas.

Whether to perform an explicit refresh after a set of operations has been executed against a specific index (true or false)

hibernate.search.default.elasticsearch.refresh_after_write false (default)

This is useful in unit tests to ensure that a write is visible by a query immediately without delay. This keeps unit tests simpler and faster. But you should not rely on the synchronous behaviour for your production code. Leave at false for optimal performance of your Elasticsearch cluster.

When scrolling, the minimum number of previous results kept in memory at any time

hibernate.search.elasticsearch.scroll_backtracking_window_size 10000 (default)

When scrolling, the number of results fetched by each Elasticsearch call

hibernate.search.elasticsearch.scroll_fetch_size 1000 (default)

When scrolling, the maximum duration ScrollableResults will be usable if no other results are fetched from Elasticsearch, in seconds

hibernate.search.elasticsearch.scroll_timeout 60 (default)

Note

Properties prefixed with hibernate.search.default can be given globally as shown above and/or be given for specific indexes:

hibernate.search.someindex.elasticsearch.index_schema_management_strategy update

This excludes properties related to the internal Elasticsearch client, which at the moment is common to every index manager (but this will change in a future version). Excluded properties are host, username, password, read_timeout, connection_timeout, max_total_connection, max_total_connection_per_route, discovery.enabled, discovery.refresh_interval and discovery.scheme.

Mapping and indexing

Like in Lucene embedded mode, indexes are transparently updated when you create or update entities mapped to Hibernate Search. Simply use familiar annotations from [search-mapping].

The name of the index will be the lowercased name provided to @Indexed (non qualified class name by default). Hibernate Search will map the fully qualified class name to the Elasticsearch type.

Annotation specificities

Field.indexNullAs

The org.hibernate.search.annotations.Field annotation allows you to provide a replacement value for null properties through the indexNullAs attribute (see [field-annotation]), but this value must be provided as a string.

In order for your value to be understood by Hibernate Search (and Elasticsearch), the provided string must follow one of those formats:

  • For string values, no particular format is required.

  • For numeric values, use formats accepted by Double.parseDouble, Integer.parseInteger, etc., depending on the actual type of your field.

  • For booleans, use either true or false.

  • For dates (java.util.Calendar, java.util.Date, java.time.*), use the ISO-8601 format.

    The full format is yyyy-MM-dd’T’HH:mm:ss.nZ[ZZZ] (for instance 2016-11-26T16:41:00.006+01:00[CET]). Please keep in mind that part of this format must be left out depending on the type of your field, though. For a java.time.LocalDateTime field, for instance, the provided string must not include the zone offset (+01:00) or the zone ID ([UTC]), because those don’t make sense.

    Even when they make sense for the type of your field, the time and time zone may be omitted (if omitted, the time zone will be interpreted as the default JVM time zone).

Dynamic boosting

The org.hibernate.search.annotations.DynamicBoost annotation is not (and cannot be) supported with Elasticsearch, because the platform lacks per-document, index-time boosting capabilities. Static boosts (@Boost) are, however, supported.

Analyzers

Warning
Analyzers are treated differently than in Lucene embedded mode.
Built-in or server-defined analyzers

Using the definition attribute in the @Analyzer annotation, you can refer to the name of the built-in Elasticsearch analyzer, or custom analyzers already registered on your Elasticsearch instances.

More information on analyzers, in particular those already built in Elasticsearch, can be found in the Elasticsearch documentation.

Example of custom analyzers defined in the elasticsearch.yml
# Custom analyzer
index.analysis:
  analyzer.custom-analyzer:
    type: custom
    tokenizer: standard
    filter: [custom-filter, lowercase]
  filter.custom-filter:
    type : stop
    stopwords : [test1, close]

From there, you can use the custom analyzers by name in your entity mappings.

Example of mapping that refers to custom and built-in analyzers on Elasticsearch
@Entity
@Indexed(index = "tweet")
public class Tweet {

    @Id
    @GeneratedValue
    private Integer id;

    @Field
    @Analyzer(definition = "english") // Elasticsearch built-in analyzer
    private String englishTweet;

    @Field
    @Analyzer(definition = "whitespace") // Elasticsearch built-in analyzer
    private String whitespaceTweet;

    @Field(name = "tweetNotAnalyzed", analyzer = Analyze.NO, store = Store.YES)
    // Custom analyzer:
    @Field(
        name = "tweetWithCustom",
        analyzer = @Analyzer(definition = "custom-analyzer")
    )
    private String multipleTweets;
}

You may also reference a built-in Lucene analyzer implementation using the @Analyzer.impl attribute: Hibernate Search will translate the implementation to an equivalent Elasticsearch built-in type, if possible.

Warning

Using the @Analyzer.impl attribute is not recommended with Elasticsearch because it will never allow you to take full advantage of Elasticsearch analysis capabilities. You cannot, for instance, use custom analyzer implementations: only built-in Lucene implementations are supported.

It should only be used when migrating an application that already used Hibernate Search, moving from an embedded Lucene instance to an Elasticsearch cluster.

Example of mapping that refers to a built-in analyzer on Elasticsearch using a Lucene implementation class
@Entity
@Indexed(index = "tweet")
public class Tweet {

    @Id
    @GeneratedValue
    private Integer id;

    @Field
    @Analyzer(impl = EnglishAnalyzer.class) // Elasticsearch built-in "english" analyzer
    private String englishTweet;

    @Field
    @Analyzer(impl = WhitespaceAnalyzer.class) // Elasticsearch built-in "whitespace" analyzer
    private String whitespaceTweet;

}
Custom analyzers

You can also define analyzers within your Hibernate Search mapping using the @AnalyzerDef annotation, like you would do with an embedded Lucene instance. When Hibernate Search creates the Elasticsearch indexes, the relevant definitions will then be automatically added as a custom analyzer in the index settings.

Two different approaches allow you to define your analyzers with Elasticsearch.

The first, recommended approach is to use the factories provided by the hibernate-search-elasticsearch module:

  • org.hibernate.search.elasticsearch.analyzer.ElasticsearchCharFilterFactory

  • org.hibernate.search.elasticsearch.analyzer.ElasticsearchTokenFilterFactory

  • org.hibernate.search.elasticsearch.analyzer.ElasticsearchTokenizerFactory

Those classes can be passed to the factory attribute of the @CharFilterDef, @TokenFilterDef and @TokenizerDef annotations.

The params attribute may be used to define the type parameter and any other parameter accepted by Elasticsearch for this type.

The parameter values will be interpreted as JSON. The parser is not strict, though:

  • quotes around strings may be left out in some cases, as when a string only contains letters.

  • when quotes are required (e.g. your string may be interpreted as a number, and you don’t want that), you may use single quotes instead of double quotes (which are painful to write in Java).

Note

You may use the name attribute of the @CharFilterDef, @TokenFilterDef and @TokenizerDef annotations to define the exact name to give to that definition in the Elasticsearch settings.

Example of mapping that defines analyzers on Elasticsearch using the Elasticsearch*Factory types
@Entity
@Indexed(index = "tweet")
@AnalyzerDef(
	name = "tweet_analyzer",
	charFilters = {
		@CharFilterDef(
			name = "custom_html_strip",
			factory = ElasticsearchCharFilterFactory.class,
			params = {
				@Parameter(name = "type", value = "'html_strip'"),
				// One can use Json arrays
				@Parameter(name = "escaped_tags", value = "['br', 'p']")
			}
		),
		@CharFilterDef(
			name = "p_br_as_space",
			factory = ElasticsearchCharFilterFactory.class,
			params = {
				@Parameter(name = "type", value = "'pattern_replace'"),
				@Parameter(name = "pattern", value = "'<p/?>|<br/?>'"),
				@Parameter(name = "replacement", value = "' '"),
				@Parameter(name = "tags", value = "'CASE_INSENSITIVE'")
			}
		)
	},
	tokenizer = @TokenizerDef(
		factory = ElasticsearchTokenizerFactory.class,
		params = {
			@Parameter(name = "type", value = "'whitespace'"),
		}
	)
)
public class Tweet {

    @Id
    @GeneratedValue
    private Integer id;

    @Field
    @Analyzer(definition = "tweet_analyzer")
    private String content;
}

The second approach is to configure everything as if you were using Lucene: use the Lucene factories, their parameter names, and format the parameter values as required in Lucene. Hibernate Search will automatically convert these definitions to the Elasticsearch equivalent.

Warning

Referencing Lucene factories is not recommended with Elasticsearch because it will never allow you to take full advantage of Elasticsearch analysis capabilities.

Here are the known limitations of the automatic translation:

  • a few factories have unsupported parameters, because those have no equivalent in Elasticsearch. An exception will be raised on startup if a parameter is not supported.

  • the hyphenator parameter for HyphenatedWordsFilterFactory must refer to a file on the Elasticsearch servers, on the contrary to other factories where the files are accessed by Hibernate Search directly. This is due to an Elasticsearch limitation (there is no way to forward the content of a local hyphenation pattern file).

  • some built-in Lucene factories are not (and cannot) be translated, because of incompatible parameters between the Lucene factory and the Elasticsearch equivalent. This is in particular the case for HunspellStemFilterFactory.

Therefore, Lucene factories should only be referenced within analyzer definitions when migrating an application that already used Hibernate Search, moving from an embedded Lucene instance to an Elasticsearch cluster.

Example of mapping that defines analyzers on Elasticsearch using Lucene factories
@Entity
@Indexed(index = "tweet")
@AnalyzerDef(
	name = "tweet_analyzer",
	charFilters = {
		@CharFilterDef(
			name = "custom_html_strip",
			factory = HTMLStripCharFilterFactory.class,
			params = {
				@Parameter(name = "escapedTags", value = "br,p")
			}
		),
		@CharFilterDef(
			name = "p_br_as_space",
			factory = PatternReplaceCharFilterFactory.class,
			params = {
				@Parameter(name = "pattern", value = "<p/?>|<br/?>"),
				@Parameter(name = "replacement", value = " ")
			}
		)
	},
	tokenizer = @TokenizerDef(
		factory = WhitespaceTokenizerFactory.class
	)
)
public class Tweet {

    @Id
    @GeneratedValue
    private Integer id;

    @Field
    @Analyzer(definition = "tweet_analyzer")
    private String content;
}

Custom field bridges

You can write custom field bridges and class bridges. For class bridges and field bridges creating multiple fields, make sure to make your bridge implementation also implement the MetadataProvidingFieldBridge contract.

Caution

Creating sub-fields in custom field bridges is not supported.

You create a sub-field when your MetadataProvidingFieldBridge registers a field whose name is the name of an existing field, with a dot and another string appended, like name + ".mySubField".

This lack of support is due to Elasticsearch not allowing a field to have multiple types. In the example above, the field would have both the object datatype and whatever datatype the original field has (string in the most common case).

As an alternative, you may append a suffix to the original field name in order to create a sibling field, e.g. use name + "_mySubField" or name + "_more.mySubField" instead of name + ".mySubField".

This limitation is true in particular for field bridges applied to the @DocumentId: fields added to the document must not be in the form name + ".mySubField", in order to avoid mapping conflicts with the ID field.

/**
 * Used as class-level bridge for creating the "firstName" and "middleName" document and doc value fields.
 */
public static class FirstAndMiddleNamesFieldBridge implements MetadataProvidingFieldBridge {

    @Override
    public void set(String name, Object value, Document document, LuceneOptions luceneOptions) {
        Explorer explorer = (Explorer) value;

        String firstName = explorer.getNameParts().get( "firstName" );
        luceneOptions.addFieldToDocument( name + "_firstName", firstName, document );
        document.add( new SortedDocValuesField( name + "_firstName", new BytesRef( firstName ) ) );

        String middleName = explorer.getNameParts().get( "middleName" );
        luceneOptions.addFieldToDocument( name + "_middleName", middleName, document );
        document.add( new SortedDocValuesField( name + "_middleName", new BytesRef( middleName ) ) );
    }

    @Override
    public void configureFieldMetadata(String name, FieldMetadataBuilder builder) {
        builder
            .field( name + "_firstName", FieldType.STRING )
                .sortable( true )
            .field( name + "_middleName", FieldType.STRING )
                .sortable( true );
    }
}
Note

This interface and FieldBridge in general are likely going to evolve in the next major version of Hibernate Search to remove its adherence to Lucene specific classes like Document.

Tika bridges

If your metadata processors create fields with a different name from the one passed as a parameter, make sure to make your processor also implement the MetadataProvidingTikaMetadataProcessor contract.

Queries

You can write queries like you usually do in Hibernate Search: native Lucene queries and DSL queries (see [search-query]). We do automatically translate the most common types of Apache Lucene queries and all queries generated by the Hibernate Search DSL except more like this (see below).

Note
Unsupported Query DSL features

Queries written via the DSL work. Open a JIRA otherwise.

The notable exception is more like this queries. Hibernate Search has a more advanced algorithm than Lucene (or Elasticsearch/Solr) which is not easily portable with what Elasticsearch exposes.

If you need this feature, contact us.

On top of translating Lucene queries, you can directly create Elasticsearch queries by using either its String format or a JSON format:

Example 2. Creating an Elasticsearch native query from a string
FullTextSession fullTextSession = Search.getFullTextSession(session);
QueryDescriptor query = ElasticsearchQueries.fromQueryString("title:tales");
List<?> result = fullTextSession.createFullTextQuery(query, ComicBook.class).list();
Example 3. Creating an Elasticsearch native query from JSON
FullTextSession fullTextSession = Search.getFullTextSession(session);
QueryDescriptor query = ElasticsearchQueries.fromJson(
      "{ 'query': { 'match' : { 'lastName' : 'Brand' } } }");
List<?> result = session.createFullTextQuery(query, GolfPlayer.class).list();
Caution
Date/time in native Elasticsearch queries

By default Elasticsearch interprets the date/time strings lacking the time zone as if they were represented using the UTC time zone. If overlooked, this can cause your native Elasticsearch queries to be completely off.

The simplest way to avoid issues is to always explicitly provide time zones IDs or offsets when building native Elasticsearch queries. This may be achieved either by directly adding the time zone ID or offset in date strings, or by using the time_zone parameter (range queries only). See Elasticsearch documentation for more information.

Spatial queries

The Elasticsearch integration supports spatial queries by using either the DSL or native Elasticsearch queries.

For regular usage, there are no particular requirements for spatial support.

However, if you want to calculate the distance from your entities to a point without sorting by the distance to this point, you need to enable the Groovy plugin by adding the following snippet to your Elasticsearch configuration:

Enabling Groovy support in your elasticsearch.yml
script.engine.groovy.inline.search: on

Paging and scrolling

You may handle large result sets in two different ways, with different limitations.

For (relatively) smaller result sets, you may use the traditional offset/limit querying provided by the FullTextQuery interfaces: setFirstResult(int) and setMaxResults(int). Limitations:

  • This will only get you as far as the 10000 first documents, i.e. when requesting a window that includes documents beyond the 10000th result, Elasticsearch will return an error. If you want to raise this limit, see the index.max_result_window property in Elasticsearch’s settings.

If your result set is bigger, you may take advantage of scrolling by using the scroll method on org.hibernate.search.FullTextQuery. Limitations:

  • This method is not available in org.hibernate.search.jpa.FullTextQuery.

  • The Elasticsearch implementation has poor performance when an offset has been defined (i.e. setFirstResult(int) has been called on the query before calling scroll()). This is because Elasticsearch does not provide such feature, thus Hibernate Search has to scroll through every previous result under the hood.

  • The Elasticsearch implementation allows only limited backtracking. Calling scrollableResults.setRowNumber(4) when currently positioned at index 1006, for example, may result in a SearchException being thrown, because only 1000 previous elements had been kept in memory. You may work this around by tweaking the property: hibernate.search.elasticsearch.scroll_backtracking_window_size (see Elasticsearch integration configuration).

  • The ScrollableResults will become stale and unusable after a given period of time spent without fetching results from Elasticsearch. You may work this around by tweaking two properties: hibernate.search.elasticsearch.scroll_timeout and hibernate.search.elasticsearch.scroll_fetch_size (see Elasticsearch integration configuration). Typically, you will solve timeout issues by reducing the fetch size and/or increasing the timeout limit, but this will also increase the performance hit on Elasticsearch.

Sorting

Sorting is performed the same way as with the Lucene backend.

If you happen to need an advanced Elasticsearch sorting feature that is not natively supported in SortField or in Hibernate Search sort DSL, you may still create a sort from JSON, and even mix it with DSL-defined sorts:

Example 4. Mixing DSL-defined sorts with native Elasticsearch JSON sorts
QueryBuilder qb = fullTextSession.getSearchFactory()
    .buildQueryBuilder().forEntity(Book.class).get();
Query luceneQuery = /* ... */;
FullTextQuery query = s.createFullTextQuery( luceneQuery, Book.class );
Sort sort = qb.sort()
        .byNative( "authors.name", "{'order':'asc', 'mode': 'min'}" )
        .andByField("title")
        .createSort();
query.setSort(sort);
List results = query.list();

Projections

All fields are stored by Elasticsearch in the JSON document it indexes, there is no specific need to mark fields as stored when you want to project them. The downside is that to project a field, Elasticsearch needs to read the whole JSON document. If you want to avoid that, use the Store.YES marker.

You can also retrieve the full JSON document by using org.hibernate.search.elasticsearch.ElasticsearchProjectionConstants.SOURCE.

query = ftem.createFullTextQuery(
                    qb.keyword()
                    .onField( "tags" )
                    .matching( "round-based" )
                    .createQuery(),
                    VideoGame.class
            )
            .setProjection( ElasticsearchProjectionConstants.SCORE, ElasticsearchProjectionConstants.SOURCE );

projection = (Object[]) query.getSingleResult();

If you’re looking for information about execution time, you may also use org.hibernate.search.elasticsearch.ElasticsearchProjectionConstants.TOOK and org.hibernate.search.elasticsearch.ElasticsearchProjectionConstants.TIMED_OUT:

query = ftem.createFullTextQuery(
                    qb.keyword()
                    .onField( "tags" )
                    .matching( "round-based" )
                    .createQuery(),
                    VideoGame.class
            )
            .setProjection(
                    ElasticsearchProjectionConstants.SOURCE,
                    ElasticsearchProjectionConstants.TOOK,
                    ElasticsearchProjectionConstants.TIMED_OUT
            );

projection = (Object[]) query.getSingleResult();
Integer took = (Integer) projection[1]; // Execution time (milliseconds)
Boolean timedOut = (Boolean) projection[2]; // Whether the query timed out

Filters

The Elasticsearch integration supports the definition of full text filters.

Your filters need to implement the ElasticsearchFilter interface.

public class DriversMatchingNameElasticsearchFilter implements ElasticsearchFilter {

    private String name;

    public DriversMatchingNameElasticsearchFilter() {
    }

    public void setName(String name) {
        this.name = name;
    }

    @Override
    public String getJsonFilter() {
        return "{ 'term': { 'name': '" + name + "' } }";
    }

}

You can then declare the filter in your entity.

@Entity
@Indexed
@FullTextFilterDef(name = "namedDriver",
  impl = DriversMatchingNameElasticsearchFilter.class)
public class Driver {
    @Id
    @DocumentId
    private int id;

    @Field(analyze = Analyze.YES)
    private String name;

    ...
}

From then you can use it as usual.

ftQuery.enableFullTextFilter( "namedDriver" ).setParameter( "name", "liz" );

For static filters, you can simply extend the SimpleElasticsearchFilter and provide an Elasticsearch filter in JSON form.

Index optimization

The optimization features documented in [search-optimize] are only partially implemented. That kind of optimization is rarely needed with recent versions of Lucene (on which Elasticsearch is based), but some of it is still provided for the very specific case of indexes meant to stay read-only for a long period of time:

  • The automatic optimization is not implemented and most probably never will be.

  • The manual optimization (searchFactory.optimize()) is implemented.

Logging executed requests

Search queries are logged to the org.hibernate.search.fulltext_query category at DEBUG level, as when using an embedded Lucene instance (the query format is Elasticsearch’s, though).

In addition, you can enable the logging of every single request sent to the Elasticsearch cluster by enabling TRACE logging for the log category org.hibernate.search.elasticsearch.request.

Limitations

Not everything is implemented yet. Here is a list of known limitations.

Please check with JIRA and the mailing lists for updates, but at the time of writing this at least the following features are known to not work yet:

  • Query timeouts: HSEARCH-2399

  • MoreLikeThis queries: HSEARCH-2395

  • @IndexedEmbedded.indexNullAs: HSEARCH-2389

  • Statistics: HSEARCH-2421

  • @AnalyzerDiscriminator: HSEARCH-2428

  • Mixing Lucene based indexes and Elasticsearch based indexes (partial support is here though)

  • Hibernate Search does not make use of nested objects nor parent child relationship mapping HSEARCH-2263. This is largely mitigated by the fact that Hibernate Search does the denormalization itself and maintain data consistency when nested objects are updated.

  • There is room for improvements in the performances of the MassIndexer implementation

  • Our new Elasticsearch integration module does not work in OSGi environments. If you need this, please vote for: HSEARCH-2524.

Known bugs in Elasticsearch

Depending on the Elasticsearch version you use, you may encounter bugs that are specific to that version. Here is a list of known Elasticsearch bugs, and what to do about it.

Acknowledgment

More information about Elasticsearch can be found on the Elasticsearch website and its reference documentation.