Skip to content

Latest commit

 

History

History
493 lines (374 loc) · 21 KB

elasticsearch-integration.asciidoc

File metadata and controls

493 lines (374 loc) · 21 KB

Integration with Elasticsearch

Status

Caution

This feature is a work in progress. Read this section carefully!

The integration with Elasticsearch is in development and should be considered experimental. We do think we have the basics covered and we are looking for feedback.

Patches can be sent as pull requests to the Github repository, but also general feedback, suggestions and questions are very welcome. To get in touch or find other interesting links for contributors, see the Hibernate Community.

Goal of the Elasticsearch integration

The goal of integrating with Elasticsearch is to allow Hibernate Search users to benefit from the full-text capabilities integrated with Hibernate ORM but replacing the local Lucene based index with a remote Elasticsearch service.

There could be various reasons to prefer this over an "embedded Lucene" approach:

  • wish to separate the service running your application from the Search service

  • integrate with an existing Elasticsearch instance

  • benefit from Elasticsearch’s out of the box horizontal scalability features

  • explore the data updated by an Hibernate powered application using the Elasticsearch dashboard integrations such as Kibana

There are a couple drawbacks compared to the embedded Lucene approach though:

  • incur a performance penalty of remote RPCs both for index updates and to run queries

  • need to buy and manage additional servers

Which solution is best will depend on the specific needs of your system.

Note
Why not use Elasticsearch directly

The #1 reason is that Hibernate Search integrates perfectly with Hibernate ORM. All changes done to your objects will trigger the necessary index changes transparently.

  • it will honor the transaction boundary - i.e. not do the indexing work if the transaction ends up in rollback

  • changes to cascaded objects are handled

  • changes to nested object embedded in a root indexed entity are handled

  • changes will be sent in bulk - i.e. optimized systematically for you

  • etc.

There is no more paradigm shift in your code. You are working on Hibernate ORM managed objects, doing your queries on object properties with a nice DSL,

Getting started and configuration

To experiment with the Elasticsearch integration you will have to download Elasticsearch and run it: Hibernate Search connects to an Elasticsearch node but does not provide one.

One option is to use the Elasticsearch Docker image.

Start an Elasticsearch node via Docker
docker pull elasticsearch
docker run -p 9200:9200 -d -v "$PWD/plugin_dir":/usr/share/elasticsearch/plugins \
    -v "$PWD/config/elasticsearch.yml":/usr/share/elasticsearch/config/elasticsearch.yml \
    elasticsearch
Note
Elasticsearch version

Hibernate Search expects an Elasticsearch node version 2.0 at least. Hibernate Search internal tests run against Elasticsearch 2.3.

Dependencies in your Java application

In addition to the usual dependencies like Hibernate ORM and Hibernate Search, you will need the new hibernate-search-elasticsearch jar.

Example 1. Maven dependencies for Hibernate Search with Elasticsearch
<dependency>
   <groupId>org.hibernate</groupId>
   <artifactId>hibernate-search-elasticsearch</artifactId>
   <version>{hibernateSearchVersion}</version>
</dependency>

Configuration

Configuration is minimal. Add them to your persistence.xml or where you put the rest of your Hibernate Search configuration.

Select Elasticsearch as the backend

hibernate.search.default.indexmanager elasticsearch

Hostname and port for Elasticsearch

hibernate.search.default.elasticsearch.host http://127.0.0.1:9200

Selects the index creation strategy

hibernate.search.default.elasticsearch.index_schema_management_strategy CREATE

Maximum time to wait for the indexes to become available before failing (in ms)

hibernate.search.default.elasticsearch.index_management_wait_timeout 10000

Status an index must at least have in order for Hibernate Search to work with it (one of "green", "yellow" or "red")

hibernate.search.default.elasticsearch.required_index_status green

Whether to perform an explicit refresh after a set of operations has been executed against a specific index (true or false)

hibernate.search.default.elasticsearch.refresh_after_write false

When scrolling, the minimum number of previous results kept in memory at any time

hibernate.search.elasticsearch.scroll_backtracking_window_size

When scrolling, the number of results fetched by each Elasticsearch call

hibernate.search.elasticsearch.scroll_fetch_size

When scrolling, the maximum duration ScrollableResults will be usable if no other results are fetched from Elasticsearch, in seconds

hibernate.search.elasticsearch.scroll_timeout

Let’s see the options for the index_schema_management_strategy property:

Value Definition

NONE

Indexes will not be created nor deleted.

MERGE

Missing indexes and mappings will be created, mappings will be updated if there are no conflicts.

CREATE

The default: existing indexes will not be altered, missing indexes will be created.

RECREATE

Indexed will be deleted if existing and then created. This will delete all content from the index!

RECREATE_DELETE

Similarly to RECREATE but will also delete the index at shutdown. Commonly used for tests.

Note that all properties prefixed with hibernate.search.default (besides host) can be given globally as shown above and/or be given for specific indexes: hibernate.search.someindex.elasticsearch.index_schema_management_strategy MERGE.

Elasticsearch configuration

There is no specific configuration required on the Elasticsearch side.

However there are a few features that would benefit from a few changes:

  • you can only refer to analyzers that are already registered on Elasticsearch, see [elasticsearch-mapping-analyzer]

  • if you want to retrieve the distance in a geolocation query, install and enable the lang-groovy plugin, see [elasticsearch-query-spatial]

  • if you want to be able to use the purge all Hibernate Search command, install the delete-by-query plugin

  • if you want to use paging (as opposed to scrolling) on result sets larger than 10000 elements (for instance access the 10001st result), you may increase the value of the index.max_result_window property (default is 10000).

Mapping and indexing

Like in Lucene embedded mode, indexes are transparently updated when you create or update entities mapped to Hibernate Search. Simply use familiar annotations from [search-mapping].

The name of the index will be the lowercased name provided to @Indexed (non qualified class name by default). Hibernate Search will map the fully qualified class name to the Elasticsearch type.

Annotation specificities

Field.indexNullAs

The org.hibernate.search.annotations.Field annotation allows you to provide a replacement value for null properties through the indexNullAs attribute (see [field-annotation]), but this value must be provided as a string.

In order for your value to be understood by Hibernate Search (and Elasticsearch), the provided string must follow one of those formats:

  • For string values, no particular format is required.

  • For numeric values, use formats accepted by Double.parseDouble, Integer.parseInteger, etc., depending on the actual type of your field.

  • For booleans, use either true or false.

  • For dates, use the ISO-8601 format (yyyy-MM-dd’T’HH:mm:ssZ, for instance 2016-08-26T16:41:00+01:00). The time and time zone may be omitted (if omitted, the time zone will be interpreted as the default JVM time zone).

Analyzers

Caution
Analyzers are treated differently than in Lucene embedded mode.

Using the definition attribute in the @Analyzer annotation, you can refer to the name of the built-in or custom analyzers registered on your Elasticsearch instances.

This will work as long as there is not a local analyzer defined with the same name. If that’s the case, Hibernate Search will ignore the analyzer when Elasticsearch is used as backend and log a warning. This might sound complicated but we are looking for ways to ease the experience.

More information on analyzers, in particular the already defined ones, can be found in the Elasticsearch documentation.

Example of custom analyzers defined in the elasticsearch.yml
# Custom analyzer
index.analysis:
  analyzer.custom-analyzer:
    type: custom
    tokenizer: standard
    filter: [custom-filter, lowercase]
  filter.custom-filter:
    type : stop
    stopwords : [test1, close]

From there, you can use the custom analyzers by name in your entity mappings.

Example of mapping that refers to custom and built-in analyzers on Elasticsearch
@Entity
@Indexed(index = "tweet")
public static class Tweet {

    @Id
    @GeneratedValue
    private Integer id;

    @Field
    @Analyzer(definition = "english") // Elasticsearch built-in analyzer
    private String englishTweet;

    @Field
    @Analyzer(definition = "whitespace") // Elasticsearch built-in analyzer
    private String whitespaceTweet;

    @Fields({
        @Field(name = "tweetNotAnalyzed", analyzer = Analyze.NO, store = Store.YES),

        // Custom analyzer
        @Field(
            name = "tweetWithCustom",
            analyzer = @Analyzer(definition = "custom-analyzer"))})
    private String multipleTweets;
}

Custom field bridges

You can write custom field bridges and class bridges. For class bridges and field bridges creating multiple fields, make sure to make your bridge implementation also implement the MetadataProvidingFieldBridge contract.

/**
 * Used as class-level bridge for creating the "firstName" and "middleName" document and doc value fields.
 */
public static class FirstAndMiddleNamesFieldBridge implements MetadataProvidingFieldBridge {

    @Override
    public void set(String name, Object value, Document document, LuceneOptions luceneOptions) {
        Explorer explorer = (Explorer) value;

        String firstName = explorer.getNameParts().get( "firstName" );
        luceneOptions.addFieldToDocument( name + "_firstName", firstName, document );
        document.add( new SortedDocValuesField( name + "_firstName", new BytesRef( firstName ) ) );

        String middleName = explorer.getNameParts().get( "middleName" );
        luceneOptions.addFieldToDocument( name + "_middleName", middleName, document );
        document.add( new SortedDocValuesField( name + "_middleName", new BytesRef( middleName ) ) );
    }

    @Override
    public void configureFieldMetadata(String name, FieldMetadataBuilder builder) {
        builder
            .field( name + "_firstName", FieldType.STRING )
                .sortable( true )
            .field( name + "_middleName", FieldType.STRING )
                .sortable( true );
    }
}
Note

This interface and FieldBridge in general are likely going to evolve in the next major version of Hibernate Search to remove its adherence to Lucene specific classes like Document.

Queries

You can write queries like you usually do in Hibernate Search: native Lucene queries and DSL queries (see [search-query]). We do automatically translate the most common types of Apache Lucene queries and many of the queries generated by the Hibernate Search DSL.

Note
Unsupported Query DSL features

Queries written via the DSL work. Open a JIRA otherwise.

Notable exceptions are:

  • more like this queries (the advanced algorithm used by Hibernate Search is not ported yet)

These are temporary limitations, if you need these features, contact us.

On top of translating Lucene queries, you can directly create Elasticsearch queries by using either its String format or a JSON format:

Example 2. Creating an Elasticsearch native query from a string
FullTextSession fullTextSession = Search.getFullTextSession(session);
QueryDescriptor query = ElasticsearchQueries.fromQueryString("title:tales");
List<?> result = fullTextSession.createFullTextQuery(query, ComicBook.class).list();
Example 3. Creating an Elasticsearch native query from JSON
FullTextSession fullTextSession = Search.getFullTextSession(session);
QueryDescriptor query = ElasticsearchQueries.fromJson(
      "{ 'query': { 'match' : { 'lastName' : 'Brand' } } }");
List<?> result = session.createFullTextQuery(query, GolfPlayer.class).list();
Caution
Date/time in native Elasticsearch queries

By default Elasticsearch interprets the date/time strings lacking the time zone as if they were represented using the UTC time zone. If overlooked, this can cause your native Elasticsearch queries to be completely off.

The simplest way to avoid issues is to always explicitly provide time zones IDs or offsets when building native Elasticsearch queries. This may be achieved either by directly adding the time zone ID or offset in date strings, or by using the time_zone parameter (range queries only). See Elasticsearch documentation for more information.

Spatial queries

The Elasticsearch integration supports spatial queries by using either the DSL or native Elasticsearch queries.

For regular usage, there are no particular requirements for spatial support.

However, if you want to calculate the distance from your entities to a point without sorting by the distance to this point, you need to enable the Groovy plugin by adding the following snippet to your Elasticsearch configuration:

Enabling Groovy support in your elasticsearch.yml
script.engine.groovy.inline.search: on

Paging and scrolling

You may handle large result sets in two different ways, with different limitations.

For (relatively) smaller result sets, you may use the traditional offset/limit querying provided by the FullTextQuery interfaces: setFirstResult(int) and setMaxResults(int). Limitations:

  • This will only get you as far as the 10000 first documents, i.e. when requesting a window that includes documents beyond the 10000th result, Elasticsearch will return an error. If you want to raise this limit, see the index.max_result_window property in Elasticsearch’s settings.

If your result set is bigger, you may take advantage of scrolling by using the scroll method on org.hibernate.search.FullTextQuery. Limitations:

  • This method is not available in org.hibernate.search.jpa.FullTextQuery.

  • The Elasticsearch implementation has poor performance when an offset has been defined (i.e. setFirstResult(int) has been called on the query before calling scroll()). This is because Elasticsearch does not provide such feature, thus Hibernate Search has to scroll through every previous result under the hood.

  • The Elasticsearch implementation allows only limited backtracking. Calling scrollableResylts.setRowNumber(4) when currently positioned at index 1006, for example, may result in a SearchException being thrown, because only 1000 previous elements had been kept in memory. You may work this around by tweaking the property: hibernate.search.elasticsearch.scroll_backtracking_window_size (see [elasticsearch-integration-configuration]).

  • The ScrollableResults will become stale and unusable after a given period of time spent without fetching results from Elasticsearch. You may work this around by tweaking two properties: hibernate.search.elasticsearch.scroll_timeout and hibernate.search.elasticsearch.scroll_fetch_size (see [elasticsearch-integration-configuration]). Typically, you will solve timeout issues by reducing the fetch size and/or increasing the timeout limit, but this will also increase the performance hit on Elasticsearch.

Sorting

Sorting is performed the same way as with the Lucene backend.

If you happen to need an advanced Elasticsearch sorting feature that is not natively supported in SortField or in Hibernate Search sort DSL, you may still create a sort from JSON, and even mix it with DSL-defined sorts:

Example 4. Mixing DSL-defined sorts with native Elasticsearch JSON sorts
QueryBuilder qb = fullTextSession.getSearchFactory()
    .buildQueryBuilder().forEntity(Book.class).get();
Query luceneQuery = /* ... */;
FullTextQuery query = s.createFullTextQuery( luceneQuery, Book.class );
Sort sort = qb.sort()
        .byNative( "authors.name", "{'order':'asc', 'mode': 'min'}" )
        .andByField("title")
        .createSort();
query.setSort(sort);
List results = query.list();

Projections

All fields are stored by Elasticsearch in the JSON document it indexes, there is no specific need to mark fields as stored when you want to project them. The downside is that to project a field, Elasticsearch needs to read the whole JSON document. If you want to avoid that, use the Store.YES marker.

You can also retrieve the full JSON document by using org.hibernate.search.elasticsearch.ElasticsearchProjectionConstants.SOURCE.

query = ftem.createFullTextQuery(
                    qb.keyword()
                    .onField( "tags" )
                    .matching( "round-based" )
                    .createQuery(),
                    VideoGame.class
            )
            .setProjection( ElasticsearchProjectionConstants.SCORE, ElasticsearchProjectionConstants.SOURCE );

projection = (Object[]) query.getSingleResult();

If you’re looking for information about execution time, you may also use org.hibernate.search.elasticsearch.ElasticsearchProjectionConstants.TOOK and org.hibernate.search.elasticsearch.ElasticsearchProjectionConstants.TIMED_OUT:

query = ftem.createFullTextQuery(
                    qb.keyword()
                    .onField( "tags" )
                    .matching( "round-based" )
                    .createQuery(),
                    VideoGame.class
            )
            .setProjection(
                    ElasticsearchProjectionConstants.SOURCE,
                    ElasticsearchProjectionConstants.TOOK,
                    ElasticsearchProjectionConstants.TIMED_OUT
            );

projection = (Object[]) query.getSingleResult();
Integer took = (Integer) projection[1]; // Execution time (milliseconds)
Boolean timedOut = (Boolean) projection[2]; // Whether the query timed out

Filters

The Elasticsearch integration supports the definition of full text filters.

Your filters need to implement the ElasticsearchFilter interface.

public class DriversMatchingNameElasticsearchFilter implements ElasticsearchFilter {

    private String name;

    public DriversMatchingNameElasticsearchFilter() {
    }

    public void setName(String name) {
        this.name = name;
    }

    @Override
    public String getJsonFilter() {
        return "{ 'term': { 'name': '" + name + "' } }";
    }

}

You can then declare the filter in your entity.

@Entity
@Indexed
@FullTextFilterDefs({
        @FullTextFilterDef(name = "namedDriver",
                impl = DriversMatchingNameElasticsearchFilter.class)
})
public class Driver {
    @Id
    @DocumentId
    private int id;

    @Field(analyze = Analyze.YES)
    private String name;

    ...
}

From then you can use it as usual.

ftQuery.enableFullTextFilter( "namedDriver" ).setParameter( "name", "liz" );

For static filters, you can simply extend the SimpleElasticsearchFilter and provide an Elasticsearch filter in JSON form.

Index optimization

The optimization features documented in [search-optimize] are only partially implemented. That kind of optimization is rarely needed with recent versions of Lucene (on which Elasticsearch is based), but some of it is still provided for the very specific case of indexes meant to stay read-only for a long period of time:

  • The automatic optimization is not implemented and most probably never will be.

  • The manual optimization (searchFactory.optimize()) is implemented.

Limitations

Not everything is implemented yet. Here is a list of known limitations.

Please check with JIRA and the mailing lists for updates, but at the time of writing this at least the following features are known to not work yet:

  • Defining analyzers with @AnalyzerDef, analyzers have to be defined in the Elasticsearch configuration

  • Query timeouts

  • MoreLikeThis queries

  • Mixing Lucene based indexes and Elasticsearch based indexes (partial support is here though)

  • There is room for improvements in the performances of the MassIndexer implementation