Caution
|
This feature is a work in progress. Read this section carefully! |
The integration with Elasticsearch is in development and should be considered experimental. We do think we have the basics covered and we are looking for feedback.
Patches can be sent as pull requests to the Github repository, but also general feedback, suggestions and questions are very welcome. To get in touch or find other interesting links for contributors, see the Hibernate Community.
The goal of integrating with Elasticsearch is to allow Hibernate Search users to benefit from the full-text capabilities integrated with Hibernate ORM but replacing the local Lucene based index with a remote Elasticsearch service.
There could be various reasons to prefer this over an "embedded Lucene" approach:
-
wish to separate the service running your application from the Search service
-
integrate with an existing Elasticsearch instance
-
benefit from Elasticsearch’s out of the box horizontal scalability features
-
explore the data updated by an Hibernate powered application using the Elasticsearch dashboard integrations such as Kibana
There are a couple drawbacks compared to the embedded Lucene approach though:
-
incur a performance penalty of remote RPCs both for index updates and to run queries
-
need to buy and manage additional servers
Which solution is best will depend on the specific needs of your system.
Note
|
Why not use Elasticsearch directly
The #1 reason is that Hibernate Search integrates perfectly with Hibernate ORM. All changes done to your objects will trigger the necessary index changes transparently.
There is no more paradigm shift in your code. You are working on Hibernate ORM managed objects, doing your queries on object properties with a nice DSL, |
To experiment with the Elasticsearch integration you will have to download Elasticsearch and run it: Hibernate Search connects to an Elasticsearch node but does not provide one.
One option is to use the Elasticsearch Docker image.
docker pull elasticsearch
docker run -p 9200:9200 -d -v "$PWD/plugin_dir":/usr/share/elasticsearch/plugins \
-v "$PWD/config/elasticsearch.yml":/usr/share/elasticsearch/config/elasticsearch.yml \
elasticsearch
Note
|
Elasticsearch version
Hibernate Search expects an Elasticsearch node version 2.0 at least. Hibernate Search internal tests run against Elasticsearch 2.3. |
In addition to the usual dependencies like Hibernate ORM and Hibernate Search,
you will need the new hibernate-search-elasticsearch
jar.
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-search-elasticsearch</artifactId>
<version>{hibernateSearchVersion}</version>
</dependency>
Configuration is minimal.
Add them to your persistence.xml
or where you put the rest of your Hibernate Search configuration.
- Select Elasticsearch as the backend
-
hibernate.search.default.indexmanager elasticsearch
- Hostname and port for Elasticsearch
-
hibernate.search.default.elasticsearch.host http://127.0.0.1:9200
- Selects the index creation strategy
-
hibernate.search.default.elasticsearch.index_schema_management_strategy CREATE
- Maximum time to wait for the indexes to become available before failing (in ms)
-
hibernate.search.default.elasticsearch.index_management_wait_timeout 10000
- Status an index must at least have in order for Hibernate Search to work with it (one of "green", "yellow" or "red")
-
hibernate.search.default.elasticsearch.required_index_status green
- Whether to perform an explicit refresh after a set of operations has been executed against a specific index (
true
orfalse
) -
hibernate.search.default.elasticsearch.refresh_after_write false
- When scrolling, the minimum number of previous results kept in memory at any time
-
hibernate.search.elasticsearch.scroll_backtracking_window_size
- When scrolling, the number of results fetched by each Elasticsearch call
-
hibernate.search.elasticsearch.scroll_fetch_size
- When scrolling, the maximum duration
ScrollableResults
will be usable if no other results are fetched from Elasticsearch, in seconds -
hibernate.search.elasticsearch.scroll_timeout
Let’s see the options for the index_schema_management_strategy
property:
Value | Definition |
---|---|
|
Indexes will not be created nor deleted. |
|
Missing indexes and mappings will be created, mappings will be updated if there are no conflicts. |
|
The default: existing indexes will not be altered, missing indexes will be created. |
|
Indexed will be deleted if existing and then created. This will delete all content from the index! |
|
Similarly to |
Note that all properties prefixed with hibernate.search.default
(besides host
) can be given globally as shown above and/or be given for specific indexes:
hibernate.search.someindex.elasticsearch.index_schema_management_strategy MERGE
.
There is no specific configuration required on the Elasticsearch side.
However there are a few features that would benefit from a few changes:
-
you can only refer to analyzers that are already registered on Elasticsearch, see [elasticsearch-mapping-analyzer]
-
if you want to retrieve the distance in a geolocation query, install and enable the
lang-groovy
plugin, see [elasticsearch-query-spatial] -
if you want to be able to use the purge all Hibernate Search command, install the
delete-by-query
plugin -
if you want to use paging (as opposed to scrolling) on result sets larger than 10000 elements (for instance access the 10001st result), you may increase the value of the
index.max_result_window
property (default is 10000).
Like in Lucene embedded mode, indexes are transparently updated when you create or update entities mapped to Hibernate Search. Simply use familiar annotations from [search-mapping].
The name of the index will be the lowercased name provided to @Indexed
(non qualified class name by default).
Hibernate Search will map the fully qualified class name to the Elasticsearch type.
The org.hibernate.search.annotations.Field
annotation allows you to provide a replacement value for null properties through the indexNullAs
attribute (see [field-annotation]), but this value must be provided as a string.
In order for your value to be understood by Hibernate Search (and Elasticsearch), the provided string must follow one of those formats:
-
For string values, no particular format is required.
-
For numeric values, use formats accepted by
Double.parseDouble
,Integer.parseInteger
, etc., depending on the actual type of your field. -
For booleans, use either
true
orfalse
. -
For dates, use the ISO-8601 format (
yyyy-MM-dd’T’HH:mm:ssZ
, for instance2016-08-26T16:41:00+01:00
). The time and time zone may be omitted (if omitted, the time zone will be interpreted as the default JVM time zone).
Caution
|
Analyzers are treated differently than in Lucene embedded mode. |
Using the definition
attribute in the @Analyzer
annotation, you can refer to the name of the
built-in or custom analyzers registered on your Elasticsearch instances.
This will work as long as there is not a local analyzer defined with the same name. If that’s the case, Hibernate Search will ignore the analyzer when Elasticsearch is used as backend and log a warning. This might sound complicated but we are looking for ways to ease the experience.
More information on analyzers, in particular the already defined ones, can be found in the Elasticsearch documentation.
# Custom analyzer
index.analysis:
analyzer.custom-analyzer:
type: custom
tokenizer: standard
filter: [custom-filter, lowercase]
filter.custom-filter:
type : stop
stopwords : [test1, close]
From there, you can use the custom analyzers by name in your entity mappings.
@Entity
@Indexed(index = "tweet")
public static class Tweet {
@Id
@GeneratedValue
private Integer id;
@Field
@Analyzer(definition = "english") // Elasticsearch built-in analyzer
private String englishTweet;
@Field
@Analyzer(definition = "whitespace") // Elasticsearch built-in analyzer
private String whitespaceTweet;
@Fields({
@Field(name = "tweetNotAnalyzed", analyzer = Analyze.NO, store = Store.YES),
// Custom analyzer
@Field(
name = "tweetWithCustom",
analyzer = @Analyzer(definition = "custom-analyzer"))})
private String multipleTweets;
}
You can write custom field bridges and class bridges.
For class bridges and field bridges creating multiple fields,
make sure to make your bridge implementation also implement the MetadataProvidingFieldBridge
contract.
/**
* Used as class-level bridge for creating the "firstName" and "middleName" document and doc value fields.
*/
public static class FirstAndMiddleNamesFieldBridge implements MetadataProvidingFieldBridge {
@Override
public void set(String name, Object value, Document document, LuceneOptions luceneOptions) {
Explorer explorer = (Explorer) value;
String firstName = explorer.getNameParts().get( "firstName" );
luceneOptions.addFieldToDocument( name + "_firstName", firstName, document );
document.add( new SortedDocValuesField( name + "_firstName", new BytesRef( firstName ) ) );
String middleName = explorer.getNameParts().get( "middleName" );
luceneOptions.addFieldToDocument( name + "_middleName", middleName, document );
document.add( new SortedDocValuesField( name + "_middleName", new BytesRef( middleName ) ) );
}
@Override
public void configureFieldMetadata(String name, FieldMetadataBuilder builder) {
builder
.field( name + "_firstName", FieldType.STRING )
.sortable( true )
.field( name + "_middleName", FieldType.STRING )
.sortable( true );
}
}
Note
|
This interface and |
You can write queries like you usually do in Hibernate Search: native Lucene queries and DSL queries (see [search-query]). We do automatically translate the most common types of Apache Lucene queries and many of the queries generated by the Hibernate Search DSL.
Note
|
Unsupported Query DSL features
Queries written via the DSL work. Open a JIRA otherwise. Notable exceptions are:
These are temporary limitations, if you need these features, contact us. |
On top of translating Lucene queries, you can directly create Elasticsearch queries by using either its String format or a JSON format:
FullTextSession fullTextSession = Search.getFullTextSession(session);
QueryDescriptor query = ElasticsearchQueries.fromQueryString("title:tales");
List<?> result = fullTextSession.createFullTextQuery(query, ComicBook.class).list();
FullTextSession fullTextSession = Search.getFullTextSession(session);
QueryDescriptor query = ElasticsearchQueries.fromJson(
"{ 'query': { 'match' : { 'lastName' : 'Brand' } } }");
List<?> result = session.createFullTextQuery(query, GolfPlayer.class).list();
Caution
|
Date/time in native Elasticsearch queries
By default Elasticsearch interprets the date/time strings lacking the time zone as if they were represented using the UTC time zone. If overlooked, this can cause your native Elasticsearch queries to be completely off. The simplest way to avoid issues is to always explicitly provide time zones IDs or offsets when building native Elasticsearch queries. This may be achieved either by directly adding the time zone ID or offset in date strings, or by using the |
The Elasticsearch integration supports spatial queries by using either the DSL or native Elasticsearch queries.
For regular usage, there are no particular requirements for spatial support.
However, if you want to calculate the distance from your entities to a point without sorting by the distance to this point, you need to enable the Groovy plugin by adding the following snippet to your Elasticsearch configuration:
script.engine.groovy.inline.search: on
You may handle large result sets in two different ways, with different limitations.
For (relatively) smaller result sets, you may use the traditional offset/limit querying provided by the FullTextQuery
interfaces: setFirstResult(int)
and setMaxResults(int)
.
Limitations:
-
This will only get you as far as the 10000 first documents, i.e. when requesting a window that includes documents beyond the 10000th result, Elasticsearch will return an error. If you want to raise this limit, see the
index.max_result_window
property in Elasticsearch’s settings.
If your result set is bigger, you may take advantage of scrolling by using the scroll
method on org.hibernate.search.FullTextQuery
.
Limitations:
-
This method is not available in
org.hibernate.search.jpa.FullTextQuery
. -
The Elasticsearch implementation has poor performance when an offset has been defined (i.e.
setFirstResult(int)
has been called on the query before callingscroll()
). This is because Elasticsearch does not provide such feature, thus Hibernate Search has to scroll through every previous result under the hood. -
The Elasticsearch implementation allows only limited backtracking. Calling
scrollableResylts.setRowNumber(4)
when currently positioned at index1006
, for example, may result in aSearchException
being thrown, because only 1000 previous elements had been kept in memory. You may work this around by tweaking the property:hibernate.search.elasticsearch.scroll_backtracking_window_size
(see [elasticsearch-integration-configuration]). -
The
ScrollableResults
will become stale and unusable after a given period of time spent without fetching results from Elasticsearch. You may work this around by tweaking two properties:hibernate.search.elasticsearch.scroll_timeout
andhibernate.search.elasticsearch.scroll_fetch_size
(see [elasticsearch-integration-configuration]). Typically, you will solve timeout issues by reducing the fetch size and/or increasing the timeout limit, but this will also increase the performance hit on Elasticsearch.
Sorting is performed the same way as with the Lucene backend.
If you happen to need an advanced Elasticsearch sorting feature that is not natively supported in SortField
or in Hibernate Search sort DSL, you may still create a sort from JSON, and even mix it with DSL-defined sorts:
QueryBuilder qb = fullTextSession.getSearchFactory()
.buildQueryBuilder().forEntity(Book.class).get();
Query luceneQuery = /* ... */;
FullTextQuery query = s.createFullTextQuery( luceneQuery, Book.class );
Sort sort = qb.sort()
.byNative( "authors.name", "{'order':'asc', 'mode': 'min'}" )
.andByField("title")
.createSort();
query.setSort(sort);
List results = query.list();
All fields are stored by Elasticsearch in the JSON document it indexes,
there is no specific need to mark fields as stored when you want to project them.
The downside is that to project a field, Elasticsearch needs to read the whole JSON document.
If you want to avoid that, use the Store.YES
marker.
You can also retrieve the full JSON document by using org.hibernate.search.elasticsearch.ElasticsearchProjectionConstants.SOURCE
.
query = ftem.createFullTextQuery(
qb.keyword()
.onField( "tags" )
.matching( "round-based" )
.createQuery(),
VideoGame.class
)
.setProjection( ElasticsearchProjectionConstants.SCORE, ElasticsearchProjectionConstants.SOURCE );
projection = (Object[]) query.getSingleResult();
If you’re looking for information about execution time, you may also use org.hibernate.search.elasticsearch.ElasticsearchProjectionConstants.TOOK
and org.hibernate.search.elasticsearch.ElasticsearchProjectionConstants.TIMED_OUT
:
query = ftem.createFullTextQuery(
qb.keyword()
.onField( "tags" )
.matching( "round-based" )
.createQuery(),
VideoGame.class
)
.setProjection(
ElasticsearchProjectionConstants.SOURCE,
ElasticsearchProjectionConstants.TOOK,
ElasticsearchProjectionConstants.TIMED_OUT
);
projection = (Object[]) query.getSingleResult();
Integer took = (Integer) projection[1]; // Execution time (milliseconds)
Boolean timedOut = (Boolean) projection[2]; // Whether the query timed out
The Elasticsearch integration supports the definition of full text filters.
Your filters need to implement the ElasticsearchFilter
interface.
public class DriversMatchingNameElasticsearchFilter implements ElasticsearchFilter {
private String name;
public DriversMatchingNameElasticsearchFilter() {
}
public void setName(String name) {
this.name = name;
}
@Override
public String getJsonFilter() {
return "{ 'term': { 'name': '" + name + "' } }";
}
}
You can then declare the filter in your entity.
@Entity
@Indexed
@FullTextFilterDefs({
@FullTextFilterDef(name = "namedDriver",
impl = DriversMatchingNameElasticsearchFilter.class)
})
public class Driver {
@Id
@DocumentId
private int id;
@Field(analyze = Analyze.YES)
private String name;
...
}
From then you can use it as usual.
ftQuery.enableFullTextFilter( "namedDriver" ).setParameter( "name", "liz" );
For static filters, you can simply extend the SimpleElasticsearchFilter
and provide an Elasticsearch filter in JSON form.
The optimization features documented in [search-optimize] are only partially implemented. That kind of optimization is rarely needed with recent versions of Lucene (on which Elasticsearch is based), but some of it is still provided for the very specific case of indexes meant to stay read-only for a long period of time:
-
The automatic optimization is not implemented and most probably never will be.
-
The manual optimization (
searchFactory.optimize()
) is implemented.
Not everything is implemented yet. Here is a list of known limitations.
Please check with JIRA and the mailing lists for updates, but at the time of writing this at least the following features are known to not work yet:
-
Defining analyzers with
@AnalyzerDef
, analyzers have to be defined in the Elasticsearch configuration -
Query timeouts
-
MoreLikeThis queries
-
Mixing Lucene based indexes and Elasticsearch based indexes (partial support is here though)
-
There is room for improvements in the performances of the MassIndexer implementation