In order to define a Lucene backend,
the hibernate.search.backends.<backend name>.type
property must be set to lucene
.
All other configuration properties are optional, but the defaults might not suit everyone. In particular, you might want to set the location of your indexes in the filesystem.
Other configuration properties are mentioned in the relevant parts of this documentation. You can find a full reference of available properties in the Hibernate Search javadoc:
The component responsible for index storage in Lucene is the org.apache.lucene.store.Directory
.
The implementation of the directory determines where the index will be stored:
on the filesystem, in the JVM’s heap, …
By default, the Lucene backend stores the indexes on the filesystem, in the JVM’s working directory.
The type of directory is set a the backend level:
hibernate.search.backends.<backend-name>.directory.type = local-filesystem
The following directory types are available:
-
local-filesystem
: Store the index on the local filesystem. See Local filesystem storage for details and configuration options. -
local-heap
: Store the index in the local JVM heap. Local heap directories and all contained indexes are lost when the JVM shuts down. See Local heap storage for details and configuration options.
The local-filesystem
directory type will store each index in
a subdirectory of a configured filesystem directory.
Note
|
Local filesystem directories really are designed to be local to one server and one application. In particular, they should not be shared between multiple Hibernate Search instances.
Even if network shares allow to share the raw content of indexes,
using the same index files from multiple Hibernate Search
would require more than that:
non-exclusive locking, routing of write requests from one node to another, …
These additional features are simply not available on If you need to share indexes between multiple Hibernate Search instances, the Elasticsearch backend will be a better choice. Refer to [architecture] for more information. |
Each index is assigned a subdirectory under a root directory.
By default, the root directory is the JVM’s working directory. It can be configured at the backend level:
hibernate.search.backends.<backend-name>.directory.root = /path/to/my/root
For example, with the configuration above,
an entity of type com.mycompany.Order
will be indexed in directory
/path/to/my/root/com.mycompany.Order/
.
If that entity is explicitly assigned the index name "orders"
(see @Indexed(index = …)
in [mapper-orm-entityindexmapping]),
it will instead be indexed in directory
/path/to/my/root/orders/
.
The default strategy for accessing the filesystem is determined automatically based on the operating system and architecture. It should work well in most situations.
For situations where a different filesystem access strategy is needed, Hibernate Search exposes a configuration property at the backend level:
hibernate.search.backends.<backend-name>.directory.filesystem_access.strategy = auto (default)
Allowed values are:
-
auto
(default): lets Lucene select the most appropriate implementation based on the operating system and architecture. -
simple
: a straightforward strategy based onFiles.newByteChannel
. Seeorg.apache.lucene.store.SimpleFSDirectory
. -
mmap
: usesmmap
for reading, andFSDirectory.FSIndexOutput
for writing. Seeorg.apache.lucene.store.MMapDirectory
. -
nio
: usesjava.nio.channels.FileChannel
's positional read for concurrent reading, andFSDirectory.FSIndexOutput
for writing. Seeorg.apache.lucene.store.NIOFSDirectory
.
Note
|
Make sure to refer to Javadocs of these |
The local-filesystem
directory also allows configuring a
locking strategy.
The local-heap
directory type will store indexes in the local JVM’s heap.
As a result, indexes contained in a local-heap
directory are lost when the JVM shuts down.
This directory type is only provided for use in testing configurations with small indexes and low concurrency, where it could slightly improve performance. In setups requiring larger indexes and/or high concurrency, a filesystem-based directory will achieve better performance.
The local-heap
directory does not offer any specific option
beyond the locking strategy.
In order to write to an index, Lucene needs to acquire a lock to ensure no other application instance writes to the same index concurrently. Each directory type comes with a default locking strategy that should work well enough in most situations.
For those (very) rare situations where a different locking strategy is needed, Hibernate Search exposes a configuration property at the backend level:
hibernate.search.backends.<backend-name>.directory.locking.strategy = native-filesystem
The following strategies are available:
-
simple-filesystem
: Locks the index by creating a marker file and checking it before write operations. This implementation is very simple and based Java’s File API. If for some reason an application ends abruptly, the marker file will stay on the filesystem and will need to be removed manually.This strategy is only available for filesystem-based directories.
See
org.apache.lucene.store.SimpleFSLockFactory
. -
native-filesystem
: Similarly tosimple-filesystem
, locks the index by creating a marker file, but using native OS file locks instead of Java’s File API, so that locks will be cleaned up if the application ends abruptly.This is the default strategy for the
local-filesystem
directory type.This implementation has known problems with NFS: it should be avoided on network shares.
This strategy is only available for filesystem-based directories.
See
org.apache.lucene.store.NativeFSLockFactory
. -
single-instance
: Locks using a Java object held in the JVM’s heap. Since the lock is only accessible by the same JVM, this strategy will only work properly when it is known that only a single application will ever try to accesses the indexes.This is the default strategy for the
local-heap
directory type.See
org.apache.lucene.store.SingleInstanceLockFactory
. -
none
: Does not use any lock. Concurrent writes from another application will result in index corruption. Test your application carefully and make sure you know what it means.See
org.apache.lucene.store.NoLockFactory
.
In the Lucene backend, sharding is disabled by default, but can be enabled by selecting a sharding strategy at the index level. Multiple strategies are available:
hash
-
hibernate.search.backends.<backend name>.indexes.<index name>.sharding.strategy = hash hibernate.search.backends.<backend name>.indexes.<index name>.sharding.number_of_shards = 2 (no default) # OR hibernate.search.backends.<backend name>.index_defaults.sharding.strategy = hash hibernate.search.backends.<backend name>.index_defaults.sharding.number_of_shards = 2 (no default)
The
hash
strategy requires to set a number of shards through thenumber_of_shards
property.This strategy will set up an explicitly configured number of shards, numbered from 0 to the chosen number minus one (e.g. for 2 shards, there will be shard "0" and shard "1").
When routing, the routing key will be hashed to assign it to a shard. If the routing key is null, the document ID will be used instead.
This strategy is suitable when there is no explicit routing key configured in the mapping, or when the routing key has a large number of possible values that need to be brought down to a smaller number (e.g. "all integers").
explicit
-
hibernate.search.backends.<backend name>.indexes.<index name>.sharding.strategy = explicit hibernate.search.backends.<backend name>.indexes.<index name>.sharding.shard_identifiers = fr,en,de (no default) # OR hibernate.search.backends.<backend name>.index_defaults.sharding.strategy = explicit hibernate.search.backends.<backend name>.index_defaults.sharding.shard_identifiers = fr,en,de (no default)
The
explicit
strategy requires to set a list of shard identifiers through theshard_identifiers
property. The identifiers must be provided as a String containing multiple shard identifiers separated by commas, or aCollection<String>
containing shard identifiers. A shard identifier can be any string.This strategy will set up one shard per configured shard identifier.
When routing, the routing key will be validated to make sure it matches a shard identifier exactly. If it does, the document will be routed to that shard. If it does not, an exception will be thrown. The routing key cannot be null, and the document ID will be ignored.
This strategy is suitable when there is an explicit routing key configured in the mapping, and that routing key has a limited number of possible values that are known before starting the application.
While Hibernate Search strives to offer a backwards compatible API, making it easy to port your application to newer versions, it still delegates to Apache Lucene to handle the index writing and searching. This creates a dependency to the Lucene index format. The Lucene developers of course attempt to keep a stable index format, but sometimes a change in the format can not be avoided. In those cases you either have to re-index all your data or use an index upgrade tool. Sometimes, Lucene is also able to read the old format so you don’t need to take specific actions (besides making backup of your index).
While an index format incompatibility is a rare event, it can happen more often that Lucene’s Analyzer implementations might slightly change its behavior. This can lead to some documents not matching anymore, even though they used to.
To avoid this analyzer incompatibility, Hibernate Search allows to configure to which version of Lucene the analyzers and other Lucene classes should conform their behavior.
This configuration property is set at the backend level:
hibernate.search.backends.<backend-name>.lucene_version = LUCENE_8_1_1
Depending on the specific version of Lucene you’re using,
you might have different options available:
see org.apache.lucene.util.Version
contained in lucene-core.jar
for a list of allowed values.
When this option is not set, Hibernate Search will instruct Lucene to use the latest version, which is usually the best option for new projects. Still, it’s recommended to define the version you’re using explicitly in the configuration, so that when you happen to upgrade, Lucene the analyzers will not change behavior. You can then choose to update this value at a later time, for example when you have the chance to rebuild the index from scratch.
Note
|
The setting will be applied consistently when using Hibernate Search APIs, but if you are also making use of Lucene bypassing Hibernate Search (for example when instantiating an Analyzer yourself), make sure to use the same value. |
Lucene does not really have a concept of centralized schema to specify the data type and capabilities of each field, but Hibernate Search maintains such a schema in memory, in order to remember which predicates/projections/sorts can be applied to each field.
For the most part, the schema is inferred from the mapping configured through Hibernate Search’s mapping APIs, which are generic and independent from Elasticsearch.
Aspects that are specific to the Lucene backend are explained in this section.
Note
|
Some types are not supported directly by the Elasticsearch backend,
but will work anyway because they are "bridged" by the mapper.
For example a |
Note
|
Field types that are not in this list can still be used with a little bit more work:
|
Field type | Limitations |
---|---|
java.lang.String |
- |
java.lang.Byte |
- |
java.lang.Short |
- |
java.lang.Integer |
- |
java.lang.Long |
- |
java.lang.Double |
- |
java.lang.Float |
- |
java.lang.Boolean |
- |
java.math.BigDecimal |
- |
java.math.BigInteger |
- |
java.time.Instant |
|
java.time.LocalDate |
|
java.time.LocalTime |
|
java.time.LocalDateTime |
|
java.time.ZonedDateTime |
|
java.time.OffsetDateTime |
|
java.time.OffsetTime |
|
java.time.Year |
|
java.time.YearMonth |
|
java.time.MonthDay |
- |
org.hibernate.search.engine.spatial.GeoPoint |
- |
Note
|
Range and resolution of date/time fields
Date/time types do not support the whole range of years that can be represented in
Values that are out of range will trigger indexing failures. Resolution for time types is also lower:
Precision beyond the millisecond will be lost when indexing. |
Not all Lucene field types have built-in support in Hibernate Search.
Unsupported field types can still be used, however,
by taking advantage of the "native" field type.
Using this field type, Lucene IndexableField
instances can be created directly,
giving access to everything Lucene can offer.
Below is an example of how to use the Lucene "native" type.
link:{sourcedir}/org/hibernate/search/documentation/backend/lucene/type/asnative/PageRankValueBinder.java[role=include]
-
Define a custom binder and its bridge. The "native" type can only be used from a binder, it cannot be used directly with annotation mapping. Here we’re defining a value binder, but a type binder, or a property binder would work as well.
-
Get the context’s type factory.
-
Apply the Lucene extension to the type factory.
-
Call
asNative
to start defining a native type. -
Define the field value type.
-
Define the
LuceneFieldContributor
. The contributor will be called upon indexing to add as many fields as necessary to the document. All fields must be named after theabsoluteFieldPath
passed to the contributor. -
Optionally, if projections are necessary, define the
LuceneFieldValueExtractor
. The extractor will be called upon projecting to extract the projected value from a single stored field. -
The value bridge is free to apply a preliminary conversion before passing the value to Hibernate Search, which will pass it along to the
LuceneFieldContributor
.
link:{sourcedir}/org/hibernate/search/documentation/backend/lucene/type/asnative/WebPage.java[role=include]
-
Map the property to an index field. Note that value bridges using a non-standard field type (such as Lucene’s "native" type) must be mapped using the
@NonStandardField
annotation: other annotations such as@GenericField
will fail. -
Instruct Hibernate Search to use our custom value binder.
Multi-tenancy is supported and handled transparently, according to the tenant ID defined in the current session:
-
documents will be indexed with the appropriate values, allowing later filtering;
-
queries will filter results appropriately.
However, a strategy must be selected in order to enable multi-tenancy.
The multi-tenancy strategy is set a the backend level:
hibernate.search.backends.<backend name>.multi_tenancy.strategy = none (default)
See the following subsections for details about available strategies.
The none
strategy (the default) disables multi-tenancy completely.
Attempting to set a tenant ID will lead to a failure when indexing.
With the discriminator
strategy,
all documents from all tenants are stored in the same index.
When indexing, a discriminator field holding the tenant ID is populated transparently for each document.
When searching, a filter targeting the tenant ID field is added transparently to the search query to only return search hits for the current tenant.
Analysis is the text processing performed by analyzers, both when indexing (document processing) and when searching (query processing).
To configure analysis in a Lucene backend, you will need to:
-
Define a class that implements the
org.hibernate.search.backend.lucene.analysis.LuceneAnalysisConfigurer
interface. -
Configure the backend to use that implementation by setting the configuration property
hibernate.search.backends.<backend name>.analysis.configurer
to a bean reference pointing to the implementation.
Hibernate Search will call the configure
method of this implementation on startup,
and the configurer will be able to take advantage of a DSL to define analyzers:
link:{sourcedir}/org/hibernate/search/documentation/analysis/MyLuceneAnalysisConfigurer.java[role=include]
-
Define a custom analyzer named "english", because it will be used to analyze English text such as book titles.
-
Set the tokenizer to a standard tokenizer: components are referenced by their factory class.
-
Set the char filters. Char filters are applied in the order they are given, before the tokenizer.
-
Set the token filters. Token filters are applied in the order they are given, after the tokenizer.
-
Set the value of a parameter for the last added char filter/tokenizer/token filter.
-
Normalizers are defined in a similar way, the only difference being that they cannot use a tokenizer.
-
Multiple analyzers/normalizers can be defined in the same configurer.
link:{resourcesdir}/analysis/lucene-simple.properties[role=include]
-
Assign the configurer to the backend
myBackend
using a Hibernate Search configuration property.
It is also possible to assign a name to a built-in analyzer, or a custom analyzer implementation:
link:{sourcedir}/org/hibernate/search/documentation/analysis/AdvancedLuceneAnalysisConfigurer.java[role=include]
Tip
|
To know which analyzers, character filters, tokenizers and token filters are available, either browse the Lucene Javadoc or read the corresponding section on the Solr Wiki (you don’t need Solr to use these analyzers, it’s just that there is no documentation page for Lucene proper). |
The Lucene backend relies on an internal thread pool to execute write operations on the index.
By default, the pool contains exactly as many threads as the number of processors available to the JVM on bootstrap. That can be changed using a configuration property:
hibernate.search.backends.<backend-name>.thread_pool.size = 4
Note
|
This number is per backend, not per index. Adding more indexes will not add more threads. |
Tip
|
Operations happening in this thread-pool include blocking I/O, so raising its size above the number of processor cores available to the JVM can make sense and may improve performance. |
Among all the write operations performed by Hibernate Search on the indexes, it is expected that there will be a lot of "indexing" operations to create/update/delete a specific document. We generally want to preserve the relative order of these requests when they are about the same documents.
For this reasons, Hibernate Search pushes these operations to internal queues and applies them in batches. Each index maintains 10 queues holding at most 1000 elements each. Queues operate independently (in parallel), but each queue applies one operation after the other, so at any given time there can be at most 10 batches of indexing requests being applied for each index.
Note
|
Indexing operations relative to the same document ID are always pushed to the same queue. |
It is possible to customize the queues in order to reduce resource consumption, or on the contrary to improve throughput. This is done through the following configuration properties, at the index level:
hibernate.search.backends.<backend name>.indexes.<index name>.indexing.queue_count 10 (default)
hibernate.search.backends.<backend name>.indexes.<index name>.indexing.queue_size 1000 (default)
# OR
hibernate.search.backends.<backend name>.index_defaults.indexing.queue_count 10 (default)
hibernate.search.backends.<backend name>.index_defaults.indexing.queue_size 1000 (default)
-
indexing.queue_count
defines the number of queues. Expects a strictly positive integer value.Higher values will lead to more indexing operations being performed in parallel, which may lead to higher indexing throughput if CPU power is the bottleneck when indexing.
Note that raising this number above the number of threads is never useful, as the number of threads limits how many queues can be processed in parallel.
-
indexing.queue_size
defines the maximum number of elements each queue can hold. Expects a strictly positive integer value.Lower values may lead to lower memory usage, especially if there are many queues, but values that are too low will increase the likeliness of application threads blocking because the queue is full, which may lead to lower indexing throughput.
Tip
|
When a queue is full, any attempt to request indexing will block until the request can be put into the queue. In order to achieve a reasonable level of performance, be sure to set the size of queues to a high enough number that this kind of blocking only happens when the application is under very high load. |
Tip
|
When sharding is enabled, each shard will be assigned its own set of queues. If you use the For example, if you set the number of shards to 8 and the number of queues to 4,
documents ending up in the shard #0 will always end up in queue #0 of that shard.
That’s because both the routing to a shard and the routing to a queue take the hash of the document ID
then apply a modulo operation to that hash,
and |
In Lucene terminology, a commit is when changes buffered in an index writer are pushed to the index itself, so that a crash or power loss will no longer result in data loss.
Some operations are critical and are always committed before they are considered complete. This is the case for changes triggered by automatic indexing (unless configured otherwise), and also for large-scale operations such as a purge. When such an operation is encountered, a commit will be performed immediately, guaranteeing that the operation is only considered complete after all changes are safely stored on disk.
Other operations, however, are not expected to be committed immediately.
This is the case for changes contributed by the mass indexer,
or by automatic indexing when using the
async
synchronization strategy.
Performance-wise, committing may be an expensive operation, which is why Hibernate Search tries not to commit too often. By default, when changes that do not require an immediate commit are applied to the index, Hibernate Search will delay the commit by one second. If other changes are applied during that second, they will be included in the same commit. This dramatically reduces the amount of commits in write-intensive scenarios (e.g. mass indexing), leading to much better performance.
It is possible to control exactly how often Hibernate Search will commit by setting the commit interval (in milliseconds) at the index level:
hibernate.search.backends.<backend name>.indexes.<index name>.io.commit_interval = 1000 (default)
# OR
hibernate.search.backends.<backend name>.index_defaults.io.commit_interval = 1000 (default)
Warning
|
Setting the commit interval to 0 will force Hibernate Search to commit after every batch of changes, which may result in a much lower throughput, both for automatic indexing and mass indexing. |
Note
|
Remember that individual write operations may force a commit, which may cancel out the potential performance gains from setting a higher commit interval. By default, the commit interval may only improve throughput of the mass indexer. If you want changes triggered by automatic indexing to benefit from it too, you will need to select a non-default synchronization strategy, so as not to require a commit after each change. |
In Lucene terminology, a refresh is when a new index reader is opened, so that the next search queries will take into account the latest changes to the index.
Performance-wise, refreshing may be an expensive operation, which is why Hibernate Search tries not to refresh too often. The index reader is refreshed upon every search query, but only if writes have occurred since the last refresh.
In write-intensive scenarios where refreshing after each write is still too frequent, it is possible to refresh less frequently and thus improve read throughput by setting a refresh interval in milliseconds. When set to a value higher than 0, the index reader will no longer be refreshed upon every search query: if, when a search query starts, the refresh occurred less than X milliseconds ago, then the index reader will not be refreshed, even though it may be out-of-date.
The refresh interval is set at the index level:
hibernate.search.backends.<backend name>.indexes.<index name>.io.refresh_interval = 0 (default)
# OR
hibernate.search.backends.<backend name>.index_defaults.io.refresh_interval = 0 (default)
Lucene’s IndexWriter
, used by Hibernate Search to write to indexes,
exposes several settings that can be tweaked to better fit your application,
and ultimately get better performance.
Hibernate Search exposes these settings through configuration properties prefixed with io.writer.
,
at the index level.
Below is a list of all index writer settings.
They can be set through configuration properties, at the index level.
For example, io.writer.ram_buffer_size
can be set like this:
hibernate.search.backends.<backend name>.indexes.<index name>.io.writer.ram_buffer_size = 32
# OR
hibernate.search.backends.<backend name>.index_defaults.io.writer.ram_buffer_size = 32
IndexWriter
Property | Description |
---|---|
|
The maximum number of documents that can be buffered in-memory before they are flushed to the Directory. Large values mean faster indexing, but more RAM usage. When used together with |
|
The maximum amount of RAM that may be used for buffering added documents and deletions before they are flushed to the Directory. Large values mean faster indexing, but more RAM usage. Generally for faster indexing performance it’s best to use this setting rather than When used together with |
|
Enables low level trace information about Lucene’s internal components; Logs will be appended to the logger This may cause significant performance degradation, even if the logger ignores the TRACE level, so this should only be used for troubleshooting purposes. Disabled by default. |
Tip
|
Refer to Lucene’s documentation, in particular the javadoc and source code of |
A Lucene index is not stored in a single, continuous file. Instead, each flush to the index will generate a small file containing all the documents added to the index. This file is called a "segment". Search can be slower on an index with too many segments, so Lucene regularly merges small segments to create fewer, larger segments.
Lucene’s merge behavior is controlled through a MergePolicy
.
Hibernate Search uses the LogByteSizeMergePolicy
,
which exposes several settings that can be tweaked to better fit your application,
and ultimately get better performance.
Below is a list of all merge settings.
They can be set through configuration properties, at the index level.
For example, io.merge.factor
can be set like this:
hibernate.search.backends.<backend name>.indexes.<index name>.io.merge.factor = 10
# OR
hibernate.search.backends.<backend name>.index_defaults.io.merge.factor = 10
Property | Description |
---|---|
|
The maximum number of documents that a segment can have before merging. Segments with more than this number of documents will not be merged. Smaller values perform better on frequently changing indexes, larger values provide better search performance if the index does not change often. |
|
The number of segments that are merged at once. With smaller values, merging happens more often and thus uses more resources,
but the total number of segments will be lower on average, increasing read performance.
Thus, larger values ( The value must not be lower than |
|
The minimum target size of segments, in MB, for background merges. Segments smaller than this size are merged more aggressively. Setting this too large might result in expensive merge operations, even tough they are less frequent. |
|
The maximum size of segments, in MB, for background merges. Segments larger than this size are never merged in the background. Settings this to a lower value helps reduce memory requirements and avoids some merging operations at the cost of optimal search speed. When forcefully merging an index, this value is ignored and |
|
The maximum size of segments, in MB, for forced merges. This is the equivalent of |
|
Whether the number of deleted documents in an index should be taken into account; When enabled, Lucene will consider that a segment with 100 documents, 50 of which are deleted, actually contains 50 documents. When disabled, Lucene will consider that such a segment contains 100 documents. Setting |
Note
|
Refer to Lucene’s documentation, in particular the javadoc and source code of |
Tip
|
The options First, consider that merging a segment is about adding it together with another existing segment to form a larger one.
Second, merge options do not affect the size of segments initially created by the index writer, before they are merged.
This size can be limited with the setting So, for example, to be fairly confident no file grows larger than 15MB, use something like this:
|