Hibernate Search provides a JSR-352 job to perform mass indexing. It covers not only the existing functionality of the mass indexer described above, but also benefits from some powerful standard features of the Java Batch Platform (JSR-352), such as failure recovery using checkpoints, chunk oriented processing, and parallel execution. This batch job accepts different entity type(s) as input, loads the relevant entities from the database, then rebuilds the full-text index from these.
However, it requires a batch runtime for the execution. Please notice that we don’t provide any batch runtime, you are free to choose one that fits you needs, e.g. the default batch runtime embedded in your Java EE container. We provide full integration to the JBeret implementation (see how to configure it here). As for other implementations, they can also be used, but will require a bit more configuration on your side.
If the runtime is JBeret, you need to add the following dependency:
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-mapper-orm-batch-jsr352-jberet</artifactId>
<version>{hibernateSearchVersion}</version>
</dependency>
For any other runtime, you need to add the following dependency:
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-mapper-orm-batch-jsr352-core</artifactId>
<version>{hibernateSearchVersion}</version>
</dependency>
Here is an example of how to run a batch instance:
JSR-352
mass-indexing joblink:{sourcedir}/org/hibernate/search/documentation/mapper/orm/indexing/HibernateOrmBatchJsr352IT.java[role=include]
-
Start building parameters for a mass-indexing job.
-
Define some parameters. In this case, the list of the entity types to be indexed.
-
Get the
JobOperator
from the framework. -
Start the job.
The following table contains all the job parameters you can use to customize the mass-indexing job.
Parameter Name | Builder Method | Requirement | Default value | Description |
---|---|---|---|---|
|
|
Required |
- |
The entity types to index in this job execution, comma-separated. |
|
|
Optional |
True |
Specify whether the existing index should be purged at the beginning of the job. This operation takes place before indexing. |
|
|
Optional |
True |
Specify whether the mass indexer should merge segments at the beginning of the job. This operation takes place after the purge operation and before indexing. |
|
|
Optional |
True |
Specify whether the mass indexer should merge segments at the end of the job. This operation takes place after indexing. |
|
|
Optional |
|
Specify the Hibernate |
|
|
Optional |
1000 |
Specifies the fetch size to be used when loading primary keys. Some databases
accept special values, for example MySQL might benefit from using |
|
|
Optional |
The value of |
Specifies the fetch size to be used when loading entities from database. Some databases
accept special values, for example MySQL might benefit from using |
|
|
Optional |
- |
Use HQL / JPQL to index entities of a target entity type. Your query should contain only one entity type. Mixing this approach with the criteria restriction is not allowed. Please notice that there’s no query validation for your input. See Indexing mode for more detail and limitations. |
|
|
Optional |
- |
The maximum number of results to load per entity type. This parameter let you define a threshold
value to avoid loading too many entities accidentally. The value defined must be greater than 0.
The parameter is not used by default. It is equivalent to keyword |
|
|
Optional |
20,000 |
The maximum number of rows to process per partition. The value defined must be greater than 0, and
equal to or greater than the value of |
|
|
Optional |
The number of partitions |
The maximum number of threads to use for processing the job. Note the batch runtime cannot guarantee the request number of threads are available; it will use as many as it can up to the request maximum. |
|
|
Optional |
2,000, or the value of |
The number of entities to process before triggering a checkpoint. The value defined must be greater
than 0, and equal to or less than the value of |
|
|
Optional |
200, or the value of |
The number of entities to process before clearing the session. The value defined must be greater
than 0, and equal to or less than the value of |
|
|
Required if there’s more than one persistence unit |
- |
The string that will identify the |
|
|
- |
- |
The mass indexing job allows you to define your own entities to be indexed — you can start a full indexing or a partial indexing through 2 different methods: selecting the desired entity types, or using HQL.
restrictedBy
HQL parameterlink:{sourcedir}/org/hibernate/search/documentation/mapper/orm/indexing/HibernateOrmBatchJsr352IT.java[role=include]
-
Start building parameters for a mass-indexing job.
-
Define the entity type to be indexed.
-
Restrict the scope of the job using a HQL restriction.
-
Get
JobOperator
form the framework. -
Start the job.
While the full indexing is useful when you perform the very first indexing, or after extensive changes to your whole database, it may also be time consuming. If your want to reindex only part of your data, you need to add restrictions using HQL: they help you to define a customized selection, and only the entities inside that selection will be indexed. A typical use-case is to index the new entities appeared since yesterday.
Note that, as detailed below, some features may not be supported depending on the indexing mode.
Indexing mode | Scope | Parallel Indexing |
---|---|---|
Full Indexation |
All entities |
Supported |
HQL |
Some entities |
Not supported |
Warning
|
When using the HQL mode, there isn’t any query validation before the job’s start. If the query is invalid, the job will start and fail. Also, parallel indexing is disabled in HQL mode, because our current parallelism implementations relies on selection order, which might not be provided by the HQL given by user. Because of those limitations, we suggest you use this approach only for indexing small numbers of entities, and only if you know that no entities matching the query will be created during indexing. |
For better performance, indexing is performed in parallel using multiple threads. The set of entities to index is split into multiple partitions. Each thread processes one partition at a time.
The following section will explain how to tune the parallel execution.
Tip
|
The "sweet spot" of number of threads, fetch size, partition size, etc. to achieve best performance is highly dependent on your overall architecture, database design and even data values. You should experiment with these settings to find out what’s best in your particular case. |
The maximum number of threads used by the job execution is defined through method maxThreads()
.
Within the N threads given, there’s 1 thread reserved for the core, so only N - 1 threads are
available for different partitions. If N = 1, the program will work, and all batch elements will run
in the same thread. The default number of threads used in Hibernate Search is 10. You can overwrite
it with your preferred number.
MassIndexingJob.parameters()
.maxThreads( 5 )
...
Note
|
Note that the batch runtime cannot guarantee the requested number of threads are available, it will use as many as possible up to the requested maximum (JSR352 v1.0 Final Release, page 34). Note also that all batch jobs share the same thread pool, so it’s not always a good idea to execute jobs concurrently. |
Each partition consists of a fixed number of elements to index. You may tune exactly how many elements
a partition will hold with rowsPerPartition
.
MassIndexingJob.parameters()
.rowsPerPartition( 5000 )
...
Note
|
This property has nothing to do with "chunk size", which is how many elements are processed together between each write. That aspect of processing is addressed by chunking. Instead, Please see the Chunking section to see how to tune chunking. |
When rowsPerPartition
is low, there will be many small partitions,
so processing threads will be less likely to starve (stay idle because there’s no more partition to process),
but on the other hand you will only be able to take advantage of a small fetch size,
which will increase the number of database accesses.
Also, due to the failure recovery mechanisms, there is some overhead in starting a new partition,
so with an unnecessarily large number of partitions, this overhead will add up.
When rowsPerPartition
is high, there will be a few big partitions,
so you will be able to take advantage of a higher chunk size,
and thus a higher fetch size,
which will reduce the number of database accesses,
and the overhead of starting a new partition will be less noticeable,
but on the other hand you may not use all the threads available.
Note
|
Each partition deals with one root entity type, so two different entity types will never run under the same partition. |
The mass indexing job supports restart a suspended or failed job more or less from where it stopped.
This is made possible by splitting each partition in several consecutive chunks of entities, and saving process information in a checkpoint at the end of each chunk. When a job is restarted, it will resume from the last checkpoint.
The size of each chunk is determined by the checkpointInterval
parameter.
MassIndexingJob.parameters()
.checkpointInterval( 1000 )
...
But the size of a chunk is not only about saving progress, it is also about performance:
-
a new Hibernate session is opened for each chunk;
-
a new transaction is started for each chunk;
-
inside a chunk, the session is cleared periodically according to the
sessionClearInterval
parameter, which must thereby be smaller than (or equal to) the chunk size; -
documents are flushed to the index at the end of each chunk.
Tip
|
In general the checkpoint interval should be small compared to the number of rows per partition. Indeed, due to the failure recovery mechanism, the elements before the first checkpoint of each partition will take longer to process than the other, so in a 1000-element partition, having a 100-element checkpoint interval will be faster than having a 1000-element checkpoint interval. On the other hand, chunks shouldn’t be too small in absolute terms. Performing a checkpoint means your JSR-352 runtime will write information about the progress of the job execution to its persistent storage, which also has a cost. Also, a new transaction and session are created for each chunk which doesn’t come for free, and implies that setting the fetch size to a value higher than the chunk size is pointless. Finally, the index flush performed at the end of each chunk is an expensive operation that involves a global lock, which essentially means that the less you do it, the faster indexing will be. Thus having a 1-element checkpoint interval is definitely not a good idea. |
Caution
|
Regardless of how the entity manager factory is retrieved, you must make sure that the entity manager factory used by the mass indexer will stay open during the whole mass indexing process. |
If your JSR-352 runtime is JBeret (used in WildFly in particular),
you can use CDI to retrieve the EntityManagerFactory
.
If you use only one persistence unit, the mass indexer will be able to access your database automatically without any special configuration.
If you want to use multiple persistence units, you will have to register the EntityManagerFactories
as beans in the CDI context.
Note that entity manager factories will probably not be considered as beans by default, in which case
you will have to register them yourself. You may use an application-scoped bean to do so:
@ApplicationScoped
public class EntityManagerFactoriesProducer {
@PersistenceUnit(unitName = "db1")
private EntityManagerFactory db1Factory;
@PersistenceUnit(unitName = "db2")
private EntityManagerFactory db2Factory;
@Produces
@Singleton
@Named("db1") // The name to use when referencing the bean
public EntityManagerFactory createEntityManagerFactoryForDb1() {
return db1Factory;
}
@Produces
@Singleton
@Named("db2") // The name to use when referencing the bean
public EntityManagerFactory createEntityManagerFactoryForDb2() {
return db2Factory;
}
}
Once the entity manager factories are registered in the CDI context, you can instruct the mass
indexer to use one in particular by naming it using the entityManagerReference
parameter.
Note
|
Due to limitations of the CDI APIs, it is not currently possible to reference an entity manager factory by its persistence unit name when using the mass indexer with CDI. |
If you want to use a different JSR-352 implementation that happens to allow dependency injection:
-
You must map the following two scope annotations to the relevant scope in the dependency injection mechanism:
-
org.hibernate.search.batch.jsr352.core.inject.scope.spi.HibernateSearchJobScoped
-
org.hibernate.search.batch.jsr352.core.inject.scope.spi.HibernateSearchPartitionScoped
-
-
You must make sure that the dependency injection mechanism will register all injection-annotated classes (
@Named
, …) from thehibernate-search-mapper-orm-batch-jsr352-core
module in the dependency injection context. For instance this can be achieved in Spring DI using the@ComponentScan
annotation. -
You must register a single bean in the dependency injection context that will implement the
EntityManagerFactoryRegistry
interface.
The following will work only if your JSR-352 runtime does not support dependency injection at all,
i.e. it ignores @Inject
annotations in batch artifacts.
This is the case for JBatch in Java SE mode, for instance.
If you use only one persistence unit,
the mass indexer will be able to access your database automatically without any special configuration:
you only have to make sure to create the EntityManagerFactory
(or SessionFactory
)
in your application before launching the mass indexer.
If you want to use multiple persistence units, you will have to add two parameters when launching the mass indexer:
-
entityManagerFactoryReference
: this is the string that will identify theEntityManagerFactory
. -
entityManagerFactoryNamespace
: this allows to select how you want to reference theEntityManagerFactory
. Possible values are:-
persistence-unit-name
(the default): use the persistence unit name defined inpersistence.xml
. -
session-factory-name
: use the session factory name defined in the Hibernate configuration by thehibernate.session_factory_name
configuration property.
-
Caution
|
If you set the |