Reindexing large volumes of data with the JSR-352 integration

Hibernate Search provides a JSR-352 job to perform mass indexing. It covers not only the existing functionality of the mass indexer described above, but also benefits from some powerful standard features of the Java Batch Platform (JSR-352), such as failure recovery using checkpoints, chunk oriented processing, and parallel execution. This batch job accepts different entity type(s) as input, loads the relevant entities from the database, then rebuilds the full-text index from these.

However, it requires a batch runtime for the execution. Please notice that we don’t provide any batch runtime, you are free to choose one that fits you needs, e.g. the default batch runtime embedded in your Java EE container. We provide full integration to the JBeret implementation (see how to configure it here). As for other implementations, they can also be used, but will require a bit more configuration on your side.

If the runtime is JBeret, you need to add the following dependency:

<dependency>
   <groupId>org.hibernate.search</groupId>
   <artifactId>hibernate-search-mapper-orm-batch-jsr352-jberet</artifactId>
   <version>{hibernateSearchVersion}</version>
</dependency>

For any other runtime, you need to add the following dependency:

<dependency>
   <groupId>org.hibernate.search</groupId>
   <artifactId>hibernate-search-mapper-orm-batch-jsr352-core</artifactId>
   <version>{hibernateSearchVersion}</version>
</dependency>

Here is an example of how to run a batch instance:

Example 1. Reindexing everything using a JSR-352 mass-indexing job

link:{sourcedir}/org/hibernate/search/documentation/mapper/orm/indexing/HibernateOrmBatchJsr352IT.java[role=include]

Start building parameters for a mass-indexing job.
Define some parameters. In this case, the list of the entity types to be indexed.
Get the JobOperator from the framework.
Start the job.

Job Parameters

The following table contains all the job parameters you can use to customize the mass-indexing job.

Table 1. Job Parameters in JSR 352 Integration

Parameter Name	Builder Method	Requirement	Default value	Description
`entityTypes`	`forEntity(Class<?>)`, `forEntities(Class<?>, Class<?>…)`	Required	-	The entity types to index in this job execution, comma-separated.
`purgeAllOnStart`	`purgeAllOnStart(boolean)`	Optional	True	Specify whether the existing index should be purged at the beginning of the job. This operation takes place before indexing.
`mergeSegmentsAfterPurge`	`mergeSegmentsAfterPurge(boolean)`	Optional	True	Specify whether the mass indexer should merge segments at the beginning of the job. This operation takes place after the purge operation and before indexing.
`mergeSegmentsOnFinish`	`mergeSegmentsOnFinish(boolean)`	Optional	True	Specify whether the mass indexer should merge segments at the end of the job. This operation takes place after indexing.
`cacheMode`	`cacheMode(CacheMode)`	Optional	`IGNORE`	Specify the Hibernate `CacheMode` when loading entities. The default is `IGNORE`, and it will be the most efficient choice in most cases, but using another mode such as `GET` may be more efficient if many of the entities being indexed refer to a small set of other entities.
`idFetchSize`	`idFetchSize(int)`	Optional	1000	Specifies the fetch size to be used when loading primary keys. Some databases accept special values, for example MySQL might benefit from using `Integer#MIN_VALUE`, otherwise it will attempt to preload everything in memory.
`entityFetchSize`	`entityFetchSize(int)`	Optional	The value of `sessionClearInterval`	Specifies the fetch size to be used when loading entities from database. Some databases accept special values, for example MySQL might benefit from using `Integer#MIN_VALUE`, otherwise it will attempt to preload everything in memory.
`customQueryHQL`	`restrictedBy(String)`	Optional	-	Use HQL / JPQL to index entities of a target entity type. Your query should contain only one entity type. Mixing this approach with the criteria restriction is not allowed. Please notice that there’s no query validation for your input. See Indexing mode for more detail and limitations.
`maxResultsPerEntity`	`maxResultsPerEntity(int)`	Optional	-	The maximum number of results to load per entity type. This parameter let you define a threshold value to avoid loading too many entities accidentally. The value defined must be greater than 0. The parameter is not used by default. It is equivalent to keyword `LIMIT` in SQL.
`rowsPerPartition`	`rowsPerPartition(int)`	Optional	20,000	The maximum number of rows to process per partition. The value defined must be greater than 0, and equal to or greater than the value of `checkpointInterval`.
`maxThreads`	`maxThreads(int)`	Optional	The number of partitions	The maximum number of threads to use for processing the job. Note the batch runtime cannot guarantee the request number of threads are available; it will use as many as it can up to the request maximum.
`checkpointInterval`	`checkpointInterval(int)`	Optional	2,000, or the value of `rowsPerPartition` if it is smaller	The number of entities to process before triggering a checkpoint. The value defined must be greater than 0, and equal to or less than the value of `rowsPerPartition`.
`sessionClearInterval`	`sessionClearInterval(int)`	Optional	200, or the value of `checkpointInterval` if it is smaller	The number of entities to process before clearing the session. The value defined must be greater than 0, and equal to or less than the value of `checkpointInterval`.
`entityManagerFactoryReference`	`entityManagerFactoryReference(String)`	Required if there’s more than one persistence unit	-	The string that will identify the `EntityManagerFactory`.
`entityManagerFactoryNamespace`	`entityManagerFactoryNamespace(String)`	-	-	See Selecting the persistence unit (EntityManagerFactory)

Indexing mode

The mass indexing job allows you to define your own entities to be indexed — you can start a full indexing or a partial indexing through 2 different methods: selecting the desired entity types, or using HQL.

Example 2. Partial reindexing using a restrictedBy HQL parameter

link:{sourcedir}/org/hibernate/search/documentation/mapper/orm/indexing/HibernateOrmBatchJsr352IT.java[role=include]

Start building parameters for a mass-indexing job.
Define the entity type to be indexed.
Restrict the scope of the job using a HQL restriction.
Get JobOperator form the framework.
Start the job.

While the full indexing is useful when you perform the very first indexing, or after extensive changes to your whole database, it may also be time consuming. If your want to reindex only part of your data, you need to add restrictions using HQL: they help you to define a customized selection, and only the entities inside that selection will be indexed. A typical use-case is to index the new entities appeared since yesterday.

Note that, as detailed below, some features may not be supported depending on the indexing mode.

Table 2. Comparaison of each indexing mode

Indexing mode	Scope	Parallel Indexing
Full Indexation	All entities	Supported
HQL	Some entities	Not supported

Warning

When using the HQL mode, there isn’t any query validation before the job’s start. If the query is invalid, the job will start and fail.

Also, parallel indexing is disabled in HQL mode, because our current parallelism implementations relies on selection order, which might not be provided by the HQL given by user.

Because of those limitations, we suggest you use this approach only for indexing small numbers of entities, and only if you know that no entities matching the query will be created during indexing.

Parallel indexing

For better performance, indexing is performed in parallel using multiple threads. The set of entities to index is split into multiple partitions. Each thread processes one partition at a time.

The following section will explain how to tune the parallel execution.

Tip	The "sweet spot" of number of threads, fetch size, partition size, etc. to achieve best performance is highly dependent on your overall architecture, database design and even data values. You should experiment with these settings to find out what’s best in your particular case.

Threads

The maximum number of threads used by the job execution is defined through method maxThreads(). Within the N threads given, there’s 1 thread reserved for the core, so only N - 1 threads are available for different partitions. If N = 1, the program will work, and all batch elements will run in the same thread. The default number of threads used in Hibernate Search is 10. You can overwrite it with your preferred number.

MassIndexingJob.parameters()
        .maxThreads( 5 )
        ...

Note

Note that the batch runtime cannot guarantee the requested number of threads are available, it will use as many as possible up to the requested maximum (JSR352 v1.0 Final Release, page 34). Note also that all batch jobs share the same thread pool, so it’s not always a good idea to execute jobs concurrently.

Rows per partition

Each partition consists of a fixed number of elements to index. You may tune exactly how many elements a partition will hold with rowsPerPartition.

MassIndexingJob.parameters()
        .rowsPerPartition( 5000 )
        ...

Note

This property has nothing to do with "chunk size", which is how many elements are processed together between each write. That aspect of processing is addressed by chunking.

Instead, rowsPerPartition is more about how parallel your mass indexing job will be.

Please see the Chunking section to see how to tune chunking.

When rowsPerPartition is low, there will be many small partitions, so processing threads will be less likely to starve (stay idle because there’s no more partition to process), but on the other hand you will only be able to take advantage of a small fetch size, which will increase the number of database accesses. Also, due to the failure recovery mechanisms, there is some overhead in starting a new partition, so with an unnecessarily large number of partitions, this overhead will add up.

When rowsPerPartition is high, there will be a few big partitions, so you will be able to take advantage of a higher chunk size, and thus a higher fetch size, which will reduce the number of database accesses, and the overhead of starting a new partition will be less noticeable, but on the other hand you may not use all the threads available.

Note	Each partition deals with one root entity type, so two different entity types will never run under the same partition.

Chunking and session clearing

The mass indexing job supports restart a suspended or failed job more or less from where it stopped.

This is made possible by splitting each partition in several consecutive chunks of entities, and saving process information in a checkpoint at the end of each chunk. When a job is restarted, it will resume from the last checkpoint.

The size of each chunk is determined by the checkpointInterval parameter.

MassIndexingJob.parameters()
        .checkpointInterval( 1000 )
        ...

But the size of a chunk is not only about saving progress, it is also about performance:

a new Hibernate session is opened for each chunk;
a new transaction is started for each chunk;
inside a chunk, the session is cleared periodically according to the sessionClearInterval parameter, which must thereby be smaller than (or equal to) the chunk size;
documents are flushed to the index at the end of each chunk.

Tip

In general the checkpoint interval should be small compared to the number of rows per partition.

Indeed, due to the failure recovery mechanism, the elements before the first checkpoint of each partition will take longer to process than the other, so in a 1000-element partition, having a 100-element checkpoint interval will be faster than having a 1000-element checkpoint interval.

On the other hand, chunks shouldn’t be too small in absolute terms. Performing a checkpoint means your JSR-352 runtime will write information about the progress of the job execution to its persistent storage, which also has a cost. Also, a new transaction and session are created for each chunk which doesn’t come for free, and implies that setting the fetch size to a value higher than the chunk size is pointless. Finally, the index flush performed at the end of each chunk is an expensive operation that involves a global lock, which essentially means that the less you do it, the faster indexing will be. Thus having a 1-element checkpoint interval is definitely not a good idea.

Selecting the persistence unit (EntityManagerFactory)

Caution

Regardless of how the entity manager factory is retrieved, you must make sure that the entity manager factory used by the mass indexer will stay open during the whole mass indexing process.

JBeret

If your JSR-352 runtime is JBeret (used in WildFly in particular), you can use CDI to retrieve the EntityManagerFactory.

If you use only one persistence unit, the mass indexer will be able to access your database automatically without any special configuration.

If you want to use multiple persistence units, you will have to register the EntityManagerFactories as beans in the CDI context. Note that entity manager factories will probably not be considered as beans by default, in which case you will have to register them yourself. You may use an application-scoped bean to do so:

@ApplicationScoped
public class EntityManagerFactoriesProducer {

    @PersistenceUnit(unitName = "db1")
    private EntityManagerFactory db1Factory;

    @PersistenceUnit(unitName = "db2")
    private EntityManagerFactory db2Factory;

    @Produces
    @Singleton
    @Named("db1") // The name to use when referencing the bean
    public EntityManagerFactory createEntityManagerFactoryForDb1() {
        return db1Factory;
    }

    @Produces
    @Singleton
    @Named("db2") // The name to use when referencing the bean
    public EntityManagerFactory createEntityManagerFactoryForDb2() {
        return db2Factory;
    }
}

Once the entity manager factories are registered in the CDI context, you can instruct the mass indexer to use one in particular by naming it using the entityManagerReference parameter.

Note	Due to limitations of the CDI APIs, it is not currently possible to reference an entity manager factory by its persistence unit name when using the mass indexer with CDI.

Other DI-enabled JSR-352 implementations

If you want to use a different JSR-352 implementation that happens to allow dependency injection:

You must map the following two scope annotations to the relevant scope in the dependency injection mechanism:
- org.hibernate.search.batch.jsr352.core.inject.scope.spi.HibernateSearchJobScoped
- org.hibernate.search.batch.jsr352.core.inject.scope.spi.HibernateSearchPartitionScoped
You must make sure that the dependency injection mechanism will register all injection-annotated classes (@Named, …) from the hibernate-search-mapper-orm-batch-jsr352-core module in the dependency injection context. For instance this can be achieved in Spring DI using the @ComponentScan annotation.
You must register a single bean in the dependency injection context that will implement the EntityManagerFactoryRegistry interface.

Plain Java environment (no dependency injection at all)

The following will work only if your JSR-352 runtime does not support dependency injection at all, i.e. it ignores @Inject annotations in batch artifacts. This is the case for JBatch in Java SE mode, for instance.

If you use only one persistence unit, the mass indexer will be able to access your database automatically without any special configuration: you only have to make sure to create the EntityManagerFactory (or SessionFactory) in your application before launching the mass indexer.

If you want to use multiple persistence units, you will have to add two parameters when launching the mass indexer:

entityManagerFactoryReference: this is the string that will identify the EntityManagerFactory.
entityManagerFactoryNamespace: this allows to select how you want to reference the EntityManagerFactory. Possible values are:
- persistence-unit-name (the default): use the persistence unit name defined in persistence.xml.
- session-factory-name: use the session factory name defined in the Hibernate configuration by the hibernate.session_factory_name configuration property.

Caution

If you set the hibernate.session_factory_name property in the Hibernate configuration and you don’t use JNDI, you will also have to set hibernate.session_factory_name_is_jndi to false.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mapper-orm-indexing-jsr352.asciidoc

mapper-orm-indexing-jsr352.asciidoc

Reindexing large volumes of data with the JSR-352 integration

Job Parameters

Indexing mode

Parallel indexing

Threads

Rows per partition

Chunking and session clearing

Selecting the persistence unit (EntityManagerFactory)

JBeret

Other DI-enabled JSR-352 implementations

Plain Java environment (no dependency injection at all)

Files

mapper-orm-indexing-jsr352.asciidoc

Latest commit

History

mapper-orm-indexing-jsr352.asciidoc

File metadata and controls

Reindexing large volumes of data with the JSR-352 integration

Job Parameters

Indexing mode

Parallel indexing

Threads

Rows per partition

Chunking and session clearing

Selecting the persistence unit (EntityManagerFactory)

JBeret

Other DI-enabled JSR-352 implementations

Plain Java environment (no dependency injection at all)