Skip to content

Commit

Permalink
HSEARCH-3808 Document the concepts of commit and refresh
Browse files Browse the repository at this point in the history
... and update the rest of the documentation accordingly.
  • Loading branch information
yrodiere committed Feb 21, 2020
1 parent 95a86a6 commit 1091bb0
Show file tree
Hide file tree
Showing 9 changed files with 170 additions and 96 deletions.
18 changes: 18 additions & 0 deletions documentation/src/main/asciidoc/backend-elasticsearch.asciidoc
Expand Up @@ -780,3 +780,21 @@ refer to the documentation:
{elasticsearchDocUrl}/analysis-tokenizers.html[tokenizers],
{elasticsearchDocUrl}/analysis-tokenfilters.html[token filters].
====

[[backend-elasticsearch-io]]
== Writing and reading

include::components/writing-reading-intro-note.asciidoc[]

When writing to indexes, Elasticsearch relies on a link:{elasticsearchDocUrl}/index-modules-translog.html[transaction log]
to make sure that changes, even uncommitted, are always safe as soon as the REST API call returns.
For that reason, the concept of "commit" is not as important to the Elasticsearch backend,
and commit requirements are largely irrelevant.

When reading from indexes, Elasticsearch relies on a periodically refreshed index reader,
meaning that search queries will return slightly out-of-date results,
unless a refresh was forced:
this is called link:{elasticsearchDocUrl}/getting-started-concepts.html#_near_realtime_nrt[near-real-time] behavior.
By default, the index reader is refreshed every second,
but this can be customized on the Elasticsearch side through index settings:
see the `refresh_interval` setting on link:{elasticsearchDocUrl}/index-modules.html[this page].
32 changes: 4 additions & 28 deletions documentation/src/main/asciidoc/backend-lucene.asciidoc
Expand Up @@ -548,41 +548,17 @@ it's just that there is no documentation page for Lucene proper).

=== Basics

At any given time, the Lucene backend holds for each index:
include::components/writing-reading-intro-note.asciidoc[]

* One `IndexWriter` instance that allows for writes to the index,
e.g. adding/deleting a document.
+
The index writer buffers writes,
and "pushes" the changes to the index when it is <<backend-lucene-io-commit,_committed_>>.
Not committing the index writer means that,
in the event of a server crash or power loss, uncommitted writes will be lost.
* One `IndexReader` instance that allows for reads from the index,
e.g. executing a search query.
+
The index reader exposes a view of the index as it was when the reader was opened,
and updates that view when it is <<backend-lucene-io-refresh,_refreshed_>>.
Not refreshing the index reader means that search queries will return potentially outdated results
that do not take into account the latest changes to the index.
The default configuration of the Lucene backend focuses on safety and freshness:

Hibernate Search chooses when to commit or refresh.
The default configuration focuses on safety (making sure that writes are committed as soon as possible)
and on providing an always up-to-date view of the index.
* Changes are committed as soon as possible (at the end of each batch of changes).
* Indexes are refreshed as soon as a change happens.

Custom configuration, explained in the following sections,
can provide performance boosts in some situations at the cost of lower write safety
and/or occasional out-of-date reads.

[NOTE]
====
After a refresh, *all* changes to the index are taken into account:
those committed to the index, but *also* those that are still buffered in the index writer.
For that reason, commits and refreshes can be treated are completely orthogonal concepts:
certain configurations will occasionally lead to committed changes not being be visible in search queries,
while other configurations will allow even uncommitted changes to be visible in search queries.
====

[[backend-lucene-io-commit]]
=== Commit

Expand Down
@@ -0,0 +1,6 @@
[NOTE]
====
For a preliminary introduction to writing to and reading from indexes in Hibernate Search,
including in particular the concepts of _commit_ and _refresh_,
see <<concepts-commit-refresh>>.
====
58 changes: 58 additions & 0 deletions documentation/src/main/asciidoc/concepts.asciidoc
Expand Up @@ -157,6 +157,64 @@ See the documentation of each backend for more information:
* <<backend-lucene-analysis,Analysis for the Lucene backend>>
* <<backend-elasticsearch-analysis,Analysis for the Elasticsearch backend>>

[[concepts-commit-refresh]]
== Commit and refresh

In order to get the best throughput when indexing and when searching,
both Elasticsearch and Lucene rely on "buffers" when writing to and reading from the index:

* When writing, changes are not _directly_ written to the index,
but to an "index writer" that buffers changes in-memory or in temporary files.
+
The changes are "pushed" to the actual index when the writer is _committed_.
Until the commit happens, uncommitted changes are in an "unsafe" state:
if the application crashes or if the server suffers from a power loss,
uncommitted changes will be lost.
* When reading, e.g. when executing a search query,
data is not read _directly_ from the index,
but from an "index reader" that exposes a view of the index as it was at some point in the past.
+
The view is updated when the reader is _refreshed_.
Until the refresh happens, results of search queries might be slightly out of date:
documents added since the last refresh will be missing,
documents delete since the last refresh will still be there, etc.

Unsafe changes and out-of-date indexes are obviously undesirable,
but they are a trade-off that improves performance.

Different factors influence when refreshes and commit happen:

* <<mapper-orm-indexing-automatic,Automatic indexing>> will, by default,
require that a commit of the index writer is performed after each set of changes,
meaning the changes are safe after the Hibernate ORM transaction commit returns.
However, no refresh is requested by default, meaning the changes may only be visible at a later time,
when the backend decides to refresh the index reader.
This behavior can be customized by setting a different <<mapper-orm-indexing-automatic-synchronization,synchronization strategy>>.
* The <<mapper-orm-indexing-massindexer,mass indexer>>
will not require any commit or refresh until the very end of mass indexing,
so as to maximize indexing throughput.
* Whenever there are no particular commit or refresh requirements,
backend defaults will apply:
** See <<backend-elasticsearch-io,here for Elasticsearch>>.
** See <<backend-lucene-io,here for Lucene>>.
* A commit may be forced explicitly through the <<mapper-orm-indexing-manual-flush,`flush()` API>>.
* A refresh may be forced explicitly though the <<mapper-orm-indexing-manual-flush,`refresh()` API>>.

[NOTE]
====
Even though we use the word "commit",
this is not the same concept as a commit in relational database transactions:
there is no transaction and no "rollback" is possible.
There is no concept of isolation, either.
After a refresh, *all* changes to the index are taken into account:
those committed to the index, but also those that are still buffered in the index writer.
For this reason, commits and refreshes can be treated as completely orthogonal concepts:
certain setups will occasionally lead to committed changes not being be visible in search queries,
while others will allow even uncommitted changes to be visible in search queries.
====

[[concepts-sharding-routing]]
== Sharding and routing

Expand Down
2 changes: 1 addition & 1 deletion documentation/src/main/asciidoc/configuration.asciidoc
Expand Up @@ -247,7 +247,7 @@ can take advantage of injection features of this framework.

Hibernate Search generally propagates exceptions occurring in background threads to the user thread,
but in some cases, such as Lucene segment merging failures,
or when indexing in <<mapper-orm-indexing-automatic-synchronization-queued,fully asynchronous mode>>,
or <<mapper-orm-indexing-automatic-synchronization-failures,some failures during automatic indexing>>,
the exception in background threads cannot be propagated.
By default, when that happens, the failure is logged at the `ERROR` level.

Expand Down
127 changes: 72 additions & 55 deletions documentation/src/main/asciidoc/mapper-orm-indexing-automatic.asciidoc
Expand Up @@ -109,67 +109,84 @@ and link:{hibernateDocUrl}#fetching-batch[the `@BatchSize` annotation].
[[mapper-orm-indexing-automatic-synchronization]]
== Synchronization with the indexes

Hibernate Search offers multiple strategies to control synchronization with the indexes
during automatic indexing,
i.e. to control the minimum progress of indexing before the application thread is resumed.
include::components/writing-reading-intro-note.asciidoc[]

You can define a default strategy for all sessions by setting the configuration property
`hibernate.search.automatic_indexing.synchronization.strategy`:
When a transaction is committed,
automatic indexing can (and, by default, will) block the application thread
until indexing reaches a certain level of completion.

* [[mapper-orm-indexing-automatic-synchronization-queued]] when set to `queued`, the application thread will be resumed as soon as
the index changes are queued in the backend.
+
This strategy offers no guarantee as to whether indexing will be performed successfully,
or even whether indexing will be performed at all:
the local JVM may crash before the works are executed, in which case the indexing requests will be forgotten,
or indexing may simply fail.
+
With this strategy, failures to extract data from entities will lead to an exception being thrown in the application thread,
but failures to perform indexing in the backend (i.e. I/O operations on the index)
will be forwarded to the <<configuration-background-failure-handling,failure handler>>,
which by default will simply log them.
* by default or when set to `committed`, the application thread will be resumed as soon as
the index changes are committed to disk.
+
This generally means that at the very least
that the backend validated the index changes,
took appropriate measures to be able to recover the changes in the event of a crash,
and confirmed to Hibernate Search it did so
(e.g. for Elasticsearch, Hibernate Search received a successful response to the HTTP request).
+
This strategy offers no guarantee as to whether indexed documents are searchable,
meaning a search query executed immediately after the application thread is resumed
may return outdated information.
Documents are searchable as soon as a refresh is performed,
either automatically after every change (default for the Lucene backend)
periodically (link:{elasticsearchDocUrl}/getting-started-concepts.html#_near_realtime_nrt[default for the Elasticsearch backend],
opt-in for the Lucene backend by <<backend-lucene-io-refresh,setting a refresh interval>>),
or <<mapper-orm-indexing-manual-refresh,explicitly>>.
+
With this strategy, indexing failures will lead to an exception being thrown in the application thread.
* when set to `searchable`, the application thread will be resumed as soon as
the index changes are committed to disk
*and* the relevant documents are searchable.
The backend will be told to make the documents searchable as soon as possible.
+
There are two main reasons for blocking the thread:

1. *Indexed data safety*:
if, once the database transaction completes,
index data must be safely stored to disk,
an <<concepts-commit-refresh,index commit>> is necessary.
Without it, index changes may only be safe after a few seconds,
when a periodic index commit happens in the background.
2. *Real-time search queries*:
if, once the database transaction completes,
any search query must immediately take the index changes into account,
an <<concepts-commit-refresh,index refresh>> is necessary.
Without it, index changes may only be visible after a few seconds,
when a periodic index refresh happens in the background.

These two requirements are controlled by the _synchronization strategy_.
The default strategy is defined by the configuration property
`hibernate.search.automatic_indexing.synchronization.strategy`.
Below is a reference of all available strategies and their guarantees.

|====
.2+h|Strategy 3+h| Guarantees when the application thread resumes .2+h|Throughput
h|Changes applied (with or without <<concepts-commit-refresh,commit>>)
h|Changes safe from crash/power loss (<<concepts-commit-refresh,commit>>)
h|Changes visible on search (<<concepts-commit-refresh,refresh>>)
|`queued`|No guarantee|No guarantee|No guarantee|Best
|`committed` (**default**)|Guaranteed|Guaranteed|No guarantee|Medium
|`searchable`|Guaranteed|Guaranteed|Guaranteed|<<mapper-orm-indexing-automatic-synchronization-refresh-throughput,Worst>>
|====

[[mapper-orm-indexing-automatic-synchronization-refresh-throughput]]
[WARNING]
====
Depending on the backend and its configuration,
this strategy may lead to poor indexing throughput,
because the backend may not be optimized for frequent, on-demand index refreshes.
+
That is why this strategy is only recommended if you know your backend is optimized for it
(for example this is true for the default configuration of the Lucene backend, but not for the Elasticsearch backend),
or for integration tests.
+
With this strategy, indexing failures will lead to an exception being thrown in the application thread.
the `searchable` strategy may lead to poor indexing throughput,
because the backend may not be designed for frequent, on-demand index refreshes.
This is why this strategy is only recommended if you know your backend is designed for it, or for integration tests.
In particular, the `searchable` strategy will work fine with the default configuration of the Lucene backend,
but will perform poorly with the Elasticsearch backend.
====

[[mapper-orm-indexing-automatic-synchronization-failures]]
[NOTE]
====
Indexing failures may be reported differently depending on the chosen strategy:
While the above configuration property defines a default,
* Failure to extract data from entities:
** Regardless of the strategy, throws an exception in the application thread.
* Failure to apply index changes (i.e. I/O operations on the index):
** For strategies that apply changes immediately: throws an exception in the application thread.
** For strategies that do *not* apply changes immediately:
forwards the failure to the <<configuration-background-failure-handling,failure handler>>,
which by default will simply log the failure.
* Failure to commit index changes:
** For strategies that guarantee an index commit: throws an exception in the application thread.
** For strategies that do *not* guarantee an index commit:
forwards the failure to the <<configuration-background-failure-handling,failure handler>>,
which by default will simply log the failure.
====

While the configuration property mentioned above defines a default,
it is possible to override this default on a particular session
by calling `SearchSession#setAutomaticIndexingSynchronizationStrategy` and passing a different strategy.
The built-in strategies can be retrieved by calling
`AutomaticIndexingSynchronizationStrategy.queued()`,
`AutomaticIndexingSynchronizationStrategy.committed()`
or `AutomaticIndexingSynchronizationStrategy.searchable()`,
but you can also define a custom strategy.

The built-in strategies can be retrieved by calling:

* `AutomaticIndexingSynchronizationStrategy.queued()`
* `AutomaticIndexingSynchronizationStrategy.committed()`
* or `AutomaticIndexingSynchronizationStrategy.searchable()`

Alternatively, you can also implement a custom strategy.

.Overriding the automatic indexing synchronization strategy
====
Expand Down
Expand Up @@ -267,23 +267,20 @@ When using multi-tenancy, only documents of one tenant will be removed:
the tenant of the session from which this workspace originated.
`purgeAsync()`::
Asynchronous version of `purge()` returning a `CompletableFuture`.
`flush()`::
[[mapper-orm-indexing-manual-flush]]`flush()`::
Flush to disk the changes to indexes that have not been committed yet.
In the case of backends with a transaction log (Elasticsearch),
also apply operations from the transaction log that were not applied yet.
+
This is generally not useful as Hibernate Search commits changes automatically.
Only to be used by experts fully aware of the implications.
See <<concepts-commit-refresh>> for more information.
`flushAsync()`::
Asynchronous version of `flush()` returning a `CompletableFuture`.
[[mapper-orm-indexing-manual-refresh]]`refresh()`::
Refresh the indexes so that all changes executed so far will be visible in search queries.
+
This is generally not useful as indexes are refreshed automatically.
either after every change (default for the Lucene backend)
or periodically (link:{elasticsearchDocUrl}/getting-started-concepts.html#_near_realtime_nrt[default for the Elasticsearch backend],
opt-in for the Lucene backend by <<backend-lucene-io-refresh,setting a refresh interval>>).
Only to be used by experts fully aware of the implications.
See <<concepts-commit-refresh>> for more information.
`refreshAsync()`::
Asynchronous version of `refresh()` returning a `CompletableFuture`.
`mergeSegments()`::
Expand Down
Expand Up @@ -15,19 +15,20 @@
public enum AutomaticIndexingSynchronizationStrategyName {

/**
* A strategy that only waits for indexing requests to be queued in the backend.
* A strategy that only waits for index changes to be queued in the backend.
* <p>
* See the reference documentation for details.
*/
QUEUED( "queued" ),
/**
* A strategy that waits for indexing requests to be committed.
* A strategy that waits for index changes to be queued and applied, forces a commit, and waits for the commit to complete.
* <p>
* See the reference documentation for details.
*/
COMMITTED( "committed" ),
/**
* A strategy that waits for indexing requests to be committed and forces index refreshes.
* A strategy that waits for index changes to be queued and applied, forces a commit and a refresh,
* and waits for the commit and refresh to complete.
* <p>
* See the reference documentation for details.
*/
Expand Down
Expand Up @@ -21,23 +21,24 @@ public interface AutomaticIndexingSynchronizationStrategy {
void apply(AutomaticIndexingSynchronizationConfigurationContext context);

/**
* @return A strategy that only waits for indexing requests to be queued in the backend.
* @return A strategy that only waits for index changes to be queued in the backend.
* See the reference documentation for details.
*/
static AutomaticIndexingSynchronizationStrategy queued() {
return QueuedAutomaticIndexingSynchronizationStrategy.INSTANCE;
}

/**
* @return A strategy that waits for indexing requests to be committed.
* @return A strategy that waits for index changes to be queued and applied, forces a commit, and waits for the commit to complete.
* See the reference documentation for details.
*/
static AutomaticIndexingSynchronizationStrategy committed() {
return CommittedAutomaticIndexingSynchronizationStrategy.INSTANCE;
}

/**
* @return A strategy that waits for indexing requests to be committed and forces index refreshes.
* @return A strategy that waits for index changes to be queued and applied, forces a commit and a refresh,
* and waits for the commit and refresh to complete.
* See the reference documentation for details.
*/
static AutomaticIndexingSynchronizationStrategy searchable() {
Expand Down

0 comments on commit 1091bb0

Please sign in to comment.