Skip to content

Commit

Permalink
HSEARCH-3776 Document writer and merge settings for the Lucene backend
Browse files Browse the repository at this point in the history
  • Loading branch information
yrodiere committed Apr 20, 2020
1 parent 80e2877 commit c5e6dab
Show file tree
Hide file tree
Showing 5 changed files with 199 additions and 51 deletions.
199 changes: 185 additions & 14 deletions documentation/src/main/asciidoc/backend-lucene.asciidoc
Expand Up @@ -647,23 +647,11 @@ Again, this is only true if you rely on the document ID and not on a provided ro
[[backend-lucene-io]]
== Writing and reading

=== Basics

include::components/writing-reading-intro-note.asciidoc[]

The default configuration of the Lucene backend focuses on safety and freshness:

* Critical changes (automatic indexing) are only considered completed
when they are committed.
* Indexes are refreshed as soon as a change happens.

Some techniques and custom configuration, explained in the following sections,
can provide performance boosts in some situations at the cost of lower write safety
and/or occasional out-of-date reads.

[[backend-lucene-io-commit]]
=== Commit

include::components/writing-reading-intro-note.asciidoc[]

In Lucene terminology, a _commit_ is when changes buffered in an index writer
are pushed to the index itself,
so that a crash or power loss will no longer result in data loss.
Expand Down Expand Up @@ -724,6 +712,8 @@ so as not to require a commit after each change.
[[backend-lucene-io-refresh]]
=== Refresh

include::components/writing-reading-intro-note.asciidoc[]

In Lucene terminology, a _refresh_ is when a new index reader is opened,
so that the next search queries will take into account the latest changes to the index.

Expand All @@ -747,3 +737,184 @@ hibernate.search.backends.<backend name>.indexes.<index name>.io.refresh_interva
# OR
hibernate.search.backends.<backend name>.index_defaults.io.refresh_interval = 0 (default)
----

[[backend-lucene-io-writer]]
=== `IndexWriter` settings
// Search 5 anchors backward compatibility
[[lucene-indexing-performance]]

Lucene's `IndexWriter`, used by Hibernate Search to write to indexes,
exposes several settings that can be tweaked to better fit your application,
and ultimately get better performance.

Hibernate Search exposes these settings through configuration properties prefixed with `io.writer.`,
at the index level.

Below is a list of all index writer settings.
They can be set through configuration properties, at the index level.
For example, `io.writer.ram_buffer_size` can be set like this:

[source]
----
hibernate.search.backends.<backend name>.indexes.<index name>.io.writer.ram_buffer_size = 32
# OR
hibernate.search.backends.<backend name>.index_defaults.io.writer.ram_buffer_size = 32
----

[[table-performance-parameters]]
.Configuration properties for the `IndexWriter`
[cols="1,2a", options="header"]
|===============
|Property
|Description

|`[...].io.writer.max_buffered_docs`
|The maximum number of documents that can be buffered in-memory
before they are flushed to the Directory.

Large values mean faster indexing, but more RAM usage.

When used together with `ram_buffer_size` a flush occurs for whichever event happens first.

|`[...].io.writer.ram_buffer_size`
|The maximum amount of RAM that may be used for buffering added documents and deletions
before they are flushed to the Directory.

Large values mean faster indexing, but more RAM usage.

Generally for faster indexing performance it's best to use this setting rather than `max_buffered_docs`.

When used together with `max_buffered_docs` a flush occurs for whichever event happens first.

|`[...].io.writer.infostream`
|Enables low level trace information about Lucene's internal components; `true` or `false`.

Logs will be appended to the logger `org.hibernate.search.backend.lucene.infostream` at the TRACE level.

This may cause significant performance degradation, even if the logger ignores the TRACE level,
so this should only be used for troubleshooting purposes.

Disabled by default.
|===============

[TIP]
====
Refer to Lucene's documentation, in particular the javadoc and source code of `IndexWriterConfig`,
for more information about the settings and their defaults.
====

[[backend-lucene-io-merge]]
=== Merge settings

A Lucene index is not stored in a single, continuous file.
Instead, each flush to the index will generate a small file containing all the documents added to the index.
This file is called a "segment".
Search can be slower on an index with too many segments,
so Lucene regularly merges small segments to create fewer, larger segments.

Lucene's merge behavior is controlled through a `MergePolicy`.
Hibernate Search uses the `LogByteSizeMergePolicy`,
which exposes several settings that can be tweaked to better fit your application,
and ultimately get better performance.

Below is a list of all merge settings.
They can be set through configuration properties, at the index level.
For example, `io.merge.factor` can be set like this:

[source]
----
hibernate.search.backends.<backend name>.indexes.<index name>.io.merge.factor = 10
# OR
hibernate.search.backends.<backend name>.index_defaults.io.merge.factor = 10
----

[cols="1,2a", options="header"]
.Configuration properties related to merges
|===============
|Property
|Description

|`[...].io.merge.max_docs`
|The maximum number of documents that a segment can have before merging.
Segments with more than this number of documents will not be merged.

Smaller values perform better on frequently changing indexes,
larger values provide better search performance if the index does not change often.

|`[...].io.merge.factor`
|The number of segments that are merged at once.

With smaller values, merging happens more often and thus uses more resources,
but the total number of segments will be lower on average, increasing read performance.
Thus, larger values (`> 10`) are best for <<mapper-orm-indexing-massindexer,mass indexing>>,
and smaller values (`< 10`) are best for <<mapper-orm-indexing-automatic,automatic indexing>>.

The value must not be lower than `2`.

|`[...].io.merge.min_size`
|The minimum target size of segments, in MB, for background merges.

Segments smaller than this size are merged more aggressively.

Setting this too large might result in expensive merge operations, even tough they are less frequent.

|`[...].io.merge.max_size`
|The maximum size of segments, in MB, for background merges.

Segments larger than this size are never merged in the background.

Settings this to a lower value helps reduce memory requirements and avoids some merging operations at the
cost of optimal search speed.

When <<mapper-orm-indexing-manual-merge,forcefully merging>> an index, this value is ignored and `max_forced_size` is used instead (see below).

|`[...].io.merge.max_forced_size`
|The maximum size of segments, in MB, for forced merges.

This is the equivalent of `io.merge.max_size` for <<mapper-orm-indexing-manual-merge,forceful merges>>.
You will generally want to set this to the same value as `max_size` or lower,
but setting it too low will <<mapper-orm-indexing-merge-segments,degrade search performance as documents are deleted>>.

|`[...].io.merge.calibrate_by_deletes`
|Whether the number of deleted documents in an index should be taken into account; `true` or `false`.

When enabled, Lucene will consider that a segment with 100 documents, 50 of which are deleted,
actually contains 50 documents.
When disabled, Lucene will consider that such a segment contains 100 documents.

Setting `calibrate_by_deletes` to `false` will lead to more frequent merges caused by `io.merge.max_docs`,
but will more aggressively merge segments with many deleted documents, improving search performance.
|===============

[NOTE]
====
Refer to Lucene's documentation, in particular the javadoc and source code of `LogByteSizeMergePolicy`,
for more information about the settings and their defaults.
====

[TIP]
[[lucene-segment-size]]
========
The options `io.merge.max_size` and `io.merge.max_forced_size`
do not *directly* define the maximum size of all segment files.
First, consider that merging a segment is about adding it together with another existing segment to form a larger one.
`io.merge.max_size` is the maximum size of segments *before* merging,
so newly merged segments can be up to twice that size.
Second, merge options do not affect the size of segments initially created by the index writer, before they are merged.
This size can be limited with the setting `io.writer.ram_buffer_size`,
but Lucene relies on estimates to implement this limit;
when these estimates are off,
it is possible for newly created segments to be slightly larger than `io.writer.ram_buffer_size`.
So, for example, to be fairly confident no file grows larger than 15MB,
use something like this:
[source]
----
hibernate.search.backends.<backend name>.index_defaults.io.writer.ram_buffer_size = 10
hibernate.search.backends.<backend name>.index_defaults.io.merge.max_size = 7
hibernate.search.backends.<backend name>.index_defaults.io.merge.max_forced_size = 7
----
========
Expand Up @@ -291,7 +291,7 @@ This is generally not useful as indexes are refreshed automatically.
See <<concepts-commit-refresh>> for more information.
`refreshAsync()`::
Asynchronous version of `refresh()` returning a `CompletableFuture`.
`mergeSegments()`::
[[mapper-orm-indexing-manual-merge]]`mergeSegments()`::
Merge each index targeted by this workspace into a single segment.
This operation does not always improve performance: see <<mapper-orm-indexing-merge-segments>>.
`mergeSegmentsAsync()`::
Expand Down
Expand Up @@ -54,6 +54,16 @@ so there is no need to begin a database transaction before the `MassIndexer` is
or to commit a transaction after it is done.
====

[WARNING]
====
A note to MySQL users: the `MassIndexer` uses forward only scrollable results
to iterate on the primary keys to be loaded,
but MySQL's JDBC driver will pre-load all values in memory.
To avoid this "optimization" set the <<mapper-orm-indexing-massindexer-parameters-idfetchsize,`idFetchSize` parameter>>
to `Integer.MIN_VALUE`.
====

You can also select entity types when creating a mass indexer,
so as to reindex only these types (and their indexed subtypes, if any):

Expand Down Expand Up @@ -113,6 +123,8 @@ as explained in <<mapper-orm-indexing-massindexer-tuning-threads>>.

[[mapper-orm-indexing-massindexer-parameters]]
== `MassIndexer` parameters
// Search 5 anchors backward compatibility
[[_useful_parameters_for_batch_indexing]]

.`MassIndexer` parameters
|===
Expand All @@ -129,7 +141,7 @@ That is to say, the number of threads spawned for entity loading
will be `typesToIndexInParallel * threadsToLoadObjects`
(+ 1 thread per type to retrieve the IDs of entities to load).

|`idFetchSize(int)`
|[[mapper-orm-indexing-massindexer-parameters-idfetchsize]]`idFetchSize(int)`
|`100`
|The fetch size to be used when loading primary keys. Some databases
accept special values, for example MySQL might benefit from using `Integer#MIN_VALUE`, otherwise it
Expand Down

This file was deleted.

2 changes: 0 additions & 2 deletions documentation/src/main/asciidoc/mapper-orm-indexing.asciidoc
Expand Up @@ -11,6 +11,4 @@ include::mapper-orm-indexing-jsr352.asciidoc[]

include::mapper-orm-indexing-manual.asciidoc[]

include::mapper-orm-indexing-optimizing.asciidoc[]

:leveloffset: -1

0 comments on commit c5e6dab

Please sign in to comment.