From c5e6dabc9da9de5512a23389f77a3cc3f62fc0af Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Yoann=20Rodi=C3=A8re?= Date: Fri, 17 Apr 2020 16:07:25 +0200 Subject: [PATCH] HSEARCH-3776 Document writer and merge settings for the Lucene backend --- .../src/main/asciidoc/backend-lucene.asciidoc | 199 ++++++++++++++++-- .../mapper-orm-indexing-manual.asciidoc | 2 +- .../mapper-orm-indexing-massindexer.asciidoc | 14 +- .../mapper-orm-indexing-optimizing.asciidoc | 33 --- .../asciidoc/mapper-orm-indexing.asciidoc | 2 - 5 files changed, 199 insertions(+), 51 deletions(-) delete mode 100644 documentation/src/main/asciidoc/mapper-orm-indexing-optimizing.asciidoc diff --git a/documentation/src/main/asciidoc/backend-lucene.asciidoc b/documentation/src/main/asciidoc/backend-lucene.asciidoc index dd95813b7b7..cc290b877e8 100644 --- a/documentation/src/main/asciidoc/backend-lucene.asciidoc +++ b/documentation/src/main/asciidoc/backend-lucene.asciidoc @@ -647,23 +647,11 @@ Again, this is only true if you rely on the document ID and not on a provided ro [[backend-lucene-io]] == Writing and reading -=== Basics - -include::components/writing-reading-intro-note.asciidoc[] - -The default configuration of the Lucene backend focuses on safety and freshness: - -* Critical changes (automatic indexing) are only considered completed -when they are committed. -* Indexes are refreshed as soon as a change happens. - -Some techniques and custom configuration, explained in the following sections, -can provide performance boosts in some situations at the cost of lower write safety -and/or occasional out-of-date reads. - [[backend-lucene-io-commit]] === Commit +include::components/writing-reading-intro-note.asciidoc[] + In Lucene terminology, a _commit_ is when changes buffered in an index writer are pushed to the index itself, so that a crash or power loss will no longer result in data loss. @@ -724,6 +712,8 @@ so as not to require a commit after each change. [[backend-lucene-io-refresh]] === Refresh +include::components/writing-reading-intro-note.asciidoc[] + In Lucene terminology, a _refresh_ is when a new index reader is opened, so that the next search queries will take into account the latest changes to the index. @@ -747,3 +737,184 @@ hibernate.search.backends..indexes..io.refresh_interva # OR hibernate.search.backends..index_defaults.io.refresh_interval = 0 (default) ---- + +[[backend-lucene-io-writer]] +=== `IndexWriter` settings +// Search 5 anchors backward compatibility +[[lucene-indexing-performance]] + +Lucene's `IndexWriter`, used by Hibernate Search to write to indexes, +exposes several settings that can be tweaked to better fit your application, +and ultimately get better performance. + +Hibernate Search exposes these settings through configuration properties prefixed with `io.writer.`, +at the index level. + +Below is a list of all index writer settings. +They can be set through configuration properties, at the index level. +For example, `io.writer.ram_buffer_size` can be set like this: + +[source] +---- +hibernate.search.backends..indexes..io.writer.ram_buffer_size = 32 +# OR +hibernate.search.backends..index_defaults.io.writer.ram_buffer_size = 32 +---- + +[[table-performance-parameters]] +.Configuration properties for the `IndexWriter` +[cols="1,2a", options="header"] +|=============== +|Property +|Description + +|`[...].io.writer.max_buffered_docs` +|The maximum number of documents that can be buffered in-memory +before they are flushed to the Directory. + +Large values mean faster indexing, but more RAM usage. + +When used together with `ram_buffer_size` a flush occurs for whichever event happens first. + +|`[...].io.writer.ram_buffer_size` +|The maximum amount of RAM that may be used for buffering added documents and deletions +before they are flushed to the Directory. + +Large values mean faster indexing, but more RAM usage. + +Generally for faster indexing performance it's best to use this setting rather than `max_buffered_docs`. + +When used together with `max_buffered_docs` a flush occurs for whichever event happens first. + +|`[...].io.writer.infostream` +|Enables low level trace information about Lucene's internal components; `true` or `false`. + +Logs will be appended to the logger `org.hibernate.search.backend.lucene.infostream` at the TRACE level. + +This may cause significant performance degradation, even if the logger ignores the TRACE level, +so this should only be used for troubleshooting purposes. + +Disabled by default. +|=============== + +[TIP] +==== +Refer to Lucene's documentation, in particular the javadoc and source code of `IndexWriterConfig`, +for more information about the settings and their defaults. +==== + +[[backend-lucene-io-merge]] +=== Merge settings + +A Lucene index is not stored in a single, continuous file. +Instead, each flush to the index will generate a small file containing all the documents added to the index. +This file is called a "segment". +Search can be slower on an index with too many segments, +so Lucene regularly merges small segments to create fewer, larger segments. + +Lucene's merge behavior is controlled through a `MergePolicy`. +Hibernate Search uses the `LogByteSizeMergePolicy`, +which exposes several settings that can be tweaked to better fit your application, +and ultimately get better performance. + +Below is a list of all merge settings. +They can be set through configuration properties, at the index level. +For example, `io.merge.factor` can be set like this: + +[source] +---- +hibernate.search.backends..indexes..io.merge.factor = 10 +# OR +hibernate.search.backends..index_defaults.io.merge.factor = 10 +---- + +[cols="1,2a", options="header"] +.Configuration properties related to merges +|=============== +|Property +|Description + +|`[...].io.merge.max_docs` +|The maximum number of documents that a segment can have before merging. +Segments with more than this number of documents will not be merged. + +Smaller values perform better on frequently changing indexes, +larger values provide better search performance if the index does not change often. + +|`[...].io.merge.factor` +|The number of segments that are merged at once. + +With smaller values, merging happens more often and thus uses more resources, +but the total number of segments will be lower on average, increasing read performance. +Thus, larger values (`> 10`) are best for <>, +and smaller values (`< 10`) are best for <>. + +The value must not be lower than `2`. + +|`[...].io.merge.min_size` +|The minimum target size of segments, in MB, for background merges. + +Segments smaller than this size are merged more aggressively. + +Setting this too large might result in expensive merge operations, even tough they are less frequent. + +|`[...].io.merge.max_size` +|The maximum size of segments, in MB, for background merges. + +Segments larger than this size are never merged in the background. + +Settings this to a lower value helps reduce memory requirements and avoids some merging operations at the +cost of optimal search speed. + +When <> an index, this value is ignored and `max_forced_size` is used instead (see below). + +|`[...].io.merge.max_forced_size` +|The maximum size of segments, in MB, for forced merges. + +This is the equivalent of `io.merge.max_size` for <>. +You will generally want to set this to the same value as `max_size` or lower, +but setting it too low will <>. + +|`[...].io.merge.calibrate_by_deletes` +|Whether the number of deleted documents in an index should be taken into account; `true` or `false`. + +When enabled, Lucene will consider that a segment with 100 documents, 50 of which are deleted, +actually contains 50 documents. +When disabled, Lucene will consider that such a segment contains 100 documents. + +Setting `calibrate_by_deletes` to `false` will lead to more frequent merges caused by `io.merge.max_docs`, +but will more aggressively merge segments with many deleted documents, improving search performance. +|=============== + +[NOTE] +==== +Refer to Lucene's documentation, in particular the javadoc and source code of `LogByteSizeMergePolicy`, +for more information about the settings and their defaults. +==== + +[TIP] +[[lucene-segment-size]] +======== +The options `io.merge.max_size` and `io.merge.max_forced_size` +do not *directly* define the maximum size of all segment files. + +First, consider that merging a segment is about adding it together with another existing segment to form a larger one. +`io.merge.max_size` is the maximum size of segments *before* merging, +so newly merged segments can be up to twice that size. + +Second, merge options do not affect the size of segments initially created by the index writer, before they are merged. +This size can be limited with the setting `io.writer.ram_buffer_size`, +but Lucene relies on estimates to implement this limit; +when these estimates are off, +it is possible for newly created segments to be slightly larger than `io.writer.ram_buffer_size`. + +So, for example, to be fairly confident no file grows larger than 15MB, +use something like this: + +[source] +---- +hibernate.search.backends..index_defaults.io.writer.ram_buffer_size = 10 +hibernate.search.backends..index_defaults.io.merge.max_size = 7 +hibernate.search.backends..index_defaults.io.merge.max_forced_size = 7 +---- +======== diff --git a/documentation/src/main/asciidoc/mapper-orm-indexing-manual.asciidoc b/documentation/src/main/asciidoc/mapper-orm-indexing-manual.asciidoc index 22cad3a7223..d5c876ac03c 100644 --- a/documentation/src/main/asciidoc/mapper-orm-indexing-manual.asciidoc +++ b/documentation/src/main/asciidoc/mapper-orm-indexing-manual.asciidoc @@ -291,7 +291,7 @@ This is generally not useful as indexes are refreshed automatically. See <> for more information. `refreshAsync()`:: Asynchronous version of `refresh()` returning a `CompletableFuture`. -`mergeSegments()`:: +[[mapper-orm-indexing-manual-merge]]`mergeSegments()`:: Merge each index targeted by this workspace into a single segment. This operation does not always improve performance: see <>. `mergeSegmentsAsync()`:: diff --git a/documentation/src/main/asciidoc/mapper-orm-indexing-massindexer.asciidoc b/documentation/src/main/asciidoc/mapper-orm-indexing-massindexer.asciidoc index 12cdfd1b9b0..99ee3ba0630 100644 --- a/documentation/src/main/asciidoc/mapper-orm-indexing-massindexer.asciidoc +++ b/documentation/src/main/asciidoc/mapper-orm-indexing-massindexer.asciidoc @@ -54,6 +54,16 @@ so there is no need to begin a database transaction before the `MassIndexer` is or to commit a transaction after it is done. ==== +[WARNING] +==== +A note to MySQL users: the `MassIndexer` uses forward only scrollable results +to iterate on the primary keys to be loaded, +but MySQL's JDBC driver will pre-load all values in memory. + +To avoid this "optimization" set the <> +to `Integer.MIN_VALUE`. +==== + You can also select entity types when creating a mass indexer, so as to reindex only these types (and their indexed subtypes, if any): @@ -113,6 +123,8 @@ as explained in <>. [[mapper-orm-indexing-massindexer-parameters]] == `MassIndexer` parameters +// Search 5 anchors backward compatibility +[[_useful_parameters_for_batch_indexing]] .`MassIndexer` parameters |=== @@ -129,7 +141,7 @@ That is to say, the number of threads spawned for entity loading will be `typesToIndexInParallel * threadsToLoadObjects` (+ 1 thread per type to retrieve the IDs of entities to load). -|`idFetchSize(int)` +|[[mapper-orm-indexing-massindexer-parameters-idfetchsize]]`idFetchSize(int)` |`100` |The fetch size to be used when loading primary keys. Some databases accept special values, for example MySQL might benefit from using `Integer#MIN_VALUE`, otherwise it diff --git a/documentation/src/main/asciidoc/mapper-orm-indexing-optimizing.asciidoc b/documentation/src/main/asciidoc/mapper-orm-indexing-optimizing.asciidoc deleted file mode 100644 index 82991f56e7e..00000000000 --- a/documentation/src/main/asciidoc/mapper-orm-indexing-optimizing.asciidoc +++ /dev/null @@ -1,33 +0,0 @@ -[[mapper-orm-indexing-optimizing]] -= Optimizing Hibernate Search indexing -// Search 5 anchors backward compatibility -[[_useful_parameters_for_batch_indexing]] - -include::todo-placeholder.asciidoc[] - -//// -TODO HSEARCH-3776 restore this documentation (maybe in the Lucene section) - and link to it when we restore the configuration properties - -Other parameters which affect indexing time and memory consumption are: - -* `hibernate.search.[default|].exclusive_index_use` -* `hibernate.search.[default|].indexwriter.max_buffered_docs` -* `hibernate.search.[default|].indexwriter.max_merge_docs` -* `hibernate.search.[default|].indexwriter.merge_factor` -* `hibernate.search.[default|].indexwriter.merge_min_size` -* `hibernate.search.[default|].indexwriter.merge_max_size` -* `hibernate.search.[default|].indexwriter.merge_max_optimize_size` -* `hibernate.search.[default|].indexwriter.merge_calibrate_by_deletes` -* `hibernate.search.[default|].indexwriter.ram_buffer_size` - -Previous versions also had a `max_field_length` but this was removed from Lucene, it's possible -to obtain a similar effect by using a `LimitTokenCountAnalyzer`. - -All `.indexwriter` parameters are Lucene specific and Hibernate Search is just passing these -parameters through - see <> for more details. - -The MassIndexer uses a forward only scrollable result to iterate on the primary keys to be loaded, -but MySQL's JDBC driver will load all values in memory; to avoid this "optimization" set -`idFetchSize` to `Integer.MIN_VALUE`. -//// diff --git a/documentation/src/main/asciidoc/mapper-orm-indexing.asciidoc b/documentation/src/main/asciidoc/mapper-orm-indexing.asciidoc index 1b34b7725df..8a28af9200c 100644 --- a/documentation/src/main/asciidoc/mapper-orm-indexing.asciidoc +++ b/documentation/src/main/asciidoc/mapper-orm-indexing.asciidoc @@ -11,6 +11,4 @@ include::mapper-orm-indexing-jsr352.asciidoc[] include::mapper-orm-indexing-manual.asciidoc[] -include::mapper-orm-indexing-optimizing.asciidoc[] - :leveloffset: -1