hibernate · yrodiere · Sep 13, 2019 · Sep 11, 2019 · Sep 11, 2019 · Sep 11, 2019
diff --git a/documentation/src/main/asciidoc/backend-elasticsearch.asciidoc b/documentation/src/main/asciidoc/backend-elasticsearch.asciidoc
@@ -421,21 +421,72 @@ The Elasticsearch `date` type does not support the whole range of years that can
 // Search 5 anchors backward compatibility
 [[elasticsearch-mapping-analyzer]]
 
-[IMPORTANT]
+<<concepts-analysis,Analysis>> is the text processing performed by analyzers,
+both when indexing (document processing)
+and when searching (query processing).
+
+All built-in Elasticsearch analyzers can be used transparently,
+without any configuration in Hibernate Search:
+just use their name wherever Hibernate Search expects an analyzer name.
+However, in order to define custom analyzers,
+analysis must be configured explicitly.
+
+[CAUTION]
 ====
-This section is currently incomplete.
-A decent introduction is included in the getting started guide: see <<getting-started-analysis>>.
+Elasticsearch analysis configuration is not applied immediately on startup:
+it needs to be pushed to the Elasticsearch cluster.
+Hibernate Search will only push the configuration to the cluster if specific conditions are met,
+and only if instructed to do so
+through the <<backend-elasticsearch-index-lifecycle,lifecycle configuration>>.
 ====
 
 To configure analysis in an Elasticsearch backend, you will need to:
 
-* Implement a bean that implements the `org.hibernate.search.backend.elasticsearch.analysis.ElasticsearchAnalysisConfigurer` interface.
-* Configure your backend to use that bean by setting the configuration property
+* Define a class that implements the `org.hibernate.search.backend.elasticsearch.analysis.ElasticsearchAnalysisConfigurer` interface.
+* Configure the backend to use that implementation by setting the configuration property
 `hibernate.search.backends.<backend name>.analysis.configurer`
-to a <<configuration-property-types,bean reference>> pointing to your bean.
+to a <<configuration-property-types,bean reference>> pointing to the implementation.
+
+Hibernate Search will call the `configure` method of this implementation on startup,
+and the configurer will be able to take advantage of a DSL to define analyzers:
+
+.Implementing and using an analysis configurer with the Elasticsearch backend
+====
+[source, JAVA, indent=0, subs="+callouts"]
+----
+include::{sourcedir}/org/hibernate/search/documentation/analysis/MyElasticsearchAnalysisConfigurer.java[tags=include]
+----
+<1> Define a custom analyzer named "english", because it will be used to analyze English text such as book titles.
+<2> Set the tokenizer to a standard tokenizer.
+<3> Set the char filters. Char filters are applied in the order they are given, before the tokenizer.
+<4> Set the token filters. Token filters are applied in the order they are given, after the tokenizer.
+<5> Note that, for Elasticsearch, any parameterized char filter, tokenizer or token filter
+must be defined separately and assigned a name.
+<6> Set the value of a parameter for the char filter/tokenizer/token filter being defined.
+<7> Normalizers are defined in a similar way, the only difference being that they cannot use a tokenizer.
+<8> Multiple analyzers/normalizers can be defined in the same configurer.
+
+[source, XML, indent=0, subs="+callouts"]
+----
+include::{resourcesdir}/analysis/elasticsearch-simple.properties[]
+----
+<1> Assign the configurer to the backend `myBackend` using a Hibernate Search configuration property.
+====
 
-// TODO add a simple example: configurer implementation + settings
+It is also possible to assign a name to a parameterized built-in analyzer:
 
+.Naming a parameterized built-in analyzer in the Elasticsearch backend
+====
+[source, JAVA, indent=0, subs="+callouts"]
+----
+include::{sourcedir}/org/hibernate/search/documentation/analysis/AdvancedElasticsearchAnalysisConfigurer.java[tags=type]
+----
+<1> Define an analyzer with the given name and type.
+<2> Set the value of a parameter for the analyzer being defined.
+====
+
+[TIP]
+====
 To know which character filters, tokenizers and token filters are available,
 refer to the documentation:
 
@@ -445,9 +496,7 @@ refer to the documentation:
 {elasticsearchDocUrl}/analysis-charfilters.html[character filters],
 {elasticsearchDocUrl}/analysis-tokenizers.html[tokenizers],
 {elasticsearchDocUrl}/analysis-tokenfilters.html[token filters].
-
-
-// TODO add detailed description of each use case: normalizer, analyzer, custom, builtin type, ...
+====
 
 [[backend-elasticsearch-multi-tenancy]]
 == Multi-tenancy

diff --git a/documentation/src/main/asciidoc/backend-lucene.asciidoc b/documentation/src/main/asciidoc/backend-lucene.asciidoc
@@ -367,37 +367,60 @@ Date/time types do not support the whole range of years that can be represented
 [[backend-lucene-analysis]]
 == Analysis
 
-[IMPORTANT]
-====
-This section is currently incomplete.
-A decent introduction is included in the getting started guide: see <<getting-started-analysis>>.
-====
+<<concepts-analysis,Analysis>> is the text processing performed by analyzers,
+both when indexing (document processing)
+and when searching (query processing).
 
 To configure analysis in a Lucene backend, you will need to:
 
-* Implement a bean that implements the `org.hibernate.search.backend.lucene.analysis.LuceneAnalysisConfigurer` interface.
-* Configure your backend to use that bean by setting the configuration property
+* Define a class that implements the `org.hibernate.search.backend.lucene.analysis.LuceneAnalysisConfigurer` interface.
+* Configure the backend to use that implementation by setting the configuration property
 `hibernate.search.backends.<backend name>.analysis.configurer`
-to a <<configuration-property-types,bean reference>> pointing to your bean.
+to a <<configuration-property-types,bean reference>> pointing to the implementation.
 
-// TODO add a simple example: configurer implementation + settings
+Hibernate Search will call the `configure` method of this implementation on startup,
+and the configurer will be able to take advantage of a DSL to define analyzers:
 
-To know which character filters, tokenizers and token filters are available,
-either browse the Lucene Javadoc or read the corresponding section on the
-link:http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters[Solr Wiki].
-
-[NOTE]
+.Implementing and using an analysis configurer with the Lucene backend
 ====
-Why the reference to the Apache Solr wiki for Lucene?
+[source, JAVA, indent=0, subs="+callouts"]
+----
+include::{sourcedir}/org/hibernate/search/documentation/analysis/MyLuceneAnalysisConfigurer.java[tags=include]
+----
+<1> Define a custom analyzer named "english", because it will be used to analyze English text such as book titles.
+<2> Set the tokenizer to a standard tokenizer: components are referenced by their factory class.
+<3> Set the char filters. Char filters are applied in the order they are given, before the tokenizer.
+<4> Set the token filters. Token filters are applied in the order they are given, after the tokenizer.
+<5> Set the value of a parameter for the last added char filter/tokenizer/token filter.
+<6> Normalizers are defined in a similar way, the only difference being that they cannot use a tokenizer.
+<7> Multiple analyzers/normalizers can be defined in the same configurer.
+
+[source, XML, indent=0, subs="+callouts"]
+----
+include::{resourcesdir}/analysis/lucene-simple.properties[]
+----
+<1> Assign the configurer to the backend `myBackend` using a Hibernate Search configuration property.
+====
+
+It is also possible to assign a name to a built-in analyzer,
+or a custom analyzer implementation:
 
-The analyzer factory framework was originally created in the Apache Solr project.
-Most of these implementations have been moved to Apache Lucene, but the
-documentation for these additional analyzers can still be found in the Solr Wiki. You might find
-other documentation referring to the "Solr Analyzer Framework"; just remember you don't need to
-depend on Apache Solr anymore: the required classes are part of the core Lucene distribution.
+.Naming an analyzer instance in the Lucene backend
+====
+[source, JAVA, indent=0, subs="+callouts"]
+----
+include::{sourcedir}/org/hibernate/search/documentation/analysis/AdvancedLuceneAnalysisConfigurer.java[tags=instance]
+----
 ====
 
-// TODO add detailed description of each use case: normalizer, analyzer, by instance, by factory, ...
+[TIP]
+====
+To know which analyzers, character filters, tokenizers and token filters are available,
+either browse the Lucene Javadoc or read the corresponding section on the
+link:http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters[Solr Wiki]
+(you don't need Solr to use these analyzers,
+it's just that there is no documentation page for Lucene proper).
+====
 
 [[backend-lucene-multi-tenancy]]
 == Multi-tenancy

diff --git a/documentation/src/main/asciidoc/concepts.asciidoc b/documentation/src/main/asciidoc/concepts.asciidoc
@@ -3,37 +3,156 @@
 [[concepts-full-text]]
 == Full-text search
 
-include::todo-placeholder.asciidoc[]
+Full-text search is a set of techniques for searching,
+in a corpus of text documents,
+the documents that best match a given query.
 
-// TODO maybe give a short introduction to full-text search and full-text indexes?
+The main difference with traditional search -- for example in an SQL database --
+is that the stored text is not considered as a single block of text,
+but as a collection of tokens (words).
+
+Hibernate Search relies on either http://lucene.apache.org/[Apache Lucene]
+or https://www.elastic.co/products/elasticsearch[Elasticsearch]
+to implement full-text search.
+Since Elasticsearch uses Lucene internally,
+they share a lot of characteristics and their general approach to full-text search.
+
+To simplify, these search engines are based on the concept of inverted indexes:
+a dictionary where the key is a token (word) found in a document,
+and the value is the list of identifiers of every document containing this token.
+
+Still simplifying, once all documents are indexed,
+searching for documents involves three steps:
+
+. extracting tokens (words) from the query;
+. looking up these tokens in the index to find matching documents;
+. aggregating the results of the lookups to produce a list of matching documents.
+
+[NOTE]
+====
+Lucene and Elasticsearch are not limited to just text search: numeric data is also supported,
+enabling support for integers, doubles, longs, dates, etc.
+These types are indexed and queried using a slightly different approach,
+which obviously does not involve text processing.
+====
 
 [[concepts-mapping]]
 == Mapping
 
-include::todo-placeholder.asciidoc[]
+Applications targeted by Hibernate search generally use an entity-based model to represent data.
+In this model, each entity is a single object with a few properties of atomic type
+(`String`, `Integer`, `LocalDate`, ...).
+Each entity can have multiple associations to one or even many other entities.
+
+Entities are thus organized as a graph,
+where each node is an entity and each association is an edge.
+
+By contrast, Lucene and Elasticsearch work with documents.
+Each document is a collection of "fields",
+each field being assigned a name -- a unique string --
+and a value -- which can be text, but also numeric data such as an integer or a date.
+Fields also have a type, which not only determines the type of values (text/numeric),
+but more importantly the way this value will be stored: indexed, stored, with doc values, etc.
+It is possible to introduce nested documents, but not real associations.
+
+Documents are thus organized, at best, as a collection of trees,
+where each tree is a document, optionally with nested documents.
 
-// TODO maybe explain what we mean by "mapping"?
-// TODO explain what an "entity" is and what it implies
+There are multiple mismatches between the entity model and the document model:
+properties vs. fields, associations vs. nested documents, graph vs. collection of trees.
+
+The goal of _mapping_, in Hibernate search, is to resolve these mismatches
+by defining how to transform one or more entities into a document,
+and how to resolve a search hit back into the original entity.
+This is the main added value of Hibernate Search,
+the basis for everything else from automatic indexing to the various search DSLs.
+
+Mapping is usually configured using annotations in the entity model,
+but this can also be achieved using a programmatic API.
+To learn more about how to configure mapping, see <<mapper-orm-mapping>>.
+
+To learn how to index the resulting documents, see <<mapper-orm-indexing>>
+(hint: it's automatic).
+
+To learn how to search with an API
+that takes advantage of the mapping to be closer to the entity model,
+in particular by returning hits as entities instead of just document identifiers,
+see <<search-dsl>>.
 
 [[concepts-analysis]]
 == Analysis
 // Search 5 anchors backward compatibility
 [[analyzer]]
 
-[IMPORTANT]
+As mentioned in <<concepts-full-text>>,
+the full-text engine works on tokens,
+which means text has to be processed
+both when indexing (document processing, to build the token -> document index)
+and when searching (query processing, to generate a list of tokens to look up).
+
+However, the processing is not *just* about "tokenizing".
+Index lookups are *exact* lookups,
+which means that looking up `Great` (capitalized) will not return documents containing only `great` (all lowercase).
+An extra step is performed when processing text to address this caveat:
+token filtering, which normalizes tokens.
+Thanks to that "normalization",
+`Great` will be indexed as `great`,
+so that an index lookup for the query `great` will match as expected.
+
+In the Lucene world (Lucene, Elasticsearch, Solr, ...),
+text processing during both the indexing and searching phases
+is called "analysis" and is performed by an "analyzer".
+
+The analyzer is made up of three types of components,
+which will each process the text successively in the following order:
+
+. Character filter: transforms the input characters. Replaces, adds or removes characters.
+. Tokenizer: splits the text into several words, called "tokens".
+. Token filter: transforms the tokens. Replaces, add or removes characters in a token,
+derives new tokens from the existing ones, removes tokens based on some condition, ...
+
+The tokenizer usually splits on whitespaces (though there are other options).
+Token filters are usually where customization takes place.
+They can remove accented characters,
+remove meaningless suffixes (`-ing`, `-s`, ...)
+or tokens (`a`, `the`, ...),
+replace tokens with a chosen spelling (`wi-fi` => `wifi`),
+etc.
+
+[TIP]
 ====
-This section is currently incomplete.
-A decent introduction is included in the getting started guide: see <<getting-started-analysis>>.
+Character filters, though useful, are rarely used,
+because they have no knowledge of token boundaries.
+
+Unless you know what you are doing,
+you should generally favor token filters.
 ====
 
-////
-TODO The getting started section has a link pointing here and expects the section to
-include a detailed explanation of analysis, how it works and how to configure it in HSearch.
-We also need to explain the difference between analyzer and normalizer.
-////
+In some cases, it is necessary to index text in one block,
+without any tokenization:
+
+* For some types of text, such as SKUs or other business codes,
+tokenization simply does not make sense: the text is a single "keyword".
+* For sorts by field value, tokenization is not necessary.
+It is also forbidden in Hibernate Search due to performance issues;
+only non-tokenized fields can be sorted on.
+
+To address these use cases,
+a special type of analyzer, called "normalizer", is available.
+Normalizers are simply analyzers that are guaranteed not to use a tokenizer:
+they can only use character filters and token filters.
+
+In Hibernate Search, analyzers and normalizers are referenced by their name,
+for example <<mapper-orm-directfieldmapping-analyzer,when defining a full-text field>>.
+Analyzers and normalizers have two separate namespaces.
+
+Some names are already assigned to built-in analyzers (in Elasticsearch in particular),
+but it is possible (and recommended) to assign names to custom analyzers and normalizers,
+assembled using built-in components (tokenizers, filters) to address your specific needs.
 
-For more information about how to configure analysis,
-see the documentation of each backend:
+Each backend exposes its own APIs to define analyzers and normalizers,
+and generally to configure analysis.
+See the documentation of each backend for more information:
 
 * <<backend-lucene-analysis,Analysis for the Lucene backend>>
 * <<backend-elasticsearch-analysis,Analysis for the Elasticsearch backend>>