Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 59 additions & 10 deletions documentation/src/main/asciidoc/backend-elasticsearch.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -421,21 +421,72 @@ The Elasticsearch `date` type does not support the whole range of years that can
// Search 5 anchors backward compatibility
[[elasticsearch-mapping-analyzer]]

[IMPORTANT]
<<concepts-analysis,Analysis>> is the text processing performed by analyzers,
both when indexing (document processing)
and when searching (query processing).

All built-in Elasticsearch analyzers can be used transparently,
without any configuration in Hibernate Search:
just use their name wherever Hibernate Search expects an analyzer name.
However, in order to define custom analyzers,
analysis must be configured explicitly.

[CAUTION]
====
This section is currently incomplete.
A decent introduction is included in the getting started guide: see <<getting-started-analysis>>.
Elasticsearch analysis configuration is not applied immediately on startup:
it needs to be pushed to the Elasticsearch cluster.
Hibernate Search will only push the configuration to the cluster if specific conditions are met,
and only if instructed to do so
through the <<backend-elasticsearch-index-lifecycle,lifecycle configuration>>.
====

To configure analysis in an Elasticsearch backend, you will need to:

* Implement a bean that implements the `org.hibernate.search.backend.elasticsearch.analysis.ElasticsearchAnalysisConfigurer` interface.
* Configure your backend to use that bean by setting the configuration property
* Define a class that implements the `org.hibernate.search.backend.elasticsearch.analysis.ElasticsearchAnalysisConfigurer` interface.
* Configure the backend to use that implementation by setting the configuration property
`hibernate.search.backends.<backend name>.analysis.configurer`
to a <<configuration-property-types,bean reference>> pointing to your bean.
to a <<configuration-property-types,bean reference>> pointing to the implementation.

Hibernate Search will call the `configure` method of this implementation on startup,
and the configurer will be able to take advantage of a DSL to define analyzers:

.Implementing and using an analysis configurer with the Elasticsearch backend
====
[source, JAVA, indent=0, subs="+callouts"]
----
include::{sourcedir}/org/hibernate/search/documentation/analysis/MyElasticsearchAnalysisConfigurer.java[tags=include]
----
<1> Define a custom analyzer named "english", because it will be used to analyze English text such as book titles.
<2> Set the tokenizer to a standard tokenizer.
<3> Set the char filters. Char filters are applied in the order they are given, before the tokenizer.
<4> Set the token filters. Token filters are applied in the order they are given, after the tokenizer.
<5> Note that, for Elasticsearch, any parameterized char filter, tokenizer or token filter
must be defined separately and assigned a name.
<6> Set the value of a parameter for the char filter/tokenizer/token filter being defined.
<7> Normalizers are defined in a similar way, the only difference being that they cannot use a tokenizer.
<8> Multiple analyzers/normalizers can be defined in the same configurer.

[source, XML, indent=0, subs="+callouts"]
----
include::{resourcesdir}/analysis/elasticsearch-simple.properties[]
----
<1> Assign the configurer to the backend `myBackend` using a Hibernate Search configuration property.
====

// TODO add a simple example: configurer implementation + settings
It is also possible to assign a name to a parameterized built-in analyzer:

.Naming a parameterized built-in analyzer in the Elasticsearch backend
====
[source, JAVA, indent=0, subs="+callouts"]
----
include::{sourcedir}/org/hibernate/search/documentation/analysis/AdvancedElasticsearchAnalysisConfigurer.java[tags=type]
----
<1> Define an analyzer with the given name and type.
<2> Set the value of a parameter for the analyzer being defined.
====

[TIP]
====
To know which character filters, tokenizers and token filters are available,
refer to the documentation:

Expand All @@ -445,9 +496,7 @@ refer to the documentation:
{elasticsearchDocUrl}/analysis-charfilters.html[character filters],
{elasticsearchDocUrl}/analysis-tokenizers.html[tokenizers],
{elasticsearchDocUrl}/analysis-tokenfilters.html[token filters].


// TODO add detailed description of each use case: normalizer, analyzer, custom, builtin type, ...
====

[[backend-elasticsearch-multi-tenancy]]
== Multi-tenancy
Expand Down
65 changes: 44 additions & 21 deletions documentation/src/main/asciidoc/backend-lucene.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -367,37 +367,60 @@ Date/time types do not support the whole range of years that can be represented
[[backend-lucene-analysis]]
== Analysis

[IMPORTANT]
====
This section is currently incomplete.
A decent introduction is included in the getting started guide: see <<getting-started-analysis>>.
====
<<concepts-analysis,Analysis>> is the text processing performed by analyzers,
both when indexing (document processing)
and when searching (query processing).

To configure analysis in a Lucene backend, you will need to:

* Implement a bean that implements the `org.hibernate.search.backend.lucene.analysis.LuceneAnalysisConfigurer` interface.
* Configure your backend to use that bean by setting the configuration property
* Define a class that implements the `org.hibernate.search.backend.lucene.analysis.LuceneAnalysisConfigurer` interface.
* Configure the backend to use that implementation by setting the configuration property
`hibernate.search.backends.<backend name>.analysis.configurer`
to a <<configuration-property-types,bean reference>> pointing to your bean.
to a <<configuration-property-types,bean reference>> pointing to the implementation.

// TODO add a simple example: configurer implementation + settings
Hibernate Search will call the `configure` method of this implementation on startup,
and the configurer will be able to take advantage of a DSL to define analyzers:

To know which character filters, tokenizers and token filters are available,
either browse the Lucene Javadoc or read the corresponding section on the
link:http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters[Solr Wiki].

[NOTE]
.Implementing and using an analysis configurer with the Lucene backend
====
Why the reference to the Apache Solr wiki for Lucene?
[source, JAVA, indent=0, subs="+callouts"]
----
include::{sourcedir}/org/hibernate/search/documentation/analysis/MyLuceneAnalysisConfigurer.java[tags=include]
----
<1> Define a custom analyzer named "english", because it will be used to analyze English text such as book titles.
<2> Set the tokenizer to a standard tokenizer: components are referenced by their factory class.
<3> Set the char filters. Char filters are applied in the order they are given, before the tokenizer.
<4> Set the token filters. Token filters are applied in the order they are given, after the tokenizer.
<5> Set the value of a parameter for the last added char filter/tokenizer/token filter.
<6> Normalizers are defined in a similar way, the only difference being that they cannot use a tokenizer.
<7> Multiple analyzers/normalizers can be defined in the same configurer.

[source, XML, indent=0, subs="+callouts"]
----
include::{resourcesdir}/analysis/lucene-simple.properties[]
----
<1> Assign the configurer to the backend `myBackend` using a Hibernate Search configuration property.
====

It is also possible to assign a name to a built-in analyzer,
or a custom analyzer implementation:

The analyzer factory framework was originally created in the Apache Solr project.
Most of these implementations have been moved to Apache Lucene, but the
documentation for these additional analyzers can still be found in the Solr Wiki. You might find
other documentation referring to the "Solr Analyzer Framework"; just remember you don't need to
depend on Apache Solr anymore: the required classes are part of the core Lucene distribution.
.Naming an analyzer instance in the Lucene backend
====
[source, JAVA, indent=0, subs="+callouts"]
----
include::{sourcedir}/org/hibernate/search/documentation/analysis/AdvancedLuceneAnalysisConfigurer.java[tags=instance]
----
====

// TODO add detailed description of each use case: normalizer, analyzer, by instance, by factory, ...
[TIP]
====
To know which analyzers, character filters, tokenizers and token filters are available,
either browse the Lucene Javadoc or read the corresponding section on the
link:http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters[Solr Wiki]
(you don't need Solr to use these analyzers,
it's just that there is no documentation page for Lucene proper).
====

[[backend-lucene-multi-tenancy]]
== Multi-tenancy
Expand Down
149 changes: 134 additions & 15 deletions documentation/src/main/asciidoc/concepts.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,37 +3,156 @@
[[concepts-full-text]]
== Full-text search

include::todo-placeholder.asciidoc[]
Full-text search is a set of techniques for searching,
in a corpus of text documents,
the documents that best match a given query.

// TODO maybe give a short introduction to full-text search and full-text indexes?
The main difference with traditional search -- for example in an SQL database --
is that the stored text is not considered as a single block of text,
but as a collection of tokens (words).

Hibernate Search relies on either http://lucene.apache.org/[Apache Lucene]
or https://www.elastic.co/products/elasticsearch[Elasticsearch]
to implement full-text search.
Since Elasticsearch uses Lucene internally,
they share a lot of characteristics and their general approach to full-text search.

To simplify, these search engines are based on the concept of inverted indexes:
a dictionary where the key is a token (word) found in a document,
and the value is the list of identifiers of every document containing this token.

Still simplifying, once all documents are indexed,
searching for documents involves three steps:

. extracting tokens (words) from the query;
. looking up these tokens in the index to find matching documents;
. aggregating the results of the lookups to produce a list of matching documents.

[NOTE]
====
Lucene and Elasticsearch are not limited to just text search: numeric data is also supported,
enabling support for integers, doubles, longs, dates, etc.
These types are indexed and queried using a slightly different approach,
which obviously does not involve text processing.
====

[[concepts-mapping]]
== Mapping

include::todo-placeholder.asciidoc[]
Applications targeted by Hibernate search generally use an entity-based model to represent data.
In this model, each entity is a single object with a few properties of atomic type
(`String`, `Integer`, `LocalDate`, ...).
Each entity can have multiple associations to one or even many other entities.

Entities are thus organized as a graph,
where each node is an entity and each association is an edge.

By contrast, Lucene and Elasticsearch work with documents.
Each document is a collection of "fields",
each field being assigned a name -- a unique string --
and a value -- which can be text, but also numeric data such as an integer or a date.
Fields also have a type, which not only determines the type of values (text/numeric),
but more importantly the way this value will be stored: indexed, stored, with doc values, etc.
It is possible to introduce nested documents, but not real associations.

Documents are thus organized, at best, as a collection of trees,
where each tree is a document, optionally with nested documents.

// TODO maybe explain what we mean by "mapping"?
// TODO explain what an "entity" is and what it implies
There are multiple mismatches between the entity model and the document model:
properties vs. fields, associations vs. nested documents, graph vs. collection of trees.

The goal of _mapping_, in Hibernate search, is to resolve these mismatches
by defining how to transform one or more entities into a document,
and how to resolve a search hit back into the original entity.
This is the main added value of Hibernate Search,
the basis for everything else from automatic indexing to the various search DSLs.

Mapping is usually configured using annotations in the entity model,
but this can also be achieved using a programmatic API.
To learn more about how to configure mapping, see <<mapper-orm-mapping>>.

To learn how to index the resulting documents, see <<mapper-orm-indexing>>
(hint: it's automatic).

To learn how to search with an API
that takes advantage of the mapping to be closer to the entity model,
in particular by returning hits as entities instead of just document identifiers,
see <<search-dsl>>.

[[concepts-analysis]]
== Analysis
// Search 5 anchors backward compatibility
[[analyzer]]

[IMPORTANT]
As mentioned in <<concepts-full-text>>,
the full-text engine works on tokens,
which means text has to be processed
both when indexing (document processing, to build the token -> document index)
and when searching (query processing, to generate a list of tokens to look up).

However, the processing is not *just* about "tokenizing".
Index lookups are *exact* lookups,
which means that looking up `Great` (capitalized) will not return documents containing only `great` (all lowercase).
An extra step is performed when processing text to address this caveat:
token filtering, which normalizes tokens.
Thanks to that "normalization",
`Great` will be indexed as `great`,
so that an index lookup for the query `great` will match as expected.

In the Lucene world (Lucene, Elasticsearch, Solr, ...),
text processing during both the indexing and searching phases
is called "analysis" and is performed by an "analyzer".

The analyzer is made up of three types of components,
which will each process the text successively in the following order:

. Character filter: transforms the input characters. Replaces, adds or removes characters.
. Tokenizer: splits the text into several words, called "tokens".
. Token filter: transforms the tokens. Replaces, add or removes characters in a token,
derives new tokens from the existing ones, removes tokens based on some condition, ...

The tokenizer usually splits on whitespaces (though there are other options).
Token filters are usually where customization takes place.
They can remove accented characters,
remove meaningless suffixes (`-ing`, `-s`, ...)
or tokens (`a`, `the`, ...),
replace tokens with a chosen spelling (`wi-fi` => `wifi`),
etc.

[TIP]
====
This section is currently incomplete.
A decent introduction is included in the getting started guide: see <<getting-started-analysis>>.
Character filters, though useful, are rarely used,
because they have no knowledge of token boundaries.

Unless you know what you are doing,
you should generally favor token filters.
====

////
TODO The getting started section has a link pointing here and expects the section to
include a detailed explanation of analysis, how it works and how to configure it in HSearch.
We also need to explain the difference between analyzer and normalizer.
////
In some cases, it is necessary to index text in one block,
without any tokenization:

* For some types of text, such as SKUs or other business codes,
tokenization simply does not make sense: the text is a single "keyword".
* For sorts by field value, tokenization is not necessary.
It is also forbidden in Hibernate Search due to performance issues;
only non-tokenized fields can be sorted on.

To address these use cases,
a special type of analyzer, called "normalizer", is available.
Normalizers are simply analyzers that are guaranteed not to use a tokenizer:
they can only use character filters and token filters.

In Hibernate Search, analyzers and normalizers are referenced by their name,
for example <<mapper-orm-directfieldmapping-analyzer,when defining a full-text field>>.
Analyzers and normalizers have two separate namespaces.

Some names are already assigned to built-in analyzers (in Elasticsearch in particular),
but it is possible (and recommended) to assign names to custom analyzers and normalizers,
assembled using built-in components (tokenizers, filters) to address your specific needs.

For more information about how to configure analysis,
see the documentation of each backend:
Each backend exposes its own APIs to define analyzers and normalizers,
and generally to configure analysis.
See the documentation of each backend for more information:

* <<backend-lucene-analysis,Analysis for the Lucene backend>>
* <<backend-elasticsearch-analysis,Analysis for the Elasticsearch backend>>
Expand Down
Loading