documentation/src/main/docbook/en-US/modules/advanced-features.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
  ~ Hibernate, Relational Persistence for Idiomatic Java
  ~
  ~  Copyright (c) 2010, Red Hat, Inc. and/or its affiliates or third-party contributors as
  ~  indicated by the @author tags or express copyright attribution
  ~  statements applied by the authors.  All third-party contributions are
  ~  distributed under license by Red Hat, Inc.
  ~
  ~  This copyrighted material is made available to anyone wishing to use, modify,
  ~  copy, or redistribute it subject to the terms and conditions of the GNU
  ~  Lesser General Public License, as published by the Free Software Foundation.
  ~
  ~  This program is distributed in the hope that it will be useful,
  ~  but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
  ~  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
  ~  for more details.
  ~
  ~  You should have received a copy of the GNU Lesser General Public License
  ~  along with this distribution; if not, write to:
  ~  Free Software Foundation, Inc.
  ~  51 Franklin Street, Fifth Floor
  ~  Boston, MA  02110-1301  USA
  -->
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY % BOOK_ENTITIES SYSTEM "../hsearch.ent">
%BOOK_ENTITIES;
]>
<chapter id="search-lucene-native">
  <title>Advanced features</title>

  <para>In this final chapter we are offering a smorgasbord of tips and tricks
  which might become useful as you dive deeper and deeper into Hibernate
  Search.</para>

  <section>
    <title>Accessing the SearchFactory</title>

    <para>The <classname>SearchFactory</classname> object keeps track of the
    underlying Lucene resources for Hibernate Search. It is a convenient way
    to access Lucene natively. The <literal>SearchFactory</literal> can be
    accessed from a <classname>FullTextSession</classname>:</para>

    <example>
      <title>Accessing the <classname>SearchFactory</classname></title>

      <programlisting language="JAVA" role="JAVA">FullTextSession fullTextSession = Search.getFullTextSession(regularSession);
SearchFactory searchFactory = fullTextSession.getSearchFactory();</programlisting>
    </example>
  </section>

  <section id="IndexReaders">
    <title>Using an IndexReader</title>

    <para>Queries in Lucene are executed on an
    <classname>IndexReader</classname>. Hibernate Search caches index readers
    to maximize performance and implements other strategies to retrieve
    updated <classname>IndexReader</classname>s in order to minimize IO
    operations. Your code can access these cached resources, but you have to
    follow some "good citizen" rules.</para>

    <example>
      <title>Accessing an <classname>IndexReader</classname></title>

      <programlisting language="JAVA" role="JAVA">IndexReader reader = searchFactory.getIndexReaderAccessor().open(Order.class);
try {
   //perform read-only operations on the reader
}
finally {
   searchFactory.getIndexReaderAccessor().close(reader);
}          </programlisting>
    </example>

    <para>In this example the <classname>SearchFactory</classname> figures out
    which indexes are needed to query this entity. Using the configured
    <classname>ReaderProvider</classname> (described in <xref
    linkend="search-architecture-readerstrategy"/>) on each index, it returns
    a compound <literal>IndexReader</literal> on top of all involved indexes.
    Because this <classname>IndexReader</classname> is shared amongst several
    clients, you must adhere to the following rules:</para>

    <itemizedlist>
      <listitem>
        <para>Never call indexReader.close(), but always call
        readerProvider.closeReader(reader), using a finally block.</para>
      </listitem>

      <listitem>
        <para>Don't use this <classname>IndexReader</classname> for
        modification operations: it's a readonly
        <classname>IndexReader</classname>, you would get an
        exception).</para>
      </listitem>
    </itemizedlist>

    <para>Aside from those rules, you can use the
    <classname>IndexReader</classname> freely, especially to do native Lucene
    queries. Using this shared <classname>IndexReader</classname>s will be
    more efficient than by opening one directly from - for example - the
    filesystem.</para>

    <para>As an alternative to the method <methodname>open(Class...
    types)</methodname> you can use <methodname>open(String...
    indexNames)</methodname>; in this case you pass in one or more index
    names; using this strategy you can also select a subset of the indexes for
    any indexed type if sharding is used.</para>

    <example>
      <title>Accessing an <classname>IndexReader by index
      names</classname></title>

      <programlisting language="JAVA" role="JAVA">IndexReader reader = searchFactory
      .getIndexReaderAccessor()
      .open("Products.1", "Products.3");</programlisting>
    </example>
  </section>

  <section>
    <title>Accessing a Lucene Directory</title>

    <para>A <classname>Directory</classname> is the most common abstraction
    used by Lucene to represent the index storage; Hibernate Search doesn't
    interact directly with a Lucene <classname>Directory</classname> but
    abstracts these interactions via an <classname>IndexManager</classname>:
    an index does not necessarily need to be implemented by a
    <classname>Directory</classname>.</para>

    <para>If you are certain that your index is represented as a
    <classname>Directory</classname> and need to access it, you can get a
    reference to the <classname>Directory</classname> via the
    <classname>IndexManager</classname>. You will have to cast the
    <classname>IndexManager</classname> instance to a
    <classname>DirectoryBasedIndexManager</classname> and then use
    <literal>getDirectoryProvider().getDirectory()</literal> to get a
    reference to the underlying <classname>Directory</classname>. This is not
    recommended, if you need low level access to the index using Lucene APIs
    we suggest to see <xref linkend="IndexReaders"/> instead.</para>
  </section>

  <section id="advanced-features-sharding">
    <title>Sharding indexes</title>

    <para>In some cases it can be useful to split (shard) the data into
    several Lucene indexes. There are two main use use cases:</para>

    <itemizedlist>
      <listitem>
        <para>A single index is so big that index update times are slowing the
        application down. In this case static sharding can be used to split
        the data into a pre-defined number of shards.</para>
      </listitem>

      <listitem>
        <para>Data is naturally segmented by customer, region, language or
        other application parameter and the index should be split according to
        these segments. This is a use case for dynamic sharding.</para>
      </listitem>
    </itemizedlist>

    <tip>
      <para>By default sharding is not enabled.</para>
    </tip>

    <section>
      <title id="advanced-features-static-sharding">Static sharding</title>

      <para>To enable static sharding set the
      <constant>hibernate.search.&lt;indexName&gt;.sharding_strategy.nbr_of_shards</constant>
      property as seen in <xref linkend="example-index-sharding"/>.</para>

      <example id="example-index-sharding">
        <title>Enabling index sharding</title>

        <programlisting>hibernate.search.[default|&lt;indexName&gt;].sharding_strategy.nbr_of_shards = 5</programlisting>
      </example>

      <para>The default sharding strategy which gets enabled by setting this
      property, splits the data according to the hash value of the document id
      (generated by the <classname>FieldBridge</classname>). This ensures a
      fairly balanced sharding. You can replace the default strategy by
      implementing a custom <classname>IndexShardingStrategy</classname>. To
      use your custom strategy you have to set the
      <constant>hibernate.search.[default|&lt;indexName&gt;].sharding_strategy</constant>
      property to the fully qualified class name of your custom
      <classname>IndexShardingStrategy</classname>.</para>

      <example id="example-index-sharding-strategy">
        <title>Registering a custom IndexShardingStrategy</title>

        <programlisting>hibernate.search.[default|&lt;indexName&gt;].sharding_strategy = my.custom.RandomShardingStrategy</programlisting>
      </example>
    </section>

    <section id="advanced-features-dynamic-sharding">
      <title>Dynamic sharding</title>

      <para>Dynamic sharding allows you to manage the shards yourself and even
      create new shards on the fly. To do so you need to implement the
      interface <classname>ShardIdentifierProvider</classname> and set the
      <constant>hibernate.search.[default|&lt;indexName&gt;].sharding_strategy</constant>
      property to the fully qualified name of this class. Note that instead of
      implementing the interface directly, you should rather derive your
      implementation from
      <classname>org.hibernate.search.store.ShardIdentifierProviderTemplate</classname>
      which provides a basic implementation. Let's look at <xref
      linkend="example-custom-shard-identifier-provider"/> for an
      example.<example id="example-custom-shard-identifier-provider">
          <title>Custom <classname>ShardidentiferProvider</classname></title>

          <programlisting>public static class AnimalShardIdentifierProvider extends ShardIdentifierProviderTemplate {

 @Override
 public String getShardIdentifier(Class&lt;?&gt; entityType, Serializable id,
         String idAsString, Document document) {
    if ( entityType.equals(Animal.class) ) {
       String type = document.getFieldable("type").stringValue();
       addShard(type);
       return type;
    }
    throw new RuntimeException("Animal expected but found " + entityType);
 }

 @Override
 protected Set&lt;String&gt; loadInitialShardNames(Properties properties, BuildContext buildContext) {
    ServiceManager serviceManager = buildContext.getServiceManager();
    SessionFactory sessionFactory = serviceManager.requestService(
        HibernateSessionFactoryServiceProvider.class, buildContext);
    Session session = sessionFactory.openSession();
    try {
       Criteria initialShardsCriteria = session.createCriteria(Animal.class);
       initialShardsCriteria.setProjection( Projections.distinct(Property.forName("type")));

       @SuppressWarnings("unchecked")
       List&lt;String&gt; initialTypes = initialShardsCriteria.list();
       return new HashSet&lt;String&gt;(initialTypes);
    }
    finally {
       session.close();
    }
 }
}</programlisting>
        </example>The are several things happening in
      <classname>AnimalShardIdentifierProvider</classname>. First off its
      purpose is to create one shard per animal type (e.g. mammal, insect,
      etc.). It does so by inspecting the class type and the Lucene document
      passed to the <methodname>getShardIdentifier()</methodname> method. It
      extracts the <constant>type</constant> field from the document and uses
      it as shard name. <methodname>getShardIdentifier()</methodname> is
      called for every addition to the index and a new shard will be created
      with every new animal type encountered. The base class
      <classname>ShardIdentifierProviderTemplate</classname> maintains a set
      with all known shards to which any identifier must be added by calling
      <methodname>addShard()</methodname>.</para>

      <para>It is important to understand that Hibernate Search cannot know which
      shards already exist when the application starts. When using
      <classname>ShardIdentifierProviderTemplate</classname> as base class of
      a <classname>ShardIdentifierProvider</classname> implementation, the
      initial set of shard identifiers must be returned by the
      <methodname>loadInitialShardNames()</methodname> method. How this is
      done will depend on the use case. However, a common case in combination
      with Hibernate ORM is that the initial shard set is defined by the the
      distinct values of a given database column. <xref
      linkend="example-custom-shard-identifier-provider"/> shows how to handle
      such a case. <classname>AnimalShardIdentifierProvider</classname> makes
      in its <methodname>loadInitialShardNames()</methodname> implementation
      use of a service called
      <classname>HibernateSessionFactoryServiceProvider</classname> (see also
      <xref linkend="section-services"/>) which is available within an ORM
      environment. It allows to request a Hibernate
      <classname>SessionFactory</classname> instance which can be used to run
      a <classname>Criteria</classname> query in order to determine the
      initial set of shard identifers.</para>

      <para>Last but not least, the
      <classname>ShardIdentifierProvider</classname> also allows for
      optimizing searches by selecting which shard to run a query against. By
      activating a filter (see <xref linkend="query-filter-shard"/>), a
      sharding strategy can select a subset of the shards used to answer a
      query (<classname>getShardIdentifiersForQuery()</classname>, not shown
      in the example) and thus speed up the query execution.</para>
    </section>
    <important>
        <para>This <classname>ShardIdentifierProvider</classname> is considered
        experimental. We might need to apply some changes to the defined method
        signatures to accomodate for unforeseen use cases. Please provide
        feedback if you have ideas, or just to let us know how you're using
        this API.</para>
    </important>
  </section>

  <section id="section-sharing-indexes">
    <title>Sharing indexes</title>

    <para>It is technically possible to store the information of more than one
    entity into a single Lucene index. There are two ways to accomplish
    this:</para>

    <itemizedlist>
      <listitem>
        <para>Configuring the underlying directory providers to point to the
        same physical index directory. In practice, you set the property
        <literal>hibernate.search.[fully qualified entity
        name].indexName</literal> to the same value. As an example let’s use
        the same index (directory) for the <classname>Furniture</classname>
        and <classname>Animal</classname> entity. We just set
        <literal>indexName</literal> for both entities to for example
        “Animal”. Both entities will then be stored in the Animal
        directory.</para>

        <para><programlisting>hibernate.search.org.hibernate.search.test.shards.Furniture.indexName = Animal
hibernate.search.org.hibernate.search.test.shards.Animal.indexName = Animal</programlisting></para>
      </listitem>

      <listitem>
        <para>Setting the <code>@Indexed</code> annotation’s
        <methodname>index</methodname> attribute of the entities you want to
        merge to the same value. If we again wanted all
        <classname>Furniture</classname> instances to be indexed in the
        <classname>Animal</classname> index along with all instances of
        <classname>Animal</classname> we would specify
        <code>@Indexed(index="Animal")</code> on both
        <classname>Animal</classname> and <classname>Furniture</classname>
        classes.<note>
            <para>This is only presented here so that you know the option is
            available. There is really not much benefit in sharing
            indexes.</para>
          </note></para>
      </listitem>
    </itemizedlist>
  </section>

  <section id="section-services">
    <title>Using external services</title>

    <para>Any of the pluggable contracts we have seen so far allows for the
    injection of a service. The most notable example being the
    <classname>DirectoryProvider</classname>. The full list is:</para>

    <itemizedlist>
      <listitem>
        <para><classname>DirectoryProvider</classname></para>
      </listitem>

      <listitem>
        <para><classname>ReaderProvider</classname></para>
      </listitem>

      <listitem>
        <para><classname>OptimizerStrategy</classname></para>
      </listitem>

      <listitem>
        <para><classname>BackendQueueProcessor</classname></para>
      </listitem>

      <listitem>
        <para><classname>Worker</classname></para>
      </listitem>

      <listitem>
        <para><classname>ErrorHandler</classname></para>
      </listitem>

      <listitem>
        <para><classname>MassIndexerProgressMonitor</classname></para>
      </listitem>
    </itemizedlist>

    <para>Some of these components need to access a service which is either
    available in the environment or whose lifecycle is bound to the
    <classname>SearchFactory</classname>. Sometimes, you even want the same
    service to be shared amongst several instances of these contract. One
    example is the ability the share an Infinispan cache instance between
    several directory providers running in different <literal>JVM</literal>s
    to store the various indexes using the same underlying infrastructure;
    this provides real-time replication of indexes across nodes.</para>

    <section>
      <title>Exposing a service</title>

      <para>To expose a service, you need to implement
      <classname>org.hibernate.search.spi.ServiceProvider&lt;T&gt;</classname>.
      <classname>T</classname> is the type of the service you want to use.
      Services are retrieved by components via their
      <classname>ServiceProvider</classname> class implementation.</para>

      <section>
        <title>Managed services</title>

        <para>If your service ought to be started when Hibernate Search starts
        and stopped when Hibernate Search stops, you can use a managed
        service. Make sure to properly implement the
        <methodname>start</methodname> and <methodname>stop</methodname>
        methods of <classname>ServiceProvider</classname>. When the service is
        requested, the <methodname>getService</methodname> method is
        called.</para>

        <example>
          <title>Example of ServiceProvider implementation</title>

          <programlisting language="JAVA" role="JAVA">public class CacheServiceProvider implements ServiceProvider&lt;Cache&gt; {
    private CacheManager manager;

    public void start(Properties properties) {
        //read configuration
        manager = new CacheManager(properties);
    }

    public Cache getService() {
        return manager.getCache(DEFAULT);
    }

    void stop() {
        manager.close();
    }
}</programlisting>
        </example>

        <note>
          <para>The <classname>ServiceProvider</classname> implementation must
          have a no-arg constructor.</para>
        </note>

        <para>To be transparently discoverable, such service should have an
        accompanying
        <filename>META-INF/services/org.hibernate.search.spi.ServiceProvider</filename>
        whose content list the (various) service provider
        implementation(s).</para>

        <example>
          <title>Content of
          META-INF/services/org.hibernate.search.spi.ServiceProvider</title>

          <programlisting>com.acme.infra.hibernate.CacheServiceProvider</programlisting>
        </example>
      </section>

      <section>
        <title>Provided services</title>

        <para>Alternatively, the service can be provided by the environment
        bootstrapping Hibernate Search. For example, Infinispan which uses
        Hibernate Search as its internal search engine can pass the
        <classname>CacheContainer</classname> to Hibernate Search. In this
        case, the <classname>CacheContainer</classname> instance is not
        managed by Hibernate Search and the
        <methodname>start</methodname>/<methodname>stop</methodname> methods
        of its corresponding service provider will not be used.</para>

        <note>
          <para>Provided services have priority over managed services. If a
          provider service is registered with the same
          <classname>ServiceProvider</classname> class as a managed service,
          the provided service will be used.</para>
        </note>

        <para>The provided services are passed to Hibernate Search via the
        <classname>SearchConfiguration</classname> interface
        (<methodname>getProvidedServices</methodname>).</para>

        <important>
          <para>Provided services are used by frameworks controlling the
          lifecycle of Hibernate Search and not by traditional users.</para>
        </important>

        <para>If, as a user, you want to retrieve a service instance from the
        environment, use registry services like JNDI and look the service up
        in the provider.</para>
      </section>
    </section>

    <section>
      <title>Using a service</title>

      <para>Many of of the pluggable contracts of Hibernate Search can use
      services. Services are accessible via the
      <classname>BuildContext</classname> interface.</para>

      <example>
        <title>Example of a directory provider using a cache service</title>

        <programlisting language="JAVA" role="JAVA">public CustomDirectoryProvider implements DirectoryProvider&lt;RAMDirectory&gt; {
    private BuildContext context;

    public void initialize(
        String directoryProviderName, 
        Properties properties, 
        BuildContext context) {
        //initialize
        this.context = context;
    }

    public void start() {
        Cache cache = context.requestService(CacheServiceProvider.class);
        //use cache
    }

    public RAMDirectory getDirectory() {
        // use cache
    }

    public stop() {
        //stop services
        context.releaseService(CacheServiceProvider.class);
    } 
}</programlisting>
      </example>

      <para>When you request a service, an instance of the service is served
      to you. Make sure to then release the service. This is fundamental. Note
      that the service can be released in the
      <methodname>DirectoryProvider.stop</methodname> method if the
      <classname>DirectoryProvider</classname> uses the service during its
      lifetime or could be released right away of the service is simply used
      at initialization time.</para>
    </section>
  </section>

  <section>
    <title>Customizing Lucene's scoring formula</title>

    <para>Lucene allows the user to customize its scoring formula by extending
    <classname>org.apache.lucene.search.Similarity</classname>. The abstract
    methods defined in this class match the factors of the following formula
    calculating the score of query q for document d:</para>

    <para><emphasis role="bold">score(q,d) = coord(q,d) · queryNorm(q) · ∑
    <subscript>t in q</subscript> ( tf(t in d) · idf(t)
    <superscript>2</superscript> · t.getBoost() · norm(t,d) )
    </emphasis></para>

    <para><informaltable align="left" width="">
        <tgroup cols="2">
          <thead>
            <row>
              <entry align="center">Factor</entry>

              <entry align="center">Description</entry>
            </row>
          </thead>

          <tbody>
            <row>
              <entry align="left">tf(t ind)</entry>

              <entry>Term frequency factor for the term (t) in the document
              (d).</entry>
            </row>

            <row>
              <entry align="left">idf(t)</entry>

              <entry>Inverse document frequency of the term.</entry>
            </row>

            <row>
              <entry align="left">coord(q,d)</entry>

              <entry>Score factor based on how many of the query terms are
              found in the specified document.</entry>
            </row>

            <row>
              <entry align="left">queryNorm(q)</entry>

              <entry>Normalizing factor used to make scores between queries
              comparable.</entry>
            </row>

            <row>
              <entry align="left">t.getBoost()</entry>

              <entry>Field boost.</entry>
            </row>

            <row>
              <entry align="left">norm(t,d)</entry>

              <entry>Encapsulates a few (indexing time) boost and length
              factors.</entry>
            </row>
          </tbody>
        </tgroup>
      </informaltable>It is beyond the scope of this manual to explain this
    formula in more detail. Please refer to
    <classname>Similarity</classname>'s Javadocs for more information.</para>

    <para>Hibernate Search provides two ways to modify Lucene's similarity
    calculation.</para>

    <para>First you can set the default similarity by specifying the fully
    specified classname of your <classname>Similarity</classname>
    implementation using the property
    <constant>hibernate.search.similarity</constant>. The default value is
    <classname>org.apache.lucene.search.DefaultSimilarity</classname>.</para>

    <para>Secondly, you can override the similarity used for a specific index
    by setting the <literal>similarity</literal> property for this index (see
    <xref linkend="search-configuration-directory"/> for more information
    about index configuration):</para>

    <programlisting>hibernate.search.[default|&lt;indexname&gt;].similarity = my.custom.Similarity</programlisting>

    <para>As an example, let's assume it is not important how often a term
    appears in a document. Documents with a single occurrence of the term
    should be scored the same as documents with multiple occurrences. In this
    case your custom implementation of the method <methodname>tf(float freq)
    </methodname> should return 1.0.</para>

    <note>
      <para>When two entities share the same index they must declare the same
      <classname>Similarity</classname> implementation.</para>
    </note>

    <note>
      <para>The use of <classname>@Similarity</classname> which was used to
      configure the similarity on a class level is deprecated since Hibernate
      Search 4.4. Instead of using the annotation use the configuration
      property.</para>
    </note>
  </section>
</chapter>