Skip to content

Lucene Facets Code Tour

jabrah edited this page Apr 12, 2017 · 11 revisions

Github, 'search-facets-integration' branch (https://github.com/jhu-digital-manuscripts/rosa2/tree/search-facets-integration) Proof of concept implementation of Lucene Facets in the Rosa search service showing simple 2 level facets with only category + value.

Current Model

  • RosaFacet - has information about facet dimension, path, count
  • SearchFacets - container for a set of RosaFacets, has some utility methods for getting facets

Indexing

The faceting starts with the Lucene index. The current way it is implemented has the facets stored in the same index as the search index. Facets are indexed in the same docs as the normal search fields. The current method of faceting lets the facets be stored inside the search index, but is limited to 2 level depth (category + value).

The indexing process is kicked off at build time in the LuceneSearchService classes. During the update phase, Lucene documents are created with both search fields and facet fields. Nothing more has to be added here for indexing, unless a different faceting scheme is used.

There is an alternate way to do faceted search in Lucene that requires the facets to be kept in a separate index, apart from the normal search index. This way supports deeply nested facets, which differs from the current implementation.

Searching

There is a new method added, SearchService#search(Query, SearchOptions, RosaFacet[]). {Impl}

  • a search for the Query is done with facets returned along side the results
  • handles Rosa search Queries and search options the same way as the original search method, but allows a user to specify selected facets
  • no facets can be specified, in which case, top level facets are returned with the search results
  • if facets are specified, the search is conducted for the query and results are drilled down to the applied facets

Tests

HTTP Endpoint

This implementation is demonstrated in this the sample UI using JSTree: https://github.com/jhu-digital-manuscripts/rosa2/wiki/JSTree-Example

Questions

  • How do we let a search UI discover initial or top level facets? Should this information be included in the 'info.json' data or perhaps have some special request that will return this information? In the sample code, a request is made with no search query and a facet request of facet_author, one of the facet dimensions, but no path. This will do a faceted search and narrow down on author but since there is no path to narrow down on, it will essentially return all facets. It more or less works, but is pretty ugly.
  • How do we want to index these facets? Do we want the facets only to be indexed down to the manifest level? This would work well for browsing, since each facet count would represent only the number of books in a category. I am not sure how this would work out in actual search.
  • Should we always return facets? In other words, should all searches be able to take advantage of facets? A UI can disallow the use of facets, but should facets always be an option in the API? This would change some internals in the Lucene search service.
  • Do we want a "drill sideways" functionality, which would allow the user to select multiple values from a category? (selecting multiple values within a category is allowed by default, without the drill sideways thing)
  • Do we want to support more deeply nested facets? Facets going further than just one level? Currently the facets in this implementation are: category > value. Do we want to support something more complex: category > sub-category > value? One general example is looking for a book by author, one could select Author > American > Mark Twain instead of Author > Mark Twain.
    • Supporting this would require a different implementation of facets that uses a separate taxonomy index.

Resources