A simple Clojure wrapper for Apache Lucene (version 8.9.0).
Key usage scenarios
- Search
- The core use-case of Lucene.
- Suggest
- Prefix-queries for content in any field.
Both in-memory, and on-disk indexes can be used depending on the dataset size.
Note: UNSTABLE API. No releases yet.
Inspired by other example wrappers I’ve come across. Notably
[org.msync/lucene-clj "0.2.0-SNAPSHOT"]
Available via clojars.
The primary use-case is for in-process text search needs for read-only data-sets that can be managed on single-instance deployments. For multi-instance deployments, keeping modifications of data in sync is an effort.
Use this library when you need light-weight text-search support without the hassle of setting up something like Solr. You may update the index if you wish, but have to take care of any race conditions, and since it is in-process, you will also need to take care of updating all instances in a multi-instance use scenario.
The objectives are loosely as follows.
- Stick to core Lucene. No script/language specific dependencies part of the core library, but can be added by users per need.
- Support for prefix based suggestions - a feature of Lucene I found quite undocumented, as well as lacking good examples for.
- Track the latest Lucene versions.
I am thankful to the above library authors for their liberal licensing. I’ve used their ideas/code in places.
There’s sample data in the repository that we use in our examples. A hand-created sample with fictional and non-fictional characters is here and one from Kaggle on music albums is here. These are also used in the tests.
A complete scenario from index creation to search actions is described below.
- Albums - Kaggle - [local]
- Hand-created, real + fictional characters here
When dealing with Lucene and data it processes, key terms to note are
- Document
- A unit of related text. It has possibly many fields, and is a unit of consumption and also of each search result. A
Document
is a collection ofFields
. - Field
- Every field is a container of indexable content. They can range across many types, from simple text to latitude and longitude.
- Analyzer
- Analyzes the input documents, and preprocesses terms appropriately. Depending on the context, decisions on tokenizing, stemming, stopwords removal, or treating input as-is - these are controlled by the use of appropriate analyzers
This is a pretty hand-wavy description, but useful enough for our purpose.
Lucene consumes documents, each of which is made up of fields having values. As is natural in Clojure, we represent all such things as maps.
{
:title-field "This is a title"
:abstract-field "This is an abstract of what is to follow"
:author-field "Lekhak Sampaadak"
:body-field "And here's the crux of the article with all the gory details"
}
To prepare our content for ingestion and indexing, we do some straightforward CSV parsing and conversion of each row into a map. Each column has a name and is used as the key for the field name in the document-map. All the preparation code is in the msync.lucene.tests-common test namespace, which we’ll refer to as the common
namespace where required for clarity. We use two CSV data-sets as our sources of documents to create two indexes, to demonstrate some distinct use-cases. All data files are in the ~test-resources~ subdirectory.
We use two simple datasets, stored as CSV. Loading is straightforward CSV parsing and converting to maps – the first rows in each file are the header rows, holding names of respective columns.
- Sample, hand-coded documents. Plain, simple data.
;; In the common namespace
(take 5 (read-csv-resource-file sample-data-file))
first-name | last-name | age | real | gender | bio |
Suppandi | Varadarajan | 16 | false | m | A wonderful, innocent soul. You’ll enjoy his antics. |
Shikari | Shambhu | 32 | False | m | Carries a gun. But no bullets. Animals love him. |
Chacha | Chaudhary | 64 | FalSe | m | The supercomputer. And then some more! |
Sabu | Jupiterwala | 2 | false | m | Yes, of legal age. Just a different age-scale because of the planet he comes from. Strong, powerful, but kind. Because, not an earthling. Children love him. |
- Albums data. From Kaggle.
- The columns
Genre
andSubgenre
, are comma-separated values themselves- They are to be pre-processed before feeding to lucene-clj
- These are multi-valued fields.
- The columns
;; In the common namespace
(take 5 (read-csv-resource-file albums-file))
Number | Year | Album | Artist | Genre | Subgenre |
1 | 1967 | Sgt. Pepper’s Lonely Hearts Club Band | The Beatles | Rock | Rock & Roll, Psychedelic Rock |
2 | 1966 | Pet Sounds | The Beach Boys | Rock | Pop Rock, Psychedelic Rock |
3 | 1966 | Revolver | The Beatles | Rock | Psychedelic Rock, Pop Rock |
4 | 1965 | Highway 61 Revisited | Bob Dylan | Rock | Folk Rock, Blues Rock |
Analyzers process each field’s content in a manner that is apt - according to what the programmer/domain-expert decides.
Some fields need to be tokenized and stemmed, while some are to be treated verbatim. Natural language text, versus some proper nouns like company name or music genre.
In the albums dataset, the Year
, Genre
and Subgenre
fields’ texts are not to be tokenized and stemmed, or filtered for stop-words. Hence, they are configured to be analyzed with the keyword analyzer. Other fields can be treated like normal text. So, in this case, we use a composed analyzer that can treat each field in its special way.
Note that the same analyzers we use while creating indexes should be used when querying the index for search and suggest to avoid surprises. This shouldn’t be surprising.
Here’s how we create analyzers.
;; In the common namespace
;; This is the default analyzer, an instance of the StandardAnalyzer
;; of Lucene
(defonce default-analyzer (analyzers/standard-analyzer))
;; This analyzer considers field values verbatim
;; Will not tokenize and stem
(defonce keyword-analyzer (analyzers/keyword-analyzer))
;; A per-field analyzer, which composes other kinds of analyzers
;; For album data, we have marked some fields as verbatim
;; Takes a default analyzer, and then a map of field to field-specific analyzer
(defonce album-data-analyzer
(analyzers/per-field-analyzer default-analyzer
{:Year keyword-analyzer
:Genre keyword-analyzer
:Subgenre keyword-analyzer}))
With the background setup done and explained, let us move ahead to demonstrating indexing and searching. You may want to try the following in a REPL by requiring the namespace the prior code is in and then playing along. I’ve used the dev namespace below, the code for which can be found here.
(ns dev
(:require [msync.lucene :as lucene]
[msync.lucene
[document :as ld]
[tests-common :as common]]))
In memory
(defonce album-index (lucene/create-index! :type :memory
:analyzer common/album-data-analyzer))
Or, on disk
(defonce album-index (lucene/create-index! :type :disk
:path "/path/to/index/directory"
:analyzer common/album-data-analyzer))
A sample of the album data for reference.
The Genre
and Subgenre
columns are pre-processed, as mentioned above, and split further.
(drop 2 (take 5 common/album-data))
({:Number "3",
:Year "1966",
:Album "Revolver",
:Artist "The Beatles",
:Genre ("Rock"),
:Subgenre ("Psychedelic Rock" "Pop Rock")}
{:Number "4",
:Year "1965",
:Album "Highway 61 Revisited",
:Artist "Bob Dylan",
:Genre ("Rock"),
:Subgenre ("Folk Rock" "Blues Rock")}
{:Number "5",
:Year "1965",
:Album "Rubber Soul",
:Artist "The Beatles",
:Genre ("Rock" "Pop"),
:Subgenre ("Pop Rock")})
Documents are Clojure maps. Each key-value in the map represents one org.apache.lucene.document.Field
. The options passed to the `index!` function control behavior in various ways
:stored-fields
- Lucene can index for efficient searching, but to save space, it need not store all the field values. If you want Lucene to also store the contents, pass them as a collection to this argument. The alternative is to use Lucene to index without storing large fields, and:suggest-fields
- Fields that are treated specially during indexing, allowing Lucene to create internal structures for quick prefix matching.:context-fn
- Lucene allows for a list of contexts to associate with the suggest fields, which allow us to filter on them while querying for suggestions.
In the following, we instruct the `index!` function to
- Store the mentioned fields
- Use the :Album and :Artist fields to index for suggestions - this uses some special processing and storage in the index.
- Use the :Genre field as context. Note that :Genre is itself can be multiple values for each document, and that works fine.
(lucene/index! album-index common/album-data
{:stored-fields [:Number :Year :Album :Artist :Genre :Subgenre]
:suggest-fields [:Album :Artist]
:context-fn :Genre})
A simple search example, in which we pass a map specifying the field, and the value we are looking for. The result includes the :hit, a :score for that :hit, and the :doc-id which is an identifier that Lucene manages. Notice that the result - :hit - is a Lucene Document object.
(lucene/search album-index {:Year "1977"}
{:results-per-page 2})
[{:doc-id 25,
:score 1.4994705,
:hit
#object[org.apache.lucene.document.Document 0x24750f97 "Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Number:26> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Year:1977> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Album:Rumours> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Artist:Fleetwood Mac> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Genre:Rock> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Subgenre:Pop Rock>>"]}
{:doc-id 40,
:score 1.4994705,
:hit
#object[org.apache.lucene.document.Document 0x6d6a6fe4 "Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Number:41> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Year:1977> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Album:Never Mind the Bollocks Here's the Sex Pistols> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Artist:Sex Pistols> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Genre:Rock> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Subgenre:Punk>>"]}]
For convenience, lucene-clj has a function that can be used to convert the Lucene Document into a Clojure map. But use beyond basic use-cases, supply your own.
(lucene/search album-index {:Year "1977"}
{:results-per-page 2
:hit->doc ld/document->map})
[{:doc-id 25,
:score 1.4994705,
:hit
{:Number "26",
:Year "1977",
:Album "Rumours",
:Artist "Fleetwood Mac",
:Genre "Rock",
:Subgenre "Pop Rock"}}
{:doc-id 40,
:score 1.4994705,
:hit
{:Number "41",
:Year "1977",
:Album "Never Mind the Bollocks Here's the Sex Pistols",
:Artist "Sex Pistols",
:Genre "Rock",
:Subgenre "Punk"}}]
Notice though, that the :Genre and :Subgenre fields did not come back as collections. The document->map function isn’t smart to identify that, and needs a hint to make that happen. With the modified hit->doc argument, the two fields come back as vectors with possibly multiple values.
(lucene/search album-index
{:Year "1977"}
{:results-per-page 2
:hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])})
[{:doc-id 25,
:score 1.4994705,
:hit
{:Number "26",
:Year "1977",
:Album "Rumours",
:Artist "Fleetwood Mac",
:Genre ["Rock"],
:Subgenre ["Pop Rock"]}}
{:doc-id 40,
:score 1.4994705,
:hit
{:Number "41",
:Year "1977",
:Album "Never Mind the Bollocks Here's the Sex Pistols",
:Artist "Sex Pistols",
:Genre ["Rock"],
:Subgenre ["Punk"]}}]
Paginated query results are supported via the :page option. Also, the following example projects a subset of the document fields by passing a modified function as the :hit->doc argument.
(lucene/search album-index
{:Year "1968"} ;; Map of field-values to search with
{:results-per-page 5 ;; Control the number of results returned
:page 2 ;; Page number, starting 0 as default
:hit->doc #(-> %
ld/document->map
(select-keys [:Year :Album]))})
[{:doc-id 160,
:score 1.4311604,
:hit {:Year "1968", :Album "The Dock of the Bay"}}
{:doc-id 170,
:score 1.4311604,
:hit {:Year "1968", :Album "The Notorious Byrd Brothers"}}
{:doc-id 204,
:score 1.4311604,
:hit {:Year "1968", :Album "Wheels of Fire"}}
{:doc-id 233,
:score 1.4311604,
:hit {:Year "1968", :Album "Bookends"}}
{:doc-id 257,
:score 1.4311604,
:hit
{:Year "1968",
:Album "The Kinks Are The Village Green Preservation Society"}}]
Searching in a single field, for a single value
(lucene/search album-index {:Year "1967"} {:results-per-page 2 :hit->doc ld/document->map})
Searching in a single field, where any of the values in the set are allowed
(lucene/search album-index {:Year #{"1960" "1965"}}
{:results-per-page 5
:hit->doc #(-> % ld/document->map (select-keys [:Year :Album]))})
[{:doc-id 118,
:score 2.2562923,
:hit {:Year "1960", :Album "At Last!"}}
{:doc-id 347,
:score 2.2562923,
:hit {:Year "1960", :Album "Muddy Waters at Newport 1960"}}
{:doc-id 357,
:score 2.2562923,
:hit {:Year "1960", :Album "Sketches of Spain"}}
{:doc-id 3,
:score 1.6102078,
:hit {:Year "1965", :Album "Highway 61 Revisited"}}
{:doc-id 4,
:score 1.6102078,
:hit {:Year "1965", :Album "Rubber Soul"}}]
When looking for multiple terms in a single field, pass a vector.
(lucene/search album-index {:Album ["complete" "unbelievable"]} {:hit->doc ld/document->map})
[{:doc-id 253,
:score 3.0571077,
:hit
{:Number "254",
:Year "1966",
:Album
"Complete & Unbelievable: The Otis Redding Dictionary of Soul",
:Artist "Otis Redding",
:Genre "Funk / Soul",
:Subgenre "Soul"}}]
Be sure that your queries are semantically right for the data-set. For example, AND-ing over two different years will lead to an empty result-set, obviously.
(lucene/search album-index {:Year ["1964" "1965"]})
[]
Spaces in the query string are inferred to mean a phrase search operation
(lucene/search album-index {:Album "the sun"} {:hit->doc ld/document->map})
[{:doc-id 10,
:score 2.8861985,
:hit
{:Number "11",
:Year "1976",
:Album "The Sun Sessions",
:Artist "Elvis Presley",
:Genre "Rock",
:Subgenre "Rock & Roll"}}
{:doc-id 287,
:score 2.544825,
:hit
{:Number "288",
:Year "1968",
:Album "Anthem of the Sun",
:Artist "Grateful Dead",
:Genre "Rock",
:Subgenre "Psychedelic Rock"}}
{:doc-id 310,
:score 2.544825,
:hit
{:Number "311",
:Year "1994",
:Album "The Sun Records Collection",
:Artist "Various",
:Genre "& Country",
:Subgenre "Rockabilly"}}]
This is an AND operation
(lucene/search album-index {:Album "the sun" :Year "1976"} {:hit->doc ld/document->map})
[{:doc-id 10,
:score 4.56387,
:hit
{:Number "11",
:Year "1976",
:Album "The Sun Sessions",
:Artist "Elvis Presley",
:Genre "Rock",
:Subgenre "Rock & Roll"}}]
Notice that in the suggest function call, the field and suggestion-prefix are not passed as a map, as unlike search, suggest calls are only supported over a single field.
From above, the fields Album
and Artist
have been marked to be indexed in a way so that we can ask for prefix-based suggestions.
(lucene/suggest album-index :Album "par"
{:hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])
:contexts ["Electronic"]})
[{:hit
{:Number "140",
:Year "1978",
:Album "Parallel Lines",
:Artist "Blondie",
:Genre ["Electronic" "Rock"],
:Subgenre ["New Wave" "Pop Rock" "Punk" "Disco"]},
:score 1.0,
:doc-id 139}]
We can ask for fuzzy matching when querying for suggestions.
(lucene/suggest album-index :Album "per"
{:hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])
:fuzzy? true
:contexts ["Electronic"]})
[{:hit
{:Number "140",
:Year "1978",
:Album "Parallel Lines",
:Artist "Blondie",
:Genre ["Electronic" "Rock"],
:Subgenre ["New Wave" "Pop Rock" "Punk" "Disco"]},
:score 2.0,
:doc-id 139}
{:hit
{:Number "76",
:Year "1984",
:Album "Purple Rain",
:Artist "Prince and the Revolution",
:Genre ["Electronic" "Rock" "Funk / Soul" "Stage & Screen"],
:Subgenre ["Pop Rock" "Funk" "Soundtrack" "Synth-pop"]},
:score 2.0,
:doc-id 75}]
Notice how forever matches fever too.
(lucene/search album-index {:Album "forever"}
{:hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])
:fuzzy? true})
[{:doc-id 39,
:score 3.0850303,
:hit
{:Number "40",
:Year "1967",
:Album "Forever Changes",
:Artist "Love",
:Genre ["Rock"],
:Subgenre ["Folk Rock" "Psychedelic Rock"]}}
{:doc-id 131,
:score 0.9592955,
:hit
{:Number "132",
:Year "1977",
:Album "Saturday Night Fever: The Original Movie Sound Track",
:Artist "Various Artists",
:Genre ["Electronic" "�Stage & Screen"],
:Subgenre ["Soundtrack" "�Disco"]}}]
- Some minimal technical overview of Lucene internals for this project can be found here.
Copyright © 2018-2020 Ravindra R. Jaju
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.