Atlas/Lucene Search Analysis

Introduction

Atlas Search uses Lucene Analyzers to control how the index sets search terms, e.g., where to break up word groupings and whether to consider punctuation. However, without an intimate knowledge of the various Lucene Analyzers, it can be difficult to select the appropriate analyzer for a given field when creating a search index. Inspired by the Analysis Screen in Apache Solr, this utility provides two simple ways -- a CLI and a UI -- to test how various analyzers will process a given text.

Note: see this Atlas feature request

Build

The UI for this tool is implemented as a Vaadin web application. The CLI is implemented as a simple POJO. To build these tools, execute the command mvn clean package from the directory containing pom.xml.

Run

Use the command mvnw -Dspring-boot.run.jvmArguments="-Dspring.devtools.restart.enabled=false" from the directory containing pom.xml to launch the Web UI on port 8080.

Use the command mvn -f pom-cli.xml exec:java -Dexec.args="<options>" with the appropriate options to run the CLI. Use the -h or --help options to see usage instructions.

usage: mvn exec:java -Dexec.args="<options>"
 -h, --help               Prints this message
 -a, --analyzer <arg>     Lucene analyzer to use (defaults to 'Standard').
                         Use 'list' for supported analyzer names.
 -d, --definition <arg>   Index definition file containing custom analyzer
 -t, --text <arg>         Input text to analyze
 -f, --file <arg>         Input text file to analyze
 -l, --language <arg>     Language code (used with '--analyzer=Language'
                         only.  Use 'list' for supported language codes.
 -n, --name <arg>         Custom analyzer name
 -o, --operator <arg>     Query operator to use (defaults to 'text'). Use
                         'list' for supported operator names.
 -k, --tokenizer <arg>    Tokeniser to use with autocomplete operator
                         (defaults to 'edgeGram'). Use 'list' for
                         supported tokenizer names.
 -m, --minGrams <arg>     Minimum number of characters per indexed sequence
                         to use with autocomplete operator (defaults to
                         '2').
 -x, --maxGrams <arg>     Maximum number of characters per indexed sequence
                         to use with autocomplete operator (defaults to
                         '3').

You can also use java -cp lib/ -jar <path-to>/atlas-search-analysis-0.0.1.jar <options> (Java 11 or later) to run the CLI.

Examples

Analyze text using the `lucene.simple` analyzer

mvn -f pom-cli.xml -q exec:java -Dexec.args="-a simple -t 'hello my-name.is Roy/Kiesler'"
Using org.apache.lucene.analysis.core.SimpleAnalyzer
[hello] [my] [name] [is] [roy] [kiesler]

Analyze text using the `lucene.standard` analyzer

mvn -f pom-cli.xml -q exec:java -Dexec.args="--analyzer standard --text 'hello my-name.is Roy/Kiesler'"
Using org.apache.lucene.analysis.standard.StandardAnalyzer
[hello] [my] [name.is] [roy] [kiesler]

Analyze text using the `lucene.whitespace` analyzer

mvn -f pom-cli.xml -q exec:java -Dexec.args="--analyzer whitespace -t 'hello my-name.is Roy/Kiesler'"
Using org.apache.lucene.analysis.core.WhitespaceAnalyzer
[hello] [my-name.is] [Roy/Kiesler]

Analyze text using the `lucene.language` English analyzer

mvn -f pom-cli.xml -q exec:java -Dexec.args="--analyzer language --language en --text 'running a race'"
Using org.apache.lucene.analysis.en.EnglishAnalyzer
[run] [race]

Analyze a text file using the `lucene.language` French analyzer

cat <<EOF >> french.txt
bonjour je m'appelle Roy Kiesler
EOF

mvn -q exec:java -Dexec.args="-a language -l fr -f french.txt"
Using org.apache.lucene.analysis.fr.FrenchAnalyzer
[bonjou] [apel] [roy] [kiesl]

Analyze text using a custom analyzer

Sample 1

mvn -f pom-cli.xml -q exec:java -Dexec.args="-a custom -t 'ROCKY II is better than Rocky V' -d index_roman.json -n romanAnalyzer"
Using org.apache.lucene.analysis.custom.CustomAnalyzer
[rocki] [2] [better] [rocki] [5]

Sample 2

mvn -f pom-cli.xml -q exec:java -Dexec.args="-a custom -t '<div><p>This is an <a href="foo.com">HTML</a> test</p></div>' -d index_html.json -n htmlStrippingAnalyzer"
Using org.apache.lucene.analysis.custom.CustomAnalyzer
[p] [This] [is] [an] [a] [href] [foo.com] [HTML] [a] [test] [p]

Analyze text using autocomplete

Sample 1

mvn -f pom-cli.xml exec:java -Dexec.args="-t 'Ribeira Charming Duplex' -o autocomplete"
Using org.apache.lucene.analysis.custom.CustomAnalyzer
Autocomplete - nGram, minGram(2), maxGram(3)
[Ri] [Rib] [ib] [ibe] [be] [bei] [ei] [eir] [ir] [ira] [ra] [ra ] [a ] [a C] [ C] [ Ch] [Ch] [Cha] [ha] [har] [ar] [arm] [rm] [rmi] [mi] [min] [in] [ing] [ng] [ng ] [g ] [g D] [ D] [ Du] [Du] [Dup] [up] [upl] [pl] [ple] [le] [lex] [ex]

Sample 2

mvn -f pom-cli.xml exec:java -Dexec.args="-t 'Ribeira Charming Duplex' -o autocomplete -k edgeGram -m 2 -x 15"
Using org.apache.lucene.analysis.custom.CustomAnalyzer
Autocomplete - edgeNGram, minGram(2), maxGram(15)
[Ri] [Rib] [Ribe] [Ribei] [Ribeir] [Ribeira] [Ribeira ] [Ribeira C] [Ribeira Ch] [Ribeira Cha] [Ribeira Char] [Ribeira Charm] [Ribeira Charmi] [Ribeira Charmin]

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.mvn/wrapper		.mvn/wrapper
bin		bin
frontend		frontend
mvn		mvn
src/main		src/main
.classpath		.classpath
.gitignore		.gitignore
.project		.project
LICENSE.md		LICENSE.md
README.md		README.md
index_camel.json		index_camel.json
index_dms.json		index_dms.json
index_html.json		index_html.json
index_regex.json		index_regex.json
index_rev.json		index_rev.json
index_roman.json		index_roman.json
index_shingle.json		index_shingle.json
mvnw		mvnw
mvnw.cmd		mvnw.cmd
package-lock.json		package-lock.json
package.json		package.json
pom-cli.xml		pom-cli.xml
pom.xml		pom.xml
tsconfig.json		tsconfig.json
types.d.ts		types.d.ts
webpack.config.js		webpack.config.js
webpack.generated.js		webpack.generated.js

License

mongodb-developer/lucene-search-analysis

Folders and files

Latest commit

History

Repository files navigation

Atlas/Lucene Search Analysis

Introduction

Build

Run

Examples

Analyze text using the lucene.simple analyzer

Analyze text using the lucene.standard analyzer

Analyze text using the lucene.whitespace analyzer

Analyze text using the lucene.language English analyzer

Analyze a text file using the lucene.language French analyzer

Analyze text using a custom analyzer

Analyze text using autocomplete

About

Resources

License

Stars

Watchers

Forks

Languages

Analyze text using the `lucene.simple` analyzer

Analyze text using the `lucene.standard` analyzer

Analyze text using the `lucene.whitespace` analyzer

Analyze text using the `lucene.language` English analyzer

Analyze a text file using the `lucene.language` French analyzer