Collection of Apache Lucene tokenizers, filters and analyzers

This library is collection of my lucene components which I use in projects

IdentifierNGramFilter

IdentifierNGramFilter tokenizes the input into n-grams delimited by punctation. N-grams are units of various length. It differs from lucene's NGramTokenFilter where n-grams are fixed-length tokens. Punctation is defined in IdentifierTokenizer's jflex grammar (PunctationTokenizerImpl.jflex) and can be included or excluded from ngrams. You can also define minimum and maximum length.

This filter is mostly used in index time.

You can use it in highlighting because it modifies offset and sorts n-grams by their offset in the original token first, then increasing length (meaning that "192.168.1" will give "192", "192.168", "192.168.1", "168", "168.1", "1").

For more examples see IdentifierNGramFilterTest

IdentifierFilter

Use this filter in query time to fields that use index time filter IdentifierNGramFilter.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

src

src

.gitignore

.gitignore

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Collection of Apache Lucene tokenizers, filters and analyzers

IdentifierNGramFilter

IdentifierFilter

About

Releases

Packages

Languages

hlavki/lucene-analyzers

Folders and files

Latest commit

History

Repository files navigation

Collection of Apache Lucene tokenizers, filters and analyzers

About

Resources

Stars

Watchers

Forks

Languages