Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

A library that adds some NLP capabilities to the Lucene search engine

branch: master

This branch is 0 commits ahead and 0 commits behind master

Fetching latest commit…

Cannot retrieve the latest commit at this time

README.markdown

lucene-stanford-lemmatizer

This is a library that adds NLP capabilities to Lucene-based search engines: lemmatization and filtering based on part-of-speech (POS) tag. It used the state-of-the-art Stanford POS Tagger for NLP support.

Lemmatizing is similar to stemming, except smarter: it takes into account the context of a word to determine the correct lemma/stem. POS filtering is a smarter replacement for stop lists. It allows filtering out all pronouns, adverbs, etc.

Getting started

Download this package and

Set your CLASSPATH to include the above, then issue ant jar.

In your search code, construct an EnglishLemmaAnalyzer instead of a StandardAnalyzer (or whatever you normally use). Pass the filename of a Stanford POS Tagger model file to the constructor (found in the models/ directory in the Stanford POS Tagger source directory.

Going further

It is possible to determine which parts-of-speech should be indexed by subclassing the tokenizer. See the API docs for details.

Bugs

The implementation is limited to English, because the Stanford lemmatizer only handles that languages. The POS tagger does Chinese and German, so it should be possible to add those languages.

Something went wrong with that request. Please try again.