Ruby bindings to the OpenNLP Java toolkit.
Ruby Java
Latest commit 758c0f3 Oct 1, 2014 @louismullie Merge pull request #9 from pgwillia/probabilities
Expose probabilities when finding name entities
Permalink
Failed to load latest commit information.
bin Remove JARs from gem and add them to downloadable package. Dec 26, 2012
lib Expose probabilities when finding name entities Sep 30, 2014
spec Expose probabilities when finding name entities Sep 30, 2014
.gitignore Ignore JARs Dec 18, 2013
.travis.yml Force re-build of Rjb after selecting new Java version. Dec 26, 2012
Gemfile First working commit. Dec 19, 2012
LICENSE First working commit. Dec 19, 2012
README.md Expose probabilities when finding name entities Sep 30, 2014
Rakefile First working commit. Dec 19, 2012
open-nlp.gemspec

README.md

Build Status

###About

This library provides high-level Ruby bindings to the Open NLP package, a Java machine learning toolkit for natural language processing (NLP). This gem is compatible with Ruby 1.9.2 and 1.9.3 as well as JRuby 1.7.1. It is tested on both Java 6 and Java 7.

###Installing

First, install the gem: gem install open-nlp. Then, download the JARs and English language models in one package (80 MB).

Place the contents of the extracted archive inside the /bin/ folder of the open-nlp gem (e.g. [...]/gems/open-nlp-0.x.x/bin/).

Alternatively, from a terminal window, cd to the gem's folder and run:

wget http://www.louismullie.com/treat/open-nlp-english.zip
unzip -o open-nlp-english.zip -d bin/

Afterwards, you may individually download the appropriate models for other languages from the open-nlp website.

###Configuring

After installing and requiring the gem (require 'open-nlp'), you may want to set some of the following configuration options.

# Set an alternative path to look for the JAR files.
# Default is gem's bin folder.
OpenNLP.jar_path = '/path_to_jars/'

# Set an alternative path to look for the model files.
# Default is gem's bin folder.
OpenNLP.model_path = '/path_to_models/'

# Pass some alternative arguments to the Java VM.
# Default is ['-Xms512M', '-Xmx1024M'].
OpenNLP.jvm_args = ['-option1', '-option2']

# Redirect VM output to log.txt
OpenNLP.log_file = 'log.txt'

# Set default models for a language.
OpenNLP.use :language

###Examples

Simple tokenizer

OpenNLP.load

sent = "The death of the poet was kept from his poems."
tokenizer = OpenNLP::SimpleTokenizer.new

tokens = tokenizer.tokenize(sent).to_a
# => %w[The death of the poet was kept from his poems .]

Maximum entropy tokenizer, chunker and POS tagger

OpenNLP.load

chunker   = OpenNLP::ChunkerME.new
tokenizer = OpenNLP::TokenizerME.new
tagger    = OpenNLP::POSTaggerME.new

sent   = "The death of the poet was kept from his poems."

tokens = tokenizer.tokenize(sent).to_a
# => %w[The death of the poet was kept from his poems .]

tags   = tagger.tag(tokens).to_a
# => %w[DT NN IN DT NN VBD VBN IN PRP$ NNS .]

chunks = chunker.chunk(tokens, tags).to_a
# => %w[B-NP I-NP B-PP B-NP I-NP B-VP I-VP B-PP B-NP I-NP O]

Abstract Bottom-Up Parser

OpenNLP.load

sent      = "The death of the poet was kept from his poems."
parser = OpenNLP::Parser.new
parse = parser.parse(sent)

parse.get_text.should eql sent

parse.get_span.get_start.should eql 0
parse.get_span.get_end.should eql 46
parse.get_child_count.should eql 1

child = parse.get_children[0]

child.text # => "The death of the poet was kept from his poems."
child.get_child_count # => 3
child.get_head_index #=> 5
child.get_type # => "S"

Maximum Entropy Name Finder*

OpenNLP.load

text = File.read('./spec/sample.txt').gsub!("\n", "")

tokenizer   = OpenNLP::TokenizerME.new
segmenter   = OpenNLP::SentenceDetectorME.new
ner_models  = ['person', 'time', 'money']

ner_finders = ner_models.map do |model|
  OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
end

sentences = segmenter.sent_detect(text)
named_entities = []

sentences.each do |sentence|

  tokens = tokenizer.tokenize(sentence)
  
  ner_models.each_with_index do |model,i|
    finder = ner_finders[i]
    name_spans = finder.find(tokens)
    name_probs = finder.probs()
    name_spans.each_with_index do |name_span,j|
      start = name_span.get_start
      stop  = name_span.get_end-1
      slice = tokens[start..stop].to_a
      prob  = name_probs[j]
      named_entities << [slice, model, prob]
    end
  end

end

Loading specific models

Just pass the name of the model file to the constructor. The gem will search for the file in the OpenNLP.model_path folder.

OpenNLP.load

tokenizer = OpenNLP::TokenizerME.new('en-token.bin')
tagger = OpenNLP::POSTaggerME.new('en-pos-perceptron.bin')
name_finder = OpenNLP::NameFinderME.new('en-ner-person.bin')
# etc.

Loading specific classes

You may want to load specific classes from the OpenNLP library that are not loaded by default. The gem provides an API to do this:

# Default base class is opennlp.tools.
OpenNLP.load_class('SomeClassName')  
# => OpenNLP::SomeClassName

# Here, we specify another base class.
OpenNLP.load_class('SomeOtherClass', 'opennlp.tools.namefind')
# => OpenNLP::SomeOtherClass

Contributing

Fork the project and send me a pull request! Config updates for other languages are welcome.