Accurate Bayesian sentence tokenizer in Ruby.
Ruby
Switch branches/tags
Nothing to show
Pull request Compare This branch is 1 commit ahead, 42 commits behind zencephalon:release.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
pkg
test
README.rdoc

README.rdoc

TactfulTokenizer

TactfulTokenizer is a Ruby library for high quality sentence tokenization. It uses a Naive Bayesian statistical model, and is based on Splitta, but has support for '?' and '!' as well as primitive handling of XHTML markup. Better support for XHTML parsing is coming shortly.

Usage

require "tactful_tokenizer"
m = TactfulTokenizer::Model.new
m.tokenize_text("Here in the U.S. Senate we prefer to eat our friends. Is it easier that way? <em>Yes.</em> <em>Maybe</em>!")
#=> ["Here in the U.S. Senate we prefer to eat our friends.", "Is it easier that way?", "<em>Yes.</em>", "<em>Maybe</em>!"]

The input text is expected to consist of paragraphs delimited by line breaks.

Installation

gem install tactful_tokenizer

Author

Copyright © 2010 Matthew Bunday. All rights reserved. Released under the GNU GPL v3.