Skip to content

Latest commit

 

History

History
41 lines (29 loc) · 1.34 KB

README.rdoc

File metadata and controls

41 lines (29 loc) · 1.34 KB

ankusa

Ankusa is a text classifier in Ruby that uses Hadoop’s HBase for storage. Because it uses HBase as a backend, the training corpus can be many terabytes in size.

Ankusa currently uses a Naive Bayes classifier. It ignores common words (a.k.a, stop words) and stems all others. Additionally, it uses Laplacian smoothing in the classification method.

Installation

First, install HBase / Hadoop. Make sure the HBase Thrift interface has been started as well. Then:

gem install ankusa

Basic Usage

require 'rubygems'
require 'ankusa'

# connect to HBase
storage = Ankusa::HBaseStorage.new 'localhost'
c = Ankusa::Classifier.new storage

# Each of these calls will return a bag-of-words
# has with stemmed words as keys and counts as values
c.train :spam, "This is some spammy text"
c.train :good, "This is not the bad stuff"

# This will return the most likely class (as symbol)
puts c.classify "This is some spammy text"

# This will return Hash with classes as keys and
# membership probability as values
puts c.classifications "This is some spammy text"

# If you have a large corpus, the probabilities will
# likely all be 0.  In that case, you must use log
# likelihood values
puts c.log_likelihoods "This is some spammy text"

# get a list of all classes
puts c.classes

# close connection
storage.close