Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Point at actively maintained version

  • Loading branch information...
commit 4586533d82adc5a132bcdb6de2ac439e636f5764 1 parent 7070bb3
Bryce Nyeggen authored
View
4 Gemfile
@@ -1,4 +0,0 @@
-source "http://rubygems.org"
-
-# Specify your gem's dependencies in bandit.gemspec
-gemspec
View
16 Gemfile.lock
@@ -1,16 +0,0 @@
-PATH
- remote: .
- specs:
- ankusa (0.0.14)
- fast-stemmer (>= 1.0.0)
-
-GEM
- remote: http://rubygems.org/
- specs:
- fast-stemmer (1.0.1)
-
-PLATFORMS
- ruby
-
-DEPENDENCIES
- ankusa!
View
137 README.rdoc
@@ -1,137 +1,6 @@
= ankusa
-Ankusa is a text classifier in Ruby that can use either Hadoop's HBase, Mongo, or Cassandra for storage. Because it uses HBase/Mongo/Cassandra as a backend, the training corpus can be many terabytes in size (though additional memory and single file storage abilities also exist for smaller corpora).
-
-Ankusa currently provides both a Naive Bayes and Kullback-Leibler divergence classifier. It ignores common words (a.k.a, stop words) and stems all others. Additionally, it uses Laplacian smoothing in both classification methods.
-
-== Installation
-First, install HBase/Hadoop, Mongo, or Cassandra (>= 0.7.0-rc2). Then, install the appropriate gem:
- gem install hbaserb
- # or
- gem install cassandra
- # or
- gem install mongo
-
-If you're using HBase, make sure the HBase Thrift interface has been started as well. Then:
- gem install ankusa
-
-== Basic Usage
-Using the naive Bayes classifier:
-
- require 'rubygems'
- require 'ankusa'
- require 'ankusa/hbase_storage'
-
- # connect to HBase. Alternatively, just for this test, use in memory storage with
- # storage = Ankusa::MemoryStorage.new
- storage = Ankusa::HBaseStorage.new 'localhost'
- c = Ankusa::NaiveBayesClassifier.new storage
-
- # Each of these calls will return a bag-of-words
- # has with stemmed words as keys and counts as values
- c.train :spam, "This is some spammy text"
- c.train :good, "This is not the bad stuff"
-
- # This will return the most likely class (as symbol)
- puts c.classify "This is some spammy text"
-
- # This will return Hash with classes as keys and
- # membership probability as values
- puts c.classifications "This is some spammy text"
-
- # If you have a large corpus, the probabilities will
- # likely all be 0. In that case, you must use log
- # likelihood values
- puts c.log_likelihoods "This is some spammy text"
-
- # get a list of all classes
- puts c.classnames
-
- # close connection
- storage.close
-
-
-== KL Diverence Classifier
-There is a Kullback–Leibler divergence classifier as well. KL divergence is a distance measure (though not a true metric because it does not satisfy the triangle inequality). The KL classifier simply measures the relative entropy between the text you want to classify and each of the classes. The class with the shortest "distance" is the best class. You may find that for a especially large corpus it may be slightly faster to use this classifier (since prior probablities are never calculated, only likelihoods).
-
-The API is the same as the NaiveBayesClassifier, except rather than calling "classifications" if you want actual numbers you call "distances".
-
- require 'rubygems'
- require 'ankusa'
- require 'ankusa/hbase_storage'
-
- # connect to HBase
- storage = Ankusa::HBaseStorage.new 'localhost'
- c = Ankusa::KLDivergenceClassifier.new storage
-
- # Each of these calls will return a bag-of-words
- # has with stemmed words as keys and counts as values
- c.train :spam, "This is some spammy text"
- c.train :good, "This is not the bad stuff"
-
- # This will return the most likely class (as symbol)
- puts c.classify "This is some spammy text"
-
- # This will return Hash with classes as keys and
- # distances >= 0 as values
- puts c.distances "This is some spammy text"
-
- # get a list of all classes
- puts c.classnames
-
- # close connection
- storage.close
-
-== Storage Methods
-Ankusa has a generalized storage interface that has been implemented for HBase, Cassandra, Mongo, single file, and in-memory storage.
-
-Memory storage can be used when you have a very small corpora
- require 'ankusa/memory_storage'
- storage = Ankusa::MemoryStorage.new
-
-FileSystem storage can be used when you have a very small corpora and want to persist the classification results.
- require 'ankusa/file_system_storage'
- storage = Ankusa::FileSystemStorage.new '/path/to/file'
- # Do classification ...
- storage.save
-
-The FileSystem storage does NOT save to the filesystem automatically, the #save method must be invoked to save and persist the results
-
-HBase storage:
- require 'ankusa/hbase_storage'
- # defaults: host='localhost', port=9090, frequency_tablename="ankusa_word_frequencies", summary_tablename="ankusa_summary"
- storage = Ankusa::HBaseStorage.new host, port, frequency_tablename, summary_tablename
-
-For Cassandra storage:
-* You will need Cassandra version 0.7.0-rc2 or greater.
-* You will need to set a max number classes since current implementation of the Ruby Cassandra client doesn't support table scans.
-* Prior to using the Cassandra storage you will need to run the following command from the cassandra-cli: "create keyspace ankusa with replication_factor = 1". This should be fixed with a new release candidate for Cassandra.
-
-To use the Cassandra storage class:
- require 'ankusa/cassandra_storage'
- # defaults: host='127.0.0.1', port=9160, keyspace = 'ankusa', max_classes = 100
- storage = Ankusa::HBaseStorage.new host, port, keyspace, max_classes
-
-For MongoDB storage:
- require 'ankusa/mongo_db_storage'
- storage = Ankusa::MongoDbStorage.new :host => "localhost", :port => 27017, :db => "ankusa"
- # defaults: :host => "localhost", :port => 27017, :db => "ankusa"
- # no default username or password
- # tou can also use frequency_tablename and summary_tablename options
-
-
-== Running Tests
-You can run the tests for any of the four storage methods. For instance, for memory storage:
- rake test_memory
-
-For the other methods you will need to edit the file test/config.yml and set the configuration params. Then:
- rake test_hbase
- # or
- rake test_cassandra
- # or
- rake test_filesystem
- #or
- rake test_mongo_db
-
-
+ATTENTION: THIS REPO IS DEPRECATED.
+PLEASE USE https://github.com/bmuller/ankusa FOR THE ACTIVELY MAINTAINED VERSION
+Ankusa is a text classifier in Ruby that can use either Hadoop's HBase, Mongo, or Cassandra for storage.
View
49 Rakefile
@@ -1,49 +0,0 @@
-require 'rubygems'
-require 'bundler'
-require 'rake/testtask'
-require 'rdoc/task'
-
-Bundler::GemHelper.install_tasks
-
-desc "Create documentation"
-RDoc::Task.new("doc") { |rdoc|
- rdoc.title = "Ankusa - Naive Bayes classifier with big data storage"
- rdoc.rdoc_dir = 'docs'
- rdoc.rdoc_files.include('README.rdoc')
- rdoc.rdoc_files.include('lib/**/*.rb')
-}
-
-desc "Run all unit tests with memory storage"
-Rake::TestTask.new("test_memory") { |t|
- t.libs += ["lib", "."]
- t.test_files = FileList['test/hasher_test.rb', 'test/memory_classifier_test.rb']
- t.verbose = true
-}
-
-desc "Run all unit tests with HBase storage"
-Rake::TestTask.new("test_hbase") { |t|
- t.libs += ["lib", "."]
- t.test_files = FileList['test/hasher_test.rb']
- t.verbose = true
-}
-
-desc "Run all unit tests with Cassandra storage"
-Rake::TestTask.new("test_cassandra") { |t|
- t.libs += ["lib", "."]
- t.test_files = FileList['test/hasher_test.rb', 'test/cassandra_classifier_test.rb']
- t.verbose = true
-}
-
-desc "Run all unit tests with FileSystem storage"
-Rake::TestTask.new("test_filesystem") { |t|
- t.libs += ["lib", "."]
- t.test_files = FileList['test/hasher_test.rb', 'test/file_system_classifier_test.rb']
- t.verbose = true
-}
-
-desc "Run all unit tests with MongoDb storage"
-Rake::TestTask.new("test_mongo_db") { |t|
- t.libs += ["lib", "."]
- t.test_files = FileList['test/hasher_test.rb', 'test/mongo_db_classifier_test.rb']
- t.verbose = true
-}
View
20 ankusa.gemspec
@@ -1,20 +0,0 @@
-$:.push File.expand_path("../lib", __FILE__)
-require "ankusa/version"
-require "rake"
-require "date"
-
-Gem::Specification.new do |s|
- s.name = "ankusa"
- s.version = Ankusa::VERSION
- s.authors = ["Brian Muller"]
- s.date = Date.today.to_s
- s.description = "Text classifier with HBase, Cassandra, or Mongo storage"
- s.summary = "Text classifier in Ruby that uses Hadoop's HBase, Cassandra, or Mongo for storage"
- s.email = "brian.muller@livingsocial.com"
- s.files = FileList["lib/**/*", "[A-Z]*", "Rakefile", "docs/**/*"]
- s.homepage = "https://github.com/livingsocial/ankusa"
- s.require_paths = ["lib"]
- s.add_dependency('fast-stemmer', '>= 1.0.0')
- s.requirements << "Either hbaserb >= 0.0.3 or cassandra >= 0.7"
- s.rubyforge_project = "ankusa"
-end
View
6 lib/ankusa.rb
@@ -1,6 +0,0 @@
-require 'ankusa/version'
-require 'ankusa/extensions'
-require 'ankusa/classifier'
-require 'ankusa/naive_bayes'
-require 'ankusa/kl_divergence'
-require 'ankusa/hasher'
View
194 lib/ankusa/cassandra_storage.rb
@@ -1,194 +0,0 @@
-require 'cassandra/0.7'
-
-#
-# At the moment you'll have to do:
-#
-# create keyspace ankusa with replication_factor = 1
-#
-# from the cassandra-cli. This should be fixed with new release candidate for
-# cassandra
-#
-module Ankusa
-
- class CassandraStorage
- attr_reader :cassandra
-
- #
- # Necessary to set max classes since current implementation of ruby
- # cassandra client doesn't support table scans. Using crufty get_range
- # method at the moment.
- #
- def initialize(host='127.0.0.1', port=9160, keyspace = 'ankusa', max_classes = 100)
- @cassandra = Cassandra.new('system', "#{host}:#{port}")
- @klass_word_counts = {}
- @klass_doc_counts = {}
- @keyspace = keyspace
- @max_classes = max_classes
- init_tables
- end
-
- #
- # Fetch the names of the distinct classes for classification:
- # eg. :spam, :good, etc
- #
- def classnames
- @cassandra.get_range(:totals, {:start => '', :finish => '', :count => @max_classes}).inject([]) do |cs, key_slice|
- cs << key_slice.key.to_sym
- end
- end
-
- def reset
- drop_tables
- init_tables
- end
-
- #
- # Drop ankusa keyspace, reset internal caches
- #
- # FIXME: truncate doesn't work with cassandra-beta2
- #
- def drop_tables
- @cassandra.truncate!('classes')
- @cassandra.truncate!('totals')
- @cassandra.drop_keyspace(@keyspace)
- @klass_word_counts = {}
- @klass_doc_counts = {}
- end
-
-
- #
- # Create required keyspace and column families
- #
- def init_tables
- # Do nothing if keyspace already exists
- if @cassandra.keyspaces.include?(@keyspace)
- @cassandra.keyspace = @keyspace
- else
- freq_table = Cassandra::ColumnFamily.new({:keyspace => @keyspace, :name => "classes"}) # word => {classname => count}
- summary_table = Cassandra::ColumnFamily.new({:keyspace => @keyspace, :name => "totals"}) # class => {wordcount => count}
- ks_def = Cassandra::Keyspace.new({
- :name => @keyspace,
- :strategy_class => 'org.apache.cassandra.locator.SimpleStrategy',
- :replication_factor => 1,
- :cf_defs => [freq_table, summary_table]
- })
- @cassandra.add_keyspace ks_def
- @cassandra.keyspace = @keyspace
- end
- end
-
- #
- # Fetch hash of word counts as a single row from cassandra.
- # Here column_name is the class and column value is the count
- #
- def get_word_counts(word)
- # fetch all (class,count) pairs for a given word
- row = @cassandra.get(:classes, word.to_s)
- return row.to_hash if row.empty?
- row.inject({}){|counts, col| counts[col.first.to_sym] = [col.last.to_f,0].max; counts}
- end
-
- #
- # Does a table 'scan' of summary table pulling out the 'vocabsize' column
- # from each row. Generates a hash of (class, vocab_size) key value pairs
- #
- def get_vocabulary_sizes
- get_summary "vocabsize"
- end
-
- #
- # Fetch total word count for a given class and cache it
- #
- def get_total_word_count(klass)
- @klass_word_counts[klass] = @cassandra.get(:totals, klass.to_s, "wordcount").values.last.to_f
- end
-
- #
- # Fetch total documents for a given class and cache it
- #
- def get_doc_count(klass)
- @klass_doc_counts[klass] = @cassandra.get(:totals, klass.to_s, "doc_count").values.last.to_f
- end
-
- #
- # Increment the count for a given (word,class) pair. Evidently, cassandra
- # does not support atomic increment/decrement. Psh. HBase uses ZooKeeper to
- # implement atomic operations, ain't it special?
- #
- def incr_word_count(klass, word, count)
- # Only wants strings
- klass = klass.to_s
- word = word.to_s
-
- prior_count = @cassandra.get(:classes, word, klass).values.last.to_i
- new_count = prior_count + count
- @cassandra.insert(:classes, word, {klass => new_count.to_s})
-
- if (prior_count == 0 && count > 0)
- #
- # we've never seen this word before and we're not trying to unlearn it
- #
- vocab_size = @cassandra.get(:totals, klass, "vocabsize").values.last.to_i
- vocab_size += 1
- @cassandra.insert(:totals, klass, {"vocabsize" => vocab_size.to_s})
- elsif new_count == 0
- #
- # we've seen this word before but we're trying to unlearn it
- #
- vocab_size = @cassandra.get(:totals, klass, "vocabsize").values.last.to_i
- vocab_size -= 1
- @cassandra.insert(:totals, klass, {"vocabsize" => vocab_size.to_s})
- end
- new_count
- end
-
- #
- # Increment total word count for a given class by 'count'
- #
- def incr_total_word_count(klass, count)
- klass = klass.to_s
- wordcount = @cassandra.get(:totals, klass, "wordcount").values.last.to_i
- wordcount += count
- @cassandra.insert(:totals, klass, {"wordcount" => wordcount.to_s})
- @klass_word_counts[klass.to_sym] = wordcount
- end
-
- #
- # Increment total document count for a given class by 'count'
- #
- def incr_doc_count(klass, count)
- klass = klass.to_s
- doc_count = @cassandra.get(:totals, klass, "doc_count").values.last.to_i
- doc_count += count
- @cassandra.insert(:totals, klass, {"doc_count" => doc_count.to_s})
- @klass_doc_counts[klass.to_sym] = doc_count
- end
-
- def doc_count_totals
- get_summary "doc_count"
- end
-
- #
- # Doesn't do anything
- #
- def close
- end
-
- protected
-
- #
- # Fetch 100 rows from summary table, yes, increase if necessary
- #
- def get_summary(name)
- counts = {}
- @cassandra.get_range(:totals, {:start => '', :finish => '', :count => @max_classes}).each do |key_slice|
- # keyslice is a clunky thrift object, map into a ruby hash
- row = key_slice.columns.inject({}){|hsh, c| hsh[c.column.name] = c.column.value; hsh}
- counts[key_slice.key.to_sym] = row[name].to_f
- end
- counts
- end
-
- end
-
-end
View
72 lib/ankusa/classifier.rb
@@ -1,72 +0,0 @@
-module Ankusa
-
- module Classifier
- attr_reader :classnames
-
- def initialize(storage)
- @storage = storage
- @storage.init_tables
- @classnames = @storage.classnames
- end
-
- # text can be either an array of strings or a string
- # klass is a symbol
- def train(klass, text)
- th = TextHash.new(text)
- th.each { |word, count|
- @storage.incr_word_count klass, word, count
- yield word, count if block_given?
- }
- @storage.incr_total_word_count klass, th.word_count
- doccount = (text.kind_of? Array) ? text.length : 1
- @storage.incr_doc_count klass, doccount
- @classnames << klass unless @classnames.include? klass
- # cache is now dirty of these vars
- @doc_count_totals = nil
- @vocab_sizes = nil
- th
- end
-
- # text can be either an array of strings or a string
- # klass is a symbol
- def untrain(klass, text)
- th = TextHash.new(text)
- th.each { |word, count|
- @storage.incr_word_count klass, word, -count
- yield word, count if block_given?
- }
- @storage.incr_total_word_count klass, -th.word_count
- doccount = (text.kind_of? Array) ? text.length : 1
- @storage.incr_doc_count klass, -doccount
- # cache is now dirty of these vars
- @doc_count_totals = nil
- @vocab_sizes = nil
- th
- end
-
- protected
- def get_word_probs(word, classnames)
- probs = Hash.new 0
- @storage.get_word_counts(word).each { |k,v| probs[k] = v if classnames.include? k }
- vs = vocab_sizes
- classnames.each { |cn|
- # if we've never seen the class, the word prob is 0
- next unless vs.has_key? cn
-
- # use a laplacian smoother
- probs[cn] = (probs[cn] + 1).to_f / (@storage.get_total_word_count(cn) + vs[cn]).to_f
- }
- probs
- end
-
- def doc_count_totals
- @doc_count_totals ||= @storage.doc_count_totals
- end
-
- def vocab_sizes
- @vocab_sizes ||= @storage.get_vocabulary_sizes
- end
-
- end
-
-end
View
13 lib/ankusa/extensions.rb
@@ -1,13 +0,0 @@
-require 'iconv'
-
-class String
- def numeric?
- true if Float(self) rescue false
- end
-
- def to_ascii
- # from http://www.jroller.com/obie/tags/unicode
- converter = Iconv.new('ASCII//IGNORE//TRANSLIT', 'UTF-8')
- converter.iconv(self).unpack('U*').select { |cp| cp < 127 }.pack('U*') rescue ""
- end
-end
View
55 lib/ankusa/file_system_storage.rb
@@ -1,55 +0,0 @@
-require 'ankusa/memory_storage'
-
-module Ankusa
-
- class FileSystemStorage < MemoryStorage
-
- def initialize(file)
- @file = file
- init_tables
- end
-
- def reset
- @freqs = {}
- @total_word_counts = Hash.new(0)
- @total_doc_counts = Hash.new(0)
- @klass_word_counts = {}
- @klass_doc_counts = {}
- end
-
- def drop_tables
- File.delete(@file) rescue Errno::ENOENT
- reset
- end
-
- def init_tables
- data = {}
- begin
- File.open(@file) do |f|
- data = Marshal.load(f)
- end
- @freqs = data[:freqs]
- @total_word_counts = data[:total_word_counts]
- @total_doc_counts = data[:total_doc_counts]
- @klass_word_counts = data[:klass_word_counts]
- @klass_doc_counts = data[:klass_word_counts]
- rescue Errno::ENOENT
- reset
- end
- end
-
- def save(file = nil)
- file ||= @file
- data = { :freqs => @freqs,
- :total_word_counts => @total_word_counts,
- :total_doc_counts => @total_doc_counts,
- :klass_word_counts => @klass_word_counts,
- :klass_doc_counts => @klass_doc_counts }
- File.open(file, 'w+') do |f|
- Marshal.dump(data, f)
- end
- end
-
- end
-
-end
View
47 lib/ankusa/hasher.rb
@@ -1,47 +0,0 @@
-require 'fast_stemmer'
-require 'ankusa/stopwords'
-
-module Ankusa
-
- class TextHash < Hash
- attr_reader :word_count
-
- def initialize(text=nil, stem=true)
- super 0
- @word_count = 0
- @stem = stem
- add_text(text) unless text.nil?
- end
-
- def self.atomize(text)
- text.downcase.to_ascii.tr('-', ' ').gsub(/[^\w\s]/," ").split
- end
-
- # word should be only alphanum chars at this point
- def self.valid_word?(word)
- not (Ankusa::STOPWORDS.include?(word) || word.length < 3 || word.numeric?)
- end
-
- def add_text(text)
- if text.instance_of? Array
- text.each { |t| add_text t }
- else
- # replace dashes with spaces, then get rid of non-word/non-space characters,
- # then split by space to get words
- words = TextHash.atomize text
- words.each { |word| add_word(word) if TextHash.valid_word?(word) }
- end
- self
- end
-
- protected
-
- def add_word(word)
- @word_count += 1
- word = word.stem if @stem
- key = word.intern
- store key, fetch(key, 0)+1
- end
- end
-
-end
View
126 lib/ankusa/hbase_storage.rb
@@ -1,126 +0,0 @@
-require 'hbaserb'
-
-module Ankusa
-
- class HBaseStorage
- attr_reader :hbase
-
- def initialize(host='localhost', port=9090, frequency_tablename="ankusa_word_frequencies", summary_tablename="ankusa_summary")
- @hbase = HBaseRb::Client.new host, port
- @ftablename = frequency_tablename
- @stablename = summary_tablename
- @klass_word_counts = {}
- @klass_doc_counts = {}
- init_tables
- end
-
- def classnames
- cs = []
- summary_table.create_scanner("", "totals") { |row|
- cs << row.row.intern
- }
- cs
- end
-
- def reset
- drop_tables
- init_tables
- end
-
- def drop_tables
- freq_table.delete
- summary_table.delete
- @stable = nil
- @ftable = nil
- @klass_word_counts = {}
- @klass_doc_counts = {}
- end
-
- def init_tables
- unless @hbase.has_table? @ftablename
- @hbase.create_table @ftablename, "classes", "total"
- end
-
- unless @hbase.has_table? @stablename
- @hbase.create_table @stablename, "totals"
- end
- end
-
- def get_word_counts(word)
- counts = Hash.new(0)
- row = freq_table.get_row(word)
- return counts if row.length == 0
-
- row.first.columns.each { |colname, cell|
- classname = colname.split(':')[1].intern
- # in case untrain has been called too many times
- counts[classname] = [cell.to_i64.to_f, 0].max
- }
-
- counts
- end
-
- def get_vocabulary_sizes
- get_summary "totals:vocabsize"
- end
-
- def get_total_word_count(klass)
- @klass_word_counts.fetch(klass) {
- @klass_word_counts[klass] = summary_table.get(klass, "totals:wordcount").first.to_i64.to_f
- }
- end
-
- def get_doc_count(klass)
- @klass_doc_counts.fetch(klass) {
- @klass_doc_counts[klass] = summary_table.get(klass, "totals:doccount").first.to_i64.to_f
- }
- end
-
- def incr_word_count(klass, word, count)
- size = freq_table.atomic_increment word, "classes:#{klass.to_s}", count
- # if this is a new word, increase the klass's vocab size. If the new word
- # count is 0, then we need to decrement our vocab size
- if size == count
- summary_table.atomic_increment klass, "totals:vocabsize"
- elsif size == 0
- summary_table.atomic_increment klass, "totals:vocabsize", -1
- end
- size
- end
-
- def incr_total_word_count(klass, count)
- @klass_word_counts[klass] = summary_table.atomic_increment klass, "totals:wordcount", count
- end
-
- def incr_doc_count(klass, count)
- @klass_doc_counts[klass] = summary_table.atomic_increment klass, "totals:doccount", count
- end
-
- def doc_count_totals
- get_summary "totals:doccount"
- end
-
- def close
- @hbase.close
- end
-
- protected
- def get_summary(name)
- counts = Hash.new 0
- summary_table.create_scanner("", name) { |row|
- counts[row.row.intern] = row.columns[name].to_i64
- }
- counts
- end
-
- def summary_table
- @stable ||= @hbase.get_table @stablename
- end
-
- def freq_table
- @ftable ||= @hbase.get_table @ftablename
- end
-
- end
-
-end
View
31 lib/ankusa/kl_divergence.rb
@@ -1,31 +0,0 @@
-module Ankusa
-
- class KLDivergenceClassifier
- include Classifier
-
- def classify(text, classes=nil)
- # return the class with the least distance from the word
- # distribution of the given text
- distances(text, classes).sort_by { |c| c[1] }.first.first
- end
-
-
- # Classes is an array of classes to look at
- def distances(text, classnames=nil)
- classnames ||= @classnames
- distances = Hash.new 0
-
- th = TextHash.new(text)
- th.each { |word, count|
- thprob = count.to_f / th.length.to_f
- probs = get_word_probs(word, classnames)
- classnames.each { |k|
- distances[k] += (thprob * Math.log(thprob / probs[k]) * count)
- }
- }
-
- distances
- end
- end
-
-end
View
69 lib/ankusa/memory_storage.rb
@@ -1,69 +0,0 @@
-module Ankusa
-
- class MemoryStorage
- def initialize
- init_tables
- end
-
- def classnames
- @total_doc_counts.keys
- end
-
- def reset
- init_tables
- end
-
- def drop_tables
- end
-
- def init_tables
- @freqs = {}
- @total_word_counts = Hash.new(0)
- @total_doc_counts = Hash.new(0)
- @klass_word_counts = {}
- @klass_doc_counts = {}
- end
-
- def get_vocabulary_sizes
- count = Hash.new 0
- @freqs.each { |w, ks|
- ks.keys.each { |k| count[k] += 1 }
- }
- count
- end
-
- def get_word_counts(word)
- @freqs.fetch word, Hash.new(0)
- end
-
- def get_total_word_count(klass)
- @total_word_counts[klass]
- end
-
- def get_doc_count(klass)
- @total_doc_counts[klass]
- end
-
- def incr_word_count(klass, word, count)
- @freqs[word] ||= Hash.new(0)
- @freqs[word][klass] += count
- end
-
- def incr_total_word_count(klass, count)
- @total_word_counts[klass] += count
- end
-
- def incr_doc_count(klass, count)
- @total_doc_counts[klass] += count
- end
-
- def doc_count_totals
- @total_doc_counts
- end
-
- def close
- end
-
- end
-
-end
View
127 lib/ankusa/mongo_db_storage.rb
@@ -1,127 +0,0 @@
-require 'mongo'
-#require 'bson_ext'
-
-module Ankusa
- class MongoDbStorage
-
- def initialize(opts={})
- options = { :host => "localhost", :port => 27017, :db => "ankusa",
- :frequency_tablename => "word_frequencies", :summary_tablename => "summary"
- }.merge(opts)
-
- @db = Mongo::Connection.new(options[:host], options[:port]).db(options[:db])
- @db.authenticate(options[:username], options[:password]) if options[:password]
-
- @ftablename = options[:frequency_tablename]
- @stablename = options[:summary_tablename]
-
- @klass_word_counts = {}
- @klass_doc_counts = {}
-
- init_tables
- end
-
- def init_tables
- @db.create_collection(@ftablename) unless @db.collection_names.include?(@ftablename)
- freq_table.create_index('word')
- @db.create_collection(@stablename) unless @db.collection_names.include?(@stablename)
- summary_table.create_index('klass')
- end
-
- def drop_tables
- @db.drop_collection(@ftablename)
- @db.drop_collection(@stablename)
- end
-
- def classnames
- summary_table.distinct('klass')
- end
-
- def reset
- drop_tables
- init_tables
- end
-
- def incr_word_count(klass, word, count)
- freq_table.update({:word => word}, { '$inc' => {klass => count} }, :upsert => true)
-
- #update vocabulary size
- word_doc = freq_table.find_one({:word => word})
- if word_doc[klass.to_s] == count
- increment_summary_klass(klass, 'vocabulary_size', 1)
- elsif word_doc[klass.to_s] == 0
- increment_summary_klass(klass, 'vocabulary_size', -1)
- end
- word_doc[klass.to_s]
- end
-
- def incr_total_word_count(klass, count)
- increment_summary_klass(klass, 'word_count', count)
- end
-
- def incr_doc_count(klass, count)
- increment_summary_klass(klass, 'doc_count', count)
- end
-
- def get_word_counts(word)
- counts = Hash.new(0)
-
- word_doc = freq_table.find_one({:word => word})
- if word_doc
- word_doc.delete("_id")
- word_doc.delete("word")
- #convert keys to symbols
- counts.merge!(word_doc.inject({}){|h, (k, v)| h[(k.to_sym rescue k) || k] = v; h})
- end
-
- counts
- end
-
- def get_total_word_count(klass)
- klass_doc = summary_table.find_one(:klass => klass)
- klass_doc ? klass_doc['word_count'].to_f : 0.0
- end
-
- def doc_count_totals
- count = Hash.new(0)
-
- summary_table.find.each do |doc|
- count[ doc['klass'] ] = doc['doc_count']
- end
-
- count
- end
-
- def get_vocabulary_sizes
- count = Hash.new(0)
-
- summary_table.find.each do |doc|
- count[ doc['klass'] ] = doc['vocabulary_size']
- end
-
- count
- end
-
- def get_doc_count(klass)
- klass_doc = summary_table.find_one(:klass => klass)
- klass_doc ? klass_doc['doc_count'].to_f : 0.0
- end
-
- def close
- end
-
- private
- def summary_table
- @stable ||= @db[@stablename]
- end
-
- def freq_table
- @ftable ||= @db[@ftablename]
- end
-
- def increment_summary_klass(klass, field, count)
- summary_table.update({:klass => klass}, { '$inc' => {field => count} }, :upsert => true)
- end
-
- end
-end
View
50 lib/ankusa/naive_bayes.rb
@@ -1,50 +0,0 @@
-module Ankusa
- INFTY = 1.0 / 0.0
-
- class NaiveBayesClassifier
- include Classifier
-
- def classify(text, classes=nil)
- # return the most probable class
- log_likelihoods(text, classes).sort_by { |c| -c[1] }.first.first
- end
-
- # Classes is an array of classes to look at
- def classifications(text, classnames=nil)
- result = log_likelihoods text, classnames
- result.keys.each { |k|
- result[k] = (result[k] == -INFTY) ? 0 : Math.exp(result[k])
- }
-
- # normalize to get probs
- sum = result.values.inject { |x,y| x+y }
- result.keys.each { |k| result[k] = result[k] / sum }
- result
- end
-
- # Classes is an array of classes to look at
- def log_likelihoods(text, classnames=nil)
- classnames ||= @classnames
- result = Hash.new 0
-
- TextHash.new(text).each { |word, count|
- probs = get_word_probs(word, classnames)
- classnames.each { |k|
- # log likelihood should be negative infinity if we've never seen the klass
- result[k] += probs[k] > 0 ? (Math.log(probs[k]) * count) : -INFTY
- }
- }
-
- # add the prior
- doc_counts = doc_count_totals.select { |k,v| classnames.include? k }.map { |k,v| v }
- doc_count_total = (doc_counts.inject { |x,y| x+y } + classnames.length).to_f
- classnames.each { |k|
- result[k] += Math.log((@storage.get_doc_count(k) + 1).to_f / doc_count_total)
- }
-
- result
- end
-
- end
-
-end
View
4 lib/ankusa/stopwords.rb
@@ -1,4 +0,0 @@
-module Ankusa
- # These are taken from MySQL - http://dev.mysql.com/tech-resources/articles/full-text-revealed.html
- STOPWORDS = %W(a able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associated at available away awfully be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c'mon c's came can can't cannot cant cause causes certain certainly changes clearly co com come comes concerning consequently consider considering contain containing contains corresponding could couldn't course currently definitely described despite did didn't different do does doesn't doing don't done down downwards during each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example except far few fifth first five followed following follows for former formerly forth four from further furthermore get gets getting given gives go goes going gone got gotten greetings had hadn't happens hardly has hasn't have haven't having he he's hello help hence her here here's hereafter hereby herein hereupon hers herself hi him himself his hither hopefully how howbeit however i'd i'll i'm i've ie if ignored immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is isn't it it'd it'll it's its itself just keep keeps kept know knows known last lately later latter latterly least less lest let let's like liked likely little look looking looks ltd mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself name namely nd near nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now nowhere obviously of off often oh ok okay old on once one ones only onto or other others otherwise ought our ours ourselves out outside over overall own particular particularly per perhaps placed please plus possible presumably probably provides que quite qv rather rd re really reasonably regarding regardless regards relatively respectively right said same saw say saying says second secondly see seeing seem seemed seeming seems seen self selves sensible sent serious seriously seven several shall she should shouldn't since six so some somebody somehow someone something sometime sometimes somewhat somewhere soon sorry specified specify specifying still sub such sup sure t's take taken tell tends th than thank thanks thanx that that's thats the their theirs them themselves then thence there there's thereafter thereby therefore therein theres thereupon these they they'd they'll they're they've think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying twice two un under unfortunately unless unlikely until unto up upon us use used useful uses using usually value various very via viz vs want wants was wasn't way we we'd we'll we're we've welcome well went were weren't what what's whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who's whoever whole whom whose why will willing wish with within without won't wonder would would wouldn't yes yet you you'd you'll you're you've your yours yourself yourselves zero)
-end
View
3  lib/ankusa/version.rb
@@ -1,3 +0,0 @@
-module Ankusa
- VERSION = "0.0.14"
-end
View
19 test/cassandra_classifier_test.rb
@@ -1,19 +0,0 @@
-require File.join File.dirname(__FILE__), 'classifier_base'
-require 'ankusa/cassandra_storage'
-
-module CassandraClassifierBase
- def initialize(name)
- @storage = Ankusa::CassandraStorage.new CONFIG['cassandra_host'], CONFIG['cassandra_port'], "ankusa_test"
- super(name)
- end
-end
-
-class NBClassifierTest < Test::Unit::TestCase
- include CassandraClassifierBase
- include NBClassifierBase
-end
-
-class KLClassifierTest < Test::Unit::TestCase
- include CassandraClassifierBase
- include KLClassifierBase
-end
View
119 test/classifier_base.rb
@@ -1,119 +0,0 @@
-require File.join File.dirname(__FILE__), 'helper'
-
-module ClassifierBase
- def train
- @classifier.train :spam, "spam and great spam" # spam:2 great:1
- @classifier.train :good, "words for processing" # word:1 process:1
- @classifier.train :good, "good word" # word:1 good:1
- end
-
- def test_train
- counts = @storage.get_word_counts(:spam)
- assert_equal counts[:spam], 2
- counts = @storage.get_word_counts(:word)
- assert_equal counts[:good], 2
- assert_equal @storage.get_total_word_count(:good), 4
- assert_equal @storage.get_doc_count(:good), 2
- assert_equal @storage.get_total_word_count(:spam), 3
- assert_equal @storage.get_doc_count(:spam), 1
- totals = @storage.doc_count_totals
- assert_equal totals.values.inject { |x,y| x+y }, 3
- assert_equal totals[:spam], 1
- assert_equal totals[:good], 2
-
- vocab = @storage.get_vocabulary_sizes
- assert_equal vocab[:spam], 2
- assert_equal vocab[:good], 3
- end
-
- def teardown
- @storage.drop_tables
- @storage.close
- end
-end
-
-
-module NBClassifierBase
- include ClassifierBase
-
- def setup
- @classifier = Ankusa::NaiveBayesClassifier.new @storage
- train
- end
-
- def test_probs
- spamlog = Math.log(3.0 / 5.0) + Math.log(1.0 / 5.0) + Math.log(2.0 / 5.0)
- goodlog = Math.log(1.0 / 7.0) + Math.log(1.0 / 7.0) + Math.log(3.0 / 5.0)
-
- # exponentiate
- spamex = Math.exp(spamlog)
- goodex = Math.exp(goodlog)
-
- # normalize
- spam = spamex / (spamex + goodex)
- good = goodex / (spamex + goodex)
-
- cs = @classifier.classifications("spam is tastey")
- assert_equal cs[:spam], spam
- assert_equal cs[:good], good
-
- cs = @classifier.log_likelihoods("spam is tastey")
- assert_equal cs[:spam], spamlog
- assert_equal cs[:good], goodlog
-
- @classifier.train :somethingelse, "this is something else entirely spam"
- cs = @classifier.classifications("spam is tastey", [:spam, :good])
- assert_equal cs[:spam], spam
- assert_equal cs[:good], good
-
- # test for class we didn't train on
- cs = @classifier.classifications("spam is super tastey if you are a zombie", [:spam, :nothing])
- assert_equal cs[:nothing], 0
- end
-
- def test_prob_result
- cs = @classifier.classifications("spam is tastey").sort_by { |c| -c[1] }.first.first
- klass = @classifier.classify("spam is tastey")
- assert_equal cs, klass
- assert_equal klass, :spam
- end
-end
-
-
-module KLClassifierBase
- include ClassifierBase
-
- def setup
- @classifier = Ankusa::KLDivergenceClassifier.new @storage
- train
- end
-
- def test_distances
- ds = @classifier.distances("spam is tastey")
- thprob_spam = 1.0 / 2.0
- thprob_tastey = 1.0 / 2.0
-
- train_prob_spam = (2 + 1).to_f / (3 + 2).to_f
- train_prob_tastey = (0 + 1).to_f / (3 + 2).to_f
- dist = thprob_spam * Math.log(thprob_spam / train_prob_spam)
- dist += thprob_tastey * Math.log(thprob_tastey / train_prob_tastey)
- assert_equal ds[:spam], dist
-
- train_prob_spam = 1.0 / (4 + 3).to_f
- train_prob_tastey = 1.0 / (4 + 3).to_f
- dist = thprob_spam * Math.log(thprob_spam / train_prob_spam)
- dist += thprob_tastey * Math.log(thprob_tastey / train_prob_tastey)
- assert_equal ds[:good], dist
- end
-
- def test_distances_result
- cs = @classifier.distances("spam is tastey").sort_by { |c| c[1] }.first.first
- klass = @classifier.classify("spam is tastey")
- assert_equal cs, klass
- assert_equal klass, :spam
-
- # assert distance from class we didn't train with is Infinity (1.0/0.0 is a way to get at Infinity)
- cs = @classifier.distances("spam is tastey", [:spam, :nothing])
- assert_equal cs[:nothing], (1.0/0.0)
- end
-end
View
7 test/config.yml
@@ -1,7 +0,0 @@
-hbase_host: 127.0.0.1
-hbase_port: 9090
-cassandra_host: 127.0.0.1
-cassandra_port: 9160
-mongo_db_host: 127.0.0.1
-mongo_db_port: 27017
-file_system_storage_file: training.anuska
View
27 test/file_system_classifier_test.rb
@@ -1,27 +0,0 @@
-require File.join File.dirname(__FILE__), 'classifier_base'
-require 'ankusa/file_system_storage'
-
-module FileSystemClassifierBase
- def initialize(name)
- @storage = Ankusa::FileSystemStorage.new CONFIG['file_system_storage_file']
- super name
- end
-
- def test_storage
- # train will be called in setup method, now reload storage and test training
- @storage.save
- @storage = Ankusa::FileSystemStorage.new CONFIG['file_system_storage_file']
- test_train
- end
-end
-
-class NBMemoryClassifierTest < Test::Unit::TestCase
- include FileSystemClassifierBase
- include NBClassifierBase
-end
-
-
-class KLMemoryClassifierTest < Test::Unit::TestCase
- include FileSystemClassifierBase
- include KLClassifierBase
-end
View
25 test/hasher_test.rb
@@ -1,25 +0,0 @@
-require File.join File.dirname(__FILE__), 'helper'
-
-class HasherTest < Test::Unit::TestCase
- def setup
- string = "Words word a the at fish fishing fishes? /^/ The at a of! @#$!"
- @text_hash = Ankusa::TextHash.new string
- @array = Ankusa::TextHash.new [string]
- end
-
- def test_stemming
- assert_equal @text_hash.length, 2
- assert_equal @text_hash.word_count, 5
-
- assert_equal @array.length, 2
- assert_equal @array.word_count, 5
- end
-
- def test_valid_word
- assert (not Ankusa::TextHash.valid_word? "accordingly")
- assert (not Ankusa::TextHash.valid_word? "appropriate")
- assert Ankusa::TextHash.valid_word? "^*&@"
- assert Ankusa::TextHash.valid_word? "mother"
- assert (not Ankusa::TextHash.valid_word? "21675")
- end
-end
View
23 test/hbase_classifier_test.rb
@@ -1,23 +0,0 @@
-require File.join File.dirname(__FILE__), 'classifier_base'
-require 'ankusa/hbase_storage'
-
-module HBaseClassifierBase
- def initialize(name)
- @freq_tablename = "ankusa_word_frequencies_test"
- @sum_tablename = "ankusa_summary_test"
- @storage = Ankusa::HBaseStorage.new CONFIG['hbase_host'], CONFIG['hbase_port'], @freq_tablename, @sum_tablename
- @freq_table = @storage.hbase.get_table(@freq_tablename)
- @sum_table = @storage.hbase.get_table(@sum_tablename)
- super(name)
- end
-end
-
-class NBClassifierTest < Test::Unit::TestCase
- include HBaseClassifierBase
- include NBClassifierBase
-end
-
-class KLClassifierTest < Test::Unit::TestCase
- include HBaseClassifierBase
- include KLClassifierBase
-end
View
8 test/helper.rb
@@ -1,8 +0,0 @@
-require 'rubygems'
-require 'test/unit'
-require 'yaml'
-
-$:.unshift(File.join File.dirname(__FILE__), '..', 'lib')
-require 'ankusa'
-
-CONFIG = YAML.load_file File.join(File.dirname(__FILE__), "config.yml")
View
20 test/memory_classifier_test.rb
@@ -1,20 +0,0 @@
-require File.join File.dirname(__FILE__), 'classifier_base'
-require 'ankusa/memory_storage'
-
-module MemoryClassifierBase
- def initialize(name)
- @storage = Ankusa::MemoryStorage.new
- super name
- end
-end
-
-class NBMemoryClassifierTest < Test::Unit::TestCase
- include MemoryClassifierBase
- include NBClassifierBase
-end
-
-
-class KLMemoryClassifierTest < Test::Unit::TestCase
- include MemoryClassifierBase
- include KLClassifierBase
-end
View
21 test/mongo_db_classifier_test.rb
@@ -1,21 +0,0 @@
-require File.join File.dirname(__FILE__), 'classifier_base'
-require 'ankusa/mongo_db_storage'
-
-module MongoDbClassifierBase
- def initialize(name)
- @storage = Ankusa::MongoDbStorage.new :host => CONFIG['mongo_db_host'], :port => CONFIG['mongo_db_port'],
- :username => CONFIG['mongo_db_username'], :password => CONFIG['mongo_db_password'],
- :db => 'ankusa-test'
- super(name)
- end
-end
-
-class NBClassifierTest < Test::Unit::TestCase
- include MongoDbClassifierBase
- include NBClassifierBase
-end
-
-class KLClassifierTest < Test::Unit::TestCase
- include MongoDbClassifierBase
- include KLClassifierBase
-end
Please sign in to comment.
Something went wrong with that request. Please try again.