Permalink
Browse files

Point at actively maintained version

  • Loading branch information...
1 parent 7070bb3 commit 4586533d82adc5a132bcdb6de2ac439e636f5764 Bryce Nyeggen committed Nov 30, 2012
View
@@ -1,4 +0,0 @@
-source "http://rubygems.org"
-
-# Specify your gem's dependencies in bandit.gemspec
-gemspec
View
@@ -1,16 +0,0 @@
-PATH
- remote: .
- specs:
- ankusa (0.0.14)
- fast-stemmer (>= 1.0.0)
-
-GEM
- remote: http://rubygems.org/
- specs:
- fast-stemmer (1.0.1)
-
-PLATFORMS
- ruby
-
-DEPENDENCIES
- ankusa!
View
@@ -1,137 +1,6 @@
= ankusa
-Ankusa is a text classifier in Ruby that can use either Hadoop's HBase, Mongo, or Cassandra for storage. Because it uses HBase/Mongo/Cassandra as a backend, the training corpus can be many terabytes in size (though additional memory and single file storage abilities also exist for smaller corpora).
-
-Ankusa currently provides both a Naive Bayes and Kullback-Leibler divergence classifier. It ignores common words (a.k.a, stop words) and stems all others. Additionally, it uses Laplacian smoothing in both classification methods.
-
-== Installation
-First, install HBase/Hadoop, Mongo, or Cassandra (>= 0.7.0-rc2). Then, install the appropriate gem:
- gem install hbaserb
- # or
- gem install cassandra
- # or
- gem install mongo
-
-If you're using HBase, make sure the HBase Thrift interface has been started as well. Then:
- gem install ankusa
-
-== Basic Usage
-Using the naive Bayes classifier:
-
- require 'rubygems'
- require 'ankusa'
- require 'ankusa/hbase_storage'
-
- # connect to HBase. Alternatively, just for this test, use in memory storage with
- # storage = Ankusa::MemoryStorage.new
- storage = Ankusa::HBaseStorage.new 'localhost'
- c = Ankusa::NaiveBayesClassifier.new storage
-
- # Each of these calls will return a bag-of-words
- # has with stemmed words as keys and counts as values
- c.train :spam, "This is some spammy text"
- c.train :good, "This is not the bad stuff"
-
- # This will return the most likely class (as symbol)
- puts c.classify "This is some spammy text"
-
- # This will return Hash with classes as keys and
- # membership probability as values
- puts c.classifications "This is some spammy text"
-
- # If you have a large corpus, the probabilities will
- # likely all be 0. In that case, you must use log
- # likelihood values
- puts c.log_likelihoods "This is some spammy text"
-
- # get a list of all classes
- puts c.classnames
-
- # close connection
- storage.close
-
-
-== KL Diverence Classifier
-There is a Kullback–Leibler divergence classifier as well. KL divergence is a distance measure (though not a true metric because it does not satisfy the triangle inequality). The KL classifier simply measures the relative entropy between the text you want to classify and each of the classes. The class with the shortest "distance" is the best class. You may find that for a especially large corpus it may be slightly faster to use this classifier (since prior probablities are never calculated, only likelihoods).
-
-The API is the same as the NaiveBayesClassifier, except rather than calling "classifications" if you want actual numbers you call "distances".
-
- require 'rubygems'
- require 'ankusa'
- require 'ankusa/hbase_storage'
-
- # connect to HBase
- storage = Ankusa::HBaseStorage.new 'localhost'
- c = Ankusa::KLDivergenceClassifier.new storage
-
- # Each of these calls will return a bag-of-words
- # has with stemmed words as keys and counts as values
- c.train :spam, "This is some spammy text"
- c.train :good, "This is not the bad stuff"
-
- # This will return the most likely class (as symbol)
- puts c.classify "This is some spammy text"
-
- # This will return Hash with classes as keys and
- # distances >= 0 as values
- puts c.distances "This is some spammy text"
-
- # get a list of all classes
- puts c.classnames
-
- # close connection
- storage.close
-
-== Storage Methods
-Ankusa has a generalized storage interface that has been implemented for HBase, Cassandra, Mongo, single file, and in-memory storage.
-
-Memory storage can be used when you have a very small corpora
- require 'ankusa/memory_storage'
- storage = Ankusa::MemoryStorage.new
-
-FileSystem storage can be used when you have a very small corpora and want to persist the classification results.
- require 'ankusa/file_system_storage'
- storage = Ankusa::FileSystemStorage.new '/path/to/file'
- # Do classification ...
- storage.save
-
-The FileSystem storage does NOT save to the filesystem automatically, the #save method must be invoked to save and persist the results
-
-HBase storage:
- require 'ankusa/hbase_storage'
- # defaults: host='localhost', port=9090, frequency_tablename="ankusa_word_frequencies", summary_tablename="ankusa_summary"
- storage = Ankusa::HBaseStorage.new host, port, frequency_tablename, summary_tablename
-
-For Cassandra storage:
-* You will need Cassandra version 0.7.0-rc2 or greater.
-* You will need to set a max number classes since current implementation of the Ruby Cassandra client doesn't support table scans.
-* Prior to using the Cassandra storage you will need to run the following command from the cassandra-cli: "create keyspace ankusa with replication_factor = 1". This should be fixed with a new release candidate for Cassandra.
-
-To use the Cassandra storage class:
- require 'ankusa/cassandra_storage'
- # defaults: host='127.0.0.1', port=9160, keyspace = 'ankusa', max_classes = 100
- storage = Ankusa::HBaseStorage.new host, port, keyspace, max_classes
-
-For MongoDB storage:
- require 'ankusa/mongo_db_storage'
- storage = Ankusa::MongoDbStorage.new :host => "localhost", :port => 27017, :db => "ankusa"
- # defaults: :host => "localhost", :port => 27017, :db => "ankusa"
- # no default username or password
- # tou can also use frequency_tablename and summary_tablename options
-
-
-== Running Tests
-You can run the tests for any of the four storage methods. For instance, for memory storage:
- rake test_memory
-
-For the other methods you will need to edit the file test/config.yml and set the configuration params. Then:
- rake test_hbase
- # or
- rake test_cassandra
- # or
- rake test_filesystem
- #or
- rake test_mongo_db
-
-
+ATTENTION: THIS REPO IS DEPRECATED.
+PLEASE USE https://github.com/bmuller/ankusa FOR THE ACTIVELY MAINTAINED VERSION
+Ankusa is a text classifier in Ruby that can use either Hadoop's HBase, Mongo, or Cassandra for storage.
View
@@ -1,49 +0,0 @@
-require 'rubygems'
-require 'bundler'
-require 'rake/testtask'
-require 'rdoc/task'
-
-Bundler::GemHelper.install_tasks
-
-desc "Create documentation"
-RDoc::Task.new("doc") { |rdoc|
- rdoc.title = "Ankusa - Naive Bayes classifier with big data storage"
- rdoc.rdoc_dir = 'docs'
- rdoc.rdoc_files.include('README.rdoc')
- rdoc.rdoc_files.include('lib/**/*.rb')
-}
-
-desc "Run all unit tests with memory storage"
-Rake::TestTask.new("test_memory") { |t|
- t.libs += ["lib", "."]
- t.test_files = FileList['test/hasher_test.rb', 'test/memory_classifier_test.rb']
- t.verbose = true
-}
-
-desc "Run all unit tests with HBase storage"
-Rake::TestTask.new("test_hbase") { |t|
- t.libs += ["lib", "."]
- t.test_files = FileList['test/hasher_test.rb']
- t.verbose = true
-}
-
-desc "Run all unit tests with Cassandra storage"
-Rake::TestTask.new("test_cassandra") { |t|
- t.libs += ["lib", "."]
- t.test_files = FileList['test/hasher_test.rb', 'test/cassandra_classifier_test.rb']
- t.verbose = true
-}
-
-desc "Run all unit tests with FileSystem storage"
-Rake::TestTask.new("test_filesystem") { |t|
- t.libs += ["lib", "."]
- t.test_files = FileList['test/hasher_test.rb', 'test/file_system_classifier_test.rb']
- t.verbose = true
-}
-
-desc "Run all unit tests with MongoDb storage"
-Rake::TestTask.new("test_mongo_db") { |t|
- t.libs += ["lib", "."]
- t.test_files = FileList['test/hasher_test.rb', 'test/mongo_db_classifier_test.rb']
- t.verbose = true
-}
View
@@ -1,20 +0,0 @@
-$:.push File.expand_path("../lib", __FILE__)
-require "ankusa/version"
-require "rake"
-require "date"
-
-Gem::Specification.new do |s|
- s.name = "ankusa"
- s.version = Ankusa::VERSION
- s.authors = ["Brian Muller"]
- s.date = Date.today.to_s
- s.description = "Text classifier with HBase, Cassandra, or Mongo storage"
- s.summary = "Text classifier in Ruby that uses Hadoop's HBase, Cassandra, or Mongo for storage"
- s.email = "brian.muller@livingsocial.com"
- s.files = FileList["lib/**/*", "[A-Z]*", "Rakefile", "docs/**/*"]
- s.homepage = "https://github.com/livingsocial/ankusa"
- s.require_paths = ["lib"]
- s.add_dependency('fast-stemmer', '>= 1.0.0')
- s.requirements << "Either hbaserb >= 0.0.3 or cassandra >= 0.7"
- s.rubyforge_project = "ankusa"
-end
View
@@ -1,6 +0,0 @@
-require 'ankusa/version'
-require 'ankusa/extensions'
-require 'ankusa/classifier'
-require 'ankusa/naive_bayes'
-require 'ankusa/kl_divergence'
-require 'ankusa/hasher'
Oops, something went wrong.

0 comments on commit 4586533

Please sign in to comment.