GitHub - industria/erl_classifier: Naive Bayes classifier in Erlang.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
priv/stopwords		priv/stopwords
src		src
test		test
.gitignore		.gitignore
Makefile.in		Makefile.in
README		README
configure		configure
configure.ac		configure.ac
erl_classifier.rel.in		erl_classifier.rel.in
sys.config.in		sys.config.in

Repository files navigation

Naive Bayes classifier in Erlang

The classifier can be used to solve any-of classifications using the approach of 
N two-class classifiers.

Note: All files are expected to be UTF8 encoded.

Section:

1) Configuring the classifier
2) Stop word lists
3) Setting up the mnesia datastore
4) Document file format
5) Dialyzer on Ubuntu

=============================
1. Configuring the classifier
=============================
The classifier is configured via the env part of the app term 
in erl_classifier.app. The following items most be setup:

1) Language: {language, language_atom}
   language atoms supported is danish and swedish.

2) Classes: {classes, [class1, ..., classn]}
   List the classes of the classifier corresponding to two-class classifiers 
   the application will contain.

3) Term length ranges: {min_term_length, integer} and {max_term_length, integer}
   The minimum and maximum length of terms that should be processes. The range
   filtering is done before the stemming, therefore its possible for terms
   shorter that min_term_length to appear in the terms table.


==================
2. Stop word lists
==================

1) Are kept in priv/stopwords/
2) Must be named lang.stopwords, where lang is the language of the stop words in the file.


==================================
3. Setting up the mnesia datastore
==================================

1) Setting up the schema
   make console
   > mnesia:stop().
   > mnesia:create_schema([node()]).
   > ctrl-g q -- to stop the shell

2) Setting up tables with the sport demo data
   make console
   > ec_store:init_tables().
   > ec_store:stopword_update().
   > ec_docfile:train("test/documents/sport_demo").
   > ec_store:change_to_disc_copies().
 
3) Classifying text (you can run the ec_any_of:classify as many times as you like)
   make console
   > ec_any_of:classify(<<"your text in danish"/utf8>>).


=======================
4. Document file format
=======================

Training a classifier can be done using the ec_coach API or the utility module ec_docfile which
reads a file in a predefined format and make calls to the ec_coach API per document. This is
an easy way to train the classifier with large amounts of document. The precondition for using
ec_docfile to train the classifier is ofcause that the documents can be fit into the document
file format.

The document file format is a really simple text format containing one document and its 
classes per line. The format is as follows:

Document text,class-1,class-...,class-n 

The main requirement for the format is:

- No commas in the document text
- No LF or CR in the document text
- File must be encoded in UTF-8

Note: Before calling ec_docfile:train you have to initialize the classifier with the classes
in the file. The classes can be extracted from the file using ec_docfile:classes("filename").


=====================
5. Dialyzer on Ubuntu
=====================

The dialyzer target in the Makefile will not work if you are on a package based Ubuntu installation,
because they decided to strip the debug info from the release by not building with +debug_info.

This effectivly breaks Erlang development on Ubuntu and as fare as I have been able to gather there
is no supported package with that option set. Without it dialyzer will not be able to run so the
Ubuntu Erlang distribution is broken for anything needing dialyzer.

The solution is to uninstall all Erlang packages and build a new install from the source distribution
on erlang.org.