industria/erl_classifier
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Naive Bayes classifier in Erlang The classifier can be used to solve any-of classifications using the approach of N two-class classifiers. Note: All files are expected to be UTF8 encoded. Section: 1) Configuring the classifier 2) Stop word lists 3) Setting up the mnesia datastore 4) Document file format 5) Dialyzer on Ubuntu ============================= 1. Configuring the classifier ============================= The classifier is configured via the env part of the app term in erl_classifier.app. The following items most be setup: 1) Language: {language, language_atom} language atoms supported is danish and swedish. 2) Classes: {classes, [class1, ..., classn]} List the classes of the classifier corresponding to two-class classifiers the application will contain. 3) Term length ranges: {min_term_length, integer} and {max_term_length, integer} The minimum and maximum length of terms that should be processes. The range filtering is done before the stemming, therefore its possible for terms shorter that min_term_length to appear in the terms table. ================== 2. Stop word lists ================== 1) Are kept in priv/stopwords/ 2) Must be named lang.stopwords, where lang is the language of the stop words in the file. ================================== 3. Setting up the mnesia datastore ================================== 1) Setting up the schema make console > mnesia:stop(). > mnesia:create_schema([node()]). > ctrl-g q -- to stop the shell 2) Setting up tables with the sport demo data make console > ec_store:init_tables(). > ec_store:stopword_update(). > ec_docfile:train("test/documents/sport_demo"). > ec_store:change_to_disc_copies(). 3) Classifying text (you can run the ec_any_of:classify as many times as you like) make console > ec_any_of:classify(<<"your text in danish"/utf8>>). ======================= 4. Document file format ======================= Training a classifier can be done using the ec_coach API or the utility module ec_docfile which reads a file in a predefined format and make calls to the ec_coach API per document. This is an easy way to train the classifier with large amounts of document. The precondition for using ec_docfile to train the classifier is ofcause that the documents can be fit into the document file format. The document file format is a really simple text format containing one document and its classes per line. The format is as follows: Document text,class-1,class-...,class-n The main requirement for the format is: - No commas in the document text - No LF or CR in the document text - File must be encoded in UTF-8 Note: Before calling ec_docfile:train you have to initialize the classifier with the classes in the file. The classes can be extracted from the file using ec_docfile:classes("filename"). ===================== 5. Dialyzer on Ubuntu ===================== The dialyzer target in the Makefile will not work if you are on a package based Ubuntu installation, because they decided to strip the debug info from the release by not building with +debug_info. This effectivly breaks Erlang development on Ubuntu and as fare as I have been able to gather there is no supported package with that option set. Without it dialyzer will not be able to run so the Ubuntu Erlang distribution is broken for anything needing dialyzer. The solution is to uninstall all Erlang packages and build a new install from the source distribution on erlang.org.
About
Naive Bayes classifier in Erlang.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published