GitHub - LanguageMachines/paramsearch: Automated parameter optimisation for Timbl

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
COPYING		COPYING
Makefile		Makefile
README		README
ana-convert.c		ana-convert.c
analysebunch.csh		analysebunch.csh
exhaust-c4.5.csh		exhaust-c4.5.csh
exhaust-fambl.csh		exhaust-fambl.csh
exhaust-ib1-bin.csh		exhaust-ib1-bin.csh
exhaust-ib1-sparse.csh		exhaust-ib1-sparse.csh
exhaust-ib1-tabbed.csh		exhaust-ib1-tabbed.csh
exhaust-ib1.csh		exhaust-ib1.csh
exhaust-ibn.csh		exhaust-ibn.csh
exhaust-igtree.csh		exhaust-igtree.csh
exhaust-maxent.csh		exhaust-maxent.csh
exhaust-perceptron.csh		exhaust-perceptron.csh
exhaust-ripper.csh		exhaust-ripper.csh
exhaust-svmlight.csh		exhaust-svmlight.csh
exhaust-tribl2.csh		exhaust-tribl2.csh
exhaust-winnow.csh		exhaust-winnow.csh
filler		filler
firstround-c4.5.csh		firstround-c4.5.csh
firstround-fambl.csh		firstround-fambl.csh
firstround-ib1-bin.csh		firstround-ib1-bin.csh
firstround-ib1-sparse.csh		firstround-ib1-sparse.csh
firstround-ib1-tabbed.csh		firstround-ib1-tabbed.csh
firstround-ib1.csh		firstround-ib1.csh
firstround-ibn.csh		firstround-ibn.csh
firstround-igtree.csh		firstround-igtree.csh
firstround-maxent.csh		firstround-maxent.csh
firstround-perceptron.csh		firstround-perceptron.csh
firstround-ripper.csh		firstround-ripper.csh
firstround-svmlight.csh		firstround-svmlight.csh
firstround-tribl2.csh		firstround-tribl2.csh
firstround-winnow.csh		firstround-winnow.csh
generate-steplist.c		generate-steplist.c
getlast10percent.c		getlast10percent.c
iterativedeepening.csh		iterativedeepening.csh
make_binary.better.pl		make_binary.better.pl
make_binary.simpel.pl		make_binary.simpel.pl
makenextlevel-c4.5.c		makenextlevel-c4.5.c
makenextlevel-fambl.c		makenextlevel-fambl.c
makenextlevel-ib1-bin.c		makenextlevel-ib1-bin.c
makenextlevel-ib1-sparse.c		makenextlevel-ib1-sparse.c
makenextlevel-ib1-tabbed.c		makenextlevel-ib1-tabbed.c
makenextlevel-ib1.c		makenextlevel-ib1.c
makenextlevel-ibn.c		makenextlevel-ibn.c
makenextlevel-igtree.c		makenextlevel-igtree.c
makenextlevel-maxent.c		makenextlevel-maxent.c
makenextlevel-perceptron.c		makenextlevel-perceptron.c
makenextlevel-ripper.c		makenextlevel-ripper.c
makenextlevel-svmlight.c		makenextlevel-svmlight.c
makenextlevel-tribl2.c		makenextlevel-tribl2.c
makenextlevel-winnow.c		makenextlevel-winnow.c
meanstd.c		meanstd.c
paramsearch.c		paramsearch.c
partition-factor.c		partition-factor.c
randomize.c		randomize.c
runfull-c4.5.c		runfull-c4.5.c
runfull-fambl.c		runfull-fambl.c
runfull-ib1-bin.c		runfull-ib1-bin.c
runfull-ib1-sparse.c		runfull-ib1-sparse.c
runfull-ib1-tabbed.c		runfull-ib1-tabbed.c
runfull-ib1.c		runfull-ib1.c
runfull-ibn.c		runfull-ibn.c
runfull-igtree.c		runfull-igtree.c
runfull-maxent.c		runfull-maxent.c
runfull-perceptron.c		runfull-perceptron.c
runfull-ripper.c		runfull-ripper.c
runfull-svmlight.c		runfull-svmlight.c
runfull-tribl2.c		runfull-tribl2.c
runfull-winnow.c		runfull-winnow.c

Repository files navigation

paramsearch 1.3

copyright (c) 2003-2011, Antal van den Bosch

    paramsearch is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 3 of the License, or
    (at your option) any later version.
    
    paramsearch is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    
    You should have received a copy of the GNU General Public License
    along with this program; if not, see <http://www.gnu.org/licenses/>.
    
Author: Antal van den Bosch / antalb@uvt.nl / http://ilk.uvt.nl


GENERAL INFORMATION

On the basis of an instance base (a data file containing a list of
examples of some classification task, where one example is represented
by a list of feature values and a class label), paramsearch produces a
combination of algorithmic parameters of a machine learning algorithm
of choice, estimated to do well on unseen material from the same
source as the input instance base. Paramsearch implements two
heuristics for search in multi-dimensional algorithmic parameter
spaces:

1. cross-validated classifier wrapping, recombining parameter settings
   pseudo-exhaustively, for small data sets (less than 1000 instances);
2. wrapped progressive sampling for larger data sets (>=1000 instances).

The main goal of paramsearch is to provide a fast approximation of
utopian exhaustive parameter optimisation-by-validation. While small
data sets do allow for some pseudo-exhaustive classifier wrapping, for
larger data sets paramsearch uses wrapped progressive sampling, a
combination of classifier wrapping and progressive sampling.

A reasonably working metaphor for wrapped progressive sampling is that
of a competition among mountaineers climbing a mountain. At the start
of the competition, some thousand mountaineers run up the foot of the
mountain and start to climb. After some time, the group of
mountaineers that has been the least successful in terms of the height
they reached (established by a one-dimensional clustering along the
height dimension) is dropped out of the competition, and the race
continues with the remaining contestants, higher up the mountain,
where climbing becomes increasingly difficult. This
competition-in-stages is repeated until only one mountaineer is left,
or the top of the mountain is reached by a group of mountaineers, in
which case one random contestant from this group is selected.

Wrapped progressive sampling always starts with a 500-instance
training set and a 100-instance test set. During the progressive
sampling steps, the training set is exponentially grown to 80% of the
full training set, while the size of the test set is synchronously
grown at 20% of the size of the training set.

Paramsearch currently supports the following machine learning
algorithms: IB1, Fambl, SVM-light, Ripper, Zhang Le's Maxent, C4.5,
Winnow, and Perceptron (the latter two using SNoW). More details
below.


INSTALLATION

Unpack the paramsearch.1.3.tar.gz tarball, go to the paramsearch
directory, and type "make":

> tar zxf paramsearch.1.3.tar.gz
> cd paramsearch.1.3
> make

This generates a bunch of executables. Install the "paramsearch"
executable in a directory in your $PATH, e.g. /usr/local/bin if you
have permission.

> cp paramsearch /usr/local/bin

Then, set the environment variable PARAMSEARCH_DIR to the paramsearch
directory.

In (t)csh: > setenv PARAMSEARCH_DIR /the/path/to/paramsearch
In bash:   > export PARAMSEARCH_DIR=/the/path/to/paramsearch


USAGE

> paramsearch <algorithm> <trainingfile> [extra]

Where <algorithm> is ib1, ib1-bin, ibn, fambl, igtree, tribl2,
svmlight, ripper, c4.5, winnow, or perceptron. <algorithm> is supposed
to be installed, and present in $PATH. [extra] either contains an
optional metric modifier for IB1, or the (obligatory!) number of
classes for Winnow or Perceptron (the underlying software emulating
Winnow and Perceptron, SNoW, demands a user specification of the
number of classes).

An optional metric modifier for IB1 is attached to -mO, -mM or -mJ,
and can be used to ignore features or overrule similarity function
settings for certain features (e.g. declare them as numeric). An
example metric setting in which features 1 and 2 are ignored, and 4 is
declared to be numeric: :I1,2:N4 .

More details on the algorithms:

- ib1        IB1, from the TiMBL software package. Required 
             version: 6.4 or higher. Current 
             tested version: 6.4.
             See http://ilk.uvt.nl/timbl
- ib1-bin    An emulation of IB1 inside the Fambl package,
             with automatically binarized multi-valued features. 
             Required version: 2.1.10 or higher. 
             See http://ilk.uvt.nl/~antalb
- ibn        IB1 from TiMBL with numeric feature metrics only.
             Tested version: 6.4. See http://ilk.uvt.nl/timbl
- igtree     IGTree, from the TiMBL software package. Tested
  	     version: 6.4. See http://ilk.uvt.nl/timbl
- tribl2     TRIBL2, from the TiMBL software package. Tested
  	     version: 6.4. See http://ilk.uvt.nl/timbl
- ib1-sparse IB1 with sparse feature coding (-F Sparse). Tested 
  	     version: 6.4. See http://ilk.uvt.nl/timbl
- fambl      Fambl. Required version: 2.1.10 or higher.
             See http://ilk.uvt.nl/~antalb
- svmlight   SVM-Light by Thorsten Joachims. Tested version: 5.00. 
             See http://svmlight.joachims.org 
- ripper     Ripper by William Cohen. Tested version: V1 release
             2.5 (patch 1). 
             See http://www.wcohen.com
- maxent     Maximum entropy classifier by Zhang Le. Tested
             version: 20041229. See 
             http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html
- c4.5       C4.5 by J. Ross Quinlan. Tested version: release 8.
             See http://www.cse.unsw.edu.au/~quinlan/ 
- winnow     Sparse Network of Winnow as implemented in SNoW by
             Carlson, Cumby, Rosen, & Roth. Tested version: 3.1.3
             See http://l2r.cs.uiuc.edu/~danr/snow.html
- perceptron Perceptron (sparse implementation) as implemented
             in SNoW by Carlson, Cumby, Rosen, & Roth. Tested 
             version: 3.1.3. 
             See http://l2r.cs.uiuc.edu/~danr/snow.html

Note that 

(1) <trainingfile> must adhere to the data formatting 
    requirements of <algorithm> (which are quite different among the
    current set of supported algorithms; paramsearch does not do
    any conversion)

(2) C4.5 expects a "names" file declaring all feature value and
    class names present in training and test material. This should
    actually be named <filestem>.data.names for paramsearch to operate
    correctly. (TiMBL or Ripper can generate names files automatically).

(3) C4.5 and Ripper return errors, not accuracies.

(4) Winnow and Perceptron demand a third command line argument stating 
    the number of classes.


ADDITIONAL TOOLS

The paramsearch distribution contains the following tools:

* runfull-<algorithm>: runs a full experiment with a <trainingfile>
and a <testfile>, using the settings as logged in
<trainingfile>.<algorithm>.bestsetting. This saves you from typing the
best settings found in the *.bestsetting file.

* make_binary.simpel.pl: written by Iris Hendrickx. Rewrites a TiMBL
data file into a "binary" file suited either for SVMLight, Maxent, or
SNoW (winnow and perceptron). 

Usage: perl make_binary.simpel.pl [svm | snow]  trainfile ( testfile )

See the file for more information.


ACKNOWLEDGEMENTS

Thanks to Iris Hendrickx, Walter Daelemans, Dan Roth, William Cohen,
Zhang Le, Erwin Marsi, Piroska Lendvai, Erik Tjong Kim Sang, Bertjan
Busser, Ko van der Sloot, Jakub Zavrel, Sabine Buchholz, Jorn
Veenstra, Sander Canisius, Menno van Zaanen, Anders Noklestad, Pei-yun
Hsueh, Guy De Pauw, Gunn Inger Lyse, Svetoslav Marinov, Handre
Groenewald, Kim Luyckx, and Maarten van Gompel for merciless testing
and for passing along pieces of the puzzle.


BUGS

This software is in development and may contain errors. Please send
all bug reports to Antal van den Bosch <Antal.vdnBosch@uvt.nl>.


DISCLAIMER

Paramsearch comes WITHOUT ANY WARRANTY. Author nor distributor accept
responsibility to anyone for the consequences of using it or for
whether it serves any particular purpose or works at all.