forked from shogun-toolbox/shogun
/
README
184 lines (145 loc) · 9.07 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
This is the SHOGUN machine learning toolbox.
(see INSTALL for first steps on installation and running shogun)
(see README.data for how to download example data sets accompanying shogun)
INTRODUCTION
The machine learning toolbox's focus is on large scale kernel methods and
especially on Support Vector Machines (SVM)[1]. It provides a generic SVM
object interfacing to several different SVM implementations, among them the
state of the art LibSVM[2] and SVMlight[3]. Each of the SVMs can be
combined with a variety of kernels. The toolbox not only provides efficient
implementations of the most common kernels, like the Linear, Polynomial,
Gaussian and Sigmoid Kernel but also comes with a number of recent string
kernels as e.g. the Locality Improved[4], Fischer[5], TOP[6], Spectrum[7],
Weighted Degree Kernel (with shifts)[8][9][10]. For the latter the efficient
LINADD[10] optimizations are implemented. Also SHOGUN offers the freedom of
working with custom pre-computed kernels. One of its key features is the
``combined kernel'' which can be constructed by a weighted linear combination
of a number of sub-kernels, each of which not necessarily working on the same
domain. An optimal sub-kernel weighting can be learned using Multiple Kernel
Learning[11][12][16].
Currently SVM 2-class classification and regression problems can be dealt
with. However SHOGUN also implements a number of linear methods like Linear
Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel)
Perceptrons and features algorithms to train hidden markov models.
The input feature-objects can be dense, sparse or strings and
of type int/short/double/char and can be converted into different feature types.
Chains of ``preprocessors'' (e.g. substracting the mean) can be attached to
each feature object allowing for on-the-fly pre-processing.
INTERFACES
SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave,
Java, C#, Ruby, Lua and Python.
PLATFORMS
Debian GNU/Linux, Mac OSX and WIN32/CYGWIN are supported platforms (see
the INSTALL file for generic and platform specific installation instructions)
DIRECTORY CONTENT
README - this file
Makefile - to create release archives
src - shogun source code
data - shogun data sets (required for some examples / applications -
these need to be downloaded separately via the download site or
git submodule update --init from the root of the git checkout
doc - documentation (to be built using doxygen)
examples - example files for all interfaces
applications - applications of shogun
benchmarks - speed benchmarks
tests - unit and integration tests
Current build status of master: http://shogun-toolbox.org/buildbot/waterfall
Travis CI checks https://travis-ci.org/shogun-toolbox/shogun
The following table depicts the status of each interface available in shogun:
+==================+===========================================================+
| interface | status |
+==================+===========================================================+
|python_modular | mature (no known problems) |
|octave_modular | mature (no known problems) |
|java_modular | stable (no known problems; not all examples are ported) |
|ruby_modular | stable (no known problems; only few examples ported) |
|csharp_modular | stable (no known problems; not all examples are ported) |
|lua_modular | alpha (some examples work, string typemaps are unstable |
|perl_modular | pre-alpha work in progress quality |
|r_modular | pre-alpha quality (swig does not properly handle reference|
| | counting and thus only for the brave: |
| | --disable-reference-counting to get it to work, but beware|
| | that it will leak memory; disabled by default.) |
+------------------+-----------------------------------------------------------+
|octave_static | mature (no known problems) |
|matlab_static | mature (no known problems) |
|python_static | mature (no known problems) |
|r_static | mature (no known problems) |
|libshogun_static | mature (no known problems) |
|cmdline_static | stable but some data types incomplete |
| | |
|elwms_static | this is the eierlegendewollmilchsau interface, a chimera |
| | that in one file interfaces with python,octave,r,matlab |
| | and provides the run_python command to run code in python |
| | using the in octave,r,matlab available variables, etc) |
+==================+===========================================================+
Visit src/README and http://www.shogun-toolbox.org/doc/en/current for further information.
APPLICATIONS
We have successfully used this toolbox to tackle the following sequence
analysis problems: Protein Super Family classification[6],
Splice Site Prediction[8][13][14], Interpreting the SVM Classifier[11,12],
Splice Form Prediction[8], Alternative Splicing[9] and Promotor
Prediction[15]. Some of them come with no less than 10
million training examples, others with 7 billion test examples.
LICENSE
Except for the files classifier/svm/Optimizer.{cpp,h},
classifier/svm/SVM_light.{cpp,h}, regression/svr/SVR_light.{cpp,h}
and the kernel caching functions in kernel/Kernel.{cpp,h}
which are (C) Torsten Joachims and follow a different
licensing scheme (cf. LICENSE.SVMLight) SHOGUN is licensed under the GPL
version 3 or any later version (cf. LICENSE).
AVAILABILITY
SHOGUN can be downloaded at
http://www.shogun-toolbox.org
REFERENCES
[1] C.~Cortes and V.N. Vapnik. Support-vector networks.
Machine Learning, 20(3):273--297, 1995.
[2] J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009.
http://www.public.asu.edu/~jye02/Software/SLEP.
[3] C.-C. Chang and C.-J. Lin. Libsvm: Introduction and benchmarks.
Technical report, Department of Computer Science and Information
Engineering, National Taiwan University, Taipei, 2000.
[4] T.Joachims. Making large-scale SVM learning practical. In B.~Schoelkopf,
C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods -
Support Vector Learning, pages 169--184, Cambridge, MA, 1999. MIT Press.
[5] A.Zien, G.Raetsch, S.Mika, B.Schoelkopf, T.Lengauer, and K.-R.
Mueller. Engineering Support Vector Machine Kernels That Recognize
Translation Initiation Sites. Bioinformatics, 16(9):799-807, September 2000.
[6] T.S. Jaakkola and D.Haussler.Exploiting generative models in
discriminative classifiers. In M.S. Kearns, S.A. Solla, and D.A. Cohn,
editors, Advances in Neural Information Processing Systems, volume 11,
pages 487-493, 1999.
[7] K.Tsuda, M.Kawanabe, G.Raetsch, S.Sonnenburg, and K.R. Mueller.
A new discriminative kernel from probabilistic models.
Neural Computation, 14:2397--2414, 2002.
[8] C.Leslie, E.Eskin, and W.S. Noble. The spectrum kernel: A string kernel
for SVM protein classification. In R.B. Altman, A.K. Dunker, L.Hunter,
K.Lauderdale, and T.E. Klein, editors, Proceedings of the Pacific
Symposium on Biocomputing, pages 564-575, Kaua'i, Hawaii, 2002.
[9] G.Raetsch and S.Sonnenburg. Accurate Splice Site Prediction for
Caenorhabditis Elegans, pages 277-298. MIT Press series on Computational
Molecular Biology. MIT Press, 2004.
[10] G.Raetsch, S.Sonnenburg, and B.Schoelkopf. RASE: recognition of
alternatively spliced exons in c. elegans. Bioinformatics,
21:i369--i377, June 2005.
[11] S.Sonnenburg, G.Raetsch, and B.Schoelkopf. Large scale genomic sequence
SVM classifiers. In Proceedings of the 22nd International Machine Learning
Conference. ACM Press, 2005.
[12] S.Sonnenburg, G.Raetsch, and C.Schaefer. Learning interpretable SVMs
for biological sequence classification. In RECOMB 2005, LNBI 3500,
pages 389-407. Springer-Verlag Berlin Heidelberg, 2005.
[13] G.Raetsch, S.Sonnenburg, and C.Schaefer. Learning Interpretable SVMs
for Biological Sequence Classification. BMC Bioinformatics, Special Issue
from NIPS workshop on New Problems and Methods in Computational Biology
Whistler, Canada, 18 December 2004, 7:(Suppl. 1):S9, March 2006.
[14] S.Sonnenburg.New methods for splice site recognition. Master's thesis,
Humboldt University, 2002. supervised by K.-R. Mueller H.-D. Burkhard and
G.Raetsch.
[15] S.Sonnenburg, G.Raetsch, A.Jagota, and K.-R. Mueller. New methods for
splice-site recognition. In Proceedings of the International Conference on
Artifical Neural Networks, 2002. Copyright by Springer.
[16] S.Sonnenburg, A.Zien, and G.Raetsch. ARTS: Accurate Recognition of
Transcription Starts in Human. 2006.
[17] S.Sonnenburg, G.Raetsch, C.Schaefer, and B.Schoelkopf,Large Scale
Multiple Kernel Learning, Journal of Machine Learning Research, 2006,
K.Bennett and E.P.-Hernandez Editors