Copyright 2017- Tatsuhiro Aoshima (hiro4bbh@gmail.com).
Package sticker provides a framework for multi-label classification.
sticker is written in golang, so everyone can easily modify and compile it on almost every environments. You can see sticker's document on GoDoc.
First, download golang, and install it. Next, get and install sticker as follows:
go get github.com/hiro4bbh/sticker
go install github.com/hiro4bbh/sticker/sticker-util
Everything has been installed, then you can try sticker's utility command-line tool sticker-util
now!
First of all, you should prepare datasets. sticker assumes the following directory structure for a dataset:
+ dataset-root
|-- train.txt: training dataset
|-- text.txt: test dataset
|-- feature_map.txt: feature map (optional)
|-- label_map.txt: label map (optional)
Training and test datasets must be formatted as ReadTextDataset
can handle (see GoDoc for data format).
Feature and label maps should enumerate the name of each feature and label per line in order of identifier, respectively.
You can check the summary of the dataset at localhost:8080/summary
as follows (you can change the port number with option addr
):
sticker-util -verbose -debug <dataset-root> @summarize -table=<table-filename-relative-to-root>
If featureMap
and labelMap
is empty string, then feature and label maps are ignored, respectively.
LabelNearest
is Sparse Weighted Nearest-Neighbor Method (Aoshima+ 2018) which achieved SOTA performances on several XMLC datasets (Bhatia+ 2016).
Recently, the model can process each data entry faster in 15.1 (AmazonCat-13K), 1.14 (Wiki10-31K), 4.88 (Delicious-200K), 15.1 (WikiLSHTC-325K), 4.19 (Amazon-670K), and 15.5 ms (Amazon-3M) on average, under the same settings of the paper (compare to the original result).
For example, you can test this method on Amazon-3M dataset (Bhatia+ 2016) as follows:
sticker-util -verbose -debug ./data/Amazon-3M/ @trainNearest @testNearest -S=75 -alpha=2.0 -beta=1
See the help of @trainNearest
and @testNearest
for the sub-command options.
LabelNear
is a faster implementation of LabelNearest
which uses the optimal Densified One Permutation Hashing (DOPH) and the reservoir sampling.
This method can process every data entry in about 1 ms with little performance degradation.
You can see the results on several XMLC datasets (Bhatia+ 2016) at Dropbox.
Almost parameters and options are same with the ones of LabelNearest
.
See the help of @trainNear
and @testNear
for details.
LabelConst
: Multi-label constant model (see GoDoc)LabelOne
: One-versus-rest classifier for multi-label ranking (see GoDoc)
LabelBoost
: Multi-label Boosting model (see GoDoc)LabelForest
: Variously-modified FastXML model (see GoDoc)LabelNext
: Your next-generation model (you can add your own train and test commands, see plugin/next/init.go)
L1Logistic_PrimalSGD
: L1-logistic regression with stochastic gradient descent (SGD) solving the primal problem (see GoDoc)L1SVC_PrimalSGD
: L1-Support Vector Classifier with SGD solving the primal problem (see GoDoc)
L1SVC_DualCD
: L1-Support Vector Classifier with coordinate descent (CD) solving the dual problem (see GoDoc)L2SVC_PrimalCD
: L2-Support Vector Classifier with CD solving the primal problem (see GoDoc)
- (Aoshima+ 2018) T. Aoshima, K. Kobayashi, and M. Minami. "Revisiting the Vector Space Model: Sparse Weighted Nearest-Neighbor Method for Extreme Multi-Label Classification." arXiv:1802.03938, 2018.
- (Bhatia+ 2016) K. Bhatia, H. Jain, Y. Prabhu, and M. Varma. The Extreme Classification Repository. 2016. Retrieved January 4, 2018 from http://manikvarma.org/downloads/XC/XMLRepository.html