QuickRank: A C++ suite of Learning-to-Rank algorithms
QuickRank is an efficient Learning-to-Rank toolkit providing several C++ implementation of LtR algorithms. QuickRank was designed and developed with efficiency in mind.
The LtR algorithms currently implemented are:
- GBRT: J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232, 2001.
- LamdaMART: Q. Wu, C. Burges, K. Svore, and J. Gao. Adapting boosting for information retrieval measures. Information Retrieval, 2010.
- Oblivious GBRT / LamdaMART: Inspired to I. Segalovich. Machine learning in search quality at yandex. Invited Talk, ACM SIGIR, 2010.
- CoordinateAscent: Metzler, D., Croft, W.B.. Linear feature-based models for information retrieval. Information Retrieval 10(3), pages 257–274, 2007.
- LineSearch: D. G. Luenberger. Linear and nonlinear programming. Addison Wesley, 1984.
- RankBoost: Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. An efficient boosting algorithm for combining preferences. JMLR, 4, 933-969 (2003).
- DART: K.V. Rashmi and R. Gilad-Bachrach. Dart: Dropouts meet multiple additive regression trees. JMLR, 38 (2015).
- Selective: C. Lucchese, F. M. Nardini, S. Orlando, R. Perego and S. Trani. Selective Gradient Boosting for Effective Learning to Rank. ACM SIGIR, 2018. README
QuickRank also provides novel learning optimizations. Currently implemented optimizers are:
- CLEAVER: C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, F. Silvestri, S. Trani. Post-Learning Optimization of Tree Ensembles for Efficient Ranking. In Proc. ACM SIGIR, 2016. README
- X-CLEAVER: C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, F. Silvestri, S. Trani. X-CLEaVER: Learning Ranking Ensembles by Growing and Pruning Trees. Paper under revision.
- X-DART: C. Lucchese, F. M. Nardini, S. Orlando, R. Perego and S. Trani. X-DART: Blending Dropout and Pruning for Efficient Learning to Rank. In Proc. ACM SIGIR, 2017. README
How to build
On Mac OS X (HomeBrew and XCode command line tools are required):
xcode-select --install; brew install gcc cmake git;
On Ubuntu Linux:
sudo apt-get update; sudo apt-get install gcc-5 cmake git;
Once the compiler and the build system are installed, you can get the latest stable source packages (including the dependencies) using git:
git clone --recursive https://github.com/hpclab/quickrank.git
QuickRank relies on the CMake build system. The root directory contains CMake files that can be used to build the project from the command line or using an IDE which supports CMake. The instructions which follows describe how to build the project from the command line, but the same actions can be performed using appropriate GUI tools.
Create a temporary build folder and change your working directory to it:
cd quickrank mkdir build_ cd build_
The Makefile generator can build the project in only one configuration, so you need to build a separate folder for each configuration. As an example, to use the Release configuration (which is by default, Debug is similar) you need to execute (take care of the compiler paths):
cmake .. \ -DCMAKE_CXX_COMPILER=/usr/local/bin/g++-5 \ -DCMAKE_BUILD_TYPE=Release
Finally to compile Quickrank:
And wait for the compilation to finish. The result will be the QuickRank executable placed in the bin directory of the project root.
If you would like to execute the unit-tests, you need to run in the root directory:
How to use
Running the QuickRank binary with the
-h option, it shows the help:
./bin/quicklearn -h _____ _____ / / /____/ /____\ / \ QuickRank has been developed by hpc.isti.cnr.it ::Quick:Rank:: firstname.lastname@example.org Training phase - general options: --algo <arg> (LAMBDAMART) LtR algorithm: [MART|LAMBDAMART|OBVMART|OBVLAMBDAMART|DART| RANKBOOST|COORDASC|LINESEARCH|CUSTOM]. --train-metric <arg> (NDCG) set train metric: [DCG|NDCG|TNDCG|MAP]. --train-cutoff <arg> (10) set train metric cutoff. --partial <arg> (100) set partial file save frequency. --train <arg> set training file. --valid <arg> set validation file. --features <arg> set features file. --model-in <arg> set input model file (for testing, re-training or optimization) --model-out <arg> set output model file --skip-train skip training phase. --restart-train restart training phase from a previous trained model. Training phase - specific options for tree-based models: --num-trees <arg> (1000) set number of trees. --shrinkage <arg> (0.1) set shrinkage. --num-thresholds <arg> (0) set number of thresholds. --min-leaf-support <arg> (1) set minimum number of leaf support. --end-after-rounds <arg> (100) set num. rounds with no gain in validation before ending (if 0 disabled). --num-leaves <arg> (10) set number of leaves [applies only to MART/LambdaMART]. --tree-depth <arg> (3) set tree depth [applies only to ObliviousMART/ObliviousLambdaMART]. Training phase - specific options for Meta LtR models: --meta-algo <arg> Meta LtR algorithm: [METACLEAVER]. --final-num-trees <arg> set number of final trees. --opt-last-only optimization executed only on trees learned in last iteration. --meta-end-after-rounds <arg> set num. rounds with no gain in validation before ending (if 0 disabled) on meta LtR models. --meta-verbose Increase verbosity of Meta Algorithm, showing each step in detail. Training phase - specific options for Dart: --sample-type <arg> (UNIFORM) sampling type of trees. [UNIFORM|WEIGHTED|WEIGHTED_INV|COUNT2|COUNT3|COUNT2N|COUNT3N|TOP_FIFTY|CONTR|CONTR_INV|WCONTR|WCONTR_INV|TOP_WCONTR|LESS_WCONTR]. --normalize-type <arg> (TREE) normalization type of trees. [TREE|NONE|WEIGHTED|FOREST|TREE_ADAPTIVE|LINESEARCH|TREE_BOOST3|CONTR|WCONTR|LMART_ADAPTIVE|LMART_ADAPTIVE_SIZE]. --adaptive-type <arg> (FIXED) adaptive type for choosing number of trees to dropout: [FIXED|PLUS1_DIV2|PLUSHALF_DIV2|PLUSONETHIRD_DIV2|PLUSHALF_RESET|PLUSHALF_RESET_LB1_UB5|PLUSHALF_RESET_LB1_UB10|PLUSHALF_RESET_LB1_UBRD]. --rate-drop <arg> (0.1) set dropout rate --skip-drop <arg> (0) set probability of skipping dropout --keep-drop keep the dropped trees out of the ensembleif the performance of the model improved --best-on-train Calculate the best performance on training (o/w valid) --random-keep <arg> (0) keep the dropped trees out of the ensemble for every drop --drop-on-best Perform the drop-out based on best perfomance (o/w last) Training phase - specific options for Coordinate Ascent and Line Search: --num-samples <arg> (21) set number of samples in search window. --window-size <arg> (10) set search window size. --reduction-factor <arg> (0.95) set window reduction factor. --max-iterations <arg> (100) set number of max iterations. --max-failed-valid <arg> (20) set number of fails on validation before exit. Training phase - specific options for Line Search: --adaptive enable adaptive reduction factor (based on last iteration metric gain). --train-partial <arg> set training file with partial scores (input for loading or output for saving). --valid-partial <arg> set validation file with partial scores (input for loading or output for saving). Optimization phase - general options: --opt-algo <arg> Optimization algorithm: [CLEAVER]. --opt-method <arg> Optimization method: CLEAVER [RANDOM|RANDOM_ADV|LOW_WEIGHTS|SKIP|LAST|QUALITY_LOSS|QUALITY_LOSS_ADV|SCORE_LOSS]. --opt-model <arg> set output model file for optimization or input model file for testing. --opt-algo-model <arg> set output algorithm model file post optimization. Optimization phase - specific options for ensemble pruning: --pruning-rate <arg> ensemble to prune (either as a ratio with respect to ensemble size or as an absolute number of estimators to prune). --with-line-search ensemble pruning is made in conjunction with line search [related parameters accepted]. --line-search-model <arg> set line search XML file path for loading line search model (options and already trained weights). Test phase - general options: --test-metric <arg> (NDCG) set test metric: [DCG|NDCG|TNDCG|RMSE|MAP]. --test-cutoff <arg> (10) set test metric cutoff. --test <arg> set testing file. --scores <arg> set output scores file. --detailed enable detailed testing [applies only to ensemble models]. Code generation - general options: --model-file <arg> set XML model file path. --code-file <arg> set C code file path. --generator <arg> (condop) set C code generation strategy. Allowed options are: - "condop" (conditional operators), - "oblivious" (optimized code for oblivious trees), - "vpred" (intermediate code used by VPRED). Help options: -h,--help print help message.
Some parameters have a default value (described in round brackets), which means you can skip their assignation if the default value is ok for you.
Training a model with QuickRank is straightforward. Looking at the training options, you need to specify the learning algorithm, the training dataset (optionally also the validation dataset) and the metric to use. If you would like to save the trained model on a file, you need to pass also the
./bin/quicklearn \ --algo LAMBDAMART \ --train quickranktestdata/msn1/msn1.fold1.train.5k.txt \ --valid quickranktestdata/msn1/msn1.fold1.vali.5k.txt \ --train-metric NDCG \ --train-cutoff 10 \ --model-out lambdamart-model.xml
Depending from the learning algorithm adopted, you have to pass additional parameters, e.g., the number of trees for ensemble-based models (and many others). Some parameters have a default value, which means if you can skip them they will use that value for training.
To test a model, you could specify the test option in the previous command, or load a previously saved model. The predicted scores can be saved on a file (one score per row, preserving the order of the test dataset).
./bin/quicklearn \ --model-in lambdamart-model.xml \ --test quickranktestdata/msn1/msn1.fold1.test.5k.txt \ --test-metric NDCG \ --test-cutoff 10 \ --scores scores.txt
--detailed option, valid only for ensemble-based algorithms, QuickRank will save in a SVM-light format (which consequently can be used as input dataset for other learning algorithms) the partial scores given by each weak ranker to the prediction of the documents (one row per document, a feature for each ensemble, preserving the order of the ensembles in the model and of the documents in the dataset).
QuickRank can translate learnt tree-based models into efficient C++ source code that can be used to score documents efficiently. See a more detailed description here.
QuickRank introduces the concept of optimizers, i.e., algorithms than are executed before or after the training phase is executed. An optimizer could process either the dataset or the model, depending from its definition. Currently in QuickRank there is a single optimizer which acts in post learning by pruning an ensemble model, improving consequently its efficiency, without hindering its effectiveness.
The optimizer can be executed in pipeline with the training phase by setting the corresponding options, or as a standalone process which works on an previously trained model (or dataset).
./bin/quicklearn \ --model-in lambdamart-model.xml \ --train quickranktestdata/msn1/msn1.fold1.train.5k.txt \ --valid quickranktestdata/msn1/msn1.fold1.vali.5k.txt \ --opt-algo CLEAVER \ --opt-method QUALITY_LOSS \ --opt-model optmization-model.xml \ --opt-algo-model optimized-model.xml \ --pruning-rate 0.5 \ --with-line-search \ --num-samples 10 \ --window-size 1 \ --reduction-factor 0.95 \ --max-iterations 100 \ --max-failed-valid 20
See a more detailed description here.
If you need a small dataset o which to test QuickRank, all you need to do is to run the following command from the
_build directory mentioned the in "How to Build" section.
This command will clone a repository with a small sample (5k rows for each split) of the MSN1 LETOR dataset. The sample will be placed inside the directory
./quickranktestdata/msn1/ already in the train/test/validation split.
File Format (train/test/validation)
The file format of the training, test and validation files is the same as for SVM-Light (http://svmlight.joachims.org/). This is also the format used in the LETOR datasets. Each of the following lines represents one training example and is of the following format:
<line> .=. <target> qid:<qid> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info> <target> .=. <float> <qid> .=. <positive integer> <feature> .=. <positive integer> <value> .=. <float> <info> .=. <string>
The target value and each of the feature/value pairs are separated by a space character. Feature/value pairs MUST be ordered by increasing feature number. Features with value zero can be skipped. The string can be used to pass additional information to the kernel (e.g. non feature vector data).
Here's an example: (taken from the SVM-Rank website). Note that everything after "#" are discarded.
3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A 2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B 1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C 1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D 1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A 2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # 2B 1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # 2C 1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2D
If you use QuickRank, please acknowledge the following paper:
- Capannini, G., Lucchese, C., Nardini, F. M., Orlando, S., Perego, R., and Tonellotto, N. Quality versus efficiency in document scoring with learning-to-rank models. Information Processing & Management (2016). LINK.
If you use CLEAVER, please acknowledge the following paper:
- C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, F. Silvestri, S. Trani. Post-Learning Optimization of Tree Ensembles for Efficient Ranking. ACM SIGIR Conference on Research and Development in Information Retrieval, (2016). LINK.
If you use X-DART, please acknowledge the following paper:
- C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, and S. Trani. X-DART: Blending Dropout and Pruning for Efficient Learning to Rank. ACM SIGIR Conference on Research and Development in Information Retrieval, (2017). LINK.
We will be happy to know that you are using QuickRank and to acknowledge you. Check the list of works using QuickRank
Tools and Libraries
We thank the developers of the following tools and libraries:
- Catch: C++ Automated Test Cases in Headers
- PugiXML: Lightweight C++ XML processing library
- ParamsMap: Simple command line parser adopting easily accessible program options
© Contributors, 2016. Licensed under an Reciprocal Public License (RPL-1.5).