# Efficiency/Effectiveness Trade-offs in Learning to Rank
### Tutorial @ ECML-PKDD 2018, HandsOn Session N. 1

##### Claudio Lucchese (UniVe), Franco Maria Nardini (ISTI-CNR)
##### High Performance Computing Lab. http://hpc.isti.cnr.it/

<img src="images/hpc.png" width="250">

In [2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import os
import pandas as pd

from rankeval.dataset import Dataset
from rankeval.model import RTEnsemble
from rankeval.analysis.effectiveness import tree_wise_performance

### Agenda

Given a trained LambdaMART model (stored in QuickRank format):

0. setup of an experimental environment for testing different scoring methods: Conditional Operators (CondOp), VPred, QuickScorer (QS), Vectorized QuickScorer (v-QS).
0. Execution of the different methods and comparison.
0. Low-level analysis with ``perf`` for two of them: VPred vs QuickScorer
0. Comparison with previously published results [QS-TOIS16, QS-TPDS18].

** Bonus Track **

0. Multi-threaded implementation of Vectorized QuickScorer.
0. GPU implementation of QuickScorer.
0. Multi-threaded scoring with RankEval

### Step 0

Clone and compile QuickRank. Detailed instructions on how to do it can be found at: http://quickrank.isti.cnr.it

### Step 0.1

Clone and install RankEval. Detailed instructions on how to do it can be found at: http://rankeval.isti.cnr.it

### Step 0.2

Download the Istella-S LETOR dataset (http://blog.istella.it/istella-learning-to-rank-dataset/)

In [3]:
# Global Options

# paths to executable files
QUICKRANK      = "./quickrank/bin/quicklearn"
SCORER         = "./quickrank/bin/quickscore"

QUICKSCORER    = "./QuickScorer/bin/quickscorer"
QUICKSCORER_NS = "./QuickScorer-noscoring/bin/quickscorer"
VPRED          = "./asadi_tkde13/out/VPred"
VPRED_NS       = "./asadi_tkde13-noscoring/out/VPred"

PERF           = "perf"
QUICKSCORER_GPU= "./QuickScorer-GPU/GPUQS/bin/quickscorer"

# paths to Istella-S dataset
train_dataset_file       = "/data/letor-datasets/tiscali/sample/ramfs/train.txt"
valid_dataset_file       = "/data/letor-datasets/tiscali/sample/ramfs/vali.txt"
test_dataset_file        = "/data/letor-datasets/tiscali/sample/ramfs/test.txt"

dataset_size = 681250

# The first row of the test file used by VPred should be: "<# rows of the file> <# features>\n".
vpred_test_dataset_file  = "/data/letor-datasets/tiscali/sample/ramfs/test.vpred"

# paths to model file
models_folder            = "models"
baseline_model_file      = os.path.join(models_folder, "istella-small.lamdamart.xml")

# setting floating point precision of Pandas
pd.set_option('precision', 1)

### Step 1

Load an existing LambdaMART model with RankEval or train it with QuickRank.

In [4]:
# load a QuickRank model
# if no model is available, use the box below to train one!

baseline_model = RTEnsemble(baseline_model_file, name="Baseline", format="QuickRank")

In [4]:
# The code below trains a LambdaMART of 50 trees.

!{QUICKRANK} \
  --algo LAMBDAMART \
  --num-trees 50 \
  --shrinkage 0.05 \
  --num-thresholds 0 \
  --num-leaves 64 \
  --min-leaf-support 1 \
  --end-after-rounds 0 \
  --partial 1000 \
  --train {train_dataset_file} \
  --valid {valid_dataset_file} \
  --train-metric NDCG \
  --train-cutoff 10 \
  --model-out ~/quickrank.1000T.64L.xml

[1m[32m
      _____  _____
     /    / /____/
    /____\ /    \        QuickRank has been developed by hpc.isti.cnr.it
    ::Quick:Rank::                             mail: quickrank@isti.cnr.it
[0m

# Ranker: LAMBDAMART
# max no. of trees = 50
# no. of tree leaves = 64
# shrinkage = 0.050000
# min leaf support = 1
# no. of thresholds = unlimited

# Reading training dataset: /data/letor-datasets/tiscali/sample/ramfs/train.txt
#	 Reading time: 95.56 s. @ 31.52 MB/s  (post-proc.: 1.21 s.)
#	 Dataset size: 2043304 x 220 (instances x features)
#	 Num queries: 19245 | Avg. len: 106.173

# Reading validation dataset: /data/letor-datasets/tiscali/sample/ramfs/vali.txt
#	 Reading time: 32.34 s. @ 31.23 MB/s  (post-proc.: 0.27 s.)
#	 Dataset size: 684076 x 220 (instances x features)
#	 Num queries: 7211 | Avg. len: 94.866

#
# Ranker: LAMBDAMART
# max no. of trees = 50
# no. of tree leaves = 64
# shrinkage = 0.050
# min leaf support = 1
# no. of thresholds = unlimited
#
# training scorer: NDC

### Step 2

We now translate the LambdaMART model in C++ code employing Conditional Operators to build the final document ranker [QS-TOIS16].

QuickRank provides a plugin to convert models stored in its native XML format to C++ source code. The result is that each tree is translated as a nested block of Conditional Operators (https://www.tutorialspoint.com/cplusplus/cpp_conditional_operator.htm). The obtained C++ code can be compiled to produce a working ranked of the given model.

Here a toy example of a tree:
~~~~
<feature>194</feature>
<threshold>140</threshold>
 <split pos="left">
   <feature>31</feature>
   <threshold>0.0120639997</threshold>
     <split pos="left">
       <output>-0.78920207999267233</output>
     </split>
     <split pos="right">
       <output>1.1050481952095461</output>
     </split>
 </split>
~~~~

The conditional operator (BOOLEAN CONDITION ? THEN : ELSE) translation produces:

~~~~
v[194] <= 140.0f ? ( v[31] <= 0.0120639997f ? -0.78920207999267233 : 1.1050481952095461 )
~~~~

In [5]:
# We use CondOp-based C code as a baseline for the scoring time evaluation

def run_condop(model_file, dataset_file, rounds=1):
    # create the C code
    print (" 1. Creating the C code for " + model_file)
    condop_source = model_file + ".c"
    condop_compiled = model_file + ".bin"
    
    _ = !{QUICKRANK} \
      --generator condop \
      --model-file {model_file} \
      --code-file {condop_source}
    
    # Compile an executable ranker. The resulting ranker is SCORER=./quickrank/bin/quickscore
    print (" 2. Compiling the model")

    # actually compule only if the model is newer
    if ( not os.path.exists(condop_compiled) or 
          os.path.getmtime(condop_compiled) < os.path.getmtime(baseline_model_file) ):
        
        # replace empty scorer
        !cp {condop_source} ./quickrank/src/scoring/ranker.cc
        # compile
        _ = !make -j -C ./quickrank/build_ quickscore 
        # copy compiled scorer
        !cp {SCORER} {condop_compiled}

    # Now running the Conditional Operators scorer by executing the previously compiled C code.
    # QuickScore options:
    #  -h,--help                             print help message
    #  -d,--dataset <arg>                    Input dataset in SVML format
    #  -r,--rounds <arg> (10)                Number of test repetitions
    #  -s,--scores <arg>                     File where scores are saved (Optional).
    print (" 3. Running the compiled model")
    cond_op_scorer_out = !{condop_compiled} \
      -d {dataset_file} \
      -r {rounds}
    
    print (cond_op_scorer_out.n)
    
    # takes the scoring time in milli-seconds
    cond_op_scoring_time = float(cond_op_scorer_out.l[-1].split()[-2])* 10**6
    
    return cond_op_scoring_time

In [6]:
condop_efficiency = run_condop(baseline_model_file, test_dataset_file) 

 1. Creating the C code for models/istella-small.lamdamart.xml
 2. Compiling the model
 3. Running the compiled model

      _____  _____
     /    / /____/
    /____\ /    \          QuickRank has been developed by hpc.isti.cnr.it
    ::Quick:Rank::                                   quickrank@isti.cnr.it

#	 Dataset size: 681250 x 220 (instances x features)
#	 Num queries: 6562 | Avg. len: 104
       Total scoring time: 74.6 s.
Avg. Dataset scoring time: 74.6 s.
Avg.    Doc. scoring time: 0.00011 s.


In [7]:
# Store current results
results = pd.DataFrame(columns=['Model', '# Trees', 'Scoring Time µs.'])

results.loc[len(results)] = ['CondOp', baseline_model.n_trees, condop_efficiency]
results

Unnamed: 0,Model,# Trees,Scoring Time µs.
0,CondOp,1492,110.0


### Step 3

Now scoring with VPred [VPRED].

First of all, we need to convert the QuickRank XML model in the VPRED format. Finally, use it to score the test file.

QuickRank provides a plugin to convert models stored in its native XML format to the textual representation employed by the original VPRED code by Nima Asadi et al. [VPRED]. The plugin outputs a textual file. 
 
The original VPred code and instructions on how to compile, install and use it are available here: https://github.com/lintool/OptTrees

In [8]:
vpred_source = baseline_model_file + ".vpred"

!{QUICKRANK} \
  --generator vpred \
  --model-file {baseline_model_file} \
  --code-file {vpred_source}

[1m[32m
      _____  _____
     /    / /____/
    /____\ /    \        QuickRank has been developed by hpc.isti.cnr.it
    ::Quick:Rank::                             mail: quickrank@isti.cnr.it
[0m
generating VPred input file from: models/istella-small.lamdamart.xml


In [9]:
# Now running the VPred scorer by using the previously converted code.
# note that we are using the original VPred code by Asadi et al. [VPRED].
# The code is available here: https://github.com/lintool/OptTrees

vpred_scorer_out = !{VPRED} \
  -ensemble {vpred_source} \
  -instances {vpred_test_dataset_file} \
  -maxLeaves 64
    
print (vpred_scorer_out.n)

# takes the scoring time in milli-seconds
vpred_scoring_time = float(vpred_scorer_out.l[0].split('\t')[1])* 10**6

$	0.000102983805
Ignore this number: -2779350


In [10]:
# Store current results
results.loc[len(results)] = ['VPred', baseline_model.n_trees, vpred_scoring_time]
results

Unnamed: 0,Model,# Trees,Scoring Time µs.
0,CondOp,1492,110.0
1,VPred,1492,103.0


### Step 4

QuickScorer uses a novel traversal methods and a cache-friendly data layout that reduces dramatically the traversal time [QS-SIGIR15, QS-TOIS16].

In [11]:
# Now running QuickScorer.
# note that we are using the original QuickScorer code by Lucchese et al. [QS-SIGIR15,QS-TOIS16].
# The code is available under NDA.
#
# Options:
#  -h [ --help ]                     Print help messages.
#  -d [ --dataset ] arg              Path of the dataset to score (SVML format).
#  -r [ --rounds ] arg (=10)         Number of test repetitions.
#  -s [ --scores ] arg               Path of the file where final scores are
#                                    saved.
#  -t [ --tree_type ] arg (=0)       Specify the type of the tree in the
#                                    ensemble:
#                                     - 0 for normal trees,
#                                     - 1 for oblivious trees,
#                                     - 2 for normal trees (reversed blocked),
#                                     - 3 for normal trees (SIMD: SSE/AVX).
#  -m [ --model ] arg                Path of the XML file storing the model.
#  -l [ --nleaves ] arg              Maximum number of leaves in a tree (<= 64).
#  --avx                             Use AVX 256 instructions (at least 8 doc
#                                    blocking).
#  --omp                             Use OpenMP multi-threading document scoring
#                                    (only SIMD: SSE/AVX).

qs_scorer_out = !{QUICKSCORER} \
  -d {test_dataset_file} \
  -m {baseline_model_file} \
  -l 64 \
  -r 1 \
  -t 0
    
print (qs_scorer_out.n)
    
# takes the scoring time in milli-seconds
qs_scoring_time = float(qs_scorer_out.l[-1].split()[-2])* 10**6


      _____  _____
     /    / /____
    /____\ _____/          QuickScorer has been developed by hpc.isti.cnr.it
    :Quick:Scorer:                                   quickscorer@isti.cnr.it

#	 Dataset size: 681250 x 220 (instances x features)
#	 Num queries: 6562 | Avg. len: 104
       Total scoring time: 15.751 s.
Avg. Dataset scoring time: 15.751 s.
Avg.    Doc. scoring time: 2.31208e-05 s.


In [12]:
# Store current results
results.loc[len(results)] = ['QS', baseline_model.n_trees, qs_scoring_time]
results

Unnamed: 0,Model,# Trees,Scoring Time µs.
0,CondOp,1492,110.0
1,VPred,1492,103.0
2,QS,1492,23.1


### Step 5

Vectorized QuickScorer improves over QuickScorer by exploiting 256-bits wide CPU registers [QS-SIGIR16].


In [13]:
# Now running Vectorized QuickScorer (AVX2)
# note that we are using the original QuickScorer code by Lucchese et al. [QS-SIGIR16].
# The code is available under NDA.
#
# Options:
#  -h [ --help ]                     Print help messages.
#  -d [ --dataset ] arg              Path of the dataset to score (SVML format).
#  -r [ --rounds ] arg (=10)         Number of test repetitions.
#  -s [ --scores ] arg               Path of the file where final scores are
#                                    saved.
#  -t [ --tree_type ] arg (=0)       Specify the type of the tree in the
#                                    ensemble:
#                                     - 0 for normal trees,
#                                     - 1 for oblivious trees,
#                                     - 2 for normal trees (reversed blocked),
#                                     - 3 for normal trees (SIMD: SSE/AVX).
#  -m [ --model ] arg                Path of the XML file storing the model.
#  -l [ --nleaves ] arg              Maximum number of leaves in a tree (<= 64).
#  -v [ --doc_block_size ] arg (=1)  Document block size (allowed values:
#                                    1,2,4,8,16; 1 means no blocking).
#  --avx                             Use AVX 256 instructions (at least 8 doc
#                                    blocking).
#  --omp                             Use OpenMP multi-threading document scoring
#                                    (only SIMD: SSE/AVX).

vqs_scorer_out = !{QUICKSCORER} \
  -d {test_dataset_file} \
  -m {baseline_model_file} \
  -l 64 \
  -r 1 \
  -t 3 \
  -v 8 \
  --avx
    
print (vqs_scorer_out.n)
    
# takes the scoring time in milli-seconds
vqs_scoring_time = float(vqs_scorer_out.l[-1].split()[-2])* 10**6


      _____  _____
     /    / /____
    /____\ _____/          QuickScorer has been developed by hpc.isti.cnr.it
    :Quick:Scorer:                                   quickscorer@isti.cnr.it

#	 Dataset size: 681250 x 220 (instances x features)
#	 Num queries: 6562 | Avg. len: 104
       Total scoring time: 10.3144 s.
Avg. Dataset scoring time: 10.3144 s.
Avg.    Doc. scoring time: 1.51404e-05 s.


In [14]:
# Store current results
results.loc[len(results)] = ['v-QS', baseline_model.n_trees, vqs_scoring_time]
results

Unnamed: 0,Model,# Trees,Scoring Time µs.
0,CondOp,1492,110.0
1,VPred,1492,103.0
2,QS,1492,23.1
3,v-QS,1492,15.1


from [QS-TOIS16]

Some considerations:

0. We reproduce the evaluation methodology presented in [QS-SIGIR15, QS-TOIS16] on a different LambdaMART. The LambdaMART here is composed of 1,492 trees while results in the two papers above are for 1,000 or 5,000 trees.
0. The scoring time is 1.5x higher than the results reported in [QS-TOIS16] for all methods. This because we are running these experiments on a slower machine than the one used for producing the experimental results presented in [QS-SIGIR15, QS-TOIS16].

![caption](images/HO1-scoring.png)

### Step 6

Low-level statistics of the scorer with ```perf```

**perf** (https://perf.wiki.kernel.org/index.php/Tutorial) is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and presents a simple commandline interface.

### Step 6.1

```perf``` on QuickScorer

In [15]:
# Below, perf is used to monitor several behaviours of the scorer:
# - L1 cache performance (references and misses): L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-icache-loads,L1-icache-load-misses
# - L3 cache performance (references and misses): cache-references,cache-misses
# - number of instructions and cycles: instructions,cycles
# - total number of branches and branch misprediction: branches,branch-misses

perf_out = !{PERF} stat -e \
  L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,\
L1-dcache-store-misses,L1-icache-loads,L1-icache-load-misses,\
instructions,cycles,cache-references,cache-misses,branches,branch-misses\
    {QUICKSCORER} \
      -d {test_dataset_file} \
      -m {baseline_model_file} \
      -l 64 \
      -r 1 \
      -t 0

print (perf_out.n)


      _____  _____
     /    / /____
    /____\ _____/          QuickScorer has been developed by hpc.isti.cnr.it
    :Quick:Scorer:                                   quickscorer@isti.cnr.it

#	 Dataset size: 681250 x 220 (instances x features)
#	 Num queries: 6562 | Avg. len: 104
       Total scoring time: 15.8972 s.
Avg. Dataset scoring time: 15.8972 s.
Avg.    Doc. scoring time: 2.33353e-05 s.

 Performance counter stats for './QuickScorer/bin/quickscorer -d /data/letor-datasets/tiscali/sample/ramfs/test.txt -m models/istella-small.lamdamart.xml -l 64 -r 1 -t 0':

   107,610,037,898 L1-dcache-loads                                              [36.37%]
     5,800,942,942 L1-dcache-load-misses     #    5.39% of all L1-dcache hits   [36.37%]
    55,355,326,714 L1-dcache-stores                                             [36.37%]
       651,532,279 L1-dcache-store-misses                                       [36.37%]
   <not supported> L1-icache-loads         
        42,052,272 L1-ica

In [16]:
# parsing perf output
num_istructions = int(perf_out[20].strip().split(' ')[0].replace(',', ''))
num_cache_ref = int(perf_out[22].strip().split(' ')[0].replace(',', ''))
num_cache_miss = int(perf_out[23].strip().split(' ')[0].replace(',', ''))
num_branches = int(perf_out[24].strip().split(' ')[0].replace(',', ''))
num_branch_misses = int(perf_out[25].strip().split(' ')[0].replace(',', ''))

### Step 6.2

```perf``` on QuickScorer (no scoring).

In [17]:
# Below, perf is used to monitor several behaviours of the scorer:
# - L1 cache performance (references and misses): L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-icache-loads,L1-icache-load-misses
# - L3 cache performance (references and misses): cache-references,cache-misses
# - number of instructions and cycles: instructions,cycles
# - total number of branches and branch misprediction: branches,branch-misses

perf_noscoring_out = !{PERF} stat -e \
  L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,\
L1-dcache-store-misses,L1-icache-loads,L1-icache-load-misses,\
instructions,cycles,cache-references,cache-misses,branches,branch-misses\
    {QUICKSCORER_NS} \
      -d {test_dataset_file} \
      -m {baseline_model_file} \
      -l 64 \
      -r 1 \
      -t 0
        
print (perf_noscoring_out.n)


      _____  _____
     /    / /____
    /____\ _____/          QuickScorer has been developed by hpc.isti.cnr.it
    :Quick:Scorer:                                   quickscorer@isti.cnr.it

#	 Dataset size: 681250 x 220 (instances x features)
#	 Num queries: 6562 | Avg. len: 104
       Total scoring time: 6.4e-08 s.
Avg. Dataset scoring time: 6.4e-08 s.
Avg.    Doc. scoring time: 9.3945e-14 s.

 Performance counter stats for './QuickScorer-noscoring/bin/quickscorer -d /data/letor-datasets/tiscali/sample/ramfs/test.txt -m models/istella-small.lamdamart.xml -l 64 -r 1 -t 0':

    54,729,913,494 L1-dcache-loads                                              [36.37%]
       185,505,538 L1-dcache-load-misses     #    0.34% of all L1-dcache hits   [36.37%]
    39,031,706,408 L1-dcache-stores                                             [36.37%]
       106,537,106 L1-dcache-store-misses                                       [36.36%]
   <not supported> L1-icache-loads         
        36,491,6

In [18]:
# parsing perf output
num_istructions_ns = int(perf_noscoring_out[20].strip().split(' ')[0].replace(',', ''))
num_cache_ref_ns = int(perf_noscoring_out[22].strip().split(' ')[0].replace(',', ''))
num_cache_miss_ns = int(perf_noscoring_out[23].strip().split(' ')[0].replace(',', ''))
num_branches_ns = int(perf_noscoring_out[24].strip().split(' ')[0].replace(',', ''))
num_branch_misses_ns = int(perf_noscoring_out[25].strip().split(' ')[0].replace(',', ''))

### Step 6.3

now computing differences between the two runs to get the low level statistics for the scoring part of QS.

In [19]:
# Store current results
perf_results = pd.DataFrame(columns=['Method', 'Instructions', 'Cache Misses', 'Branch Misprediction'])

normalized_instruction_count = (num_istructions - num_istructions_ns) / float(dataset_size * baseline_model.n_trees)
normalized_cache_miss = (num_cache_miss - num_cache_miss_ns) / float(dataset_size * baseline_model.n_trees)
normalized_branch_miss = (num_branch_misses - num_branch_misses_ns) / float(dataset_size * baseline_model.n_trees)

perf_results.loc[len(perf_results)] = ['QS',
                                  normalized_instruction_count,
                                  normalized_cache_miss,
                                  normalized_branch_miss]
perf_results

Unnamed: 0,Method,Instructions,Cache Misses,Branch Misprediction
0,QS,77.5,0.001,0.1


### Step 6.4

The same methodology now on VPred

In [20]:
# Below, perf is used to monitor several behaviours of the scorer:
# - L1 cache performance (references and misses): L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-icache-loads,L1-icache-load-misses
# - L3 cache performance (references and misses): cache-references,cache-misses
# - number of instructions and cycles: instructions,cycles
# - total number of branches and branch misprediction: branches,branch-misses

vpred_perf_out = !{PERF} stat -e \
  L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,\
L1-dcache-store-misses,L1-icache-loads,L1-icache-load-misses,\
instructions,cycles,cache-references,cache-misses,branches,branch-misses\
    {VPRED} \
      -ensemble {vpred_source} \
      -instances {vpred_test_dataset_file} \
      -maxLeaves 64
        
print (vpred_perf_out.n)

$	0.000105318109
Ignore this number: -2779350

 Performance counter stats for './asadi_tkde13/out/VPred -ensemble models/istella-small.lamdamart.xml.vpred -instances /data/letor-datasets/tiscali/sample/ramfs/test.vpred -maxLeaves 64':

   265,521,807,106 L1-dcache-loads                                              [36.37%]
     1,461,322,337 L1-dcache-load-misses     #    0.55% of all L1-dcache hits   [36.37%]
    90,022,515,465 L1-dcache-stores                                             [36.37%]
        46,465,323 L1-dcache-store-misses                                       [36.37%]
   <not supported> L1-icache-loads         
     9,661,742,673 L1-icache-load-misses     #    0.00% of all L1-icache hits   [36.36%]
   625,502,247,782 instructions              #    2.08  insns per cycle         [45.45%]
   300,488,893,895 cycles                    [45.45%]
    16,004,807,642 cache-references                                             [45.45%]
        11,733,166 cache-misses            

In [21]:
# parsing perf output
num_istructions = int(vpred_perf_out[11].strip().split(' ')[0].replace(',', ''))
num_cache_ref = int(vpred_perf_out[13].strip().split(' ')[0].replace(',', ''))
num_cache_miss = int(vpred_perf_out[14].strip().split(' ')[0].replace(',', ''))
num_branches = int(vpred_perf_out[15].strip().split(' ')[0].replace(',', ''))
num_branch_misses = int(vpred_perf_out[16].strip().split(' ')[0].replace(',', ''))

### Step 6.5

``perf`` on VPred (no scoring).

In [22]:
# Below, perf is used to monitor several behaviours of the scorer:
# - L1 cache performance (references and misses): L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-icache-loads,L1-icache-load-misses
# - L3 cache performance (references and misses): cache-references,cache-misses
# - number of instructions and cycles: instructions,cycles
# - total number of branches and branch misprediction: branches,branch-misses

vpred_perf_noscoring_out = !{PERF} stat -e \
  L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,\
L1-dcache-store-misses,L1-icache-loads,L1-icache-load-misses,\
instructions,cycles,cache-references,cache-misses,branches,branch-misses\
    {VPRED_NS} \
      -ensemble {vpred_source} \
      -instances {vpred_test_dataset_file} \
      -maxLeaves 64
        
print (vpred_perf_noscoring_out.n)

$	1.48256881e-13
Ignore this number: 0

 Performance counter stats for './asadi_tkde13-noscoring/out/VPred -ensemble models/istella-small.lamdamart.xml.vpred -instances /data/letor-datasets/tiscali/sample/ramfs/test.vpred -maxLeaves 64':

    43,941,623,126 L1-dcache-loads                                              [36.37%]
        49,346,422 L1-dcache-load-misses     #    0.11% of all L1-dcache hits   [36.37%]
    31,270,056,480 L1-dcache-stores                                             [36.37%]
        24,291,207 L1-dcache-store-misses                                       [36.36%]
   <not supported> L1-icache-loads         
         8,301,655 L1-icache-load-misses     #    0.00% of all L1-icache hits   [36.35%]
   175,562,118,101 instructions              #    2.37  insns per cycle         [45.46%]
    74,077,992,866 cycles                    [45.46%]
        23,344,652 cache-references                                             [45.46%]
         7,768,084 cache-misses         

In [23]:
# parsing perf output
num_istructions_ns = int(vpred_perf_noscoring_out[11].strip().split(' ')[0].replace(',', ''))
num_cache_ref_ns = int(vpred_perf_noscoring_out[13].strip().split(' ')[0].replace(',', ''))
num_cache_miss_ns = int(vpred_perf_noscoring_out[14].strip().split(' ')[0].replace(',', ''))
num_branches_ns = int(vpred_perf_noscoring_out[15].strip().split(' ')[0].replace(',', ''))
num_branch_misses_ns = int(vpred_perf_noscoring_out[16].strip().split(' ')[0].replace(',', ''))

### Step 6.6

now computing differences between the two runs to get the low level statistics for the scoring part of VPred.

In [24]:
# Store current results
normalized_instruction_count = (num_istructions - num_istructions_ns) / float(dataset_size * baseline_model.n_trees)
normalized_cache_miss = (num_cache_miss - num_cache_miss_ns) / float(dataset_size * baseline_model.n_trees)
normalized_branch_miss = (num_branch_misses - num_branch_misses_ns) / float(dataset_size * baseline_model.n_trees)

perf_results.loc[len(perf_results)] = ['VPred',
                                  normalized_instruction_count,
                                  normalized_cache_miss,
                                  normalized_branch_miss]
perf_results

Unnamed: 0,Method,Instructions,Cache Misses,Branch Misprediction
0,QS,77.5,0.001,0.12
1,VPred,442.7,0.0039,0.043


### Step 6.7

from [QS-TOIS16]

Some considerations:

0. We reproduce the methodology presented in [QS-SIGIR15, QS-TOIS16] on a different LambdaMART. The LambdaMART here is composed of 1,492 trees while results in the two papers above are for 1,000 or 5,000 trees. Given that said, the low level behavior of the two methods is confirmed.
0. The number of instructions executed by VPred is the largest one. This is because VPred always runs ``d`` steps, where ``d`` is the depth of a tree even if a document might reach an exit leaf earlier. On the other hand, QS executes the smallest number instructions. This is due to the different traversal strategy of the ensemble, as QS needs to process the false nodes only.
0. In terms of number of branches, we note that QS has a larger total number of branch mispredictions than VPred, which uses scoring functions that are branch-free.
0. In terms of cache misses, we note that QS has a lower cache miss. This is mostly due to the new data layout of QS that perform document scoring by means of linear scans of arrays.

![caption](images/HO1-lowlevelperf.png)

### Step 7

We developed a Multithreaded implementation of Vectorized QuickScorer that exploits OpenMP to distribute bunches of documents to threads scoring them in parallel. [QS-TPDS18]

In [25]:
# setting up environment variables for OpenMP
os.environ['OMP_NUM_THREADS']='32'
os.environ['OMP_DISPLAY_ENV']='VERBOSE'
os.environ['OMP_SCHEDULE']='auto'
os.environ['GOMP_CPU_AFFINITY']='0-7,8-15'

In [26]:
# Now running Multi-threaded Vectorized QuickScorer.
# Options:
#  -h [ --help ]                     Print help messages.
#  -d [ --dataset ] arg              Path of the dataset to score (SVML format).
#  -r [ --rounds ] arg (=10)         Number of test repetitions.
#  -s [ --scores ] arg               Path of the file where final scores are
#                                    saved.
#  -t [ --tree_type ] arg (=0)       Specify the type of the tree in the
#                                    ensemble:
#                                     - 0 for normal trees,
#                                     - 1 for oblivious trees,
#                                     - 2 for normal trees (reversed blocked),
#                                     - 3 for normal trees (SIMD: SSE/AVX).
#  -m [ --model ] arg                Path of the XML file storing the model.
#  -l [ --nleaves ] arg              Maximum number of leaves in a tree (<= 64).
#  -v [ --doc_block_size ] arg (=1)  Document block size (allowed values:
#                                    1,2,4,8,16; 1 means no blocking).
#  --avx                             Use AVX 256 instructions (at least 8 doc
#                                    blocking).
#  --omp                             Use OpenMP multi-threading document scoring
#                                    (only SIMD: SSE/AVX).

scorer_out = !{QUICKSCORER} \
  -d {test_dataset_file} \
  -m {baseline_model_file} \
  -l 64 \
  -r 1 \
  -t 3 \
  -v 8 \
  --avx \
  --omp
    
print (scorer_out.n)
    
# takes the scoring time in milli-seconds
scoring_time = float(scorer_out.l[-1].split()[-2])* 10**6


OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '32'
  OMP_SCHEDULE = 'AUTO'
  OMP_PROC_BIND = 'TRUE'
  OMP_PLACES = '{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15}'
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  GOMP_CPU_AFFINITY = ''
  GOMP_STACKSIZE = '0'
  GOMP_SPINCOUNT = '300000'
OPENMP DISPLAY ENVIRONMENT END

      _____  _____
     /    / /____
    /____\ _____/          QuickScorer has been developed by hpc.isti.cnr.it
    :Quick:Scorer:                                   quickscorer@isti.cnr.it

#	 Dataset size: 681250 x 220 (instances x features)
#	 Num queries: 6562 | Avg. len: 104
       Total scoring time: 0.848875 s.
Avg. Dataset scoring time: 0.848875 s.
Avg.    Doc. scoring time: 1.24606e-06 s.


In [27]:
# Store current results
results.loc[len(results)] = ['vQS-OMP', baseline_model.n_trees, scoring_time]
results

Unnamed: 0,Model,# Trees,Scoring Time µs.
0,CondOp,1492,110.0
1,VPred,1492,103.0
2,QS,1492,23.1
3,v-QS,1492,15.1
4,vQS-OMP,1492,1.2


### Step 8

We developed a GPU implementation of QuickScorer. This test below works on a NVIDIA Titan Xp card. [QS-TPDS18]

In [28]:
# Now running GPU QuickScorer.
# Options:
#   -h [ --help ]                     Print help messages
#   -d [ --dataset ] arg              Input dataset in SVML format
#   -r [ --rounds ] arg (=10)         Number of test repetitions
#   -s [ --scores ] arg               File where scores are saved
#   -w [ --warmup ] arg               Warmp dataset in SVML format used for
#                                     reversed trees
#   -m [ --model ] arg                File storing the model
#   -l [ --nleaves ] arg              Maximum number of leaves in a tree (<= 64)
#   -b [ --tree_block_size ] arg (=1) Tree block size (1 means no blocking)
#   -v [ --doc_block_size ] arg (=1)  Documents block size (allowed: 1,2,4,8,16;
#                                     1 means no blocking)
#   -y [ --cuda_threads ] arg (=256)  Number of threads per CUDA block (allowed:
#                                     96,128,192,256,384,512,768,1024; default:
#                                     256)
#   -z [ --cuda_blocks ] arg (=32768) Number CUDA blocks used by the scoring
#                                     kernel (default: 1024 * 32)

scorer_out = !{QUICKSCORER_GPU} \
    -m {baseline_model_file} \
    -t 1 -l 64 -r 10 -b 4000 -y 384 -z 16384 \
    -d {test_dataset_file}

print (scorer_out.n)

# takes the scoring time in milli-seconds
scoring_time = float(scorer_out.l[-2].split()[-2])* 10**6


      _____  _____
     /    / /____
    /____\ _____/          QuickScorer has been developed by Tiscali SpA, CNR, Univ. of Pisa, Univ. of Venezia
    :Quick:Scorer:                                                                      quickscorer@isti.cnr.it

#	 Dataset size: 681250 x 220 (instances x features)
#	 Num queries: 6562 | Avg. len: 104
Using a GPU-based strategy!
GPU INFO: Mem free at START: 12600672256 (total: 12782075904)
Building the model...
Scoring...
GPU => # of trees 1492 (max per block: 4000)
GPU => size treeIDs + masks + thresholds: 1503936 bytes
GPU => size vec temporary offsets: 884 bytes
GPU => size tree score table: 763904 bytes
GPU => size doc result scores: 5450000 bytes
GPU => Size of a block of instances processed at once: 305040
GPU => Global memory reserved for the block of instances: 1073740800 bytes
GPU => Executing 16384 blocks, 384 threads per block
GPU => 32000 bytes shared mem allocated x CUDA-block
GPU => Processing docs [0-305040) (305040 docs)


In [29]:
# Store current results
results.loc[len(results)] = ['QS-GPU', baseline_model.n_trees, scoring_time]
results

Unnamed: 0,Model,# Trees,Scoring Time µs.
0,CondOp,1492,110.0
1,VPred,1492,103.0
2,QS,1492,23.1
3,v-QS,1492,15.1
4,vQS-OMP,1492,1.2
5,QS-GPU,1492,0.4


### Step 9

RankEval (http://rankeval.isti.cnr.it) - multithread scoring written in Cython

In [30]:
from rankeval.dataset.datasets_fetcher import load_dataset

dataset_container = load_dataset(dataset_name='istella-sample',
                                download_if_missing=True, 
                                force_download=False, 
                                with_models=False)

Loading files. This may take a few minutes.
done loading dataset!


In [31]:
# We now use RankEval to score the test file.
scorer_out = %timeit -o baseline_model.score(dataset_container.test_dataset, False)

The slowest run took 39719.63 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 256 µs per loop


### References

[VPRED] Asadi et al. Runtime Optimizations for Tree-Based Machine Learning Models. IEEE Trans. Knowl. Data Eng. 26(9): 2281-2292 (2014).

[QS-SIGIR15] Lucchese et al. QuickScorer: A Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees. ACM SIGIR 2015. Best Paper Award.

[QS-TOIS16] Dato et al. Fast Ranking with Additive Ensembles of Oblivious and Non-Oblivious Regression Trees. ACM TOIS, Vol. 9, No. 4. Dec. 2016.

[QS-SIGIR16] Lucchese et al. Exploiting CPU SIMD Extensions to Speed-up Document Scoring with Tree Ensembles. ACM SIGIR 2016.

[QS-TPDS18]  Lettich et al. Parallel Traversal of Large Ensembles of Decision Trees. IEEE TPDS. 2018.