# Statistical Analysis

**RankEval** provides the following statistical analysis tools: *i)* Fisher's randomization test for statistical significance, and *ii)* bias/variance decomposition of the error.

In [1]:
# import common libraries
%load_ext autoreload
%autoreload 2

import numpy as np
import math

## Statistical Significance

According to the work by *M.D. Smucker, J. Allan, B. Carterette, "A Comparison of Statistical Significance Tests for Information Retrieval Evaluation", CIKM 2007*, **Fisher's randomization test** is the most appropriate statistical test to evaluate wheter two rankers differ significantly.

We first shortly describe the test. The null hypthesis is that the two given rankers A and B are indentical: an underlying ranker R is asked to produce two rankings for each given query  and these two rankings are randomly labeled as ranker A or ranker B. The goal of the test is to measure the probability that the observed performance gap between ranker A and B is due to a random labeling.

Under the null hypthesis, every permutation of the labelling is equally probable. If we enumerate all the possible A-B labelings, and we measure the corresponding quality gap, we have that:
 - the *one-sided p-value* is given by the fraction of times the quality difference is larger than the originally observed difference;
 - the *two-sided p-value* is given by the fraction of times the resulting quality *absolute difference* is larger than the originally observed difference.

Since the number of permutations is exponential in the number of queries, a large number of random permutations is used.

##### Import RankEval statistical significance tools

In [2]:
from rankeval.model import RTEnsemble
from rankeval.dataset import Dataset
from rankeval.metrics import NDCG
from rankeval.metrics import Precision
from rankeval.analysis.statistical import statistical_significance

##### Load models and data from file

In [6]:
# files
dataset_file = "/home/rankeval/rankeval_data/msn/dataset/Fold1/test.txt"

qr_1K_file  = "/home/rankeval/rankeval_data/msn/models/Fold1/msn1.quickrank.LAMBDAMART.20000.32.T1000.xml"
qr_10K_file = "/home/rankeval/rankeval_data/msn/models/Fold1/msn1.quickrank.LAMBDAMART.20000.32.T10000.xml"
lgbm_1K_file   = "/home/rankeval/rankeval_data/msn/models/Fold1/msn1.lightgbm.LAMBDAMART.1000.32.T1000.model"

# load
qr_1K   = RTEnsemble(qr_1K_file, name="QuickRank.1k", format="QuickRank")
qr_10K  = RTEnsemble(qr_10K_file, name="QuickRank.10k", format="QuickRank")
lgbm_1K = RTEnsemble(lgbm_1K_file, name="LGBM.1k", format="LightGBM")

msn1 = Dataset.load(dataset_file, name="Msn - Fold 1")

##### Run the Fisher's Randomization test

The `statistical_significance` test between a two rankers can be run on a list of datasets and for a list of IR quality metrics. The function returns both the one-sided and two-sided p-values.

We first compare the three models we loaded above. We can observe below that the QuickRank model with 10k trees performs worse that the QuickRank 1k tree: this is due to the overfitting of such a large model. 

In [7]:
from rankeval.analysis.effectiveness import model_performance

ndcg_10 = NDCG(cutoff=10)

perf = model_performance(datasets=[msn1], 
                         models=[qr_1K, qr_10K, lgbm_1K], 
                         metrics=[ndcg_10])
perf.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Model Performance
dataset,model,metric,Unnamed: 3_level_1
Msn - Fold 1,QuickRank.1k,NDCG@10,0.52957
Msn - Fold 1,QuickRank.10k,NDCG@10,0.510248
Msn - Fold 1,LGBM.1k,NDCG@10,0.524908


We also observe that the QuickRank.1k model performs better than LGBM.1k with only a small difference. We therefore measure whether this difference is statistically significant as follows.

In [8]:
stat_sig = statistical_significance(datasets=[msn1],
                                    model_a=qr_1K, model_b=lgbm_1K, 
                                    metrics=[ndcg_10],
                                    n_perm=100000 )
stat_sig.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Statistical Significance
dataset,metric,p-value,Unnamed: 3_level_1
Msn - Fold 1,NDCG@10,one-sided,0.01988
Msn - Fold 1,NDCG@10,two-sided,0.04066


We conclude that the difference is statistically significant at $p<0.05$. To conclude the analysis, we evaluate the performance of the two algorithms also with NDCG@50 and Precision@10.

In [12]:
ndcg_50 = NDCG(cutoff=50)
prec_10 = Precision(cutoff=10)

perf = model_performance(datasets=[msn1], 
                         models=[qr_1K, lgbm_1K], 
                         metrics=[ndcg_10, ndcg_50, prec_10])
perf.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Model Performance
dataset,model,metric,Unnamed: 3_level_1
Msn - Fold 1,QuickRank.1k,NDCG@10,0.52957
Msn - Fold 1,QuickRank.1k,NDCG@50,0.605428
Msn - Fold 1,QuickRank.1k,Precision@10[>=1],0.657644
Msn - Fold 1,LGBM.1k,NDCG@10,0.524908
Msn - Fold 1,LGBM.1k,NDCG@50,0.60048
Msn - Fold 1,LGBM.1k,Precision@10[>=1],0.655794


In [13]:
stat_sig = statistical_significance(datasets=[msn1],
                                    model_a=qr_1K, model_b=lgbm_1K, 
                                    metrics=[ndcg_10, ndcg_50, prec_10],
                                    n_perm=100000 )
stat_sig.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Statistical Significance
dataset,metric,p-value,Unnamed: 3_level_1
Msn - Fold 1,NDCG@10,one-sided,0.02118
Msn - Fold 1,NDCG@10,two-sided,0.04257
Msn - Fold 1,NDCG@50,one-sided,0.00014
Msn - Fold 1,NDCG@50,two-sided,0.00022
Msn - Fold 1,Precision@10[>=1],one-sided,0.23908
Msn - Fold 1,Precision@10[>=1],two-sided,0.47749


# Bias-Variance

The Error of a Algorithm can be decomposed in:
$$E(A) = Bias(A) + Variance(A) + Noise(A)$$
where:
 - Bias is how far is the model from the prediction
 - Variance is how sensitive (how changes) the prediction with different training sets (overfitting)
 - Noise is the irreducible error in the dataset (learner independent)

**RankEval** supports the computation of the bias vs. variance decomposition of the error.
The approach used is based on the works of [Webb05] and [Dom05]. As in other works, we hereinafter assume noise is absent.

RankEval allows to decompose the errore according to a given user provided (IR) quality metric as follows.

Each instance of the dataset is scored *L* times.
A single scoring is achieved by splitting the dataset at random into
*k* folds. Each fold is scored by the model *M* trained with the algorithm $A$ on the remainder folds.
[Webb05] recommends the use of 2 folds.

If the metric used is Mean Squared Error then the standard decomposition is used.
The Bias for and instance *x* is defined as mean squared error of the *L* trained models
w.r.t. the true label *y*, denoted with ${\sf E}_{L} [M(x) - y]^2$. 
The Variance for an instance *x* is measured across the *L* trained models: 
${\sf E}_{L} [M(x) - {\sf E}_{L} M(x)]^2$. 
Both are averaged over all instances in the dataset.

If the metric is any of the IR quality measures, we resort to the bias variance
decomposition of the mean squared error of the given metric w.r.t. its ideal value,
e.g., for the case of NDCG, ${\sf E}_{L} [1 - {\sf NDCG}]^2$. 
Recall that, a formal Bias/Variance decomposition was not proposed yet.

##### References
 - [Webb05] Webb, Geoffrey I., and Paul Conilione. "Estimating bias and variance from data." Pre-publication manuscript (2005).
 - [Dom05] Domingos P. A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning 2000 (pp. 231-238).

##### Load dataset and define metrics of interest

In [3]:
from rankeval.analysis.statistical import bias_variance

from rankeval.dataset import Dataset
from rankeval.metrics import NDCG

msn1 = Dataset.load("/home/rankeval/rankeval_data/msn/dataset/Fold1/test.txt", name="MSN - Fold 1")

ndcg_10 = NDCG(cutoff=10)

##### Define the algorithm of wich we want to measure its bias/variance decomposition

The Bias/Variancs decomposition is a measure of a given algorithm with given parameters. Recall that RankEval needs to repeatedly train and evaluate models learnt by the given algorithm. To do so, we define a wrapper function to be used by RankEval with the following parameters:
 - `train_X`: numpy.ndarray storing a 2-D matrix of size num_docs x num_features
 - `train_Y`: numpy.ndarray storing a vector of document's relevance labels
 - `train_q`: numpy.ndarray storing a vector of query lengths
 - `test_X`: numpy.ndarray as for `train_X`
Such wrapper function trains a new model on `train_X`, `train_Y`, `train_q`, then used to score `test_X`.
An `numpy.ndarray` with such scores is returned.

In the example below we use LightGBM, for which we define a two wrapper function for training forests of 100 trees and with eithr 32 (`lgbm_small_wrapper`) or 64 (`lgbm_large_wrapper`) leaves each.

In [4]:
import lightgbm

def lgbm_algo(trees, leaves, train_X, train_Y, train_q, test_X):
    params = {'num_leaves': leaves, 'objective':'lambdarank',
             'learning_rate': 0.01, 'max_bin': 1024}

    training = lightgbm.Dataset(data=train_X, label=train_Y, group=train_q)
    
    bst = lightgbm.train(params, training, num_boost_round=trees)
    
    return bst.predict(test_X)

def lgbm_small_wrapper(train_X, train_Y, train_q, test_X):
    return lgbm_algo(100, 16, train_X, train_Y, train_q, test_X)

def lgbm_large_wrapper(train_X, train_Y, train_q, test_X):
    return lgbm_algo(100, 128, train_X, train_Y, train_q, test_X)

##### Run the bias/variance decomposition

The function `bias_variance` returns a 3-tuple with: 
 - the average loss according to the given metric
 - the average bias
 - the average variance

Below the bias variance decomposition for the MSE and for NDCG@10.

In [5]:
small_mse = bias_variance(msn1, algo=lgbm_small_wrapper, metric="mse", L=5, k=2)
print "Small model - MSE decomposition:"
print "Error   :", small_mse[0]
print "Bias    :", small_mse[1]
print "Variace :", small_mse[2]

AttributeError: 'NoneType' object has no attribute 'value'

In [6]:
large_mse = bias_variance(msn1, algo=lgbm_large_wrapper, metric="mse", L=5, k=2, progress_bar=progress_bar)
print "Large model - MSE decomposition:"
print "Error   :", large_mse[0]
print "Bias    :", large_mse[1]
print "Variace :", large_mse[2]

Large model - MSE decomposition:
Error   : 1.22659
Bias    : 1.22559
Variace : 0.00100088


In [19]:
small_ndcg = bias_variance(msn1, algo=lgbm_small_wrapper, metric=ndcg_10, L=5, k=2, progress_bar=progress_bar)
print "Small model - NDCG decomposition:"
print "Error   :", small_ndcg[0]
print "Bias    :", small_ndcg[1]
print "Variace :", small_ndcg[2]

Small model - NDCG decomposition:
Error   : 0.321833
Bias    : 0.318143
Variace : 0.00368976


In [21]:
large_ndcg = bias_variance(msn1, algo=lgbm_large_wrapper, metric=ndcg_10, L=5, k=2, progress_bar=progress_bar)
print "Large model - NDCG decomposition:"
print "Error   :", large_ndcg[0]
print "Bias    :", large_ndcg[1]
print "Variace :", large_ndcg[2]

Large model - NDCG decomposition:
Error   : 0.311195
Bias    : 0.305739
Variace : 0.00545653
