This homework is made by:
* Masoumeh Bakhtiariziabari (11813105)
* Marianne de Heer Kloots (11138351)
* Tharangni Sivaji (XXXXXXXX)

# Theoretical Part [15 pts]

## 1. Hypothesis Testing – The problem of multiple comparisons [5 points]
Experimentation in AI often happens like this: 
1. Modify/Build an algorithm
2. Compare the algorithm to a baseline by running a hypothesis test.
3. If not significant, go back to step A
4. If significant, start writing a paper. 

How many hypothesis tests, m, does it take to get to (with Type I error for each test = α):
* P(m<sup>th</sup> experiment gives significant result | m experiments lacking power to reject H<sub>0</sub>)?
* P(at least one significant result | m experiments lacking power to reject H<sub>0</sub>)?

<div style="background-color: lightyellow">
<ol>
<li><ul>
        <li> $$
        P(m^{\text{th}}\text{ experiment gives significant result} \mid m \text{ experiments lacking power to reject } H_0) \\
        \approx P(m^{\text{th}}\text{ experiment gives significant result} \mid H_0 \text{ is true in all m experiments}) \\
        \text{(i.e. only the } m^{\text{th}} \text{ result is significant whereas } (m-1) \text{ results are not significant)} \\
        = \boldsymbol{((1 - \alpha)^{m-1})\cdot\alpha}
        $$<br><br>
        <li> $$
        P(\text{at least one significant result} \mid m \text{ experiments lacking power to reject } H_0)\\
        = 1 - P(\text{no significant result})\\
        = \boldsymbol{1 - (1 - \alpha)^m}
        $$
    </ul>
</ol>
</div>

## 2. Bias and unfairness in Interleaving experiments [10 points]
Balance interleaving has been shown to be biased in a number of corner cases. An example was given during the lecture with two ranked lists of length 3 being interleaved, and a randomly clicking population of users that resulted in algorithm A winning ⅔ of the time, even though in theory the percentage of wins should be 50% for both algorithms. Can you come up with a situation of two ranked lists of length 3 and a distribution of clicks over them for which Team-draft interleaving is unfair to the better algorithm?

<div style="background-color: lightyellow">
*answer here*
</div>

# Experimental Part [85 pts]
Commercial search engines use both offline and online approach in evaluating a new search algorithm: they first use an offline test collection to compare the production algorithm (P) with the new experimental algorithm (E); if *E* statistically significantly outperforms *P* with respect to the evaluation measure of their interest, the two algorithms are then compared online through an interleaving experiment.

For the purpose of this homework we will assume that the evaluation measures of interest are:
1. Binary evaluation measures
    1. Precision at rank k,
    2. Recall at rank k,
    3. Average Precision,
2. Multi-graded evaluation measures
    1. Normalized Discounted Cumulative Gain at rank k (nDCG@k),
    2. Expected Reciprocal Rank (ERR).

Further, for the purpose of this homework we will assume that the interleaving algorithms of interest are:
Team-Draft Interleaving (Joachims. "Evaluating retrieval performance using clickthrough data". Text Mining 2003.),
Probabilistic Interleaving (Hofmann, Whiteson, and de Rijke. "A probabilistic method for inferring preferences from clicks." CIKM 2011.).
 
In an interleaving experiment the ranked results of *P* and *E* (against a user query) are interleaved in a single ranked list which is presented to a user. The user then clicks on the results and the algorithm that receives most of the clicks wins the comparison. The experiment is repeated for a number of times (impressions) and the total wins for *P* and *E* are computed. 

A Sign/Binomial Test is then run to examine whether the difference in wins between the two algorithms is statistically significant (or due to chance). Alternatively one can calculate the proportion of times the *E* wins and test whether this proportion, *p*, is greater than *p<sub>0</sub>=*0.5. This is called an 1-sample 1-sided proportion test.

One of the key questions however is **whether offline evaluation and online evaluation outcomes agree with each other**. In this homework you will determine the degree of agreement between offline evaluation measures and interleaving outcomes, by the means of simulations. A similar analysis using actual online data can be found at Chapelle et al. “Large-Scale Validation and Analysis of Interleaved Search Evaluation”.

## <font color='purple'>[Based on Lecture 1]</font>
### Step 1: <font color='darkred'>Simulate Rankings of Relevance for *E* and *P* *(5 points)*</font>

In the first step you will generate pairs of rankings of relevance, for the production *P* and experimental *E*, respectively, for a hypothetical query **q**. Assume a 3-graded relevance, i.e. `{N, R, HR}`. Construct all possible *P* and *E* ranking pairs of length 5. This step should give you about.

Example:

    P: {N N N N N}
    E: {N N N N R}
    …
    P: {HR HR HR HR R}
    E: {HR HR HR HR HR}

(Note 1: If you do not have enough computational power, sample 5000 pair uniformly at random to show your work.)

In [2]:
from itertools import product
from pprint import pprint
import random

# define collections of algorithms and relevance grades
algorithms = ['P', 'E']
relevance_grades = ['N', 'HR', 'R']

# all possible ranking sequences
# list of rankings [('HR', 'HR', 'HR', 'HR', 'HR') ... ('N', 'N', 'N', 'N', 'N')]
rankings = [ranking for ranking in product(relevance_grades, repeat=5)]

# all algorithms paired with all rankings 
# (list of lists with elements e.g. ('P', ('HR', 'HR', 'HR', 'HR', 'HR')))
algorithm_rankings = [list(product(alg, rankings)) for alg in algorithms]

# all possible pairs of P and E with their rankings
all_ranking_pairs = [pair for pair in product(*algorithm_rankings)]

# all ranking pairs except equals
ranking_pairs = [pair for pair in product(*algorithm_rankings) if pair[0][1] != pair[1][1]]

# pretty print
print('number of combinations:', len(all_ranking_pairs))
print('number of non-equal combinations:', len(ranking_pairs))
pprint(random.sample(ranking_pairs, 10))

number of combinations: 59049
number of non-equal combinations: 58806
[(('P', ('HR', 'N', 'HR', 'R', 'R')), ('E', ('HR', 'N', 'R', 'HR', 'R'))),
 (('P', ('N', 'R', 'HR', 'N', 'N')), ('E', ('N', 'HR', 'HR', 'N', 'N'))),
 (('P', ('N', 'HR', 'N', 'N', 'R')), ('E', ('HR', 'HR', 'R', 'HR', 'HR'))),
 (('P', ('N', 'HR', 'HR', 'R', 'HR')), ('E', ('HR', 'HR', 'N', 'HR', 'HR'))),
 (('P', ('HR', 'R', 'N', 'R', 'R')), ('E', ('R', 'N', 'R', 'HR', 'N'))),
 (('P', ('R', 'R', 'R', 'HR', 'HR')), ('E', ('R', 'N', 'HR', 'HR', 'R'))),
 (('P', ('R', 'HR', 'N', 'N', 'HR')), ('E', ('R', 'N', 'R', 'N', 'N'))),
 (('P', ('HR', 'R', 'N', 'N', 'R')), ('E', ('HR', 'R', 'R', 'N', 'N'))),
 (('P', ('R', 'R', 'R', 'N', 'R')), ('E', ('HR', 'R', 'HR', 'R', 'HR'))),
 (('P', ('R', 'N', 'R', 'R', 'HR')), ('E', ('R', 'R', 'HR', 'HR', 'N')))]


<div style="background-color: lightyellow">
<p>We find 59049 different ranking pairs. This makes sense: each ranking pair consists of 10 relevance values (5 produced by each algorithm), all of which can take on any of the 3 grades (N, R, HR). So there should be 3<sup>10</sup> = 59049 different combinations, which matches our finding.</p>

<p>We then exclude all ranking pairs which have exactly the same relevance grade sequences, which leaves us with 58806 different combinations.</p>

<p>We have printed a randomly selected sample of 10 ranking pairs as an example.</p>
</div>

### Step 2: <font color='darkred'>Implement Evaluation Measures *(10 points)*</font>
Implement 1 binary and 2 multi-graded evaluation measures out of the 7 measures mentioned above. 

(Note 2: Some of the aforementioned measures require the total number of relevant and highly relevant documents in the entire collection – pay extra attention on how to find this)

In [3]:
# precision at rank k
def precision(k,_ranking_pairs):
    
    def calc_precision(i,x):
        rel_counter = 0.0
        prec = 0.0
        for j in range(k):
            if _ranking_pairs[i][x][1][j] == 'HR' or _ranking_pairs[i][x][1][j] == 'R':
                rel_counter += 1
                prec += rel_counter/(1.0+j)
        return prec
    
    prec_list = []
    for i in range(len(_ranking_pairs)):
        p_prec = calc_precision(i,0)
        e_prec = calc_precision(i,1)
        prec_list.append((p_prec/k,e_prec/k))
        #print(p_prec,e_prec)
    
    return prec_list

# results for k = 3
prec = precision(3,ranking_pairs)

random_query_index_list = random.sample(range(len(ranking_pairs)), 10)
for i in random_query_index_list:
    print("query", i, ": \t", prec[i])

query 796 : 	 (0.0, 0.38888888888888884)
query 48467 : 	 (1.0, 0.38888888888888884)
query 55823 : 	 (1.0, 0.3333333333333333)
query 33520 : 	 (0.6666666666666666, 1.0)
query 4130 : 	 (0.1111111111111111, 0.1111111111111111)
query 31832 : 	 (1.0, 1.0)
query 50725 : 	 (1.0, 1.0)
query 7356 : 	 (0.16666666666666666, 0.5555555555555555)
query 28468 : 	 (1.0, 1.0)
query 3511 : 	 (0.1111111111111111, 1.0)


In [4]:
# average precision
import numpy as np

def average_prec(k_max,_ranking_pairs):
    temp_list = []
    ap_list = [[0,0] for i in range(len(_ranking_pairs))]
    for k in range(1,k_max+1):
        temp_list = precision(k,_ranking_pairs)
        #print(k,temp_list)
        for index,item in enumerate(temp_list):
            ap_list[index] = (np.array(ap_list[index])+np.array(item)).tolist()
    #print(ap_list)
    for item in ap_list:
        item[:] = [i / k_max for i in item]
    return ap_list 

k_max = 5
avg_prec = average_prec(k_max,ranking_pairs)

for i in random_query_index_list:
    print("query ",i,": \t",ranking_pairs[i][0], "\t",'%.3f'%avg_prec[i][0],'\t',ranking_pairs[i][1],'\t','%.3f'%avg_prec[i][1])

query  37361 : 	 ('P', ('HR', 'R', 'R', 'N', 'HR')) 	 0.902 	 ('E', ('HR', 'N', 'HR', 'HR', 'N')) 	 0.629
query  34992 : 	 ('P', ('HR', 'R', 'HR', 'N', 'N')) 	 0.870 	 ('E', ('HR', 'R', 'HR', 'N', 'HR')) 	 0.902
query  42315 : 	 ('P', ('R', 'N', 'HR', 'HR', 'N')) 	 0.629 	 ('E', ('R', 'HR', 'R', 'N', 'HR')) 	 0.902
query  38297 : 	 ('P', ('HR', 'R', 'R', 'HR', 'R')) 	 1.000 	 ('E', ('N', 'R', 'N', 'R', 'HR')) 	 0.197
query  58617 : 	 ('P', ('R', 'R', 'R', 'R', 'R')) 	 1.000 	 ('E', ('N', 'HR', 'R', 'R', 'R')) 	 0.332
query  27042 : 	 ('P', ('HR', 'HR', 'N', 'HR', 'N')) 	 0.781 	 ('E', ('R', 'N', 'R', 'N', 'HR')) 	 0.585
query  57705 : 	 ('P', ('R', 'R', 'R', 'HR', 'HR')) 	 1.000 	 ('E', ('HR', 'HR', 'N', 'N', 'HR')) 	 0.737
query  10752 : 	 ('P', ('N', 'HR', 'HR', 'R', 'R')) 	 0.332 	 ('E', ('HR', 'N', 'R', 'R', 'N')) 	 0.629
query  6957 : 	 ('P', ('N', 'HR', 'N', 'N', 'HR')) 	 0.144 	 ('E', ('R', 'N', 'R', 'N', 'R')) 	 0.585
query  27007 : 	 ('P', ('HR', 'HR', 'N', 'HR', 'N')) 	 0.781

In [16]:
# nDCG

#-------------calc ideal dcg based on ground truth [10 HR, 10 R, 10 N]
def calc_ideal_dgc(k):
    idcg_rel = [2]*10 + [1]*10 + [0]*10
    idcg = 0
    for index in range(min(k, len(idcg_rel))):
        idcg += ((2**idcg_rel[index]) - 1)/(math.log2(2+index))
    return idcg

# I used the second formula of DCG from slide 6 of http://www.cs.cornell.edu/courses/cs4300/2013fa/lectures/evaluation-1-4pp.pdf
#nDCG is a way to calculate this measure across many independent ____queries____(http://curtis.ml.cmu.edu/w/courses/index.php/Normalized_discounted_cumulative_gain)
#Normalize DCG at rank n by the DCG value at rank n of the ideal ranking(stanford slide)

import math
def ndcg(max_k,query_index,method):    
    all_dcg_k = []
    
    #--------------------make dcg list for all k-----------------------------
    for r in range(0,max_k):
        if ranking_pairs[query_index][method][1][r] == 'HR':
            rel_num = 2
        elif ranking_pairs[query_index][method][1][r] == 'R':
            rel_num = 1
        else:
            rel_num = 0
            
        if len(all_dcg_k) > 0:   
            all_dcg_k.append((((2**rel_num) - 1)/(math.log2(2+r))) + all_dcg_k[-1])
        else:
            all_dcg_k.append(((2**rel_num) - 1)/(math.log2(2+r)))
            
            
    #----------------calculate ideal dcg------------------------------------      
    idcg = calc_ideal_dgc(max_k)
        
    #--------------------------convert dcg to ndcg--------------------------   
    if(idcg == 0):
        all_dcg_k[:] = [0 for x in all_dcg_k]
    else:
        all_dcg_k[:] = [x / idcg for x in all_dcg_k]
     
    return all_dcg_k

#-----------------------call ndcg function for all 59000 queries------------------
ndcg_list = []
max_k = 5
for i in range(len(ranking_pairs)):
    p_dcg = ndcg(max_k,i,0)
    e_dcg = ndcg(max_k,i,1)
    ndcg_list.append((p_dcg[-1],e_dcg[-1]))
    
for index in random_query_index_list:
    print("query %d"%index,":", ranking_pairs[index][0], "\t", '%.3f'%ndcg_list[index][0], ranking_pairs[index][1], "\t", '%.3f'%ndcg_list[index][1])

query 37361 : ('P', ('HR', 'R', 'R', 'N', 'HR')) 	 0.598 ('E', ('HR', 'N', 'HR', 'HR', 'N')) 	 0.655
query 34992 : ('P', ('HR', 'R', 'HR', 'N', 'N')) 	 0.580 ('E', ('HR', 'R', 'HR', 'N', 'HR')) 	 0.711
query 42315 : ('P', ('R', 'N', 'HR', 'HR', 'N')) 	 0.429 ('E', ('R', 'HR', 'R', 'N', 'HR')) 	 0.515
query 38297 : ('P', ('HR', 'R', 'R', 'HR', 'R')) 	 0.657 ('E', ('N', 'R', 'N', 'R', 'HR')) 	 0.251
query 58617 : ('P', ('R', 'R', 'R', 'R', 'R')) 	 0.333 ('E', ('N', 'HR', 'R', 'R', 'R')) 	 0.363
query 27042 : ('P', ('HR', 'HR', 'N', 'HR', 'N')) 	 0.699 ('E', ('R', 'N', 'R', 'N', 'HR')) 	 0.301
query 57705 : ('P', ('R', 'R', 'R', 'HR', 'HR')) 	 0.518 ('E', ('HR', 'HR', 'N', 'N', 'HR')) 	 0.684
query 10752 : ('P', ('N', 'HR', 'HR', 'R', 'R')) 	 0.476 ('E', ('HR', 'N', 'R', 'R', 'N')) 	 0.444
query 6957 : ('P', ('N', 'HR', 'N', 'N', 'HR')) 	 0.345 ('E', ('R', 'N', 'R', 'N', 'R')) 	 0.213
query 27007 : ('P', ('HR', 'HR', 'N', 'HR', 'N')) 	 0.699 ('E', ('HR', 'R', 'HR', 'N', 'R')) 	 0.624


In [5]:
# ERR
import numpy as np

def numerical(relevance_sequence):
    """
    Convert a relevance grade sequence to a sequence of numerical 
    values, based on the relevance grades given.
    E.g. ['HR', 'HR', 'HR', 'R', 'N'] returns [2, 2, 2, 1, 0]
    """
    numerical_relevance_sequence = [1 if grade == 'R' \
                                    else 2 if grade == 'HR' else 0 \
                                    for grade in relevance_sequence]
    return numerical_relevance_sequence

def R_function(g, g_max):
    return (2**g - 1)/(2**g_max)

def ERR(relevance_sequence):
    """
    Compute the ERR based on Algorithm 2 in 
    https://pdfs.semanticscholar.org/7e3c/f6492128f915112ca01dcb77c766129e65cb.pdf
    """
    p = 1
    ERR = 0
    n = len(relevance_sequence)
    g_max = max(relevance_sequence)
    
    for r in range(1, n + 1):
        g = relevance_sequence[r - 1]
        R = R_function(g, g_max)
        ERR = ERR + p * (R/r)
        p = p * (1 - R)
    return ERR

ERR_list = []
for P, E in ranking_pairs:
    ERR_P = ERR(numerical(P[1]))
    ERR_E = ERR(numerical(E[1]))
    ERR_list.append((ERR_P, ERR_E))
    
print('mean of all ERR values for P & E (should be equal):\n', 
      np.mean(ERR_list, axis=0))
print('mean of a random sample of 100 pairs:\n', 
      np.mean(random.sample(ERR_list, 100), axis=0))
for i in random_query_index_list:
    print("query ",i,": \t",ranking_pairs[i][0], "\t",'%.3f'%ERR_list[i][0],'\t',ranking_pairs[i][1],'\t','%.3f'%ERR_list[i][1])

mean of all ERR values for P & E (should be equal):
 [ 0.55613292  0.55613292]
mean of a random sample of 100 pairs:
 [ 0.5534362   0.55497852]
query  37361 : 	 ('P', ('HR', 'R', 'R', 'N', 'HR')) 	 0.818 	 ('E', ('HR', 'N', 'HR', 'HR', 'N')) 	 0.824
query  34992 : 	 ('P', ('HR', 'R', 'HR', 'N', 'N')) 	 0.828 	 ('E', ('HR', 'R', 'HR', 'N', 'HR')) 	 0.835
query  42315 : 	 ('P', ('R', 'N', 'HR', 'HR', 'N')) 	 0.473 	 ('E', ('R', 'HR', 'R', 'N', 'HR')) 	 0.568
query  38297 : 	 ('P', ('HR', 'R', 'R', 'HR', 'R')) 	 0.825 	 ('E', ('N', 'R', 'N', 'R', 'HR')) 	 0.256
query  58617 : 	 ('P', ('R', 'R', 'R', 'R', 'R')) 	 0.689 	 ('E', ('N', 'HR', 'R', 'R', 'R')) 	 0.415
query  27042 : 	 ('P', ('HR', 'HR', 'N', 'HR', 'N')) 	 0.855 	 ('E', ('R', 'N', 'R', 'N', 'HR')) 	 0.397
query  57705 : 	 ('P', ('R', 'R', 'R', 'HR', 'HR')) 	 0.486 	 ('E', ('HR', 'HR', 'N', 'N', 'HR')) 	 0.853
query  10752 : 	 ('P', ('N', 'HR', 'HR', 'R', 'R')) 	 0.444 	 ('E', ('HR', 'N', 'R', 'R', 'N')) 	 0.783
query  6957 : 	 ('

<div style="background-color: lightyellow">
explanation/analysis?
</div>

### Step 3: <font color='darkred'>Calculate the 𝛥measure *(0 points)*</font>
For the three measures and all *P* and *E* ranking pairs constructed above calculate the difference: 𝛥measure = measure<sub>E</sub>-measure<sub>P</sub>. Consider only those pairs for which *E* outperforms *P*.


In [21]:
# code
diff_precision = [precision_E - precision_P for precision_P, precision_E in prec if precision_E > precision_P]
diff_avgprec = [avgprec_E - avgprec_P for avgprec_P, avgprec_E in avg_prec if avgprec_E > avgprec_P]
diff_nDCG = [nDCG_E - nDCG_P for nDCG_P, nDCG_E in ndcg_list if nDCG_E > nDCG_P]
diff_ERR = [ERR_E - ERR_P for ERR_P, ERR_E in ERR_list if ERR_E > ERR_P]
print(len(diff_precision), len(diff_avgprec), len(diff_nDCG), len(diff_ERR))

24462 27962 29376 29374


<div style="background-color: lightyellow">
explanation/analysis?
</div>

## <font color='purple'>[Based on Lecture 2]</font>
### Step 4: <font color='darkred'>Implement Interleaving *(15 points)*</font>
Implement 2 interleaving algorithms: (1) Team-Draft Interleaving OR Balanced Interleaving, AND (2), Probabilistic Interleaving. The interleaving algorithms should (a) given two rankings of relevance interleave them into a single ranking, and (b) given the users clicks on the interleaved ranking assign credit to the algorithms that produced the rankings.

(Note 4: Note here that as opposed to a normal interleaving experiment where rankings consists of urls or docids, in our case the rankings consist of relevance labels. Hence in this case (a) you will assume that E and P return different documents, (b) the interleaved ranking will also be a ranking of labels.)


In [None]:
# code

<div style="background-color: lightyellow">
explanation/analysis?
</div>

## <font color='purple'>[Based on Lecture 3]</font>
### Step 5: <font color='darkred'>Implement User Clicks Simulation *(15 points)*</font>
Having interleaved all the ranking pairs an online experiment could be ran. However, given that we do not have any users (and the entire homework is a big simulation) we will simulate user clicks.

We have considered a number of click models including:
1. Random Click Model (RCM)
2. Position-Based Model (PBM)
3. Simple Dependent Click Model (SDCM)
4. Simple Dynamic Bayesian Network (SDBN)

Consider two different click models, (a) the Random Click Model (RCM), and (b) one out of the remaining 3 aforementioned models. The parameters of some of these models can be estimated using the Maximum Likelihood Estimation (MLE) method, while others require using the Expectation-Maximization (EM) method. Implement the two models so that (a) there is a method that learns the parameters of the model given a set of training data, (b) there is a method that predicts the click probability given a ranked list of relevance labels, (c) there is a method that decides - stochastically - whether a document is clicked based on these probabilities.

Having implemented the two click models, estimate the model parameters using the Yandex Click Log [[file]](https://drive.google.com/file/d/1tqMptjHvAisN1CJ35oCEZ9_lb0cEJwV0/view).

(Note 6: Do not learn the attractiveness parameter *a*<sub>uq</sub>)

<div style="background-color: lightyellow">
explanation/analysis?
</div>

### Step 6: <font color='darkred'>Simulate Interleaving Experiment *(10 points)*</font>
Having implemented the click models, it is time to run the simulated experiment.

For each of interleaved ranking run N simulations for each one of the click models implemented and measure the proportion *p* of wins for E.
(Note 7: Some of the models above include an attractiveness parameter *a*<sub>uq</sub>. Use the relevance label to assign this parameter by setting *a*<sub>uq</sub> for a document u in the ranked list accordingly. (See [Click Models for Web Search](http://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf))


<div style="background-color: lightyellow">
explanation/analysis?
</div>

### Step 7: <font color='darkred'>Results and Analysis *(30 points)*</font>
Compare the results of the offline experiments (i.e. the values of the 𝛥measure) with the results of the online experiment (i.e. proportion of wins), analyze them and reach your conclusions regarding their agreement.
* Use easy to read and comprehend visuals to demonstrate the results;
* Analyze the results on the basis of
    * the evaluation measure used,
    * the interleaving method used,
    * the click model used.
* Report and ground your conclusions.

(Note 8: This is the place where you need to demonstrate your deeper understanding of what you have implemented so far; hence the large number of points assigned. Make sure you clearly do that so that the examiner of your work can grade it accordingly.)

In [34]:
from scipy import stats
from itertools import combinations
np.random.seed(123)

diff_measure_lists = {'precision': diff_precision, 'average precision': diff_avgprec, 
                      'nDCG': diff_nDCG, 'ERR': diff_ERR}

for measure_1, measure_2 in combinations(diff_measure_lists.keys(), 2):
    print(measure_1, 'vs.', measure_2)
    measure_1 = diff_measure_lists[measure_1]
    measure_2 = diff_measure_lists[measure_2]
    print(np.mean(measure_1), np.mean(measure_2), '\t', stats.ttest_ind(measure_1, measure_2), '\n')

average precision vs. precision
0.385633681425 0.440397350993 	 Ttest_indResult(statistic=-24.810331399370998, pvalue=4.1906255886023189e-135) 

average precision vs. ERR
0.385633681425 0.252210726855 	 Ttest_indResult(statistic=71.148169032468587, pvalue=0.0) 

average precision vs. nDCG
0.385633681425 0.227499954393 	 Ttest_indResult(statistic=87.264614427180462, pvalue=0.0) 

precision vs. ERR
0.440397350993 0.252210726855 	 Ttest_indResult(statistic=102.07668528704501, pvalue=0.0) 

precision vs. nDCG
0.440397350993 0.227499954393 	 Ttest_indResult(statistic=120.28697419447488, pvalue=0.0) 

ERR vs. nDCG
0.252210726855 0.227499954393 	 Ttest_indResult(statistic=17.076224869103331, pvalue=3.2076576398410232e-65) 



<div style="background-color: lightyellow">
explanation/analysis?
</div>

<u>Yandex Click Log File</u>:

The dataset includes user sessions extracted from Yandex logs, with queries, URL rankings and clicks. To allay privacy concerns the user data is fully anonymized. So, only meaningless numeric IDs of queries, sessions, and URLs are released. The queries are grouped only by sessions and no user IDs are provided. The dataset consists of several parts. Logs represent a set of rows, where each row represents one of the possible user actions: query or click.

In the case of a Query:

    SessionID TimePassed TypeOfAction QueryID RegionID ListOfURLs


In the case of a Click:

    SessionID TimePassed TypeOfAction URLID


* `SessionID` - the unique identifier of the user session.
* `TimePassed` - the time elapsed since the beginning of the current session in standard time units.
* `TypeOfAction` - type of user action. This may be either a query (Q), or a click (C).
* `QueryID` - the unique identifier of the request.
* `RegionID` - the unique identifier of the country from which a given query. This identifier may take four values.
* `URLID` - the unique identifier of the document.
* `ListOfURLs` - the list of documents from left to right as they have been shown to users on the page extradition Yandex (top to bottom).