This homework is made by:
* Masoumeh Bakhtiariziabari (11813105)
* Marianne de Heer Kloots (11138351)
* Tharangni Sivaji (11611065)

# Theoretical Part [15 pts]

## 1. Hypothesis Testing – The problem of multiple comparisons [5 points]
Experimentation in AI often happens like this: 
1. Modify/Build an algorithm
2. Compare the algorithm to a baseline by running a hypothesis test.
3. If not significant, go back to step A
4. If significant, start writing a paper. 

How many hypothesis tests, m, does it take to get to (with Type I error for each test = α):
* P(m<sup>th</sup> experiment gives significant result | m experiments lacking power to reject H<sub>0</sub>)?
* P(at least one significant result | m experiments lacking power to reject H<sub>0</sub>)?

## 2. Bias and unfairness in Interleaving experiments [10 points]
Balance interleaving has been shown to be biased in a number of corner cases. An example was given during the lecture with two ranked lists of length 3 being interleaved, and a randomly clicking population of users that resulted in algorithm A winning ⅔ of the time, even though in theory the percentage of wins should be 50% for both algorithms. Can you come up with a situation of two ranked lists of length 3 and a distribution of clicks over them for which Team-draft interleaving is unfair to the better algorithm?

<div style="background-color: lightyellow">
*answer here*
</div>

# Experimental Part [85 pts]
Commercial search engines use both offline and online approach in evaluating a new search algorithm: they first use an offline test collection to compare the production algorithm (P) with the new experimental algorithm (E); if *E* statistically significantly outperforms *P* with respect to the evaluation measure of their interest, the two algorithms are then compared online through an interleaving experiment.

For the purpose of this homework we will assume that the evaluation measures of interest are:
1. Binary evaluation measures
    1. Precision at rank k,
    2. Recall at rank k,
    3. Average Precision,
2. Multi-graded evaluation measures
    1. Normalized Discounted Cumulative Gain at rank k (nDCG@k),
    2. Expected Reciprocal Rank (ERR).

Further, for the purpose of this homework we will assume that the interleaving algorithms of interest are:
Team-Draft Interleaving (Joachims. "Evaluating retrieval performance using clickthrough data". Text Mining 2003.),
Probabilistic Interleaving (Hofmann, Whiteson, and de Rijke. "A probabilistic method for inferring preferences from clicks." CIKM 2011.).
 
In an interleaving experiment the ranked results of *P* and *E* (against a user query) are interleaved in a single ranked list which is presented to a user. The user then clicks on the results and the algorithm that receives most of the clicks wins the comparison. The experiment is repeated for a number of times (impressions) and the total wins for *P* and *E* are computed. 

A Sign/Binomial Test is then run to examine whether the difference in wins between the two algorithms is statistically significant (or due to chance). Alternatively one can calculate the proportion of times the *E* wins and test whether this proportion, *p*, is greater than *p<sub>0</sub>=*0.5. This is called an 1-sample 1-sided proportion test.

One of the key questions however is **whether offline evaluation and online evaluation outcomes agree with each other**. In this homework you will determine the degree of agreement between offline evaluation measures and interleaving outcomes, by the means of simulations. A similar analysis using actual online data can be found at Chapelle et al. “Large-Scale Validation and Analysis of Interleaved Search Evaluation”.

## <font color='purple'>[Based on Lecture 1]</font>
### Step 1: <font color='darkred'>Simulate Rankings of Relevance for *E* and *P* *(5 points)*</font>

In the first step you will generate pairs of rankings of relevance, for the production *P* and experimental *E*, respectively, for a hypothetical query **q**. Assume a 3-graded relevance, i.e. `{N, R, HR}`. Construct all possible *P* and *E* ranking pairs of length 5. This step should give you about.

Example:

    P: {N N N N N}
    E: {N N N N R}
    …
    P: {HR HR HR HR R}
    E: {HR HR HR HR HR}

(Note 1: If you do not have enough computational power, sample 5000 pair uniformly at random to show your work.)

In [1]:

from itertools import product
from pprint import pprint
import random

# define collections of algorithms and relevance grades
algorithms = ['P', 'E']
relevance_grades = ['N', 'HR', 'R']

# all possible ranking sequences
# list of rankings [('HR', 'HR', 'HR', 'HR', 'HR') ... ('N', 'N', 'N', 'N', 'N')]
rankings = [ranking for ranking in product(relevance_grades, repeat=5)]

# all algorithms paired with all rankings 
# (list of lists with elements e.g. ('P', ('HR', 'HR', 'HR', 'HR', 'HR')))
algorithm_rankings = [list(product(alg, rankings)) for alg in algorithms]

# all possible pairs of P and E with their rankings
all_ranking_pairs = [pair for pair in product(*algorithm_rankings)]

# all ranking pairs except equals
ranking_pairs = [pair for pair in product(*algorithm_rankings) if pair[0][1] != pair[1][1]]

# pretty print
print('number of combinations:', len(all_ranking_pairs))
print('number of non-equal combinations:', len(ranking_pairs))
pprint(random.sample(ranking_pairs, 10))

number of combinations: 59049
number of non-equal combinations: 58806
[(('P', ('N', 'HR', 'R', 'N', 'N')), ('E', ('N', 'N', 'HR', 'HR', 'N'))),
 (('P', ('HR', 'R', 'HR', 'N', 'HR')), ('E', ('HR', 'HR', 'R', 'N', 'HR'))),
 (('P', ('R', 'R', 'R', 'R', 'N')), ('E', ('HR', 'N', 'N', 'R', 'R'))),
 (('P', ('HR', 'R', 'N', 'R', 'HR')), ('E', ('N', 'N', 'R', 'HR', 'R'))),
 (('P', ('HR', 'N', 'HR', 'HR', 'R')), ('E', ('HR', 'HR', 'N', 'N', 'R'))),
 (('P', ('HR', 'N', 'N', 'R', 'N')), ('E', ('R', 'R', 'HR', 'N', 'R'))),
 (('P', ('N', 'N', 'N', 'R', 'N')), ('E', ('HR', 'R', 'HR', 'N', 'HR'))),
 (('P', ('N', 'HR', 'HR', 'N', 'N')), ('E', ('N', 'N', 'R', 'R', 'N'))),
 (('P', ('HR', 'R', 'N', 'HR', 'N')), ('E', ('N', 'HR', 'R', 'R', 'HR'))),
 (('P', ('R', 'N', 'N', 'HR', 'R')), ('E', ('HR', 'HR', 'R', 'R', 'N')))]


<div style="background-color: lightyellow">
**explanation/analysis:**
We find 59049 different ranking pairs. This makes sense: each ranking pair consists of 10 relevance values (5 produced by each algorithm), all of which can take on any of the 3 grades (N, R, HR). So there should be 310 = 59049 different combinations, which matches our finding.
We then exclude all ranking pairs which have exactly the same relevance grade sequences, which leaves us with 58806 different combinations.
We have printed a randomly selected sample of 10 ranking pairs as an example.
</div>

### Step 2: <font color='darkred'>Implement Evaluation Measures *(10 points)*</font>
Implement 1 binary and 2 multi-graded evaluation measures out of the 7 measures mentioned above. 

(Note 2: Some of the aforementioned measures require the total number of relevant and highly relevant documents in the entire collection – pay extra attention on how to find this)

In [2]:
# code
#p@K function

#import numpy as np
def precision(k,_ranking_pairs):
    
    def calc_precision(i,x):
        rel_counter = 0.0
        prec = 0.0
        for j in range(k):
            if _ranking_pairs[i][x][1][j] == 'HR' or _ranking_pairs[i][x][1][j] == 'R':
                rel_counter += 1
                prec += rel_counter/(1.0+j)
        return prec
    
    prec_list = []
    for i in range(len(_ranking_pairs)):
        p_prec = calc_precision(i,0)
        e_prec = calc_precision(i,1)
        prec_list.append((p_prec/k,e_prec/k))
        #print(p_prec,e_prec)
    
    return prec_list


In [3]:
prec = precision(3,ranking_pairs)

random_query_index_list = random.sample(range(len(ranking_pairs)), 10)
for i in random_query_index_list:
    print("query ",i,": \t",prec[i])

query  18575 : 	 (0.38888888888888884, 0.5555555555555555)
query  41395 : 	 (0.5555555555555555, 0.1111111111111111)
query  19821 : 	 (0.3333333333333333, 0.6666666666666666)
query  58575 : 	 (1.0, 0.1111111111111111)
query  28808 : 	 (1.0, 0.1111111111111111)
query  9892 : 	 (0.38888888888888884, 1.0)
query  57194 : 	 (1.0, 0.3333333333333333)
query  39351 : 	 (0.3333333333333333, 1.0)
query  14794 : 	 (0.16666666666666666, 0.16666666666666666)
query  42430 : 	 (0.5555555555555555, 0.38888888888888884)


In [4]:
import numpy as np

def average_prec(k_max,_ranking_pairs):
    temp_list = []
    ap_list = [[0,0] for i in range(len(_ranking_pairs))]
    for k in range(1,k_max+1):
        temp_list = precision(k,_ranking_pairs)
        #print(k,temp_list)
        for index,item in enumerate(temp_list):
            ap_list[index] = (np.array(ap_list[index])+np.array(item)).tolist()
    #print(ap_list)
    for item in ap_list:
        item[:] = [i / k_max for i in item]
    return ap_list       
    

In [5]:
k_max = 5
avg_prec = average_prec(k_max,ranking_pairs)

random_query_index_list = random.sample(range(len(ranking_pairs)), 10)
for i in random_query_index_list:
    print("query ",i,": \t",ranking_pairs[i][0], "\t",'%.3f'%avg_prec[i][0],'\t',ranking_pairs[i][1],'\t','%.3f'%avg_prec[i][1])

query  28613 : 	 ('P', ('HR', 'HR', 'HR', 'N', 'HR')) 	 0.902 	 ('E', ('N', 'R', 'N', 'HR', 'N')) 	 0.173
query  27209 : 	 ('P', ('HR', 'HR', 'N', 'HR', 'HR')) 	 0.813 	 ('E', ('HR', 'N', 'R', 'R', 'N')) 	 0.629
query  18276 : 	 ('P', ('N', 'R', 'R', 'HR', 'N')) 	 0.300 	 ('E', ('HR', 'HR', 'R', 'N', 'HR')) 	 0.902
query  33382 : 	 ('P', ('HR', 'R', 'N', 'N', 'R')) 	 0.737 	 ('E', ('R', 'R', 'HR', 'HR', 'HR')) 	 1.000
query  13289 : 	 ('P', ('N', 'R', 'N', 'N', 'N')) 	 0.128 	 ('E', ('R', 'R', 'N', 'R', 'N')) 	 0.781
query  35122 : 	 ('P', ('HR', 'R', 'HR', 'N', 'HR')) 	 0.902 	 ('E', ('N', 'HR', 'N', 'HR', 'R')) 	 0.197
query  15610 : 	 ('P', ('N', 'R', 'HR', 'N', 'HR')) 	 0.257 	 ('E', ('HR', 'HR', 'HR', 'R', 'N')) 	 0.960
query  10486 : 	 ('P', ('N', 'HR', 'HR', 'R', 'HR')) 	 0.332 	 ('E', ('HR', 'N', 'N', 'N', 'N')) 	 0.457
query  55175 : 	 ('P', ('R', 'R', 'HR', 'N', 'R')) 	 0.902 	 ('E', ('R', 'R', 'R', 'R', 'R')) 	 1.000
query  30044 : 	 ('P', ('HR', 'HR', 'HR', 'R', 'HR')) 	 1.

<div style="background-color: lightyellow">
explanation/analysis?
</div>

In [6]:
#-------------calc ideal dcg based on ground truth [10 HR, 10 R, 10 N]
def calc_ideal_dgc(k):
    idcg_rel = [2]*10 + [1]*10 + [0]*10
    idcg = 0
    for index in range(min(k, len(idcg_rel))):
        idcg += ((2**idcg_rel[index]) - 1)/(math.log2(2+index))
    return idcg
        

In [7]:
# I used the second formula of DCG from slide 6 of http://www.cs.cornell.edu/courses/cs4300/2013fa/lectures/evaluation-1-4pp.pdf
#nDCG is a way to calculate this measure across many independent ____queries____(http://curtis.ml.cmu.edu/w/courses/index.php/Normalized_discounted_cumulative_gain)
#Normalize DCG at rank n by the DCG value at rank n of the ideal ranking(stanford slide)

import math
def ndcg(max_k,query_index,method):    
    all_dcg_k = []
    
    #--------------------make dcg list for all k-----------------------------
    for r in range(0,max_k):
        if ranking_pairs[query_index][method][1][r] == 'HR':
            rel_num = 2
        elif ranking_pairs[query_index][method][1][r] == 'R':
            rel_num = 1
        else:
            rel_num = 0
            
        if len(all_dcg_k) > 0:   
            all_dcg_k.append((((2**rel_num) - 1)/(math.log2(2+r))) + all_dcg_k[-1])
        else:
            all_dcg_k.append(((2**rel_num) - 1)/(math.log2(2+r)))
            
            
    #----------------calculate ideal dcg------------------------------------      
    idcg = calc_ideal_dgc(max_k)
        
    #--------------------------convert dcg to ndcg--------------------------   
    if(idcg == 0):
        all_dcg_k[:] = [0 for x in all_dcg_k]
    else:
        all_dcg_k[:] = [x / idcg for x in all_dcg_k]
     
    return all_dcg_k

#-----------------------call ndcg function for all 59000 queries------------------
ndcg_list = []
max_k = 5
for i in range(len(ranking_pairs)):
    p_dcg = ndcg(max_k,i,0)
    e_dcg = ndcg(max_k,i,1)
    ndcg_list.append((p_dcg[-1],e_dcg[-1]))

    
    

In [8]:
ndcg_list[:10]

[(0.0, 0.13120507751234178),
 (0.0, 0.04373502583744726),
 (0.0, 0.14606834984270645),
 (0.0, 0.27727342735504823),
 (0.0, 0.1898033756801537),
 (0.0, 0.04868944994756881),
 (0.0, 0.17989452745991058),
 (0.0, 0.09242447578501607),
 (0.0, 0.16958010263680803),
 (0.0, 0.30078518014914984)]

In [9]:
random_query_index_list = random.sample(range(len(ranking_pairs)), 10)
for index in random_query_index_list:
    print("query %d"%index,":")
    print(ranking_pairs[index][0], "\t", ndcg_list[index][0])
    print(ranking_pairs[index][1], "\t",ndcg_list[index][1],'\n')

query 33980 :
('P', ('HR', 'R', 'N', 'HR', 'R')) 	 0.600292335865279
('E', ('HR', 'N', 'R', 'N', 'HR')) 	 0.5268919836648939 

query 56662 :
('P', ('R', 'R', 'R', 'N', 'N')) 	 0.24090885754831726
('E', ('N', 'HR', 'N', 'R', 'HR')) 	 0.3938807921944381 

query 34987 :
('P', ('HR', 'R', 'HR', 'N', 'N')) 	 0.5800690628219334
('E', ('HR', 'R', 'N', 'HR', 'HR')) 	 0.6877623875401735 

query 43836 :
('P', ('R', 'N', 'R', 'N', 'HR')) 	 0.30078518014914984
('E', ('N', 'HR', 'N', 'R', 'HR')) 	 0.3938807921944381 

query 1377 :
('P', ('N', 'N', 'N', 'HR', 'R')) 	 0.1898033756801537
('E', ('R', 'N', 'N', 'R', 'N')) 	 0.16174285170544084 

query 23929 :
('P', ('HR', 'N', 'HR', 'R', 'R')) 	 0.6011647836954401
('E', ('R', 'HR', 'R', 'R', 'HR')) 	 0.5634608948312462 

query 38353 :
('P', ('HR', 'R', 'R', 'HR', 'R')) 	 0.656819036744215
('E', ('HR', 'HR', 'HR', 'N', 'N')) 	 0.7227265726449517 

query 51474 :
('P', ('R', 'HR', 'R', 'HR', 'R')) 	 0.5733697430514892
('E', ('R', 'N', 'N', 'R', 'R')) 	 0.2

In [10]:
# ERR

import numpy as np

def numerical(relevance_sequence):
    """
    Convert a relevance grade sequence to a sequence of numerical 
    values, based on the relevance grades given.
    E.g. ['HR', 'HR', 'HR', 'R', 'N'] returns [2, 2, 2, 1, 0]
    """
    numerical_relevance_sequence = [1 if grade == 'R' \
                                    else 2 if grade == 'HR' else 0 \
                                    for grade in relevance_sequence]
    return numerical_relevance_sequence

def R_function(g, g_max):
    return (2**g - 1)/(2**g_max)

def ERR(relevance_sequence):
    """
    Compute the ERR based on Algorithm 2 in 
    https://pdfs.semanticscholar.org/7e3c/f6492128f915112ca01dcb77c766129e65cb.pdf
    """
    p = 1
    ERR = 0
    n = len(relevance_sequence)
    g_max = max(relevance_sequence)
    
    for r in range(1, n + 1):
        g = relevance_sequence[r - 1]
        R = R_function(g, g_max)
        ERR = ERR + p * (R/r)
        p = p * (1 - R)
    return ERR

ERR_list = []
for P, E in ranking_pairs:
    ERR_P = ERR(numerical(P[1]))
    ERR_E = ERR(numerical(E[1]))
    ERR_list.append((ERR_P, ERR_E))
    
print('mean of all ERR values for P & E (should be equal):\n', 
      np.mean(ERR_list, axis=0))
print('mean of a random sample of 100 pairs:\n', 
      np.mean(random.sample(ERR_list, 100), axis=0))
pprint(ERR_list[:10])

mean of all ERR values for P & E (should be equal):
 [ 0.55613292  0.55613292]
mean of a random sample of 100 pairs:
 [ 0.54105143  0.57543164]
[(0.0, 0.15),
 (0.0, 0.1),
 (0.0, 0.1875),
 (0.0, 0.225),
 (0.0, 0.2),
 (0.0, 0.125),
 (0.0, 0.175),
 (0.0, 0.175),
 (0.0, 0.25),
 (0.0, 0.2875)]


### Step 3: <font color='darkred'>Calculate the 𝛥measure *(0 points)*</font>
For the three measures and all *P* and *E* ranking pairs constructed above calculate the difference: 𝛥measure = measure<sub>E</sub>-measure<sub>P</sub>. Consider only those pairs for which *E* outperforms *P*.


In [11]:
def measure_diff(measure_list, _ranking_pairs, query_index_list):
    for i in query_index_list:
        diff = measure_list[i][1] - measure_list[i][0]
        if diff > 0:
            print("query",i,": ",'\t',_ranking_pairs[i][0],'\t',_ranking_pairs[i][1],'\t','%.3f'%(diff))
        else:
            print("query",i,": ",'\t',"less than 0")

In [12]:
random_query_index_list = random.sample(range(len(ranking_pairs)), 10)

#for ap:
print(" 𝛥measure for AP: ")
measure_diff(avg_prec, ranking_pairs, random_query_index_list)
print("-------------------------------------------------------------------------------------------------------")
#for nDCG:
print(" 𝛥measure for nDCG: ")
measure_diff(ndcg_list, ranking_pairs, random_query_index_list)
print("-------------------------------------------------------------------------------------------------------")
#for ERR:
print(" 𝛥measure for ERR: ")
measure_diff(ERR_list, ranking_pairs, random_query_index_list)


 𝛥measure for AP: 
query 17386 :  	 ('P', ('N', 'R', 'HR', 'R', 'R')) 	 ('E', ('R', 'HR', 'HR', 'R', 'HR')) 	 0.668
query 58063 :  	 less than 0
query 46713 :  	 less than 0
query 10805 :  	 ('P', ('N', 'HR', 'HR', 'R', 'R')) 	 ('E', ('HR', 'R', 'R', 'HR', 'R')) 	 0.668
query 44267 :  	 ('P', ('R', 'N', 'R', 'N', 'R')) 	 ('E', ('R', 'R', 'N', 'R', 'R')) 	 0.228
query 42874 :  	 less than 0
query 3236 :  	 ('P', ('N', 'N', 'HR', 'HR', 'HR')) 	 ('E', ('HR', 'N', 'HR', 'N', 'HR')) 	 0.464
query 8651 :  	 ('P', ('N', 'HR', 'N', 'R', 'R')) 	 ('E', ('R', 'N', 'R', 'N', 'R')) 	 0.388
query 3633 :  	 less than 0
query 57932 :  	 less than 0
-------------------------------------------------------------------------------------------------------
 𝛥measure for nDCG: 
query 17386 :  	 ('P', ('N', 'R', 'HR', 'R', 'R')) 	 ('E', ('R', 'HR', 'HR', 'R', 'HR')) 	 0.343
query 58063 :  	 less than 0
query 46713 :  	 less than 0
query 10805 :  	 ('P', ('N', 'HR', 'HR', 'R', 'R')) 	 ('E', ('HR', 'R', 'R', 'H

In [13]:
# code
diff_precision = [precision_E - precision_P for precision_P, precision_E in prec if precision_E > precision_P]
diff_avgprec = [avgprec_E - avgprec_P for avgprec_P, avgprec_E in avg_prec if avgprec_E > avgprec_P]
diff_nDCG = [nDCG_E - nDCG_P for nDCG_P, nDCG_E in ndcg_list if nDCG_E > nDCG_P]
diff_ERR = [ERR_E - ERR_P for ERR_P, ERR_E in ERR_list if ERR_E > ERR_P]
print(len(diff_precision), len(diff_avgprec), len(diff_nDCG), len(diff_ERR))

24462 27962 29376 29374


<div style="background-color: lightyellow">
explanation/analysis?
</div>

## <font color='purple'>[Based on Lecture 2]</font>
### Step 4: <font color='darkred'>Implement Interleaving *(15 points)*</font>
Implement 2 interleaving algorithms: (1) Team-Draft Interleaving OR Balanced Interleaving, AND (2), Probabilistic Interleaving. The interleaving algorithms should (a) given two rankings of relevance interleave them into a single ranking, and (b) given the users clicks on the interleaved ranking assign credit to the algorithms that produced the rankings.

(Note 4: Note here that as opposed to a normal interleaving experiment where rankings consists of urls or docids, in our case the rankings consist of relevance labels. Hence in this case (a) you will assume that E and P return different documents, (b) the interleaved ranking will also be a ranking of labels.)


In [14]:
#Team-Draft Interleaving
#P and E as two interleaving lists
#as output we will have 

def team_draft_interleaving(_ranking_pairs):
    all_I = []
    for i in range(len(_ranking_pairs)):
        A = _ranking_pairs[i][0][1]
        B = _ranking_pairs[i][1][1]
        team_A = []
        team_B = []
        I = []
        while len(team_A) < len(A) or len(team_B) < len(B):
            RandBit = random.getrandbits(1)
            #pick from A
            if len(team_A) < len(team_B) or (len(team_A) == len(team_B) and RandBit == 1):
                I.append((A[len(team_A)],0))
                team_A.append((I[-1],0))        
            else:
                #pick from B
                I.append((B[len(team_B)],1))
                team_B.append(I[-1])
        all_I.append(I)
    return all_I


In [15]:
def find_winner(_click,_all_I):
    winner_list = []
    for query_index in range(len(_all_I)):
        h_a = 0
        h_b = 0
        for index, item in enumerate(_all_I[query_index]):
            if _click[query_index][index] > 0:
                if item[1] == 0:
                    h_a += 1
                if item[1] == 1:
                    h_b += 1
        if(h_a > h_b):
            winner_list.append('P')
        if(h_a == h_b):
            winner_list.append('Equal')
        else:
            winner_list.append('E')
    return winner_list
    

<div style="background-color: lightyellow">
explanation/analysis?
</div>

## <font color='purple'>[Based on Lecture 3]</font>
### Step 5: <font color='darkred'>Implement User Clicks Simulation *(15 points)*</font>
Having interleaved all the ranking pairs an online experiment could be ran. However, given that we do not have any users (and the entire homework is a big simulation) we will simulate user clicks.

We have considered a number of click models including:
1. Random Click Model (RCM)
2. Position-Based Model (PBM)
3. Simple Dependent Click Model (SDCM)
4. Simple Dynamic Bayesian Network (SDBN)

Consider two different click models, (a) the Random Click Model (RCM), and (b) one out of the remaining 3 aforementioned models. The parameters of some of these models can be estimated using the Maximum Likelihood Estimation (MLE) method, while others require using the Expectation-Maximization (EM) method. Implement the two models so that (a) there is a method that learns the parameters of the model given a set of training data, (b) there is a method that predicts the click probability given a ranked list of relevance labels, (c) there is a method that decides - stochastically - whether a document is clicked based on these probabilities.

Having implemented the two click models, estimate the model parameters using the Yandex Click Log [[file]](https://drive.google.com/file/d/1tqMptjHvAisN1CJ35oCEZ9_lb0cEJwV0/view).

(Note 6: Do not learn the attractiveness parameter *a*<sub>uq</sub>)

In [16]:
all_I = team_draft_interleaving(ranking_pairs)   

In [17]:
def learn_rcm_param(_train_file):
    c = 0
    query_num = 0.0
    click_num = 0.0
    for line in _train_file:
        info = line.split()
        #print(info[2] == 'Q')
        if info[2] == 'Q':
            query_num += 1
        if info[2] == 'C':
            click_num += 1

    doc_num = query_num*10
    _rcm_param = click_num/doc_num
    return _rcm_param
 

In [None]:
#SDBM:
# I have to change the formula for finiding the number of last click-------------------!!!!!!

def learn_sdbm_param(_train_file):
    for line in _train_file:
        info = line.split()
    

    return sigma


<div style="background-color: lightyellow">
explanation/analysis?
</div>

### Step 6: <font color='darkred'>Simulate Interleaving Experiment *(10 points)*</font>
Having implemented the click models, it is time to run the simulated experiment.

For each of interleaved ranking run N simulations for each one of the click models implemented and measure the proportion *p* of wins for E.
(Note 7: Some of the models above include an attractiveness parameter *a*<sub>uq</sub>. Use the relevance label to assign this parameter by setting *a*<sub>uq</sub> for a document u in the ranked list accordingly. (See [Click Models for Web Search](http://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf))


In [19]:
def simulate_rcm_click(_I,_ru):
    simulated_click = []
    for i in range(len(_I)):
        if(random.random() < _ru):
            simulated_click.append(1)
        else:
            simulated_click.append(0)
    return simulated_click

In [20]:
train_file = open("YandexRelPredChallenge.txt","r")
ru = learn_rcm_param(train_file)
print(ru)

rcm_click_list = [simulate_rcm_click(all_I[i],ru) for i in range(len(all_I))]

print(all_I[:10])
pprint(rcm_click_list[:10])

winner_list = find_winner(rcm_click_list, all_I)
pprint(winner_list[:20])

0.13445559411047547
[[('N', 1), ('N', 0), ('N', 0), ('N', 1), ('N', 1), ('N', 0), ('N', 0), ('N', 1), ('N', 0), ('HR', 1)], [('N', 1), ('N', 0), ('N', 1), ('N', 0), ('N', 1), ('N', 0), ('N', 1), ('N', 0), ('R', 1), ('N', 0)], [('N', 1), ('N', 0), ('N', 1), ('N', 0), ('N', 1), ('N', 0), ('N', 0), ('HR', 1), ('N', 1), ('N', 0)], [('N', 0), ('N', 1), ('N', 0), ('N', 1), ('N', 1), ('N', 0), ('N', 0), ('HR', 1), ('N', 0), ('HR', 1)], [('N', 0), ('N', 1), ('N', 1), ('N', 0), ('N', 0), ('N', 1), ('N', 0), ('HR', 1), ('R', 1), ('N', 0)], [('N', 0), ('N', 1), ('N', 1), ('N', 0), ('N', 1), ('N', 0), ('R', 1), ('N', 0), ('N', 1), ('N', 0)], [('N', 1), ('N', 0), ('N', 0), ('N', 1), ('N', 0), ('N', 1), ('R', 1), ('N', 0), ('N', 0), ('HR', 1)], [('N', 1), ('N', 0), ('N', 1), ('N', 0), ('N', 1), ('N', 0), ('R', 1), ('N', 0), ('R', 1), ('N', 0)], [('N', 1), ('N', 0), ('N', 0), ('N', 1), ('N', 0), ('HR', 1), ('N', 1), ('N', 0), ('N', 0), ('N', 1)], [('N', 0), ('N', 1), ('N', 1), ('N', 0), ('HR', 1), ('

In [21]:
def simulate_sdbm_click(_I,_sigma):
    simulated_click = [0]*len(_I)
    attraction = {'HR':0.9,'R':0.6,'N':0.1}
    for index,item in enumerate(_I):
        if random.random() < attraction[item[0]]:
            simulated_click[index]= 1
            if random.random() < _sigma:
                return simulated_click
    return simulated_click

In [22]:
train_file = open("YandexRelPredChallenge.txt","r")
sigma = learn_sdbm_param(train_file)

sdbm_click_list = [simulate_sdbm_click(all_I[i],sigma) for i in range(len(all_I))]

print(all_I[:10])
pprint(sdbm_click_list[:10])

winner_list = find_winner(sdbm_click_list, all_I)
pprint(winner_list[:20])

[[('N', 1), ('N', 0), ('N', 0), ('N', 1), ('N', 1), ('N', 0), ('N', 0), ('N', 1), ('N', 0), ('HR', 1)], [('N', 1), ('N', 0), ('N', 1), ('N', 0), ('N', 1), ('N', 0), ('N', 1), ('N', 0), ('R', 1), ('N', 0)], [('N', 1), ('N', 0), ('N', 1), ('N', 0), ('N', 1), ('N', 0), ('N', 0), ('HR', 1), ('N', 1), ('N', 0)], [('N', 0), ('N', 1), ('N', 0), ('N', 1), ('N', 1), ('N', 0), ('N', 0), ('HR', 1), ('N', 0), ('HR', 1)], [('N', 0), ('N', 1), ('N', 1), ('N', 0), ('N', 0), ('N', 1), ('N', 0), ('HR', 1), ('R', 1), ('N', 0)], [('N', 0), ('N', 1), ('N', 1), ('N', 0), ('N', 1), ('N', 0), ('R', 1), ('N', 0), ('N', 1), ('N', 0)], [('N', 1), ('N', 0), ('N', 0), ('N', 1), ('N', 0), ('N', 1), ('R', 1), ('N', 0), ('N', 0), ('HR', 1)], [('N', 1), ('N', 0), ('N', 1), ('N', 0), ('N', 1), ('N', 0), ('R', 1), ('N', 0), ('R', 1), ('N', 0)], [('N', 1), ('N', 0), ('N', 0), ('N', 1), ('N', 0), ('HR', 1), ('N', 1), ('N', 0), ('N', 0), ('N', 1)], [('N', 0), ('N', 1), ('N', 1), ('N', 0), ('HR', 1), ('N', 0), ('N', 0), ('

<div style="background-color: lightyellow">
explanation/analysis?
</div>

### Step 7: <font color='darkred'>Results and Analysis *(30 points)*</font>
Compare the results of the offline experiments (i.e. the values of the 𝛥measure) with the results of the online experiment (i.e. proportion of wins), analyze them and reach your conclusions regarding their agreement.
* Use easy to read and comprehend visuals to demonstrate the results;
* Analyze the results on the basis of
    * the evaluation measure used,
    * the interleaving method used,
    * the click model used.
* Report and ground your conclusions.

(Note 8: This is the place where you need to demonstrate your deeper understanding of what you have implemented so far; hence the large number of points assigned. Make sure you clearly do that so that the examiner of your work can grade it accordingly.)

<u>Yandex Click Log File</u>:

The dataset includes user sessions extracted from Yandex logs, with queries, URL rankings and clicks. To allay privacy concerns the user data is fully anonymized. So, only meaningless numeric IDs of queries, sessions, and URLs are released. The queries are grouped only by sessions and no user IDs are provided. The dataset consists of several parts. Logs represent a set of rows, where each row represents one of the possible user actions: query or click.

In the case of a Query:

    SessionID TimePassed TypeOfAction QueryID RegionID ListOfURLs


In the case of a Click:

    SessionID TimePassed TypeOfAction URLID


* `SessionID` - the unique identifier of the user session.
* `TimePassed` - the time elapsed since the beginning of the current session in standard time units.
* `TypeOfAction` - type of user action. This may be either a query (Q), or a click (C).
* `QueryID` - the unique identifier of the request.
* `RegionID` - the unique identifier of the country from which a given query. This identifier may take four values.
* `URLID` - the unique identifier of the document.
* `ListOfURLs` - the list of documents from left to right as they have been shown to users on the page extradition Yandex (top to bottom).


In [23]:
from scipy import stats
from itertools import combinations
np.random.seed(123)

diff_measure_lists = {'precision': diff_precision, 'average precision': diff_avgprec, 
                      'nDCG': diff_nDCG, 'ERR': diff_ERR}

for measure_1, measure_2 in combinations(diff_measure_lists.keys(), 2):
    print(measure_1, 'vs.', measure_2)
    measure_1 = diff_measure_lists[measure_1]
    measure_2 = diff_measure_lists[measure_2]
    print(np.mean(measure_1), np.mean(measure_2), '\t', stats.ttest_ind(measure_1, measure_2), '\n')

precision vs. average precision
0.440397350993 0.385633681425 	 Ttest_indResult(statistic=24.810331399370998, pvalue=4.1906255886023189e-135) 

precision vs. nDCG
0.440397350993 0.227499954393 	 Ttest_indResult(statistic=120.28697419447488, pvalue=0.0) 

precision vs. ERR
0.440397350993 0.252210726855 	 Ttest_indResult(statistic=102.07668528704501, pvalue=0.0) 

average precision vs. nDCG
0.385633681425 0.227499954393 	 Ttest_indResult(statistic=87.264614427180462, pvalue=0.0) 

average precision vs. ERR
0.385633681425 0.252210726855 	 Ttest_indResult(statistic=71.148169032468587, pvalue=0.0) 

nDCG vs. ERR
0.227499954393 0.252210726855 	 Ttest_indResult(statistic=-17.076224869103331, pvalue=3.2076576398410232e-65) 



<div style="background-color: lightyellow">
explanation/analysis?
</div>