# Retrieval Evaluation

Compare the effectiveness of System A and System B on a test collection consisting of three queries. The table below contains the rankings generated by the two systems as well as the ground truth. We assume that relevance is binary, i.e., the ground truth column contains a set of the relevant documents. 

Document rankings produced by two systems and binary relevance judgements:

\begin{array}{|l|l|l|l|}
    \hline    
    \textbf{Query} & \textbf{System A ranking} & \textbf{System B ranking} & \textbf{Ground truth} \\
    \hline
    \hline
    Q1 & 1, 2, 4, 5, 3, 6, 9, 8, 10, 7 & 2, 4, 3, 10, 5, 6, 7, 8, 9, 1 & 1, 3 \\
    \hline
    Q2 & 1, 2, 4, 5, 3, 9, 8, 6, 10, 7 & 5, 6, 4, 1, 7, 8, 9, 10, 3, 2 & 2, 4, 5, 6 \\
    \hline
    Q3 & 1, 7, 4, 5, 3, 6, 9, 8, 10, 2 & 2, 4, 3, 7, 5, 6, 1, 8, 9, 10 & 7 \\
    \hline
  \end{array}

## Solution


Effectiveness measures fo **System A**: 

  \begin{array}{|l||c|c|c|c|}
    \hline    
    \textbf{Query} & P@5 & P@10 & (M)AP & (M)RR\\
    \hline
    \hline
	Q1 & & & & \\
    \hline
	Q2 & & & & \\
    \hline
	Q3 & & & & \\
    \hline
    \hline		    
	Average & & & & \\
    \hline
  \end{array}



Effectiveness measures fo **System B**: 

  \begin{array}{|l||c|c|c|c|}
    \hline    
    \textbf{Query} & P@5 & P@10 & (M)AP & (M)RR\\
    \hline
    \hline
	Q1 & & & & \\
    \hline
	Q2 & & & & \\
    \hline
	Q3 & & & & \\
    \hline
    \hline		    
	Average & & & & \\
    \hline
  \end{array}

In [39]:
from typing import Callable, Dict, List, Set


In [118]:
def precision_k(A, B , k):
    ''' Precision @ k
    Params:
        A: Relevant documents
        B: Retrived documents
    '''
    A, B = set(A), set(B[:k])
    return len(A.intersection(B))/len(B)

def receall_k(A, B , k):
    ''' Precision @ k
    Params:
        A: Relevant documents
        B: Retrived documents
    '''
    A, B = set(A), set(B[:k])
    return len(A.intersection(B))/len(A)

def average_precision(A, B):
    ''' Average Precision
    Params:
        A: Relevant documents
        B: Retrived documents
    '''
    ap = []

    for i, doc_id in enumerate(B):
        if doc_id in A:
            p_i = precision_k(A, B, k=i+1)
            ap.append(p_i)
            
    return  sum(ap)/len(A)


def reciprocal_rank(A, B):
    ''' Average Precision
    Params:
        A: Relevant documents
        B: Retrived documents
    '''
    rr = 0
    for i, doc_id in enumerate(B):
        r = i + 1
        if doc_id in A:
            rr = 1/r
            break
    return rr




In [65]:
gt_q1 = [1, 3]
sys_A_rank_q1 = [1, 2, 4, 5, 3, 6, 9, 8, 10, 7]
sys_B_rank_q1 = [2, 4, 3, 10, 5, 6, 7, 8, 9, 1]

In [47]:
# Precision
k_vals = [5, 10]
for k in k_vals:
    print(f'sysA q1 k:{k}',precision_k(gt_q1, sys_A_rank_q1, k=k))
    print(f'sysB q1 k:{k}',precision_k(gt_q1, sys_B_rank_q1, k=k))


sysA q1 k:5 0.4
sysB q1 k:5 0.2
sysA q1 k:10 0.2
sysB q1 k:10 0.2


In [81]:
# Average precision
print('AP Sys A q1')
print(average_precision(gt_q1, sys_A_rank_q1))
print('AP Sys B q1')
print(average_precision(gt_q1, sys_B_rank_q1))

AP Sys A q1
0.7
AP Sys B q1
0.26666666666666666


In [85]:
# Query 2 Sys A
sys_A_q2 = [1, 2, 4, 5, 3, 9, 8, 6, 10, 7]
sys_B_q2 = [5, 6, 4, 1, 7, 8, 9, 10, 3, 2]
gt_q2 =  [2, 4, 5, 6]

# Precision
k_vals = [5, 10]
for k in k_vals:
    print(f'sysA q2 k:{k}',precision_k(gt_q2, sys_A_q2, k=k))
    print(f'sysB q2 k:{k}',precision_k(gt_q2, sys_B_q2, k=k))


sysA q2 k:5 0.6
sysB q2 k:5 0.6
sysA q2 k:10 0.4
sysB q2 k:10 0.4


In [115]:
# Average precision
print('AP Sys A q1')
print(average_precision(gt_q2, sys_A_q2))
print('AP Sys B q1')
print(average_precision(gt_q2, sys_B_q2))

AP Sys A q1
0.6041666666666666
AP Sys B q1
0.85


In [120]:
## Reciprocal Rank
reciprocal_rank(gt_q2, sys_A_q2), reciprocal_rank(gt_q2, sys_B_q2),

(0.5, 1.0)