# Week 3

## Metrics

Metrics can be transformed in whatever way we want as long as it does not become biased towards any particular system, i.e., the order of performance is preserved.

### Precision

\begin{equation}
P = \frac{\text{TP}}{\text{TP} + \text{FP}}
\end{equation}


### Recall

\begin{equation}
R = \frac{\text{TP}}{\text{TP} + \text{FN}}
\end{equation}


### F1

\begin{equation}
F_\beta = \frac{1}{ \frac{\beta^2}{\beta^2 + 1}\frac{1}{P} +  \frac{1}{\beta^2 + 1}\frac{1}{R}}
\end{equation}

This is the harmonic mean weighted by the parameter $\beta \in (0, 1)$

\begin{equation}
F_1 = \frac{2PR}{P + R}
\end{equation}

is the special case for $\beta = 1$.

## Evaluating ranked list

### Precision-Recall (PR) Curve

Given ranked list, go from top to bottom and compute the corresponding precision and recall $(P_i, R_i)$ at each position $i$.

Plot these coordinates with $P$ as y-axis and $R$ as x-axis. This is known as PR curve.

Assume precision outside for items outside the list has precision 0.

Area under the PR curve is a measure of the quality of the algorithm that generated this list.

### Average Precision (AP)

The average of precision at every cutoff where a new relevant document is retrieved.

Normalizer = total number of relevant documents in collection.

Sensitive to the rank of each relevant document.

Compute precision and recall $(P_i, R_i)$ at each position $i$.
From top rank to bottom of list, average the precision corresponding to a change in recall.

In this example, assume there are **10 relevant documents** in the collection.

| Doc  | Precision | Recall |
|------|-----------|--------|
| D1 + | **1/1**   | 1/10   |
| D2 + | **2/2**   | 2/10   |
| D3 - | 2/3       |        |
| D4 - | 2/4       |        |
| D5 + | **3/5**   | 3/10   |
| D6 - |           |        |
| D7 - |           |        |
| D8 + | **4/8**   | 4/10   |
| D9 - | 4/9       |        |
| D10 -| 4/10      |        |

\begin{equation}
AP = \frac{ \frac{1}{1} + \frac{2}{2} + \frac{3}{5} + \frac{4}{8} + \overbrace{0 + \ldots + 0}^{\text{Entries where recall stagnants}} }{ 
\underbrace{10}_{\text{# of relevant documents in collection}} }
\end{equation}

For e.g., if two adjacent points have the same recall, it means they either are on top of each other (when they have the same precision) or one is below the other (when one precision is smaller than the other). 
By only considering precision for points where there is a change in recall, we ignore the points "below the curve".

Special case when there is only **one known** relevant document: 

\begin{equation}
\text{AP} = \text{Reciprocal Rank} = \frac{1}{r}
\end{equation}


### Mean Average Precision (MAP)

Mean of average precision over **a set of queries**.

\begin{equation}
\text{MAP} := \frac{\sum_{i=1}^N \text{AP}_i}{N}
\end{equation}


### Geometric mean average precision (gMAP)

Geometric mean of average precisions over a set of queries.

\begin{equation}
\text{gMAP} := \left( \prod_{i=1}^N \text{AP}_i \right)^{\frac{1}{N}}
\end{equation}

## Multi-level relevance judgement

The previous measures only take into account binary relevance (i.e., relevant or not relevant).

They are not applicable if we want to assign a "score"/"weight" to represent the degree of relevance of each retrieved document.

E.g., relevance level: r = 1 (non-relevant), 2 (marginally-relevant), 3 (very relevant).

The gain/relevance is a measure of the "utility" of a document to the user.

### Discounted cumulative gain (DCG)

We need to **discount** the gain based on the position of the document in the list so that **relevant documents with higher rank** is better than those below.

\begin{equation}
\text{DCG@k} := \text{rel}_1 + \sum_{i = 2}^k \frac{ \overbrace{\text{rel}_i}^{\text{Relevance for ith doc}} }
{ \underbrace{\log_2 i}_{\text{Discount factor for position i}} }
\end{equation}

**NOTE** that we can find variations of how NDCG is computed online.


### Normalized discounted cumulative gain (NDCG)

It is not possible to compare DCG across different query and collections because the DCG can have different scales.
By normalizing the DCG, we make all queries contribute equally to an aggregated score.

\begin{equation}
\text{NDCG@k} := \frac{\text{DCG@k}}{\text{Ideal DCG@k}}
\end{equation}

where Ideal DCG@k is the best possible DCG@k score for that collection and query.
This maps the DCG score to the range (0, 1).

Here is an example.

| Doc | Gain / Relvance | Cumulative Gain | DCG        |
|-----|------|----------------------------|------------|
|D1   | 3    | 3 | 3                                   |
|D2   | 2    | 5 | $3 + \frac{2}{\log_2 2}$            | 
|D3   | 1    | 6 | $3 + \frac{2}{\log_2 2} + \frac{1}{\log_2 3}$            | 

## Statistical significance testing

For a given set of queries and the performance for a number of systems, we want to
use statistical testing to judge if score is statistically signficant or due to random chance
when we determine if a system is better than another.

E.g., a simple sign test (system A > or < B) along with p-value to test if A is better than B.

Another is Wilcoxon test, which takes into account the difference between the scores.



## Pooling to create test collection

Naturally we want to minimize the amount of work needed to create a test collection.

We want to select a subset of documents to judge.

Pooling strategy works by choosing a diverse set of systems and have each return the top-K documents
for human assessors to judge.

Documents not judged assumed to be irrelevant (though they don't have to be).

Valid for comparing systems that contributed to the pool but problematic for those that did not.

