# Evaluation of IR Systems

## Overall Evaluation of a Search Engine

### Criteria

#### Efficiency 

* How fast does it index?
    * Large scale ... Number of docs/hour
* How fast does it search?
    * Latency as a function of index size
* How good are the results?
    * Filling information need in top $N$ documents

#### Interface 

* Expressiveness of the query language
    * Boolean vs simple queries
    * Negation
* Error tolerance
* UI
* Cost

### Key Measure: User Happiness

* Speed of response/size of index are factors
* Blindingly fast, useless answers won't make a user happy
* Need a way to quantify happiness

#### Variation by Goal

* Web engine
    * User finds what s/he wants, returns for more searches
    * Measure: returning user rate
* eCommerce
    * User finds product to buy
    * Measure: searches that result in purchase
* Enterprise (Gov, academic)
    * Speed of finding information

## Establishing a Standard 

### Defining a Benchmark

1. Establish information need
2. Establish relevant documents
3. Have multiple annotators perform the same evaluation

#### Relevance 

* Most common proxy of user happiness: relevance
* How to measure?
    1. A benchmark document collection
    2. A benchmark suite of queries
    3. A binary (usually) classification of either relevant or nonrelevant for each $(q, d)$ pair
* Assessed with respect to _information need_, not query

### Inter-annotator agreement

* Kappa measure ($\kappa$)
    * Agreement between judgement for categorical evaluation
    * Corrects for agreement by chance
* $P(A)$: proportion of time judges agree
* $P(E)$: What agreement would be by chance
* Output: $0$ if chance, $1$ for total agreement

$$\kappa = \frac{[P(A)-P(E)]}{[1-P(E)]}$$

#### Example

|  _           | Relevant | Not Relevant | Total |
|--------------|----------|--------------|-------|
| Relevant     |    300   |           20 | 320   |
| Not Relevant |    10    |           70 | 80    |
| **Total**    |**310**   |  **90**      | **400**|

$$P(A) = \frac{300+70}{400}=0.925$$

$$P(J_1=Y \cap J_2 = Y)=\frac{320}{400}\cdot\frac{310}{400}=0.62$$

$$P(J_1=N \cap J_2 = N)=\frac{90}{400}\cdot\frac{80}{400}=0.045$$

$$P(E)=0.62+0.045=0.665$$

$$\kappa = \frac{P(A)-P(E)}{1-P(E)}=\frac{0.925-0.665}{1-0.665}=0.776$$

* $\kappa > 0.8$ = good agreement
* $0.67 < \kappa < 0.8 \rightarrow$ "tentative conclusions"
* For $>2$ annotators: average pairwise kappas

### Available test collections

* TREC conference

## Boolean Model Evaluation

### Precision

Fraction of relevant retrieved docs.

$$P(\text{Relevant}|\text{Retrieved}) = \frac{TP}{TP+FP}$$

### Recall

Fraction of relevant docs retrieved.

$$P(\text{Retrieved | Relevant}) = \frac{TP}{TP+FN}$$

### Accuracy

Total number of correct classifications. Commonly used in machine learning classification, but not necessarily as useful in IR.

$$\frac{TP+TN}{TP+FP+FN+TN} = \frac{TP+TN}{N}$$

### Tradeoff

* Can get high recall (low precision) by just retrieving all docs
    * Retrieval is non-decreasing function of # docs retrieved
* Ideally, precision is inversely proportional to # docs / recall

### Combined F Measure

* Weighted harmonic mean ($F$) assesses precision/recall tradeoff
* Balanced $F_1$
    * Generally used
    * $\beta=1$ or $\alpha=\frac{1}{2}$
* Conservative

$$F=\frac{1}{\alpha\frac{1}{P}+(1-\alpha)\frac{1}{R}} = \frac{(\beta^2+1)PR}{\beta^2 P + R}$$

## Ranked Model Evaluation

### Precision@R

1. Set rank threshold $K$
2. Compute % relevant in top $K$
3. Ignore documents ranked lower than $K$

#### Interpolated Average Precision

* Early TREC competitions used $11$-point interpolated average precision
* Take precision at 11 levels of recall varying from 0 to 1 by tenths of the documents
* Using interpolation _(value for $0$ always interpolated)_
* Average them

### Mean Average Precision (MAP)

* Average of the precision value obtained for top $k$ documents, each time a relevant doc is retrieved
* Avoids interpolation, fixed recall levels
* Most common measure in papers
* Good for web search?
* Assumes user is interested in finding many relevant documents per query

#### Intra/inter- system variance

* Test collections generally do poorly on some information needs ($\mathit{MAP}=0.1$) and well on others ($\mathit{MAP}=0.7$)
* Variance of _same system across queries_ is **much greater** than variance of _different systems on same query_

### MRR

* Consider rank position $K$ of first relevant doc
* Reciprocal Rank score = $\frac{1}{K}$
* $\mathit{MRR}$ is the mean RR across queries

## Other Issues

### Absolute/Marginal- Relevance

* A document can be redundant even if highly relevant
* Same, redundant information from multiple sources

### A/B Testing

* Purpose: Test single change/innovation/idea
* Have most users old system
* Divert small subset of users (e.g., $1\%$) to new system
* Evaluate with automatic measure
    * Clickthrough on first result
