# Term-at-a-time Query Processing

Implement term-at-a-time query processing using a simple scoring function.

In [2]:
import ipytest
import pytest

from typing import Dict, List, Tuple
from collections import Counter

ipytest.autoconfig()

### Inverted index

For simplicity, the inverted index for the document collection is given as a dictionary, with a terms as keys and posting lists as values. Each posting is a (document ID, term frequency) tuple.

In [3]:
index = {
    "beijing": [        (1, 1),                 (4, 1)],
    "dish":    [        (1, 1),                 (4, 1)],
    "duck":    [(0, 3), (1, 2), (2, 2),         (4, 1)],
    "rabbit":  [                (2, 1), (3, 1)        ],
    "recipe":  [                (2, 1), (3, 1), (4, 1)]
}

### Document lengths

The length of each document is provided in a list (Normally, this information would be present in a document index).



In [4]:
doc_len = [3, 4, 4, 2, 4]

### Term-at-a-time scoring

We utilize the fact that the posting lists are ordered by document ID. Then, it's enough to iterate through each query term's posting list only once. We keep a score accumulator for each document.

The retrieval function we use is the following:

$$score(q,d) = \sum_{t \in q} w_{t,d} \times w_{t,q}$$

where $w_{t,d}$ and $w_{t,q}$ are length-normalized term frequencies, i.e., $w_{t,d} = \frac{c_{t,d}}{|d|}$, where $c_{t,d}$ is the number of occurrences of term $t$ in document $d$ and $|d|$ is the document length, i.e., the total number of terms. Similarly for the query.

In [None]:
def score_collection(index: Dict[str, List[Tuple[int, int]]], 
                     doc_len: List[int], 
                     query: str) -> List[Tuple[int, float]]:
    """Scores all documents in the collection.
    
    Args:
        index: Dict holding the inverted index.
        doc_len: List with document lengths.
        query: Search query.
    
    Returns:
        List with (document_id, score) tuples, ordered by score desc.
    """
    
    # Turns the query string into a "term: freq" dictionary.
    query_freqs = dict(Counter(query.split()))
    # Computes query length (i.e., sum of all query term frequencies).
    query_len = sum(query_freqs.values())

    doc_scores = {docid: 0 for docid in range(len(doc_len))}  # Holds the final document scores (accumulator).
    
    # Iterate through each posting list.
    for term, tf_q in query_freqs.items():
        # TODO: TAAT scoring algorithm        

    # TODO: return doc_scores sorted
    return None

In [None]:
%%run_pytest[clean]

def test_scoring():
    scores = score_collection(index, doc_len, "beijing duck recipe")    
    assert scores[0][0] == 0
    assert scores[0][1] == pytest.approx(1/3, rel=1e-2)
    assert scores[2][0] == 2
    assert scores[2][1] == pytest.approx(1/4, rel=1e-2)
    assert scores[4][0] == 3
    assert scores[4][1] == pytest.approx(1/6, rel=1e-2)