# Query processing with document-at-a-time scoring

Implement term-at-a-time scoring using a simple retrieval function.

In [1]:
import ipytest
import pytest

from typing import Dict, List, Tuple
from collections import Counter

ipytest.autoconfig()


### Inverted index

For simplicity, the inverted index for the document collection is given as a dictionary, with a terms as keys and posting lists as values. Each posting is a (document ID, term frequency) tuple.

In [2]:
index = {
    "beijing": [(1, 1), (4, 1)],
    "dish": [(1, 1), (4, 1)],
    "duck": [(0, 3), (1, 2), (2, 2), (4, 1)],
    "rabbit": [(2, 1), (3, 1)],
    "recipe": [(2, 1), (3, 1), (4, 1)]
}

### Document lengths

The length of each document is provided in a list. (Normally, this information would be present in a document metadata store or in a forward index.)

In [3]:
doc_len = [3, 4, 4, 2, 4]

### Document-at-a-time scoring

We utilize the fact that the posting lists are ordered by document ID.  Then, it"s enough to iterate through each query term"s posting list only once.  We keep a pointer for each query term.

Normally, document scores would be kept in a priority queue. Here, for simplicity, we will keep them in a dictionary.

The retrieval function we use is the following:

$$score(q,d) = \sum_{t \in q} w_{t,d} \times w_{t,q}$$

where $w_{t,d}$ and $w_{t,q}$ are length-normalized term frequencies. I.e., $w_{t,d}=\frac{c_{t,d}}{|d|}$, where $c_{t,d}$ is the number of occurrences of term $t$ in document $d$ and $|d|$ is the document length (=total number of terms). (It goes analogously for the query.)

In [4]:
def score_collection(index: Dict[str, List[Tuple[int, int]]], 
                    doc_len: List[int], 
                    query: str) -> List[Tuple[int, float]]:
    """Scores all documents in the collection.
    
    Args:
        index: Dict holding the inverted index.
        doc_len: List with document lengths.
        query: Search query.
    
    Returns:
        List with (document_id, score) tuples, ordered by score desc.
    """
    
    # Turns the query string into a "term: freq" dictionary.
    query_freqs = dict(Counter(query.split()))
    # Computes query length (i.e., sum of all query term frequencies).
    query_len = sum(query_freqs.values())

    doc_scores = {}  # Holds the final document scores (this should be a priority list, but for simplicity we use a dictionary here).
    
    pos = {term: 0 for term in query_freqs}  # Holds a pointer for each query term"s posting list.
        
    # Iterate through each document.
    for doc_id in range(len(doc_len)):            
        # First, we collect the document term frequencies from the index.
        # (Essentially, we just "recover" the document"s contents from the index.)
        c_td = {}  # Holds the term frequencies in the document
        for term in query_freqs.keys(): 
            # Get the term frequency from the posting list.
            # Utilize the fact that the posting lists are ordered by document ID!
            if pos[term] == len(index[term]):  # The end of the posting list has been reached.
                continue
            (d, freq) = index[term][pos[term]]
            if d == doc_id:
                c_td[term] = freq
                pos[term] += 1
            else:
                # This means that d > doc_id, i.e., the term is not present in this doc.
                pass
                    
        # Then, we score the document.
        score = 0  # Holds the document"s retrieval score
        for term, c_tq in query_freqs.items():
            # Incement the document"s score according to the given query term
            w_td = c_td.get(term, 0) / doc_len[doc_id]
            w_tq = c_tq / query_len
            score += w_td * w_tq
        # Record final document score.
        doc_scores[doc_id] = score
        
    return sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)

Tests.

In [5]:
%%run_pytest[clean]

def test_scoring():
    scores = score_collection(index, doc_len, "beijing duck recipe")    
    assert scores[0][0] == 0
    assert scores[0][1] == pytest.approx(1/3, rel=1e-2)
    assert scores[2][0] == 2
    assert scores[2][1] == pytest.approx(1/4, rel=1e-2)
    assert scores[4][0] == 3
    assert scores[4][1] == pytest.approx(1/6, rel=1e-2)

.                                                                                  [100%]
1 passed in 0.01s
