# Document-at-a-time scoring

  - Implement document-at-a-time scoring using vector space retrieval with TFIDF term weighting
  - Use the TF-IDF weighting schemes from the previous task

In [1]:
from pprint import pprint
from collections import Counter
import math

#### Term-document matrix

In [2]:
td_matrix = {
    "beijing": [0, 1, 0, 0, 1],
    "dish": [0, 1, 0, 0, 1],
    "duck": [3, 2, 2, 0, 1],
    "rabbit": [0, 0, 1, 1, 0],
    "recipe": [0, 0, 1, 1, 1],
    "roast": [0, 0, 0, 0, 0]
}

The number of documents is set manually for simplicity

In [3]:
NUM_DOCS = 5

#### Creating the corresponding inverted index

The postings hold (docID, freq) pairs. docID indices start from 0

`doclen` stores the document length

In [4]:
inv_idx = {}
doclen = {}
for term, vec in td_matrix.items():
    inv_idx[term] = []
    for doc_id, freq in enumerate(vec):
        if freq > 0:
            inv_idx[term].append((doc_id, freq))
            doclen[doc_id] = doclen.get(doc_id, 0) + freq

pprint(inv_idx)

{'beijing': [(1, 1), (4, 1)],
 'dish': [(1, 1), (4, 1)],
 'duck': [(0, 3), (1, 2), (2, 2), (4, 1)],
 'rabbit': [(2, 1), (3, 1)],
 'recipe': [(2, 1), (3, 1), (4, 1)],
 'roast': []}


#### This class provides access to the inverted index

In [5]:
class InvIndex(object):
    def __init__(self, idx_contents):
        self.idx = idx_contents
    
    def postings(self, term):
        return self.idx.get(term, [])

Instantiate the InvIndex class

In [6]:
idx = InvIndex(inv_idx)

#### IDF calculation

In [7]:
def idf(term):
    return math.log(NUM_DOCS / len(idx.postings(term))) if len(idx.postings(term)) > 0 else 0

### Document-at-a-time scoring

We utilize the fact that the posting lists are ordered by document ID. 
The posting lists of the query terms are iterated parallel to each other (we always read from the beginning of the list and delete the posting once the current document has been processed).
Each document is scored according to

$Score(q,d) = \sum_{t \in q} w_{t,q} \times w_{t,d}$

where $w_{t,d}=\frac{tfidf_{t,d}}{\sqrt{\sum_{t} tfidf_{t,d}^2}}$ and $w_{t,q}=\frac{tfidf_{t,q}}{\sqrt{\sum_{t} tfidf_{t,q}^2}}$

(which is the same as before).

Here, we cache the query tfidf scores ($tfidf_{t,q}$) and the query normalizer ($\sqrt{\sum_{t} tfidf_{t,q}^2}$), so that these are computed only once in the beginning, and not for each document. 

Further, the document normalizers are also pre-computed (since it's a const value for each doc); for simplicity, this latter computation is based on the term-document matrix, and not on the inverted index.


In [8]:
dnorm = {}
for term, vec in td_matrix.items():
    for doc_id, freq in enumerate(vec):
        if freq > 0:
            tfidf = freq / doclen[doc_id] * idf(term)
            dnorm[doc_id] = dnorm.get(doc_id, 0) + tfidf**2
for doc_id, val in dnorm.items():
    dnorm[doc_id] = math.sqrt(val)

In [9]:
def score_dt(query, index):
    # change the sequence of query terms into a "term: freq" dictionary
    qry = dict(Counter(query))

    # prepcompute the query TFIDF scores and the query normalizer
    tfidf_q = {}
    qnorm = 0
    for term, freq in qry.items():
        tf = freq / len(query)
        tfidf_q[term] = tf * idf(term)
        qnorm += tfidf_q[term]**2
    qnorm = math.sqrt(qnorm)
    
    # fetch the posting lists of each query term
    plists = {}  # holds a copy of the posting list for query term i
    for term in qry.keys():
        plists[term] = list(index.postings(term))  # need to copy the list!
        
    scores = {}  # holds the final document scores

    # iterate through each document
    for doc_id in range(NUM_DOCS):            
        # i) first, we collect the document term frequencies from the index
        # (Essentially, we just "recover" the document's contents from the index.)
        f_d = {}  # holds the term frequencies in the document
        for term in qry.keys(): 
            # TODO: get the frequency of query term i from the posting list
            f_d[term] = 0
                    
        # ii) then, we score the document
        score = 0  # holds the document's retrieval score
        for term in qry.keys(): 
            # incement the document's score according to the given query term
            tfidf_d = 0  # TODO: compute the term's TFIDF score in the document
            score += tfidf_q[term] * tfidf_d
        # final document score, with the query and document normalizers        
        scores[doc_id] = score / (qnorm * dnorm[doc_id])
    return scores
                   

In [10]:
query = ["beijing", "duck", "recipe"]
scores = score_dt(query, idx)

In [11]:
for doc_id, score in sorted(scores.items(), key=lambda x: x[1], reverse=True):
    print("D" + str(doc_id + 1) + ":", round(score, 3))

D1: 0.0
D2: 0.0
D3: 0.0
D4: 0.0
D5: 0.0
