# Exercise #2: Neural document (re-)ranking

Re-rank the documents retrieved for the queries in `data/queries.txt` and evaluate them in terms of NDCG@10.

**NOTE**: These are the first 10 queries from Assignment 2B. For computational efficiency, we operate on the `title` field (instead of `content`) and re-rank the top-20 documents retrieved.

In [34]:
import urllib
import requests
import json
import math

In [23]:
QUERIES_FILE = "data/queries.txt"
QRELS_FILE = "data/qrels.csv"

API = "http://gustav1.ux.uis.no:5002"
MAIN_INDEX = "clueweb12b"

## Loading queries

In [8]:
def load_queries(query_file):
    queries = {}
    with open(query_file, "r") as fin:
        for line in fin.readlines():
            qid, query = line.strip().split(" ", 1)
            queries[qid] = query
    return queries

In [9]:
queries = load_queries(QUERIES_FILE)

## Loading relevance judments

In [36]:
def load_qrels(qrels_file):
    qrels = {}  # holds a list of relevant documents for each queryID
    with open(qrels_file, "r") as fin:
        header = fin.readline().strip()
        if header != "QueryId,DocumentId,Relevance":
            raise Exception("Incorrect file format!")
        for line in fin.readlines():
            qid, docid, rel = line.strip().split(",")
            if qid not in qrels:
                qrels[qid] = {}
            qrels[qid][docid] = int(rel)
    return qrels 

In [37]:
qrels = load_qrels(QRELS_FILE)

## Baseline retrieval

In [26]:
def search(indexname, query, field, size=10):
    url = "/".join([API, indexname, "_search"]) + "?" \
          + urllib.parse.urlencode({"q": query, "df": field, "size": size})
    response = requests.get(url).text
    return json.loads(response)

In [45]:
def term_vectors(indexname, doc_id, term_statistics=False):
    ret = {}    
    url = "/".join([API, indexname, doc_id, "_termvectors"]) + "?" \
          + urllib.parse.urlencode({"term_statistics": str(term_statistics).lower()})
    response = requests.get(url).text
    try:
        ret = json.loads(response)
    except:
        print("Failed to json-decode this response:\n{}".format(response))
    return ret

In [54]:
rankings = {}
for qid, query in queries.items():
    rankings[qid] = []
    res = search(MAIN_INDEX, query, "title", size=20)
    for doc in res.get('hits', {}).get("hits", {}):
        rankings[qid].append(doc.get("_id"))    

## Evaluation

In [57]:
def dcg(rel, p):
    dcg = rel[0]
    for i in range(1, min(p, len(rel))): 
        dcg += rel[i] / math.log(i + 1, 2)  # rank position is indexed from 1..
    return dcg

def get_ndcg_rankings(rankings, qrels, k=10):
    sum_ndcg = 0
    for qid, ranking in sorted(rankings.items()):
        gt = qrels[qid]    
        gains = [] # holds corresponding relevance levels for the ranked docs
        for doc_id in ranking[:k]: 
            gain = gt.get(doc_id, 0)
            gains.append(gain)

        # relevance levels of the idealized ranking
        gain_ideal = sorted([v for _, v in gt.items()], reverse=True)

        ndcg = dcg(gains, k) / dcg(gain_ideal, k)
        sum_ndcg += ndcg

    return sum_ndcg / len(rankings)

In [58]:
print(get_ndcg_rankings(rankings, qrels))

0.06814317459580331


## Neural re-ranking

Re-rank the top-100 documents by calculating the cosine similarity between the centroids of the document and query word embeddings. Specifically, 

  1. Represent both the query and the document as the centroid of their term embedding vectors. Specifically, compute $\vec{q}$ by setting the $i$th vector dimension as $$\vec{q}[i]=\frac{\sum_{j=1}^m \vec{q_j}[i]}{m}$$ where $\vec{q_j}$ is the (pre-trained) embedding vector of $j$th query term and $m$ is the length of the query. The computation of $\vec{d}$ follows analogously.
  2. Then, score documents by compute the cosine similarity between $\vec{q}$ and $\vec{d}$. 

### Loading pre-trained embeddings

In [17]:
import gensim.downloader as api
from gensim.models import Word2Vec

In [18]:
import gensim.downloader as api
model = api.load('word2vec-google-news-300')



In [19]:
# Example: getting the embedding vector of a given word
print(model.wv['man'])

[ 0.32617188  0.13085938  0.03466797 -0.08300781  0.08984375 -0.04125977
 -0.19824219  0.00689697  0.14355469  0.0019455   0.02880859 -0.25
 -0.08398438 -0.15136719 -0.10205078  0.04077148 -0.09765625  0.05932617
  0.02978516 -0.10058594 -0.13085938  0.001297    0.02612305 -0.27148438
  0.06396484 -0.19140625 -0.078125    0.25976562  0.375      -0.04541016
  0.16210938  0.13671875 -0.06396484 -0.02062988 -0.09667969  0.25390625
  0.24804688 -0.12695312  0.07177734  0.3203125   0.03149414 -0.03857422
  0.21191406 -0.00811768  0.22265625 -0.13476562 -0.07617188  0.01049805
 -0.05175781  0.03808594 -0.13378906  0.125       0.0559082  -0.18261719
  0.08154297 -0.08447266 -0.07763672 -0.04345703  0.08105469 -0.01092529
  0.17480469  0.30664062 -0.04321289 -0.01416016  0.09082031 -0.00927734
 -0.03442383 -0.11523438  0.12451172 -0.0246582   0.08544922  0.14355469
 -0.27734375  0.03662109 -0.11035156  0.13085938 -0.01721191 -0.08056641
 -0.00708008 -0.02954102  0.30078125 -0.09033203  0.03149

  from ipykernel import kernelapp as app


### Neural re-ranking

In [51]:
import numpy as np

In [52]:
def cosine_sim(vec_a, vec_b):
    dot = np.dot(vec_a, vec_b)
    norma = np.linalg.norm(vec_a)
    normb = np.linalg.norm(vec_b)
    return dot / (norma * normb)

In [59]:
rankings_nn = {}
for qid, docs in rankings.items():
    print("Reranking for query {}".format(qid))
    # TODO: Compute the query embedding vector as the centroid of the query term's embedding vectors
    query_vec = [1, 1, 0, 0]  # Note: these are just dummy values
    scores = {}
    for docid in docs:
        print(docid)
        # TODO: Compute the doc embedding vector as the centroid of embedding vectors of the terms in the doc's title
        doc_vec = [1, 0, 0, 0]  # Note: these are just dummy values
        # iterate through the terms in the title field
        tv = term_vectors(MAIN_INDEX, docid, term_statistics=True)['term_vectors']['title']
        for term in tv['terms']:
            # TODO: get the embedding vector of `term` and update `doc_vec`
            pass
        scores[docid] = cosine_sim(query_vec, doc_vec)

    # sort documents by score
    rankings_nn[qid] = []
    for docid in sorted(scores, key=scores.get, reverse=True):
        rankings_nn[qid].append(docid)

Reranking for query 201
clueweb12-0508wb-36-14116
clueweb12-1812wb-36-11474
clueweb12-0906wb-09-33744
clueweb12-0906wb-96-33932
clueweb12-0906wb-67-25261
clueweb12-0902wb-72-11855
clueweb12-1205wb-78-13462
clueweb12-0500tw-17-18276
clueweb12-1205wb-35-08540
clueweb12-1102wb-91-12621
clueweb12-0100tw-52-01034
clueweb12-0307wb-47-02869
clueweb12-1111wb-41-15778
clueweb12-0909wb-35-26187
clueweb12-1100tw-55-13200
clueweb12-0200tw-42-04809
clueweb12-1201tw-23-04915
clueweb12-1604wb-20-11054
clueweb12-0908wb-09-14789
clueweb12-1716wb-66-00027
Reranking for query 202
clueweb12-1705wb-22-13047
clueweb12-1116wb-12-30914
clueweb12-1707wb-98-07904
clueweb12-0002wb-14-02885
clueweb12-0305wb-35-13201
clueweb12-1013wb-15-21838
clueweb12-1116wb-59-18964
clueweb12-0203wb-91-06006
clueweb12-0601wb-18-11856
clueweb12-0610wb-53-03244
clueweb12-1900tw-53-20555
clueweb12-1212wb-02-13082
clueweb12-0814wb-20-13703
clueweb12-1215wb-33-08369
clueweb12-0503wb-31-31475
clueweb12-0503wb-68-11823
clueweb12-1311wb

## Evaluate your ranking

In [60]:
print(get_ndcg_rankings(rankings_nn, qrels))

0.06814317459580331


## (Optional) Combining rankings

Instead of using the cosine similarity alone for ranking, you may combine it linearly with the original (BM25) retrieval score.

In [43]:
# TODO