# Exercise #1: PRMS for entity ranking

## Build a local Elasticsearch index of a set of selected movies

Build a fielded Elasticsearch index. Fields should include title, description, categories, directors, actors.

In [6]:
INDEX_NAME = "movies"

In [1]:
# TODO

## MLM for ranking movies

Implement the mixture of language models for ranking movies

In [12]:
# TODO update field names and weights
FIELDS = ["title", "content"]
FIELD_WEIGHTS = [0.2, 0.8]
LAMBDA = 0.1

Documents should be scored according to **query (log)likelihood**: 

$\log P(q|d) = \sum_{t \in q} f_{t,q} \log P(t|\theta_d)$, 

where
  - $f_{t,q}$ is the frequency of term $t$ in the query
  - $P(t|\theta_d)$ is the (smoothed) document language model.
  
Using multiple document fields, the **document language model** is taken to be a linear combination of the (smoothed) field language models:

$P(t|\theta_d) = \sum_i w_i P(t|\theta_{d_i})$ ,

where $w_i$ is the field weight for field $i$ (and $\sum_i w_i = 1$).

The **field language models** $P(t|\theta_{d_i})$ are computed as follows.

Using **Jelinek-Mercer smoothing**:

$P(t|\theta_{d_i}) = (1-\lambda_i) P(t|d_i) + \lambda_i P(t|C_i)$,

where 

  - $\lambda_i$ is a field-specific smoothing parameter
  - $P(t|d_i) = \frac{f_{t,d_i}}{|d_i|}$ is the empirical field language model (term's relative frequency in the document field). $f_{t,d_i}$ is the raw frequency of $t$ in field $i$ of $d$. $|d_i|$ is the length (number of terms) in field $i$ of $d$.
  - $P(t|C_i) = \frac{\sum_{d'}f_{t,d'_i}}{\sum_{d'}|d'_i|}$ is the collecting field language model (term's relative frequency in that field across the entire collection)
  
Using **Dirichlet smoothing**:

$p(t|\theta_{d_i}) = \frac{f_{t,d_i} + \mu_i P(t|C_i)}{|d_i| + \mu_i}$

where $\mu_i$ is the field-specific smoothing parameter.

#### Collection Language Model class

This class is used for obtaining collection language modeling probabilities  P(t|Ci)P(t|Ci) .

The reason this class is needed is that es.termvectors does not return term statistics for terms that do not appear in the given document. This would cause problems in scoring documents that are partial matches (do not contain all query terms in all fields).

The idea is that for each query term, we need to find a document that contains that term. Then the collection term statistics are available from that document's term vector. To make sure we find a matching document, we issue a boolean (match) query.

In [9]:
class CollectionLM(object):
    def __init__(self, es, qterms):
        self._es = es
        self._probs = {}
        # computing P(t|C_i) for each field and for each query term
        for field in FIELDS:
            self._probs[field] = {}
            for t in qterms:
                self._probs[field][t] = self._get_prob(field, t)
        
    def _get_prob(self, field, term):
        # Use a boolean query to find a document that contains the term
        hits = self._es.search(index=INDEX_NAME, body={"query": {"match": {field: term}}},
                               _source=False, size=1).get("hits", {}).get("hits", {})
        doc_id = hits[0]["_id"] if len(hits) > 0 else None
        if doc_id is not None:
            # Ask for global term statistics when requesting the term vector of that doc (`term_statistics=True`)
            # TODO: complete this part            
            return 0

        return 0  # this only happens if none of the documents contain that term

    def prob(self, field, term):
        return self._probs.get(field, {}).get(term, 0)

#### Document scorer

In [10]:
def score_mlm(es, clm, qterms, doc_id):
    score = 0  # log P(q|d)
    
    # Getting term frequency statistics for the given document field from Elasticsearch
    # Note that global term statistics are not needed (`term_statistics=False`)
    tv = es.termvectors(index=INDEX_NAME, id=doc_id, fields=FIELDS,
                              term_statistics=False).get("term_vectors", {})

    # compute field lengths $|d_i|$
    len_d_i = []  # document field length
    for i, field in enumerate(FIELDS):
        if field in tv: 
            len_d_i.append(sum([s["term_freq"] for t, s in tv[field]["terms"].items()]))
        else:  # that document field may be empty
            len_d_i.append(0)
        
    # scoring the query
    for t in qterms:
        Pt_theta_d = 0  # P(t|\theta_d)
        for i, field in enumerate(FIELDS):
            if field in tv:
                Pt_di = tv[field]["terms"].get(t, {}).get("term_freq", 0) / len_d_i[i]  # $P(t|d_i)$
            else:  # that document field is empty
                Pt_di = 0
            Pt_Ci = clm.prob(field, t)  # $P(t|C_i)$
            Pt_theta_di = (1 - LAMBDA) * Pt_di + LAMBDA * Pt_Ci  # $P(t|\theta_{d_i})$ with J-M smoothing
            Pt_theta_d += FIELD_WEIGHTS[i] * Pt_theta_di
        score += math.log(Pt_theta_d)    
    
    return score

#### Scoring queries

Perform an initial retrieval using the default ranking in Elasticsearch, then re-score each document using `score_mlm()`.

In [13]:
query = "TODO"

In [15]:
es = Elasticsearch()

NameError: name 'Elasticsearch' is not defined

In [14]:
# get top 100 docs using BM25
res = es.search(index=INDEX_NAME, q=query, df="content", _source=False, size=200).get('hits', {})

# re-score docs using MLM

# TODO: get analyzed query
qterms = []

# get collection LM 
# (this needs to be instantiated only once per query and can be used for scoring all documents)
clm = CollectionLM(es, qterms)        
scores = {}
for doc in res.get("hits", {}):
    doc_id = doc.get("_id")
    scores[doc_id] = score_mlm(es, clm, qterms, doc_id)

# TODO output top 5 documents

NameError: name 'es' is not defined

## PRMS

Implement field-specific term weighting using PRMS

In [4]:
# TODO