# MLM ranking

## Scoring a given document

This method computes $\log P(q|d) = \sum_{i=1}^{|q|} \log P(q_i|\theta_d)$, where $q_i$ is a query term.

The document language model is a mixture of field-level language models (substitute $t$ with $q_i$):

$P(t|\theta_d) = \sum_i \mu_i P(t|\theta_{d_i})$

where $mu_i$ are in `FIELD_WEIGHT`

The field-level language models are empirical field LMs smoothed with a field background model (using Jelinek-Mercer smoothing)

$P(t|\theta_{d_i}) = (1-\lambda) P(t|d_i) + \lambda P(t|C_i)$

where
  - the empirical field LM $P(t|d_i)$ is the relative frequency of $t$ in the field
  - the background field LM $P(t|C_i)$ can be computed using the provided `BackgroundLM` class (but it needs to be instantiated for each field)
  - the same smoothing parameter $lambda$ is used for all fields (you could also use field-specific smoothing parameters $\lambda_i$)

In [1]:
def score_doc(doc_id, qterms):
    # Get document termvectors (holds document term frequencies for each field)
    tv = es.termvectors(index=INDEX_NAME, doc_type=DOC_TYPE, id=doc_id, fields=FIELDS,
                                  term_statistics=False).get('term_vectors', {})
    tf = {} # tf[field][t] holds the frequency of term `t` in a given document field; extract the values from `tv`
    fieldlen = {} # length of the field: fieldlen[field] = sum(tf[field].values())
    
    score = 0  # holds log P(q|d)
    for term in qterms:  # this is the main summation over query terms
        ptd = 0  # compute P(t|\theta_d) as a mixture of field LMs
        for field in FIELDS:  # field corresponds to i in the above equations
            ptdi = tf[field][term] / fieldlen[field]  # empirical field LM P(t|d_i)
            ptci = BG_LM[field].get_prob(term)  # background field LM P(t|C_i)
            pttdi = (1 - LAMBDA) * ptdi + LAMBDA * ptci  # P(t|\theta_{d_i}) with JM smoothing
            ptd += FIELD_WEIGHT[field] * pttdi
        score += math.log(ptd) 
    return score        

## Scoring all queries

In [None]:
es = Elasticsearch()

for query in queries:
    # Top 200 documents retrieved using default Elasticsearch method
    res = es.search(index=INDEX_NAME, q=query, df=CONTENT_FIELD, _source=False, size=200).get('hits', {})
    
    # Extract preprocessed query terms
    tokens = es.indices.analyze(index=INDEX_NAME, body=query)['tokens']
    qterms = []  # you need to extract the terms from tokens
    
    # Re-rank documents using MLM
    scores = {}
    for doc in res.get("hits"):
        doc_id = doc.get("_id")
        scores[doc_id] = score(doc_id, qterms)
    
    # Output top-100 docs with highest scores to file