# Assignment 1 - Part 2

Scoring documents using the Mixture of Language Models (MLM) approach. Use two fields: title and content.

In [1]:
from elasticsearch import Elasticsearch
import math

In [2]:
INDEX_NAME = "aquaint"
DOC_TYPE = "doc"

In [3]:
QUERY_FILE = "data/queries.txt"

In [4]:
OUTPUT_FILE = "data/mlm_default.txt"  # output the ranking

Document fields used for scoring.

In [5]:
FIELDS = ["title", "content"]

Field weights. You'll need to set these properly in Part 3 of the assignment. For now, you can use these values.

In [6]:
FIELD_WEIGHTS = [0.2, 0.8]

Smoothing: we use Jelinek-Mercer smoothing here with the following lambda parameter. (I.e., the same smoothing parameter is used for all fields.)

In [7]:
LAMBDA = 0.1

### Load the queries from the file

See the assignment description for the format of the query file [here](https://github.com/kbalog/uis-dat630-fall2017/tree/master/assignment-1#queries).

In [8]:
def load_queries(query_file):
    queries = {}
    with open(query_file, "r") as fin:
        for line in fin.readlines():
            qid, query = line.strip().split(" ", 1)
            queries[qid] = query
    return queries

## Query analyzer

See [indices.analyze](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.analyze).

In [9]:
def analyze_query(es, query):
    tokens = es.indices.analyze(index=INDEX_NAME, body={"text": query})["tokens"]
    query_terms = []
    for t in sorted(tokens, key=lambda x: x["position"]):
        query_terms.append(t["token"])
    return query_terms

## MLM scorer

Documents should be scored according to **query (log)likelihood**: 

$\log P(q|d) = \sum_{t \in q} f_{t,q} \log P(t|\theta_d)$, 

where
  - $f_{t,q}$ is the frequency of term $t$ in the query
  - $P(t|\theta_d)$ is the (smoothed) document language model.
  
Using multiple document fields, the **document language model** is taken to be a linear combination of the (smoothed) field language models:

$P(t|\theta_d) = \sum_i w_i P(t|\theta_{d_i})$ ,

where $w_i$ is the field weight for field $i$ (and $\sum_i w_i = 1$).

The **field language models** $P(t|\theta_{d_i})$ are computed as follows.

Using **Jelinek-Mercer smoothing**:

$P(t|\theta_{d_i}) = (1-\lambda_i) P(t|d_i) + \lambda_i P(t|C_i)$,

where 

  - $\lambda_i$ is a field-specific smoothing parameter
  - $P(t|d_i) = \frac{f_{t,d_i}}{|d_i|}$ is the empirical field language model (term's relative frequency in the document field). $f_{t,d_i}$ is the raw frequency of $t$ in field $i$ of $d$. $|d_i|$ is the length (number of terms) in field $i$ of $d$.
  - $P(t|C_i) = \frac{\sum_{d'}f_{t,d'_i}}{\sum_{d'}|d'_i|}$ is the collecting field language model (term's relative frequency in that field across the entire collection)
  
Using **Dirichlet smoothing**:

$p(t|\theta_{d_i}) = \frac{f_{t,d_i} + \mu_i P(t|C_i)}{|d_i| + \mu_i}$

where $\mu_i$ is the field-specific smoothing parameter.

### Collection Language Model class

This class is used for obtaining collection language modeling probabilities $P(t|C_i)$.

The reason this class is needed is that `es.termvectors` does not return term statistics for terms that do not appear in the given document. This would cause problems in scoring documents that are partial matches (do not contain all query terms in all fields). 

The idea is that for each query term, we need to find a document that contains that term. Then the collection term statistics are available from that document's term vector. To make sure we find a matching document, we issue a [boolean (match)](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl-match-query.html) query.

In [10]:
class CollectionLM(object):
    def __init__(self, es, qterms):
        self._es = es
        self._probs = {}
        # computing P(t|C_i) for each field and for each query term
        for field in FIELDS:
            self._probs[field] = {}
            for t in qterms:
                self._probs[field][t] = self.__get_prob(field, t)
        
    def __get_prob(self, field, term):
        # use a boolean query to find a document that contains the term
        hits = self._es.search(index=INDEX_NAME, body={"query": {"match": {field: term}}},
                               _source=False, size=1).get("hits", {}).get("hits", {})
        doc_id = hits[0]["_id"] if len(hits) > 0 else None
        if doc_id is not None:
            # ask for global term statistics when requesting the term vector of that doc (`term_statistics=True`)
            tv = self._es.termvectors(index=INDEX_NAME, doc_type=DOC_TYPE, id=doc_id, fields=field,
                                      term_statistics=True)["term_vectors"][field]
            ttf = tv["terms"].get(term, {}).get("ttf", 0)  # total term count in the collection (in that field)
            sum_ttf = tv["field_statistics"]["sum_ttf"]
            return ttf / sum_ttf

        return 0  # this only happens if none of the documents contain that term

    def prob(self, field, term):
        return self._probs.get(field, {}).get(term, 0)

### Document scorer

In [11]:
def score_mlm(es, clm, qterms, doc_id):
    score = 0  # log P(q|d)
    
    # Getting term frequency statistics for the given document field from Elasticsearch
    # Note that global term statistics are not needed (`term_statistics=False`)
    tv = es.termvectors(index=INDEX_NAME, doc_type=DOC_TYPE, id=doc_id, fields=FIELDS,
                              term_statistics=False).get("term_vectors", {})

    # compute field lengths $|d_i|$
    len_d_i = []  # document field length
    for i, field in enumerate(FIELDS):
        if field in tv: 
            len_d_i.append(sum([s["term_freq"] for t, s in tv[field]["terms"].items()]))
        else:  # that document field may be empty
            len_d_i.append(0)
        
    # scoring the query
    for t in qterms:
        Pt_theta_d = 0  # P(t|\theta_d)
        for i, field in enumerate(FIELDS):
            if field in tv:
                Pt_di = tv[field]["terms"].get(t, {}).get("term_freq", 0) / len_d_i[i]  # $P(t|d_i)$
            else:  # that document field is empty
                Pt_di = 0
            Pt_Ci = clm.prob(field, t)  # $P(t|C_i)$
            Pt_theta_di = (1 - LAMBDA) * Pt_di + LAMBDA * Pt_Ci  # $P(t|\theta_{d_i})$ with J-M smoothing
            Pt_theta_d += FIELD_WEIGHTS[i] * Pt_theta_di
        score += math.log(Pt_theta_d)    
    
    return score

## Main

In [12]:
es = Elasticsearch()

In [13]:
queries = load_queries(QUERY_FILE)

In [14]:
with open(OUTPUT_FILE, "w") as fout:
    # write header
    fout.write("QueryId,DocumentId\n")
    for qid, query in queries.items():
        # get top 200 docs using BM25
        print("Get baseline ranking for [%s] '%s'" % (qid, query))
        res = es.search(index=INDEX_NAME, q=query, df="content", _source=False, size=200).get('hits', {})
        
        # re-score docs using MLM
        print("Re-scoring documents using MLM")
        # get analyzed query
        qterms = analyze_query(es, query)
        # get collection LM 
        # (this needs to be instantiated only once per query and can be used for scoring all documents)
        clm = CollectionLM(es, qterms)        
        scores = {}
        for doc in res.get("hits", {}):
            doc_id = doc.get("_id")
            scores[doc_id] = score_mlm(es, clm, qterms, doc_id)

        # write top 100 results to file
        for doc_id, score in sorted(scores.items(), key=lambda x: x[1], reverse=True)[:100]:            
            fout.write(qid + "," + doc_id + "\n")

Get baseline ranking for [303] 'Hubble Telescope Achievements'
Re-scoring documents using MLM
Get baseline ranking for [307] 'New Hydroelectric Projects'
Re-scoring documents using MLM
Get baseline ranking for [310] 'Radio Waves and Brain Cancer'
Re-scoring documents using MLM
Get baseline ranking for [314] 'Marine Vegetation'
Re-scoring documents using MLM
Get baseline ranking for [322] 'International Art Crime'
Re-scoring documents using MLM
Get baseline ranking for [325] 'Cult Lifestyles'
Re-scoring documents using MLM
Get baseline ranking for [330] 'Iran-Iraq Cooperation'
Re-scoring documents using MLM
Get baseline ranking for [336] 'Black Bear Attacks'
Re-scoring documents using MLM
Get baseline ranking for [341] 'Airport Security'
Re-scoring documents using MLM
Get baseline ranking for [344] 'Abuses of E-Mail'
Re-scoring documents using MLM
Get baseline ranking for [345] 'Overseas Tobacco Sales'
Re-scoring documents using MLM
Get baseline ranking for [347] 'Wildlife Extinction'
R