# Exercise #1: PRMS for entity ranking

You are provided with an MLM ranker for movies. Your task is to implement PRMS for dynamic term weighting.

In [33]:
import requests
import urllib.parse
import wikipediaapi
from elasticsearch import Elasticsearch

In [2]:
INDEX_NAME = "movies"

## Build a local Elasticsearch index of a set of selected movies

Build a fielded Elasticsearch index of movies, with title, description, categories, directors, and actors fields.

In [16]:
wiki_wiki = wikipediaapi.Wikipedia("en")

We collect some movie titles from a few specific studios. Note that not all pages will be actual movies, but we'll filter those out later at indexing time.

In [10]:
def get_category_pages(cat_name):
    pages = []
    cat = wiki_wiki.page(cat_name)
    for c in cat.categorymembers.values():
        if c.ns != wikipediaapi.Namespace.CATEGORY:
            pages.append(c.title)
    return pages

In [11]:
categories = ["Category:20th_Century_Fox_films", "Category:Warner_Bros._films", "Category:Metro-Goldwyn-Mayer_films"]
pages = []

for cat_name in categories:
    pages += get_category_pages(cat_name)

print(pages[:10])

['List of 20th Century Fox films (1935–99)', 'List of 20th Century Fox films (2000–present)', 'The 3rd Voice', '4 Clowns', '5 Fingers', '9 to 5 (film)', '12 Rounds (film)', '13 Fighting Men', '13 Lead Soldiers', 'The 13th Letter']


Fetching movie details from DBpedia and indexing.

In [18]:
es = Elasticsearch()

if es.indices.exists(INDEX_NAME):
    es.indices.delete(index=INDEX_NAME)
    
es.indices.create(index=INDEX_NAME)

{'acknowledged': True, 'index': 'movies', 'shards_acknowledged': True}

Creating a fielded document representation for a movie. Some predicates are single-valued while others are multi-valued.

In [28]:
TYPE_PREDICATE = "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
TITLE_PREDICATE = "http://xmlns.com/foaf/0.1/name"
DESCRIPTION_PREDICATE = "http://www.w3.org/2000/01/rdf-schema#comment"
CATEGORIES_PREDICATE = "http://purl.org/dc/terms/subject"
DIRECTORS_PREDICATE = "http://dbpedia.org/ontology/director"
ACTORS_PREDICATE = "http://dbpedia.org/ontology/director"

In [27]:
def has_type(properties, target_type):
    if TYPE_PREDICATE not in properties:
        return False
    for p in properties[TYPE_PREDICATE]:
        if p['value'] == target_type:
            return True
    return False

In [22]:
def resolve_uri(uri):
    return uri.split("/")[-1].replace("_", " ")

In [24]:
def get_predicate_value(properties, predicate, multi_valued=False, transformation=None):
    if predicate not in properties:
        return ""
    limit = len(predicate[properties]) if multi_valued else 1
    value = ""
    for i in range(limit):
        if i > 0:
            value += " "
        v = predicate[properties][i]
        v_str = str(v['value']) if v['type'] == "literal" else resolve_uri(v['value'])
        if transformation == "categories":
            v_str = v_str[9:]
        value += v_str
    return value

In [25]:
def get_movie_doc(properties):
    doc = {
        'title': get_predicate_value(properties, TITLE_PREDICATE),
        'description': get_predicate_value(properties, DESCRIPTION_PREDICATE),
        'categories': get_predicate_value(properties, CATEGORIES_PREDICATE, multi_valued=True, transformation="categories"),
        'directors': get_predicate_value(properties, DIRECTORS_PREDICATE, multi_valued=True),
        'actors': get_predicate_value(properties, ACTORS_PREDICATE, multi_valued=True)
    }    
    return doc

In [21]:
from pprint import pprint

In [38]:
for page in pages:
    url_name = urllib.parse.quote(page.replace(" ", "_"))
    print(url_name)
    data = requests.get("http://dbpedia.org/data/{}.json".format(url_name)).json()
    print(data)
    break
    dict_key = "http://dbpedia.org/resource/{}".format(url_name)
    if dict_key not in data:
        continue
    properties = data[dict_key]
    # Filter out non-movies (as well as entities without any type)
    if not has_type(properties, "http://dbpedia.org/ontology/Movie"):
        continue    
    print("getting props")
    pprint(get_movie_doc(properties))
    break

List_of_20th_Century_Fox_films_%281935%E2%80%9399%29
{'http://dbpedia.org/resource/List_of_20th_Century_Fox_films_(1935–99)': {'http://www.w3.org/2000/01/rdf-schema#label': [{'type': 'literal', 'value': 'List of 20th Century Fox films (1935–99)', 'lang': 'en'}], 'http://www.w3.org/2000/01/rdf-schema#comment': [{'type': 'literal', 'value': 'This is a list of films produced by the U.S. film studio 20th Century Fox Film Corporation and released between its May 31, 1935 creation – as a merger between Fox Film Corporation (1915–1935) and 20th Century Pictures (1933–1936) – until 1999. For subsequent releases by 20th Century Fox, see List of 20th Century Fox films (2000–present).', 'lang': 'en'}], 'http://www.w3.org/2002/07/owl#sameAs': [{'type': 'uri', 'value': 'http://dbpedia.org/resource/List_of_20th_Century_Fox_films_(1935–99)'}, {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q6571123'}, {'type': 'uri', 'value': 'http://ko.dbpedia.org/resource/20세기_폭스의_영화_목록_(1935년-1999년)'}, {'

## MLM for ranking movies

Implement the mixture of language models for ranking movies

In [None]:
# TODO update field names and weights
FIELDS = ["title", "content"]
FIELD_WEIGHTS = [0.2, 0.8]
LAMBDA = 0.1

Documents should be scored according to **query (log)likelihood**: 

$\log P(q|d) = \sum_{t \in q} f_{t,q} \log P(t|\theta_d)$, 

where
  - $f_{t,q}$ is the frequency of term $t$ in the query
  - $P(t|\theta_d)$ is the (smoothed) document language model.
  
Using multiple document fields, the **document language model** is taken to be a linear combination of the (smoothed) field language models:

$P(t|\theta_d) = \sum_i w_i P(t|\theta_{d_i})$ ,

where $w_i$ is the field weight for field $i$ (and $\sum_i w_i = 1$).

The **field language models** $P(t|\theta_{d_i})$ are computed as follows.

Using **Jelinek-Mercer smoothing**:

$P(t|\theta_{d_i}) = (1-\lambda_i) P(t|d_i) + \lambda_i P(t|C_i)$,

where 

  - $\lambda_i$ is a field-specific smoothing parameter
  - $P(t|d_i) = \frac{f_{t,d_i}}{|d_i|}$ is the empirical field language model (term's relative frequency in the document field). $f_{t,d_i}$ is the raw frequency of $t$ in field $i$ of $d$. $|d_i|$ is the length (number of terms) in field $i$ of $d$.
  - $P(t|C_i) = \frac{\sum_{d'}f_{t,d'_i}}{\sum_{d'}|d'_i|}$ is the collecting field language model (term's relative frequency in that field across the entire collection)
  
Using **Dirichlet smoothing**:

$p(t|\theta_{d_i}) = \frac{f_{t,d_i} + \mu_i P(t|C_i)}{|d_i| + \mu_i}$

where $\mu_i$ is the field-specific smoothing parameter.

#### Collection Language Model class

This class is used for obtaining collection language modeling probabilities  P(t|Ci)P(t|Ci) .

The reason this class is needed is that es.termvectors does not return term statistics for terms that do not appear in the given document. This would cause problems in scoring documents that are partial matches (do not contain all query terms in all fields).

The idea is that for each query term, we need to find a document that contains that term. Then the collection term statistics are available from that document's term vector. To make sure we find a matching document, we issue a boolean (match) query.

In [None]:
class CollectionLM(object):
    def __init__(self, es, qterms):
        self._es = es
        self._probs = {}
        # computing P(t|C_i) for each field and for each query term
        for field in FIELDS:
            self._probs[field] = {}
            for t in qterms:
                self._probs[field][t] = self._get_prob(field, t)
        
    def _get_prob(self, field, term):
        # Use a boolean query to find a document that contains the term
        hits = self._es.search(index=INDEX_NAME, body={"query": {"match": {field: term}}},
                               _source=False, size=1).get("hits", {}).get("hits", {})
        doc_id = hits[0]["_id"] if len(hits) > 0 else None
        if doc_id is not None:
            # Ask for global term statistics when requesting the term vector of that doc (`term_statistics=True`)
            # TODO: complete this part            
            return 0

        return 0  # this only happens if none of the documents contain that term

    def prob(self, field, term):
        return self._probs.get(field, {}).get(term, 0)

#### Document scorer

In [None]:
def score_mlm(es, clm, qterms, doc_id):
    score = 0  # log P(q|d)
    
    # Getting term frequency statistics for the given document field from Elasticsearch
    # Note that global term statistics are not needed (`term_statistics=False`)
    tv = es.termvectors(index=INDEX_NAME, id=doc_id, fields=FIELDS,
                              term_statistics=False).get("term_vectors", {})

    # compute field lengths $|d_i|$
    len_d_i = []  # document field length
    for i, field in enumerate(FIELDS):
        if field in tv: 
            len_d_i.append(sum([s["term_freq"] for t, s in tv[field]["terms"].items()]))
        else:  # that document field may be empty
            len_d_i.append(0)
        
    # scoring the query
    for t in qterms:
        Pt_theta_d = 0  # P(t|\theta_d)
        for i, field in enumerate(FIELDS):
            if field in tv:
                Pt_di = tv[field]["terms"].get(t, {}).get("term_freq", 0) / len_d_i[i]  # $P(t|d_i)$
            else:  # that document field is empty
                Pt_di = 0
            Pt_Ci = clm.prob(field, t)  # $P(t|C_i)$
            Pt_theta_di = (1 - LAMBDA) * Pt_di + LAMBDA * Pt_Ci  # $P(t|\theta_{d_i})$ with J-M smoothing
            Pt_theta_d += FIELD_WEIGHTS[i] * Pt_theta_di
        score += math.log(Pt_theta_d)    
    
    return score

#### Scoring queries

Perform an initial retrieval using the default ranking in Elasticsearch, then re-score each document using `score_mlm()`.

In [None]:
query = "TODO"

In [None]:
# get top 100 docs using BM25
res = es.search(index=INDEX_NAME, q=query, df="content", _source=False, size=200).get('hits', {})

# re-score docs using MLM

# TODO: get analyzed query
qterms = []

# get collection LM 
# (this needs to be instantiated only once per query and can be used for scoring all documents)
clm = CollectionLM(es, qterms)        
scores = {}
for doc in res.get("hits", {}):
    doc_id = doc.get("_id")
    scores[doc_id] = score_mlm(es, clm, qterms, doc_id)

# TODO output top 5 documents

## PRMS

Implement field-specific term weighting using PRMS

In [None]:
# TODO