# Exercise #0: Getting term probabilities from Elasticsearch

For this exercise, you may use any existing Elasticsearch index you have locally.

Your task is to implement the two method below, for returning the empirical (i.e., unsmoothed) probability of a term in a given document and in the collection (both w.r.t. a specific field). That is, simply the relative frequency of the term in the document field or in the collection.

Relevant documentation:
  * Elasticsearch API endpoint: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html
  * Respective Python client call: https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=termvectors#elasticsearch.Elasticsearch.termvectors

In [1]:
from elasticsearch import Elasticsearch

es = Elasticsearch()

Using the Wikipedia index from [Lecture 13](https://github.com/kbalog/uis-dat640-fall2019/tree/master/exercises/lecture_13) Exercise 1, with fields `title` and `content`.

In [9]:
INDEX_NAME = "wikipedia"
DOC_TYPE = "_doc"

In [39]:
from pprint import pprint

def get_document_term_prob(term, doc_id, field):
    # Get the term vector for that field
    tv = es.termvectors(index=INDEX_NAME, id=doc_id, fields=field, term_statistics=True)['term_vectors'][field]        
    tf = 0
    if term in tv['terms']:
        tf = tv['terms'][term]['term_freq']
        
    # Document length is calculated
    len_d = sum([s['term_freq'] for t, s in tv['terms'].items()])    
    
    return tf / len_d

In [40]:
# Note the indexing applies stemming
print(get_document_term_prob("citi", "Stavanger", "content"))

0.045936395759717315


In [45]:
def get_collection_term_prob(term, field):
    # Use a boolean query to find a document that contains the term
    hits = es.search(index=INDEX_NAME, q=term, field=field, size=1).get("hits", {}).get("hits", {})
    doc_id = hits[0]['_id'] if len(hits) > 0 else None
    if doc_id is not None:
        # Ask for global term statistics when requesting the term vector of that doc
        tv = es.term_vectors(INDEX_NAME, doc_id, term_statistics=True)['term_vectors'][field]
        ttf = tv['terms'].get(term, {}).get("ttf", 0)  # total term count in the collection (in that field)
        sum_ttf = tv['field_statistics']['sum_ttf']
        return ttf / sum_ttf

    return 0  # this only happens if none of the documents contain that term


In [46]:
get_collection_term_prob("city", "content")

TypeError: search() got an unexpected keyword argument 'field'