# Access a Term Vector


A term vector is information and statistics in the fields of a particular document. Term vectors in Elasticsearch are generated on the fly.


 
## Getting Started

In this example, we will use the Elaticsearch Python API. First, we will import and set-up all of the required Python modules and variables we will use later on. Additionally, if you wish to use `curl` instead of the Python API, the corresponding command line function has been commented above each API request.

In [None]:
from elasticsearch import Elasticsearch
import pandas as pd
es = Elasticsearch(urls=['localhost'], port=9200)

Let's examine what a document in this index looks like. (this operation may take few seconds)

In [None]:
# This query will retrieve every document in the index.
query = {
    'query': {
        'match_all': {}
    }
}

# Send a search request to Elasticsearch.
# curl -X GET localhost:9200/goma/_search -H 'Content-Type: application/json' -d @query.json
res = es.search(index='goma', body=query)

# The response is a json object, the listing is nested inside it.
# Here we are accessing the first hit in the listing.
res['hits']['hits'][0]

Using the term vector API, let's investigate the term vector for the description field of the document above (id _AV19Sgi4jk6MoKTLfifp_). Note in the call to the `termvectors` method, we explicitly request the term statistics `term_statistics=True`.

In [None]:
# curl -X GET localhost:9200/goma/event/2CmUSmgBOPedV1qM5TF2/_termvectors?term_statistics&fields=description
res = es.termvectors(index='goma', doc_type='_doc', id='2CmUSmgBOPedV1qM5TF2', 
                     fields=['description'], term_statistics=True)

# We don't really care that much about the additional info, let's get straight to the point.
tv = res['term_vectors']['description']
tv

That's a big json object, so let's break it down into some digestable tables. Firstly, let's take a look at the field statistics.

 - `doc_count`: document count (how many documents contain this field)
 - `sum_doc_freq`: sum of document frequencies (the sum of document frequencies for all terms in this field)
 - `sum_ttf`: sum of total term frequencies (the sum of total term frequencies of each term in this field)

In [None]:
pd.DataFrame(tv['field_statistics'], index=['count'])

More importantly, we can also see the breakdown of the term statistics in the document for each term in the document. These tables omit the `tokens` field, however this is can be used to extract the location of the term in the document.

 - `term_freq`: term frequency in the field
 - `doc_freq`: document frequency (the number of documents containing the current term)
 - `ttf`: sum of total term frequencies (the sum of total term frequencies of each term in this field)

In [None]:
terms = []
for term in tv['terms']:
    term_info = tv['terms'][term].copy()
    del(term_info['tokens'])
    term_info.update({'term': term})
    terms.append(term_info)
df = pd.DataFrame(terms).set_index('term')
df[0:10]

In [None]:
# Sorted by doc_freq
df.sort_values(by='doc_freq', ascending=False)[0:10]

In [None]:
# Sorted by term_freq
df.sort_values(by='term_freq', ascending=False)[0:10]

In [None]:
# Sorted by ttf
df.sort_values(by='ttf', ascending=False)[0:10]

#### Exercise 1

Repeat the expoloration of the term vector for a document using the ClueWeb12 sample index you have built in previous activities.

#### Exercise 2 -- advanced

Using the Clueweb12 sample index, identify two documents that contain a query term of your choice (suggestion: after having chosen a term, query the index to retrieve the top 2 documents that satisfy the query). Then, compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between the two term vectors.