# Accessing a Term Vector

_Using Elasticsearch and the Python api_

A term vector is information and statistics in the fields of a particular document. Term vectors in Elasticsearch are generated on the fly.

## What you will need

 - Python 3
 
This example uses an index based on media releases by a gallery, available at: https://data.qld.gov.au/dataset/qagoma-media-releases/resource/a1e4dffa-edb1-4e6d-a4a0-353aca79e9a3.
 
## Getting Started

In this example, we will use the Elaticsearch Python api. First, we will import and set-up all of the required Python modules and variables we will use later on. Additionally, if you wish to use `curl` instead of the Python api, the complimentary command line function has been commented above each api request.

In [1]:
from elasticsearch import Elasticsearch
import pandas as pd
es = Elasticsearch(urls=['localhost'], port=9200)

Let's examine what a document in this index looks like.

In [2]:
# This query will retrieve every document in the index.
query = {
    'query': {
        'match_all': {}
    }
}

# Send a search request to Elasticsearch.
# curl -X GET localhost:9200/goma/_search -H 'Content-Type: application/json' -d @query.json
res = es.search(index='goma', body=query)

# The response is a json object, the listing is nested inside it.
# Here we are accessing the first hit in the listing.
res['hits']['hits'][0]

{'_id': 'AV19Sgi4jk6MoKTLfifp',
 '_index': 'goma',
 '_score': 1.0,
 '_source': {'available': 'Yes',
  'category': 'Exhibition',
  'description': "'Lucent' draws together works from the Aboriginal and Pacific collections, illuminating connections and differences between the cultures. Evocations of light and its absence are explored through works ranging from installations of great majesty to intimate adornments for the body. Multiple feathered Banumbirr (Morning star) poles associated with creation stories engage with a 22 metre black Ngatu Ta Uli (barkcloth) used within mourning rituals, as do decorative and ceremonial pearlshell pendants from both cultures and textiles.",
  'end_time': '2017-07-30 17:00:00',
  'entry': 'Free',
  'id': '49647',
  'link': 'https://www.qagoma.qld.gov.au/whats-on/exhibitions/lucent',
  'location': 'GOMA: Gallery 3.5',
  'start_time': '2016-11-26 10:00:00',
  'stop_date': '2017-07-30',
  'thumbnail': 'https://www.qagoma.qld.gov.au/__data/assets/image/0006/

Using the term vector api, let's investigate the term vector for the description field of the document above (id _AV19Sgi4jk6MoKTLfifp_).

In [3]:
# curl -X GET localhost:9200/goma/event/AV19Sgi4jk6MoKTLfifp/_termvectors?term_statistics&fields=description
res = es.termvectors(index='goma', doc_type='event', id='AV19Sgi4jk6MoKTLfifp', 
                     fields=['description'], term_statistics=True)

# We don't really care that much about the additional info, let's get straight to the point.
tv = res['term_vectors']['description']
tv

{'field_statistics': {'doc_count': 32, 'sum_doc_freq': 1473, 'sum_ttf': 1854},
 'terms': {'22': {'doc_freq': 1,
   'term_freq': 1,
   'tokens': [{'end_offset': 381, 'position': 52, 'start_offset': 379}],
   'ttf': 1},
  'a': {'doc_freq': 28,
   'term_freq': 1,
   'tokens': [{'end_offset': 378, 'position': 51, 'start_offset': 377}],
   'ttf': 37},
  'aboriginal': {'doc_freq': 2,
   'term_freq': 1,
   'tokens': [{'end_offset': 49, 'position': 6, 'start_offset': 39}],
   'ttf': 3},
  'absence': {'doc_freq': 1,
   'term_freq': 1,
   'tokens': [{'end_offset': 173, 'position': 22, 'start_offset': 166}],
   'ttf': 1},
  'adornments': {'doc_freq': 1,
   'term_freq': 1,
   'tokens': [{'end_offset': 267, 'position': 35, 'start_offset': 257}],
   'ttf': 1},
  'and': {'doc_freq': 23,
   'term_freq': 5,
   'tokens': [{'end_offset': 53, 'position': 7, 'start_offset': 50},
    {'end_offset': 103, 'position': 12, 'start_offset': 100},
    {'end_offset': 161, 'position': 20, 'start_offset': 158},
    {

That's a big json object, so let's break it down into some digestable tabes. Firstly, let's take a look at the field statistics.

 - `doc_count`: document count (how many documents contain this field)
 - `sum_doc_freq`: sum of document frequencies (the sum of document frequencies for all terms in this field)
 - `sum_ttf`: sum of total term frequencies (the sum of total term frequencies of each term in this field)

In [4]:
pd.DataFrame(tv['field_statistics'], index=['count'])

Unnamed: 0,doc_count,sum_doc_freq,sum_ttf
count,32,1473,1854


More importantly, we can also see the breakdown of the term statistics in the document for each term in the document. These tables omit the `tokens` field, however this is can be used to extract the location of the term in the document.

 - `term_freq`: term frequency in the field
 - `doc_freq`: document frequency (the number of documents containing the current term)
 - `ttf`: sum of total term frequencies (the sum of total term frequencies of each term in this field)

In [5]:
terms = []
for term in tv['terms']:
    term_info = tv['terms'][term].copy()
    del(term_info['tokens'])
    term_info.update({'term': term})
    terms.append(term_info)
df = pd.DataFrame(terms).set_index('term')
df[0:10]

Unnamed: 0_level_0,doc_freq,term_freq,ttf
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
22,1,1,1
a,28,1,37
aboriginal,2,1,3
absence,1,1,1
adornments,1,1,1
and,23,5,73
are,3,1,3
as,9,1,12
associated,1,1,1
banumbirr,1,1,1


In [6]:
# Sorted by doc_freq
df.sort_values(by='doc_freq', ascending=False)[0:10]

Unnamed: 0_level_0,doc_freq,term_freq,ttf
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,28,1,37
the,25,3,85
of,24,2,54
and,23,5,73
for,14,1,14
to,14,1,24
from,11,3,20
as,9,1,12
with,7,2,12
works,6,2,8


In [7]:
# Sorted by term_freq
df.sort_values(by='term_freq', ascending=False)[0:10]

Unnamed: 0_level_0,doc_freq,term_freq,ttf
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
and,23,5,73
the,25,3,85
from,11,3,20
works,6,2,8
of,24,2,54
cultures,1,2,2
with,7,2,12
lucent,1,1,1
majesty,1,1,1
metre,1,1,1


In [8]:
# Sorted by ttf
df.sort_values(by='ttf', ascending=False)[0:10]

Unnamed: 0_level_0,doc_freq,term_freq,ttf
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
the,25,3,85
and,23,5,73
of,24,2,54
a,28,1,37
to,14,1,24
from,11,3,20
for,14,1,14
with,7,2,12
as,9,1,12
works,6,2,8
