# Elastic Search Term Extraction

In [None]:
ES_HOST = 'localhost'
ES_PORT = 9200

To get the term information we must make a request using the document ids. So the first thing is to get all the ids for the documents.

Doing this by paginating through the results would be too time consuming. It's better to use [a script to create a custom aggregation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-scripted-metric-aggregation.html). This transforms each document into it's id and provides that list:

You can read [more about scripting](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html).

In [None]:
ES_SEARCH_QUERY = {
    "aggs": {
        "pk": {
            "scripted_metric": {
                "init_script": "params._agg.pks = []",
                "map_script": "params._agg.pks.add(doc.id.value.toString())"
            }
        }
    }
}

In [None]:
import requests

aggregated_ids = requests.get(
    f'http://{ES_HOST}:{ES_PORT}/documents/document/_search',
    json=ES_SEARCH_QUERY
).json()

Now that we have the ids, we can ask for the terms.

You can read about the [termvector query](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html) and about the [batch api for it](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-termvectors.html).

In [None]:
ES_MTERMVECTORS_QUERY = {
    "ids": aggregated_ids["aggregations"]["pk"]["value"][0]["pks"],
    "parameters": {
        "fields": [ "fullText" ],
        "offsets": False,
        "positions": False,
        "field_statistics": False
    }
}

In [None]:
document_terms = requests.get(
    f'http://{ES_HOST}:{ES_PORT}/documents/document/_mtermvectors',
    json=ES_MTERMVECTORS_QUERY
).json()

These terms are per document (tweet) so to get the overall frequencies we need to aggregate them.

In [None]:
from collections import defaultdict

term_frequencies = defaultdict(int)

for key, frequency in (
        (key, value['term_freq'])
        for doc in document_terms['docs']
        for key, value in doc['term_vectors']['fullText']['terms'].items()
):
    term_frequencies[key] += frequency

In [None]:
sorted(term_frequencies.items(), key=lambda term: term[1], reverse=True)[:10]