# Elasticsearch

A toy-sized example for indexing and searching a collection of documents.

In [1]:
from elasticsearch import Elasticsearch

In [2]:
from pprint import pprint  # for pretty printing of JSON objects

In [3]:
INDEX_NAME = "toy_index"  # the name of the index

INDEX_SETTINGS = {  # single shard with a single replica
    'settings' : {
        'index' : {
            'number_of_shards' : 1,
            'number_of_replicas' : 1
        }
    }
}

The collection of documents is given here as a Python dictionary. Each document has two fields: title and content.

In [4]:
DOCS = {
    1: {'title': "Rap God",
        'content': "gonna, gonna, Look, I was gonna go easy on you and not to hurt your feelings"
        },
    2: {'title': "Lose Yourself",
        'content': "Yo, if you could just, for one minute Or one split second in time, forget everything Everything that bothers you, or your problems Everything, and follow me"
        },
    3: {'title': "Love The Way You Lie",
        'content': "Just gonna stand there and watch me burn But that's alright, because I like the way it hurts"
        },
    4: {'title': "The Monster",
        'content': ["gonna gonna I'm friends with the monster", "That's under my bed Get along with the voices inside of my head"]
        },
    5: {'title': "Beautiful",
        'content': "Lately I've been hard to reach I've been too long on my own Everybody has a private world Where they can be alone"
        }
}  # Eminem rulez ;)

### Create Elasticsearch object

In [5]:
es = Elasticsearch()

Check if service is running

In [6]:
es.info()

{'cluster_name': 'elasticsearch',
 'cluster_uuid': 'LMlf8WX9RPC0aJ0eB5R69Q',
 'name': 'Krisztians-MacBook-Pro.local',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2019-09-27T08:36:48.569419Z',
  'build_flavor': 'default',
  'build_hash': '22e1767283e61a198cb4db791ea66e3f11ab9910',
  'build_snapshot': False,
  'build_type': 'tar',
  'lucene_version': '8.2.0',
  'minimum_index_compatibility_version': '6.0.0-beta1',
  'minimum_wire_compatibility_version': '6.8.0',
  'number': '7.4.0'}}

### Create index

If the index exists, we delete it (normally, you don't want to do this).

In [7]:
if es.indices.exists(INDEX_NAME):
    es.indices.delete(index=INDEX_NAME)

We set the number of shards and replicas to be used for each index when it's created. (We use a single shard instead of the default 5.)

In [8]:
es.indices.create(index=INDEX_NAME, body=INDEX_SETTINGS)

{'acknowledged': True, 'index': 'toy_index', 'shards_acknowledged': True}

### Add documents to the index

In [9]:
for doc_id, doc in DOCS.items():
    es.index(index=INDEX_NAME, id=doc_id, body=doc)

### Check what has been indexed

Get the contents of doc #3

In [10]:
doc = es.get(index=INDEX_NAME, id=3)

In [11]:
pprint(doc)

{'_id': '3',
 '_index': 'toy_index',
 '_primary_term': 1,
 '_seq_no': 2,
 '_source': {'content': "Just gonna stand there and watch me burn But that's "
                        'alright, because I like the way it hurts',
             'title': 'Love The Way You Lie'},
 '_type': '_doc',
 '_version': 1,
 'found': True}


Get the term vector for doc #3.

`termvectors` returns information and statistics on terms in the fields of a particular document.

In [12]:
tv = es.termvectors(index=INDEX_NAME, id=3, fields="title,content", term_statistics=True)

In [13]:
pprint(tv)

{'_id': '3',
 '_index': 'toy_index',
 '_type': '_doc',
 '_version': 1,
 'found': True,
 'term_vectors': {'content': {'field_statistics': {'doc_count': 5,
                                                   'sum_doc_freq': 91,
                                                   'sum_ttf': 104},
                              'terms': {'alright': {'doc_freq': 1,
                                                    'term_freq': 1,
                                                    'tokens': [{'end_offset': 59,
                                                                'position': 10,
                                                                'start_offset': 52}],
                                                    'ttf': 1},
                                        'and': {'doc_freq': 3,
                                                'term_freq': 1,
                                                'tokens': [{'end_offset': 26,
                                                        

Interpretation of the returned values
  * `[{field}]['field_statistics']`: 
    - `doc_count`: how many documents contain this field
    - `sum_ttf`: the sum of all term frequencies in this field
  * `[{field}][{term}]`:
    - `doc_freq`: how many document contain this term
    - `term_freq`: frequency (number of occurrences) of the term in this document field
    - `ttf`: total term frequency, i.e., number of occurrences of the term in this field in all documents

Note that Elasticsearch splits indices into multiple shards (by default: 5). This means that when you ask for term statistics, these are computed by shard. In case of a large collection, this is typically not an issue as the statistics become "normalized" across the different shards and the differences are negligible. For smaller collections that fit on a single disk, you may set the number of shards to 1 to avoid this issue alltogether (like we've done in this example in `INDEX_SETTINGS`).

Check the following documents for further information:
  - https://www.elastic.co/guide/en/elasticsearch/reference/6.2/_basic_concepts.html
  - https://www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch

### Search

In [14]:
query = "rap monster"
res = es.search(index=INDEX_NAME, q=query, _source=False, size=10)

Print full response (`hits` holds the results)

In [15]:
pprint(res)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [],
          'max_score': None,
          'total': {'relation': 'eq', 'value': 0}},
 'timed_out': False,
 'took': 1}


Print only search results (ranked list of docs)

In [16]:
for hit in res['hits']['hits']:
    print("Doc ID: %3r  Score: %5.2f" % (hit['_id'], hit['_score']))

## Elasticsearch query language

Elasticsearch supports structured queries based on its own [DSL query language](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).

Mind that certain queries expect analyzed query terms (e.g., [term queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html)), while other query types (e.g., [match](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html)) perform analysis as part of the processing. Make sure you check the respective documentation carefully.

### Building a second toy index with position information

In [17]:
INDEX_NAME2 = "toy_index2"  

INDEX_SETTINGS2 = {
    'settings' : {
        'index' : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        },
        'analysis': {
            'analyzer': {
                'my_english_analyzer': {
                    'type': "custom",
                    'tokenizer': "standard",
                    'stopwords': "_english_",
                    'filter': [
                        "lowercase",
                        "english_stop",
                        "filter_english_minimal"
                    ]                
                }
            },
            'filter' : {
                'filter_english_minimal' : {
                    'type': "stemmer",
                    'name': "minimal_english"
                },
                'english_stop': {
                    'type': "stop",
                    'stopwords': "_english_"
                }
            },
        }
    },
    'mappings': {
        'properties': {
            'title': {
                'type': "text",
                'term_vector': "with_positions",
                'analyzer': "my_english_analyzer"
            },
            'content': {
                'type': "text",
                'term_vector': "with_positions",
                'analyzer': "my_english_analyzer"
            }
        }
    }
}

In [18]:
if es.indices.exists(INDEX_NAME2):
    es.indices.delete(index=INDEX_NAME2)
    
es.indices.create(index=INDEX_NAME2, body=INDEX_SETTINGS2)

{'acknowledged': True, 'index': 'toy_index2', 'shards_acknowledged': True}

In [19]:
for doc_id, doc in DOCS.items():
    es.index(index=INDEX_NAME2, id=doc_id, body=doc)

Check that term position information has been added to the index

In [25]:
tv = es.termvectors(index=INDEX_NAME2, id=3, fields="title", term_statistics=True)

pprint(tv)

{'_id': '3',
 '_index': 'toy_index2',
 '_type': '_doc',
 '_version': 1,
 'found': True,
 'term_vectors': {'title': {'field_statistics': {'doc_count': 5,
                                                 'sum_doc_freq': 10,
                                                 'sum_ttf': 10},
                            'terms': {'lie': {'doc_freq': 1,
                                              'term_freq': 1,
                                              'tokens': [{'position': 4}],
                                              'ttf': 1},
                                      'love': {'doc_freq': 1,
                                               'term_freq': 1,
                                               'tokens': [{'position': 0}],
                                               'ttf': 1},
                                      'way': {'doc_freq': 1,
                                              'term_freq': 1,
                                              'tokens': [{'position': 2}],
 

### Examples

Searching for documents that must match a [boolean combination](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html) of multiple terms (in any order).  

In [23]:
query = {
    'bool': {
        'must': [
            {'match': {'content': "gonna"}}, 
            {'match': {'content': "monster"}}
        ]
    }
}

res = es.search(index=INDEX_NAME2, body={'query': query})

pprint(res)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '4',
                    '_index': 'toy_index2',
                    '_score': 2.147757,
                    '_source': {'content': ["gonna gonna I'm friends with the "
                                            'monster',
                                            "That's under my bed Get along "
                                            'with the voices inside of my '
                                            'head'],
                                'title': 'The Monster'},
                    '_type': '_doc'}],
          'max_score': 2.147757,
          'total': {'relation': 'eq', 'value': 1}},
 'timed_out': False,
 'took': 1}


Searching for documents that match an [extract phrase](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html) (terms in that exact order).

In [24]:
query = {'match_phrase': {'content': "split second"}}

res = es.search(index=INDEX_NAME2, body={'query': query})

pprint(res)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '2',
                    '_index': 'toy_index2',
                    '_score': 2.4706814,
                    '_source': {'content': 'Yo, if you could just, for one '
                                           'minute Or one split second in '
                                           'time, forget everything Everything '
                                           'that bothers you, or your problems '
                                           'Everything, and follow me',
                                'title': 'Lose Yourself'},
                    '_type': '_doc'}],
          'max_score': 2.4706814,
          'total': {'relation': 'eq', 'value': 1}},
 'timed_out': False,
 'took': 1}
