# Elasticsearch

Run this example to index and search a toy-sized collection of documents using Elasticsearch.  There is nothing for you to add/complete here, it's just to make sure you're all set for the next exercise.

Before starting, make sure that you've 

1. Downloaded and started Elasticsearch
1. Installed the `elasticsearch` Python package
  - It's part of the standard Anaconda distribution; otherwise, you can run `conda install elasticsearch`.

In [1]:
from elasticsearch import Elasticsearch
from pprint import pprint

In [2]:
INDEX_NAME = "toy_index"  # the name of the index

INDEX_SETTINGS = {  # single shard with a single replica
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        }
    }
}

The collection of documents is given here as a Python dictionary. Each document has two fields: title and content.

In [3]:
DOCS = {
    1: {"title": "Rap God",
        "content": "gonna, gonna, Look, I was gonna go easy on you and not to hurt your feelings"
        },
    2: {"title": "Lose Yourself",
        "content": "Yo, if you could just, for one minute Or one split second in time, forget everything Everything that bothers you, or your problems Everything, and follow me"
        },
    3: {"title": "Love The Way You Lie",
        "content": "Just gonna stand there and watch me burn But that's alright, because I like the way it hurts"
        },
    4: {"title": "The Monster",
        "content": ["gonna gonna I'm friends with the monster", "That's under my bed Get along with the voices inside of my head"]
        },
    5: {"title": "Beautiful",
        "content": "Lately I've been hard to reach I've been too long on my own Everybody has a private world Where they can be alone"
        }
}  # Eminem rulez ;)

### Create Elasticsearch object

In [4]:
es = Elasticsearch()

Check if service is running

In [5]:
es.info()

{'name': 'k122-129.ux.uis.no',
 'cluster_name': 'elasticsearch',
 'cluster_uuid': '0LQmPetvQsacAJ8SLmPybA',
 'version': {'number': '7.9.2',
  'build_flavor': 'default',
  'build_type': 'tar',
  'build_hash': 'd34da0ea4a966c4e49417f2da2f244e3e97b4e6e',
  'build_date': '2020-09-23T00:45:33.626720Z',
  'build_snapshot': False,
  'lucene_version': '8.6.2',
  'minimum_wire_compatibility_version': '6.8.0',
  'minimum_index_compatibility_version': '6.0.0-beta1'},
 'tagline': 'You Know, for Search'}

### Create index

If the index exists, we delete it (normally, you don't want to do this).

In [6]:
if es.indices.exists(INDEX_NAME):
    es.indices.delete(index=INDEX_NAME)

We set the number of shards and replicas to be used for each index when it's created. (We use a single shard instead of the default 5.)

In [7]:
es.indices.create(index=INDEX_NAME, body=INDEX_SETTINGS)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'toy_index'}

### Add documents to the index

In [8]:
for doc_id, doc in DOCS.items():
    es.index(index=INDEX_NAME, doc_type="_doc", id=doc_id, body=doc)

### Check what has been indexed

Get the contents of doc #3

In [9]:
doc = es.get(index=INDEX_NAME, id=3)

In [10]:
pprint(doc)

{'_id': '3',
 '_index': 'toy_index',
 '_primary_term': 1,
 '_seq_no': 2,
 '_source': {'content': "Just gonna stand there and watch me burn But that's "
                        'alright, because I like the way it hurts',
             'title': 'Love The Way You Lie'},
 '_type': '_doc',
 '_version': 1,
 'found': True}


Get the term vector for doc #3.

`termvectors` returns information and statistics on terms in the fields of a particular document.

In [11]:
tv = es.termvectors(index=INDEX_NAME, doc_type="_doc", id=3, fields="title,content", term_statistics=True)

In [12]:
pprint(tv)

{'_id': '3',
 '_index': 'toy_index',
 '_type': '_doc',
 '_version': 1,
 'found': True,
 'term_vectors': {'content': {'field_statistics': {'doc_count': 5,
                                                   'sum_doc_freq': 91,
                                                   'sum_ttf': 104},
                              'terms': {'alright': {'doc_freq': 1,
                                                    'term_freq': 1,
                                                    'tokens': [{'end_offset': 59,
                                                                'position': 10,
                                                                'start_offset': 52}],
                                                    'ttf': 1},
                                        'and': {'doc_freq': 3,
                                                'term_freq': 1,
                                                'tokens': [{'end_offset': 26,
                                                        

Interpretation of the returned values
  * `[{field}]['field_statistics']`: 
    - `doc_count`: how many documents contain this field
    - `sum_ttf`: the sum of all term frequencies in this field
  * `[{field}][{term}]`:
    - `doc_freq`: how many document contain this term
    - `term_freq`: frequency (number of occurrences) of the term in this document field
    - `ttf`: total term frequency, i.e., number of occurrences of the term in this field in all documents

Note that Elasticsearch splits indices into multiple shards (by default: 5). This means that when you ask for term statistics, these are computed by shard. In case of a large collection, this is typically not an issue as the statistics become "normalized" across the different shards and the differences are negligible. For smaller collections that fit on a single disk, you may set the number of shards to 1 to avoid this issue alltogether (like we've done in this example in `INDEX_SETTINGS`).

Check the following documents for further information:
  - https://www.elastic.co/guide/en/elasticsearch/reference/6.2/_basic_concepts.html
  - https://www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch

### Search

In [13]:
query = "rap monster"
res = es.search(index=INDEX_NAME, q=query, _source=False, size=10)

Print full response (`hits` holds the results)

In [14]:
pprint(res)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [],
          'max_score': None,
          'total': {'relation': 'eq', 'value': 0}},
 'timed_out': False,
 'took': 2}


Print only search results (ranked list of docs)

In [15]:
for hit in res["hits"]["hits"]:
    print("Doc ID: %3r  Score: %5.2f" % (hit["_id"], hit["_score"]))

## Elasticsearch query language

Elasticsearch supports structured queries based on its own [DSL query language](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).

Mind that certain queries expect analyzed query terms (e.g., [term queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html)), while other query types (e.g., [match](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html)) perform analysis as part of the processing. Make sure you check the respective documentation carefully.

### Building a second toy index with position information

In [16]:
INDEX_NAME2 = "toy_index2"  

INDEX_SETTINGS2 = {
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        },
        "analysis": {
            "analyzer": {
                "my_english_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "stopwords": "_english_",
                    "filter": [
                        "lowercase",
                        "english_stop",
                        "filter_english_minimal"
                    ]                
                }
            },
            "filter" : {
                "filter_english_minimal" : {
                    "type": "stemmer",
                    "name": "minimal_english"
                },
                "english_stop": {
                    "type": "stop",
                    "stopwords": "_english_"
                }
            },
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "term_vector": "with_positions",
                "analyzer": "my_english_analyzer"
            },
            "content": {
                "type": "text",
                "term_vector": "with_positions",
                "analyzer": "my_english_analyzer"
            }
        }
    }
}

In [17]:
if es.indices.exists(INDEX_NAME2):
    es.indices.delete(index=INDEX_NAME2)
    
es.indices.create(index=INDEX_NAME2, body=INDEX_SETTINGS2)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'toy_index2'}

In [18]:
for doc_id, doc in DOCS.items():
    es.index(index=INDEX_NAME2, doc_type="_doc", id=doc_id, body=doc)

Check that term position information has been added to the index

In [19]:
tv = es.termvectors(index=INDEX_NAME2, doc_type="_doc", id=3, fields="title", term_statistics=True)

pprint(tv)

{'_id': '3',
 '_index': 'toy_index2',
 '_type': '_doc',
 '_version': 1,
 'found': True,
 'term_vectors': {'title': {'field_statistics': {'doc_count': 5,
                                                 'sum_doc_freq': 10,
                                                 'sum_ttf': 10},
                            'terms': {'lie': {'doc_freq': 1,
                                              'term_freq': 1,
                                              'tokens': [{'position': 4}],
                                              'ttf': 1},
                                      'love': {'doc_freq': 1,
                                               'term_freq': 1,
                                               'tokens': [{'position': 0}],
                                               'ttf': 1},
                                      'way': {'doc_freq': 1,
                                              'term_freq': 1,
                                              'tokens': [{'position': 2}],
 

### Examples

Searching for documents that must match a [boolean combination](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html) of multiple terms (in any order).  

In [20]:
query = {
    "bool": {
        "must": [
            {"match": {"content": "gonna"}}, 
            {"match": {"content": "monster"}}
        ]
    }
}

res = es.search(index=INDEX_NAME2, body={"query": query})

pprint(res)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [],
          'max_score': None,
          'total': {'relation': 'eq', 'value': 0}},
 'timed_out': False,
 'took': 1}


Searching for documents that match an [extract phrase](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html) (terms in that exact order).

In [21]:
query = {"match_phrase": {"content": "split second"}}

res = es.search(index=INDEX_NAME2, body={'query': query})

pprint(res)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '2',
                    '_index': 'toy_index2',
                    '_score': 1.7076306,
                    '_source': {'content': 'Yo, if you could just, for one '
                                           'minute Or one split second in '
                                           'time, forget everything Everything '
                                           'that bothers you, or your problems '
                                           'Everything, and follow me',
                                'title': 'Lose Yourself'},
                    '_type': '_doc'}],
          'max_score': 1.7076306,
          'total': {'relation': 'eq', 'value': 1}},
 'timed_out': False,
 'took': 2}
