# Elasticsearch

Run this example to index and search a toy-sized collection of documents using Elasticsearch.  There is nothing for you to add/complete here, it's just to make sure you're all set for the next exercise.

Before starting, make sure that you've 

1. Downloaded and started Elasticsearch
1. Installed the `elasticsearch` Python package
  - It's part of the standard Anaconda distribution; otherwise, you can run `conda install elasticsearch`.

In [None]:
!pip install ipytest
!pip install elasticsearch==7.9.0

In [None]:
%%bash

wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512 

In [None]:
%%bash --bg

sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

In [None]:
# Sleep for few seconds to let the instance start.
import time
time.sleep(20)

In [None]:
from elasticsearch import Elasticsearch
from pprint import pprint

In [None]:
INDEX_NAME = "toy_index"  # the name of the index

INDEX_SETTINGS = {  # single shard with a single replica
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        }
    }
}

The collection of documents is given here as a Python dictionary. Each document has two fields: title and content.

In [None]:
DOCS = {
    1: {"title": "Rap God",
        "content": "gonna, gonna, Look, I was gonna go easy on you and not to hurt your feelings"
        },
    2: {"title": "Lose Yourself",
        "content": "Yo, if you could just, for one minute Or one split second in time, forget everything Everything that bothers you, or your problems Everything, and follow me"
        },
    3: {"title": "Love The Way You Lie",
        "content": "Just gonna stand there and watch me burn But that's alright, because I like the way it hurts"
        },
    4: {"title": "The Monster",
        "content": ["gonna gonna I'm friends with the monster", "That's under my bed Get along with the voices inside of my head"]
        },
    5: {"title": "Beautiful",
        "content": "Lately I've been hard to reach I've been too long on my own Everybody has a private world Where they can be alone"
        }
}  # Eminem rulez ;)

### Create Elasticsearch object

In [None]:
es = Elasticsearch()

Check if service is running

In [None]:
es.info()

### Create index

If the index exists, we delete it (normally, you don't want to do this).

In [None]:
if es.indices.exists(INDEX_NAME):
    es.indices.delete(index=INDEX_NAME)

We set the number of shards and replicas to be used for each index when it's created. (We use a single shard instead of the default 5.)

In [None]:
es.indices.create(index=INDEX_NAME, body=INDEX_SETTINGS)

### Add documents to the index

In [None]:
for doc_id, doc in DOCS.items():
    es.index(index=INDEX_NAME, doc_type="_doc", id=doc_id, body=doc)

### Check what has been indexed

Get the contents of doc #3

In [None]:
doc = es.get(index=INDEX_NAME, id=3)

In [None]:
pprint(doc)

Get the term vector for doc #3.

`termvectors` returns information and statistics on terms in the fields of a particular document.

In [None]:
tv = es.termvectors(index=INDEX_NAME, doc_type="_doc", id=3, fields="title,content", term_statistics=True)

In [None]:
pprint(tv)

Interpretation of the returned values
  * `[{field}]['field_statistics']`: 
    - `doc_count`: how many documents contain this field
    - `sum_ttf`: the sum of all term frequencies in this field
  * `[{field}][{term}]`:
    - `doc_freq`: how many document contain this term
    - `term_freq`: frequency (number of occurrences) of the term in this document field
    - `ttf`: total term frequency, i.e., number of occurrences of the term in this field in all documents

Note that Elasticsearch splits indices into multiple shards (by default: 5). This means that when you ask for term statistics, these are computed by shard. In case of a large collection, this is typically not an issue as the statistics become "normalized" across the different shards and the differences are negligible. For smaller collections that fit on a single disk, you may set the number of shards to 1 to avoid this issue alltogether (like we've done in this example in `INDEX_SETTINGS`).

Check the following documents for further information:
  - https://www.elastic.co/guide/en/elasticsearch/reference/6.2/_basic_concepts.html
  - https://www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch

### Search

In [None]:
query = "rap monster"
res = es.search(index=INDEX_NAME, q=query, _source=False, size=10)

Print full response (`hits` holds the results)

In [None]:
pprint(res)

Print only search results (ranked list of docs)

In [None]:
for hit in res["hits"]["hits"]:
    print("Doc ID: %3r  Score: %5.2f" % (hit["_id"], hit["_score"]))

## Elasticsearch query language

Elasticsearch supports structured queries based on its own [DSL query language](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).

Mind that certain queries expect analyzed query terms (e.g., [term queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html)), while other query types (e.g., [match](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html)) perform analysis as part of the processing. Make sure you check the respective documentation carefully.

### Building a second toy index with position information

In [None]:
INDEX_NAME2 = "toy_index2"  

INDEX_SETTINGS2 = {
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        },
        "analysis": {
            "analyzer": {
                "my_english_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "stopwords": "_english_",
                    "filter": [
                        "lowercase",
                        "english_stop",
                        "filter_english_minimal"
                    ]                
                }
            },
            "filter" : {
                "filter_english_minimal" : {
                    "type": "stemmer",
                    "name": "minimal_english"
                },
                "english_stop": {
                    "type": "stop",
                    "stopwords": "_english_"
                }
            },
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "term_vector": "with_positions",
                "analyzer": "my_english_analyzer"
            },
            "content": {
                "type": "text",
                "term_vector": "with_positions",
                "analyzer": "my_english_analyzer"
            }
        }
    }
}

In [None]:
if es.indices.exists(INDEX_NAME2):
    es.indices.delete(index=INDEX_NAME2)
    
es.indices.create(index=INDEX_NAME2, body=INDEX_SETTINGS2)

In [None]:
for doc_id, doc in DOCS.items():
    es.index(index=INDEX_NAME2, doc_type="_doc", id=doc_id, body=doc)

Check that term position information has been added to the index

In [None]:
tv = es.termvectors(index=INDEX_NAME2, doc_type="_doc", id=3, fields="title", term_statistics=True)

pprint(tv)

### Examples

Searching for documents that must match a [boolean combination](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html) of multiple terms (in any order).  

In [None]:
query = {
    "bool": {
        "must": [
            {"match": {"content": "gonna"}}, 
            {"match": {"content": "monster"}}
        ]
    }
}

res = es.search(index=INDEX_NAME2, body={"query": query})

pprint(res)

Searching for documents that match an [extract phrase](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html) (terms in that exact order).

In [None]:
query = {"match_phrase": {"content": "split second"}}

res = es.search(index=INDEX_NAME2, body={'query': query})

pprint(res)