# Elasticsearch: indexing and queries

### Resources

* https://marcobonzanini.com/2015/06/22/tuning-relevance-in-elasticsearch-with-custom-boosting/
* https://readthedocs.org/projects/elasticsearch-dsl/downloads/pdf/stable/
* https://www.elastic.co/blog/easier-relevance-tuning-elasticsearch-7-0

In [6]:
# Import dependencies
import os
import json
import sys
from elasticsearch import Elasticsearch
import wikipediaapi
from slugify import slugify
from pprint import pprint

### Connect to Elasticsearch

In [7]:
client = Elasticsearch("http://localhost:9200")

In [3]:
wiki_wiki = wikipediaapi.Wikipedia('en')

### Find wikipedia articles from category

In [None]:
def print_categorymembers(categorymembers, level=0, max_level=2):
        for c in categorymembers.values():
            print("%s: %s (ns: %d)" % ("*" * (level + 1), c.title, c.ns))
            if c.ns == wikipediaapi.Namespace.CATEGORY and level < max_level:
                print_categorymembers(c.categorymembers, level=level + 1, max_level=max_level)


# cat = wiki_wiki.page(f"Category:{category[2]}")
# print_categorymembers(cat.categorymembers)

# Default index mapping and simple search

In [4]:
# Check database for existing indices
client.indices.get_alias("_all")

{'marvel-comics': {'aliases': {}},
 'natural-languge-processing': {'aliases': {}},
 'coronaviridae': {'aliases': {}},
 'presidents-of-the-united-states': {'aliases': {}},
 'american-comics-writers': {'aliases': {}},
 'american-science-fiction-television-series': {'aliases': {}},
 'pandemics': {'aliases': {}},
 'american-comedy-webcomics': {'aliases': {}},
 'marvel-comics-editors-in-chief': {'aliases': {}}}

In [None]:
### Clear database and check if it is empty
# new_ind.delete(client, index="_all")
# new_ind.get_aliases(client)

### Populate database using default mapping

In [None]:
def simple_wiki_doc(category, client):
    
    if type(category) is not list: category = [ category ]

    wiki_wiki = wikipediaapi.Wikipedia('en')
    
    for c in category:

        cat = wiki_wiki.page(f"Category:{c}")

        for key in cat.categorymembers.keys():
            page = wiki_wiki.page(key)

            if not "Category:" in page.title:
                
            ''' Create Elasticsearch entry with a default index structure'''
                
                doc = {"title": page.title,
                       "page_id":page.pageid,
                       "source": page.fullurl,
                       "text":page.text
                }
                
                client.index(index=slugify(c), body=doc)
                
# simple_wiki_doc(category, client)

### Check default data structure

In [5]:
# Check an index mapping
pprint(client.indices.get_mapping("coronaviridae"))

{'coronaviridae': {'mappings': {'properties': {'page_id': {'type': 'long'},
                                               'source': {'fields': {'keyword': {'ignore_above': 256,
                                                                                 'type': 'keyword'}},
                                                          'type': 'text'},
                                               'text': {'fields': {'keyword': {'ignore_above': 256,
                                                                               'type': 'keyword'}},
                                                        'type': 'text'},
                                               'title': {'fields': {'keyword': {'ignore_above': 256,
                                                                                'type': 'keyword'}},
                                                         'type': 'text'}}}}}


## Experimenting with search queries

Elaticsearch query types: match, term and range

- **Match query is a standard query for full text search. Performed against analyzed text.**
- Term query is looking for an exact match
- Range is used for finding numerical values

In [10]:

index = ['presidents-of-the-united-states', 
         'marvel-comics-editors-in-chief', 
         'marvel-comics',
         'coronaviridae']

In [11]:
question = "When was Barak Obama inaugurated?"

### Simple full text search

In [12]:
body = {"query": {"match": {"text": question}}}

_search API response to a query includes several metafields, but we are mostly interest in the relevance score ('_score')

In [9]:
docs = client.search(body, index=index)

print("Example data strucure")
print("----------------------")
pprint(docs['hits']['hits'][6])

Example data strucure
----------------------
{'_id': 'aFzHFHEB8JJg9As-SDdp',
 '_index': 'coronaviridae',
 '_score': 3.4372873,
 '_source': {'page_id': 201983,
             'source': 'https://en.wikipedia.org/wiki/Coronavirus',
             'text': 'Coronaviruses are a group of related viruses that cause '
                     'diseases in mammals and birds. In humans, coronaviruses '
                     'cause respiratory tract infections that can be mild, '
                     'such as some cases of the common cold (among other '
                     'possible causes, predominantly rhinoviruses), and others '
                     'that can be lethal, such as SARS, MERS, and COVID-19. '
                     'Symptoms in other species vary: in chickens, they cause '
                     'an upper respiratory tract disease, while in cows and '
                     'pigs they cause diarrhea. There are yet to be vaccines '
                     'or antiviral drugs to prevent or treat huma

                     'RNA viruses. The boundaries of cis-acting elements '
                     'essential to replication are fairly well-defined, and '
                     'the RNA secondary structures of these regions are '
                     'understood. However, how these cis-acting structures and '
                     'sequences interact with the viral replicase and host '
                     'cell components to allow RNA synthesis is not well '
                     'understood.\n'
                     '\n'
                     'Genome packaging\n'
                     'The assembly of infectious coronavirus particles '
                     'requires the selection of viral genomic RNA from a '
                     'cellular pool that contains an abundant excess of '
                     'non-viral and viral RNAs. Among the seven to ten '
                     'specific viral mRNAs synthesized in virus-infected '
                     'cells, only the full-length genomic RNA is 

In [13]:
print(f"Question: {question}")
print("")
print("Search results:")
print("----------------------")

for i, doc in enumerate(docs["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    idx = doc['_index']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score} | Index {idx}')
    print(f'    Link: {url}')

Question: When was Barak Obama inaugurated?

Search results:
----------------------
Result 0: Bill Clinton | Relevance score 7.305959 | Index presidents-of-the-united-states
    Link: https://en.wikipedia.org/wiki/Bill_Clinton
Result 1: Barack Obama | Relevance score 4.0938215 | Index presidents-of-the-united-states
    Link: https://en.wikipedia.org/wiki/Barack_Obama
Result 2: Donald Trump | Relevance score 3.9640172 | Index presidents-of-the-united-states
    Link: https://en.wikipedia.org/wiki/Donald_Trump
Result 3: Jimmy Carter | Relevance score 3.5737386 | Index presidents-of-the-united-states
    Link: https://en.wikipedia.org/wiki/Jimmy_Carter
Result 4: President of the United States | Relevance score 3.5697677 | Index presidents-of-the-united-states
    Link: https://en.wikipedia.org/wiki/President_of_the_United_States
Result 5: Coronavirus | Relevance score 3.4372873 | Index coronaviridae
    Link: https://en.wikipedia.org/wiki/Coronavirus
Result 6: Coronavirinae | Relevance s

### Match phrase

match_phrase query: 
* All the terms must appear in the field, and in the same order
* Can add custom query analyzer

In [26]:
body0 = \
    {
      "query": {"match_phrase": 
                    {"title": "Barack Obama"}
               }
    }
docs0 = client.search(body0, index=index)

print(f"Question: {question}")
print("")
print("Search results:")
print("----------------------")
for i, doc in enumerate(docs0['hits']['hits']):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score}')
    print(f'    Link: {url}')

Question: When was Barak Obama inaugurated?

Search results:
----------------------
Result 0: Barack Obama | Relevance score 7.8283415
    Link: https://en.wikipedia.org/wiki/Barack_Obama


### Multi-field search

In [16]:
body1 = \
{
  "query": {
    "multi_match" : {
      "query":    question, 
      "fields": [ "title", "text" ] 
    }
  }
}

docs1 = client.search(body1, index=index)

print(f"Question: {question}")
print("")
print("Search results:")
print("----------------------")
for i, doc in enumerate(docs1["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score}')
    print(f'    Link: {url}')

Question: When was Barak Obama inaugurated?

Search results:
----------------------
Result 0: Bill Clinton | Relevance score 7.305959
    Link: https://en.wikipedia.org/wiki/Bill_Clinton
Result 1: Barack Obama | Relevance score 4.0938215
    Link: https://en.wikipedia.org/wiki/Barack_Obama
Result 2: Donald Trump | Relevance score 3.9640172
    Link: https://en.wikipedia.org/wiki/Donald_Trump
Result 3: Jimmy Carter | Relevance score 3.5737386
    Link: https://en.wikipedia.org/wiki/Jimmy_Carter
Result 4: President of the United States | Relevance score 3.5697677
    Link: https://en.wikipedia.org/wiki/President_of_the_United_States
Result 5: Coronavirus | Relevance score 3.4372873
    Link: https://en.wikipedia.org/wiki/Coronavirus
Result 6: Coronavirinae | Relevance score 3.4372873
    Link: https://en.wikipedia.org/wiki/Coronavirus
Result 7: George H. W. Bush | Relevance score 3.091462
    Link: https://en.wikipedia.org/wiki/George_H._W._Bush
Result 8: George W. Bush | Relevance score

Wikipedia "Barack Obama" is still not at the top of the list. This is because other alrtiles may have many mentions of Barack Obama too.

### Dynamic boosting

Dynamic boosting allows to boost some document fields during query

In [17]:
body2 = \
    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "title": {
                  "query": question,
                  "boost": 3
                }
              }
            },
            {
              "match": { 
                "text": question
              }
            }
          ]
        }
      }
    }

docs2 = client.search(body2, index=index)
for i, doc in enumerate(docs2["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score}')
    print(f'    Link: {url}')

Result 0: Barack Obama | Relevance score 15.836334
    Link: https://en.wikipedia.org/wiki/Barack_Obama
Result 1: Bill Clinton | Relevance score 7.305959
    Link: https://en.wikipedia.org/wiki/Bill_Clinton
Result 2: Donald Trump | Relevance score 3.9640172
    Link: https://en.wikipedia.org/wiki/Donald_Trump
Result 3: Jimmy Carter | Relevance score 3.5737386
    Link: https://en.wikipedia.org/wiki/Jimmy_Carter
Result 4: President of the United States | Relevance score 3.5697677
    Link: https://en.wikipedia.org/wiki/President_of_the_United_States
Result 5: Coronavirus | Relevance score 3.4372873
    Link: https://en.wikipedia.org/wiki/Coronavirus
Result 6: Coronavirinae | Relevance score 3.4372873
    Link: https://en.wikipedia.org/wiki/Coronavirus
Result 7: George H. W. Bush | Relevance score 3.091462
    Link: https://en.wikipedia.org/wiki/George_H._W._Bush
Result 8: George W. Bush | Relevance score 3.053765
    Link: https://en.wikipedia.org/wiki/George_W._Bush
Result 9: Bat-borne

### Multifield search with boosing - short version

In [18]:
body3 = \
{
  "query": {
    "multi_match" : {
      "query":    question, 
      "fields": [ "title^3", "text" ] 
    }
  }
}

docs3 = client.search(body3, index=index)
for i, doc in enumerate(docs3["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score}')
    print(f'    Link: {url}')

Result 0: Barack Obama | Relevance score 11.742513
    Link: https://en.wikipedia.org/wiki/Barack_Obama
Result 1: Bill Clinton | Relevance score 7.305959
    Link: https://en.wikipedia.org/wiki/Bill_Clinton
Result 2: Donald Trump | Relevance score 3.9640172
    Link: https://en.wikipedia.org/wiki/Donald_Trump
Result 3: Jimmy Carter | Relevance score 3.5737386
    Link: https://en.wikipedia.org/wiki/Jimmy_Carter
Result 4: President of the United States | Relevance score 3.5697677
    Link: https://en.wikipedia.org/wiki/President_of_the_United_States
Result 5: Coronavirus | Relevance score 3.4372873
    Link: https://en.wikipedia.org/wiki/Coronavirus
Result 6: Coronavirinae | Relevance score 3.4372873
    Link: https://en.wikipedia.org/wiki/Coronavirus
Result 7: George H. W. Bush | Relevance score 3.091462
    Link: https://en.wikipedia.org/wiki/George_H._W._Bush
Result 8: George W. Bush | Relevance score 3.053765
    Link: https://en.wikipedia.org/wiki/George_W._Bush
Result 9: Bat-borne

Probable reason for coronavirus being high on the list is the default N-gram tokenizer. The tokenizer analyzes the input text and produces letter N-grams with N of 1 and 2:
```
{
  "tokenizer": "ngram",
  "text": "Quick Fox"
}

[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]

```

## Customizing indexing  - mappings

Qary chatbot uses BERT or ALBERT model to generate answers from context.
For it to perform better we want to only give it shorter and most relevant context. Storing articles in sections may improve performance

To store wikipedia artilcles in sections, we define the text field as nested datatype:

In [19]:
mapping = {
    "properties": {
        
            "text": {
                "type": "nested",
                "properties":{
                    "section_num": {"type":"integer"},
                    "section_title": {"type":"text"},
                    "section_content": {"type":"text"}
                }
            },
        
            "title": {
                "type": "text"
            },
        
            "source": {
                "type": "text"
            },
        
            "page_id": {
                "type": "long"
            },
            
        }
    }


In [20]:
pprint(mapping)

{'properties': {'page_id': {'type': 'long'},
                'source': {'type': 'text'},
                'text': {'properties': {'section_content': {'type': 'text'},
                                        'section_num': {'type': 'integer'},
                                        'section_title': {'type': 'text'}},
                         'type': 'nested'},
                'title': {'type': 'text'}}}


In [21]:
# Check indices existing in the database
client.indices.get_alias("")

{'natural-languge-processing': {'aliases': {}},
 'american-comedy-webcomics': {'aliases': {}},
 'coronaviridae': {'aliases': {}},
 'presidents-of-the-united-states': {'aliases': {}},
 'american-comics-writers': {'aliases': {}},
 'pandemics': {'aliases': {}},
 'american-science-fiction-television-series': {'aliases': {}},
 'marvel-comics': {'aliases': {}},
 'marvel-comics-editors-in-chief': {'aliases': {}}}

### Create some new indices

In [3]:
index1 = [
    'natural-languge-processing',
    'american-comedy-webcomics',
    'american-comics-writers',
    'american-science-fiction-television-series',
    'pandemics'
]

In [23]:
# create new index and check if it now exists in database
client.indices.get_mapping('natural-languge-processing')

{'natural-languge-processing': {'mappings': {'properties': {'page_id': {'type': 'long'},
    'source': {'type': 'text'},
    'text': {'type': 'nested',
     'properties': {'section_content': {'type': 'text'},
      'section_num': {'type': 'integer'},
      'section_title': {'type': 'text'}}},
    'title': {'type': 'text'}}}}}

In [None]:
def build_entry(page):
    
    sections = [{"section_num":0,
                 "section_title": "Summary",
                 "section_content": page.summary}]
    for i, section in enumerate(page.sections, start=1):
        sections_dict = {}
        sections_dict ['section_num'] = i
        sections_dict['section_title'] = section.title
        sections_dict['section_content'] = section.text

        sections.append(sections_dict)
        
    return {'text': sections,
            'title': page.title,
           'source': page.fullurl,
           'page_id': page.pageid,
           }

In [None]:
def search_insert_wiki(category, mapping):
    
    if type(category) is not list: category = [ category ]

    wiki_wiki = wikipediaapi.Wikipedia('en')
    
    for c in category:
        
        '''Create and empty index with predefined data structure'''
        client.indices.create(index=slugify(c), body={"mappings":mapping})
        
        print(f"Index {c} has been created")

        cat = wiki_wiki.page(f"Category:{c}")

        for key in cat.categorymembers.keys():
            page = wiki_wiki.page(key)

            if not "Category:" in page.title:
                ''' Build a dictionary and add in to the index'''
                doc = build_entry(page)
                try:
                    client.index(index=slugify(c), body=doc)
                    print(f"Success! New document was added to {c} index")
                except:
                    print("Something went wrong")
                
# search_insert_wiki('Pandemics', mapping)

### Querying nested data

In [1]:
question1 = "stan lee"

In [12]:
body4={
    "query": {
        "nested":{
            "path":"text",
            "query":{
                "bool": {
                "must": [
                    {"match":{"text.section_content":question1}},
                    {"match":{"text.section_title":"Summary"}},
                    ]
                }
            }
        }
    }
}


docs4 = client.search(body4, index=index1)

for i, doc in enumerate(docs4["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    pageid =doc['_source']['page_id']
    ind = doc["_index"]
    print(f'Result {i}: {title} | Relevance score {score} | {pageid} | {ind}')
    print(f'    Link: {url}')

Result 0: Carl Wessler | Relevance score 11.118463 | 12294922 | american-comics-writers
    Link: https://en.wikipedia.org/wiki/Carl_Wessler
Result 1: Don Rico | Relevance score 10.556743 | 2780362 | american-comics-writers
    Link: https://en.wikipedia.org/wiki/Don_Rico
Result 2: Jim Salicrup | Relevance score 10.427067 | 12895497 | american-comics-writers
    Link: https://en.wikipedia.org/wiki/Jim_Salicrup
Result 3: Colleen Doran | Relevance score 10.426506 | 261414 | american-comics-writers
    Link: https://en.wikipedia.org/wiki/Colleen_Doran
Result 4: Suburban Tribe | Relevance score 9.495357 | 3928893 | american-comedy-webcomics
    Link: https://en.wikipedia.org/wiki/Suburban_Tribe
Result 5: Stan Lee | Relevance score 8.471656 | 18598186 | american-comics-writers
    Link: https://en.wikipedia.org/wiki/Stan_Lee
Result 6: Stan Lee | Relevance score 8.471656 | 18598186 | american-comics-writers
    Link: https://en.wikipedia.org/wiki/Stan_Lee
Result 7: Stan Sakai | Relevance sco