# Improving search accuracy in Elastisearch

Elasticsearch ranks documents found in response to a quary by a score, roughly described as a term frequency normalized by the field length. A detailed explanation of how relevancy scores are calculated in Elasticsearch can be found [here](https://qbox.io/blog/practical-guide-elasticsearch-scoring-relevancy)

In this tutorial we experiment with different types of Elasticsearch queries and their performance with different document indexing approaches. We touch on indexing basics [here](https://ovbondarenko.github.io/elasticsearch/index.html).

Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html

In [1]:
# Import dependencies
from elasticsearch import Elasticsearch
import wikipediaapi
from slugify import slugify
from pprint import pprint

Start a new Elasticsearch connection:

In [2]:
client = Elasticsearch("http://localhost:9200")

Our small Elasticsearch library contains several indices. One group of indices has a default data structure, and the second has a nested structure with predefined field mappings.

Here is all the indices in the database:

In [3]:
client.indices.get_alias("")

{'wikipedia-simple': {'aliases': {}},
 'natural-language-processing': {'aliases': {}},
 'pandemics': {'aliases': {}},
 'coronaviridae': {'aliases': {}},
 'science-fiction-television': {'aliases': {}},
 'american-comics-writers': {'aliases': {}},
 'machine-learning': {'aliases': {}},
 'american-science-fiction-television-series': {'aliases': {}},
 'marvel-comics': {'aliases': {}},
 'marvel-comics-editors-in-chief': {'aliases': {}},
 'presidents-of-the-united-states': {'aliases': {}}}

The index called 'presidents-of-the-united-states' has the fillowing structure:

In [4]:
client.indices.get_mapping('wikipedia-simple')

{'wikipedia-simple': {'mappings': {'properties': {'page_id': {'type': 'long'},
    'source': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'text': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'title': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}}}}

In [None]:
client = Elasticsearch("http://localhost:9200") # or name of local network, e.g. client = Elasticsearch("es:9200")

In [7]:
pprint(client.indices.get_mapping('wikipedia-simple'))

{'wikipedia-simple': {'mappings': {'properties': {'page_id': {'type': 'long'},
                                                  'source': {'fields': {'keyword': {'ignore_above': 256,
                                                                                    'type': 'keyword'}},
                                                             'type': 'text'},
                                                  'text': {'fields': {'keyword': {'ignore_above': 256,
                                                                                  'type': 'keyword'}},
                                                           'type': 'text'},
                                                  'title': {'fields': {'keyword': {'ignore_above': 256,
                                                                                   'type': 'keyword'}},
                                                            'type': 'text'}}}}}


In [6]:
pprint(client.indices.get_mapping('machine-learning'))

{'machine-learning': {'mappings': {'properties': {'page_id': {'type': 'long'},
                                                  'references': {'properties': {'section_content': {'type': 'text'},
                                                                                'section_num': {'type': 'integer'},
                                                                                'section_title': {'type': 'text'}},
                                                                 'type': 'nested'},
                                                  'source': {'type': 'text'},
                                                  'text': {'properties': {'section_content': {'type': 'text'},
                                                                          'section_num': {'type': 'integer'},
                                                                          'section_title': {'type': 'text'}},
                                                           'type': 'nested'},
 

In [5]:
client.indices.get_mapping('presidents-of-the-united-states')

{'presidents-of-the-united-states': {'mappings': {'properties': {'page_id': {'type': 'long'},
    'source': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'text': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'title': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}}}}

And here is an example of an index with nested structure:

In [6]:
client.indices.get_mapping('coronaviridae')

{'coronaviridae': {'mappings': {'properties': {'page_id': {'type': 'long'},
    'source': {'type': 'text'},
    'text': {'type': 'nested',
     'properties': {'section_content': {'type': 'text'},
      'section_num': {'type': 'integer'},
      'section_title': {'type': 'text'}}},
    'title': {'type': 'text'}}}}}

We will show how to build queries for both types of indices to get better text search results.

## General query types

Elaticsearch has three types of queries: match, term and range.

- **Match query is a standard query for full text search. Performed against analyzed text.** We will focus on match queries, because they are most useful for free text search.
- Term query is looking for an eaxact match
- Range is used for finding numerical values

We will start with search in indices that have a default data structure. Let us check what this means in terms of mapping.

## Query indices with single text fields

In [7]:
index1 = ['presidents-of-the-united-states', 
         'marvel-comics-editors-in-chief', 
         'marvel-comics',
         'american-comics-writers']

In [8]:
question1 = "When stan Lee was born?"

### Simple full text search

In [1]:
body1 = {"query": {"match": {"text": question1}}}

NameError: name 'question1' is not defined

_search API response to a query includes several metafields, but we are mostly interest in the relevance score ('_score')

In [11]:
docs = client.search(body1, index=index1)

print("Example data strucure")
print("----------------------")
pprint(docs['hits']['hits'][0].keys())
pprint(docs['hits']['hits'][0]['_source'].keys())

Example data strucure
----------------------
dict_keys(['_index', '_type', '_id', '_score', '_source'])
dict_keys(['title', 'page_id', 'source', 'text'])


In [12]:
print(f"Question: {question1}")
print("")
print("Search results:")
print("----------------------")

docs1 = client.search(body, index="")

for i, doc in enumerate(docs1["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    idx = doc['_index']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score} | Index {idx}')
    print(f'    Link: {url}')

Question: When stan Lee was born?

Search results:
----------------------
Result 0: Stan Lee | Relevance score 9.7426195 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Stan_Lee
Result 1: Leon Lazarus | Relevance score 9.079921 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Leon_Lazarus
Result 2: Roy Thomas | Relevance score 8.832645 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Roy_Thomas
Result 3: Jim Salicrup | Relevance score 8.592315 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Jim_Salicrup
Result 4: Danny Fingeroth | Relevance score 8.055832 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Danny_Fingeroth
Result 5: Jack Kirby | Relevance score 7.9232183 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Jack_Kirby
Result 6: Steve Ditko | Relevance score 7.883254 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Ste

In this case, the search returns the most relevant article first. However it is goot to run another test.

In [13]:
question2 = "When Barack Obama was inaugurated?"
body2 = {"query": {"match": {"text": question2}}}

print(f"Question: {question2}")
print("")
print("Search results:")
print("----------------------")

docs2 = client.search(body2, index=index1)

for i, doc in enumerate(docs2["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    idx = doc['_index']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score} | Index {idx}')
    print(f'    Link: {url}')

Question: When Barack Obama was inaugurated?

Search results:
----------------------
Result 0: Jeff Mariotte | Relevance score 12.4784 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Jeff_Mariotte
Result 1: Ta-Nehisi Coates | Relevance score 11.486142 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Ta-Nehisi_Coates
Result 2: Amber Benson | Relevance score 9.384043 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Amber_Benson
Result 3: Eric Millikin | Relevance score 8.75001 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Eric_Millikin
Result 4: Jason Rubin | Relevance score 8.663446 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Jason_Rubin
Result 5: Rashida Jones | Relevance score 7.785064 | Index american-comics-writers
    Link: https://en.wikipedia.org/wiki/Rashida_Jones
Result 6: Barack Obama | Relevance score 7.342956 | Index presidents-of-the-united-states
    

As you can see, the highest score calculated by Elasticsearch does not always correspond to the most relevant article. How can we improve the results?

### Match phrase query

* All the terms must appear in the field, and in the same order
* Can add custom query analyzer

It may be very useful of we are looking for a specific set of words, like "Barack Obama".

In [14]:
phrase = "Barack Obama"

body3 = \
    {
      "query": {"match_phrase": 
                    {"title": phrase}
               }
    }
docs3 = client.search(body3, index=index1)

print(f"Question: {phrase}")
print("")
print("Search results:")
print("----------------------")
for i, doc in enumerate(docs3['hits']['hits']):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score}')
    print(f'    Link: {url}')

Question: Barack Obama

Search results:
----------------------
Result 0: Barack Obama | Relevance score 7.8283415
    Link: https://en.wikipedia.org/wiki/Barack_Obama


In [15]:
phrase = "When Barack Obama was born?"

body3 = \
    {
      "query": {"match_phrase": 
                    {"title": phrase}
               }
    }
docs3 = client.search(body3, index=index1)

print(f"Question: {phrase}")
print("")
print("Search results:")
print("----------------------")
for i, doc in enumerate(docs3['hits']['hits']):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score}')
    print(f'    Link: {url}')

Question: When Barack Obama was born?

Search results:
----------------------


Clearly, it does not work when our question is has more natural and uncertain form.

### Multi-field search

We can try to improve the scoring by searching in multiple fields. Our indices in this example have only two text fields - text and title, we will search in both of them.

In [16]:
body4 = \
{
  "query": {
    "multi_match" : {
      "query":    question2, 
      "fields": [ "title", "text" ] 
    }
  }
}

docs4 = client.search(body4, index=index1)

print(f"Question: {question2}")
print("")
print("Search results:")
print("----------------------")
for i, doc in enumerate(docs4["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score}')
    print(f'    Link: {url}')

Question: When Barack Obama was inaugurated?

Search results:
----------------------
Result 0: Jeff Mariotte | Relevance score 12.4784
    Link: https://en.wikipedia.org/wiki/Jeff_Mariotte
Result 1: Ta-Nehisi Coates | Relevance score 11.486142
    Link: https://en.wikipedia.org/wiki/Ta-Nehisi_Coates
Result 2: Amber Benson | Relevance score 9.384043
    Link: https://en.wikipedia.org/wiki/Amber_Benson
Result 3: Eric Millikin | Relevance score 8.75001
    Link: https://en.wikipedia.org/wiki/Eric_Millikin
Result 4: Jason Rubin | Relevance score 8.663446
    Link: https://en.wikipedia.org/wiki/Jason_Rubin
Result 5: Barack Obama | Relevance score 7.8283415
    Link: https://en.wikipedia.org/wiki/Barack_Obama
Result 6: Rashida Jones | Relevance score 7.785064
    Link: https://en.wikipedia.org/wiki/Rashida_Jones
Result 7: Alex Ross | Relevance score 6.783415
    Link: https://en.wikipedia.org/wiki/Alex_Ross
Result 8: Bill Clinton | Relevance score 6.571047
    Link: https://en.wikipedia.org/

While the most relevant "Barack Obama" article is ranked 6th instead of 7th, it is still not at the top of the list. The improvement is pretty minor. This is because other articles may have multiples mentions of Barack Obama too.

### Dynamic boosting

Dynamic boosting allows to boost some document fields during query. We can boost the score calculated from the title field, since we believe that the article that has the keywords from our query in its title is more relevant than the one that doesn't. In our Barack Obama example this is the correct assumption.

In [17]:
body5 = \
    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "title": {
                  "query": question2,
                  "boost": 3
                }
              }
            },
            {
              "match": { 
                "text": question2
              }
            }
          ]
        }
      }
    }

docs5 = client.search(body5, index=index1)

print(f"Question: {question2}")
print("")
print("Search results:")
print("----------------------")

for i, doc in enumerate(docs5["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score}')
    print(f'    Link: {url}')

Question: When Barack Obama was inaugurated?

Search results:
----------------------
Result 0: Barack Obama | Relevance score 30.827982
    Link: https://en.wikipedia.org/wiki/Barack_Obama
Result 1: Jeff Mariotte | Relevance score 12.4784
    Link: https://en.wikipedia.org/wiki/Jeff_Mariotte
Result 2: Ta-Nehisi Coates | Relevance score 11.486142
    Link: https://en.wikipedia.org/wiki/Ta-Nehisi_Coates
Result 3: Amber Benson | Relevance score 9.384043
    Link: https://en.wikipedia.org/wiki/Amber_Benson
Result 4: Eric Millikin | Relevance score 8.75001
    Link: https://en.wikipedia.org/wiki/Eric_Millikin
Result 5: Jason Rubin | Relevance score 8.663446
    Link: https://en.wikipedia.org/wiki/Jason_Rubin
Result 6: Rashida Jones | Relevance score 7.785064
    Link: https://en.wikipedia.org/wiki/Rashida_Jones
Result 7: Alex Ross | Relevance score 6.783415
    Link: https://en.wikipedia.org/wiki/Alex_Ross
Result 8: Bill Clinton | Relevance score 6.571047
    Link: https://en.wikipedia.org/

Much better! Now the most useful article s at the top of the list.

### Multifield search with boosing - short version

This is just another way to make a multi-field query with boosting which is more concise then the first one.

In [18]:
body6 = \
{
  "query": {
    "multi_match" : {
      "query":    question2, 
      "fields": [ "title^3", "text" ] 
    }
  }
}

docs6 = client.search(body6, index="")

print(f"Question: {question2}")
print("")
print("Search results:")
print("----------------------")

for i, doc in enumerate(docs6["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    print(f'Result {i}: {title} | Relevance score {score}')
    print(f'    Link: {url}')

Question: When Barack Obama was inaugurated?

Search results:
----------------------
Result 0: Barack Obama | Relevance score 23.485025
    Link: https://en.wikipedia.org/wiki/Barack_Obama
Result 1: Jeff Mariotte | Relevance score 12.4784
    Link: https://en.wikipedia.org/wiki/Jeff_Mariotte
Result 2: Ta-Nehisi Coates | Relevance score 11.486142
    Link: https://en.wikipedia.org/wiki/Ta-Nehisi_Coates
Result 3: Amber Benson | Relevance score 9.384043
    Link: https://en.wikipedia.org/wiki/Amber_Benson
Result 4: Eric Millikin | Relevance score 8.75001
    Link: https://en.wikipedia.org/wiki/Eric_Millikin
Result 5: Jason Rubin | Relevance score 8.663446
    Link: https://en.wikipedia.org/wiki/Jason_Rubin
Result 6: Rashida Jones | Relevance score 7.785064
    Link: https://en.wikipedia.org/wiki/Rashida_Jones
Result 7: Alex Ross | Relevance score 6.783415
    Link: https://en.wikipedia.org/wiki/Alex_Ross
Result 8: Bill Clinton | Relevance score 6.571047
    Link: https://en.wikipedia.org/

Default Elasticsearch tokenizer is N-gram tokenizer. The tokenizer analyzes the input text and produces letter N-grams with N of 1 and 2:
```
{
  "tokenizer": "ngram",
  "text": "Quick Fox"
}

[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]

```

## Querying nested fields

Here is the indices in our library with a nested structure:

In [14]:
index2 = [
    'natural-language-processing',
    'machine-learning',
    'coronaviridae',
    'pandemics'
]

In [15]:
question3 = "What is natural language processing?"

In [31]:
body7={
    "query": {
        "nested":{
            "path":"text",
            "query":{
                "bool": {
                "must": [
                    {"match":{"text.section_content":{'query': question3}}},
                    {"match":{"text.section_title":{'query': 'Summary', "boost": 3}}},
                    ]
                }
            }
        }
    }
}


docs7 = client.search(body7, index=index2)

print(f"Question: {question3}")
print("")
print("Search results:")
print("----------------------")

for i, doc in enumerate(docs7["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    pageid =doc['_source']['page_id']
    ind = doc["_index"]
    print(f'Result {i}: {title} | Relevance score {score} | {pageid} | {ind}')
    print(f'    Link: {url}')

Question: What is natural language processing?

Search results:
----------------------
Result 0: Conditional random field | Relevance score 21.414635 | 4118276 | machine-learning
    Link: https://en.wikipedia.org/wiki/Conditional_random_field
Result 1: Paraphrasing (computational linguistics) | Relevance score 17.06732 | 56142183 | machine-learning
    Link: https://en.wikipedia.org/wiki/Paraphrasing_(computational_linguistics)
Result 2: Documenting Hate | Relevance score 16.588968 | 54994687 | machine-learning
    Link: https://en.wikipedia.org/wiki/Documenting_Hate
Result 3: Bag-of-words model | Relevance score 16.155718 | 14003441 | machine-learning
    Link: https://en.wikipedia.org/wiki/Bag-of-words_model
Result 4: Constrained conditional model | Relevance score 15.544483 | 28255458 | machine-learning
    Link: https://en.wikipedia.org/wiki/Constrained_conditional_model
Result 5: Structured sparsity regularization | Relevance score 15.375017 | 48844125 | machine-learning
    Link

Unfortunately, it didn't quite work out, because the article titled "Natural language processing" came up only on the 8th place. Let's try the multi-field search with the title field boosting:

In [25]:
body7={
  "query": {
    "multi_match" : {
      "query": question3,
      "type": "best_fields",
      "fields": [ "title^3", "text.section_content"],
    }
  }
}

docs7 = client.search(body7, index=index2)

print(f"Question: {question3}")
print("")
print("Search results:")
print("----------------------")

for i, doc in enumerate(docs7["hits"]["hits"]):
    title = doc['_source']['title']
    score = doc['_score']
    url = doc['_source']['source']
    pageid =doc['_source']['page_id']
    ind = doc["_index"]
    print(f'Result {i}: {title} | Relevance score {score} | {pageid} | {ind}')
    print(f'    Link: {url}')

Question: What is natural language processing?

Search results:
----------------------
Result 0: Natural language processing | Relevance score 21.272564 | 21652 | natural-language-processing
    Link: https://en.wikipedia.org/wiki/Natural_language_processing
Result 1: Studies in Natural Language Processing | Relevance score 16.254246 | 6650456 | natural-language-processing
    Link: https://en.wikipedia.org/wiki/Studies_in_Natural_Language_Processing
Result 2: Semantic decomposition (natural language processing) | Relevance score 16.254246 | 57932194 | natural-language-processing
    Link: https://en.wikipedia.org/wiki/Semantic_decomposition_(natural_language_processing)
Result 3: Outline of natural language processing | Relevance score 16.254246 | 37764426 | natural-language-processing
    Link: https://en.wikipedia.org/wiki/Outline_of_natural_language_processing
Result 4: History of natural language processing | Relevance score 16.254246 | 27837170 | natural-language-processing
    L

Great, this approach helped to bring the most relevant document to the top of our list.