Skip to content

8. Search

Madhusudhan Konda edited this page Dec 10, 2020 · 8 revisions

Search

We use Books dataset for these queries and exercises

Full Text vs Term Search

There are two variants of search - full-text and structured search.

Structured search queries return results in exact matches. That is, the search is only worried about finding the documents that match the search criteria (but not how well they are matched). The criteria are usually based on numerical values, dates, range, or boolean expressions. The basic idea for this kind of search is that the results are binary. There will be no scores attached to these results and hence they are not sorted based on relevancy. And as the results are not scored, they can be cached by the server thus gaining a performance benefit should the same query be rerun. The traditional database search is more of this sort.

For example, fetch all the developers who attended a tech talk, ratings of a book greater than 4 stars, flights from London Heathrow on 1st of May 2020, and so on.

On the other hand, full-text (unstructured) queries will try to find results that are relevant to the query. That is, Elasticsearch will find all the documents that are most suited for the query. Elasticsearch employs a similarity algorithm to generate a relevance score for full-text queries. The score is a positive floating-point number attached to the results, with the highest scored document indicating more relevant to the query criteria. Elasticsearch applies different relevance scoring algorithms for various queries, including allowing us to customize the scoring algorithm too.

The structured or unstructured search is executed in an execution context by Elasticsearch - filter or query context respectively. Let’s see what these contexts are in the next section

Search Execution Contexts

Elasticsearch internally uses an execution context when running the searches - a filter context or query context. Although we have no say in asking Elasticsearch to apply a certain type of context, it is our query that lets Elasticsearch decide on applying the appropriate context.

A structured search will result in a binary yes/no answer, hence Elasticsearch uses a filter context for this. Remember there are no relevance scores expected for these results, so filter context is the appropriate one.

Of course the queries on full-text search fields will be run in a query context as they must have a scoring associated with each of the matched documents.

We will look at some examples to demonstrate these contexts in action, but in the meantime let’s find out how we can access the Elasticsearch search endpoints.

Search APIs

Elasticsearch exposes the Search API via its _search endpoint. There are two ways of accessing the search endpoint:

  • URI Search Request: In this method, we pass in the search query parameters alongside the endpoint as params to the query.
  • Query DSL: Elasticsearch has implemented a domain specific language (DSL) for search. The criteria is passed in as a JSON object as the payload when using Query DSL.

Request URI Search

Let's use the simple URI search for fetching a book whose title is Java and first edition:

// Search books whose title has Java word
GET books/_search?q=title:Java

// Only get me 2 results
GET books/_search?q=title:Java&size=2

//From 3rd page and two records per page
GET books/_search?q=title:Java&size=2&from=3

You can fetch the documents using multiple field search, either using OR or AND condition:

// Fetch for documents whose title is Java OR first edition.
GET books/_search?q=title:Java OR edition:1

//Default Operator is OR - so you can omit to specify between the fields - it's implicit. 

// Fetch for documents whose title is Java OR first edition - implicit OR condition:
GET books/_search?q=title:Java edition:1

However, if you wish to do an AND, simply change the default_operator=AND

GET books/_search?q=title:Java edition:1&default_operator=AND

Each query has an explanation as to how Elasticsearch came to those results. We use explain=true on the query and send it over to the _search endpoint:

// Ask for an explanation of the query
GET books/_search?q=title:Java edition:1&default_operator=AND&explain=true

We even can do sorting:

// Sort descending
GET books/_search?q=title:Java&sort=release_date:desc

// Sort ascending by default or specify release_date:asc
GET books/_search?q=title:Java&sort=release_date

NOTE: score=null for these documents

While the URL method is simple and easy to code up, it becomes error prone as the complexity of the query criteria grows. We can use it for quick testing but relying on it for complicated and in-depth queries might be asking for trouble.

We can alleviate the issue to some extent by using Query DSL’s query_string method. We can send the same URI query parameters as JSON request body as a query_string in a query object, as demonstrated in the following:

POST books/_search
{
  "query": {
    "query_string": {
      "default_field": "author",
      "query": "Joshua Goetz",
      "default_operator": "AND"
    }
  }
}

The query_string is equivalent to the q parameter we’ve used in the URI search method. While it is much better than the URI method, the query_string method is strict with syntax and has some unforgiving characteristics. Unless there’s a strong reason not to, we can use Query DSL’s other queries than using query_string.

We have been using URI to pass all our parameters so far. There's another and in fact, the preferred mechanism to invoke search - using Query DSL.

Query DSL

Elasticsearch developed a specific purpose language and syntax called Query DSL (domain-specific language) for querying the data.

The Query DSL is a sophisticated, powerful, and expressive language to create a multitude of queries ranging from simple and basic to complex, nested, and complicated ones. It is a JSON based query language which can be constructed with deep queries both for search and analytics. The format goes this:

GET books/_search
{ 
  "query": { 
    "match": {
          
    }
  }
}

Now we know the two methods of querying for search results, there is an important concept that you should know - full-text and term-level queries. But before picking up these concepts, we need to understand what is Relevancy

Relevancy

Let's go over an important concept - Relevancy. Modern search engines not just return results based on your query’s criteria but also analyze and return the most relevant results. If you are searching for “Java” in a title of a book, a document containing more than one occurrence of a “Java” word in the title is highly relevant than the other documents where the title has one or no occurrence.

Elasticsearch uses the Okapi BM25 relevancy algorithm for scoring the return results so the client can expect relevant results. On a high level, the relevancy algorithm uses TF / IDF (Term Frequency / Inverse Document Frequency)

Term Frequency (TF) is a measure of how frequent the word is in the field of the document. It is the number of times the search word appears in the search field. The higher the frequency the higher the score.

The Inverse Document Frequency (IDF) is the number of times the word appears across the whole set of documents (whole index). The higher the frequency the lower the relevance (hence inverse document frequency).

The Field-length norm is another factor used in calculating the relevancy. The occurrence of search word in a field of short length (say 20 characters in a field1 ) is highly relevant than the same in a long field (say 200 characters of field2)

Relevancy is a positive floating-point number that determines the ranking of the search results. In Elasticsearch, the relevancy is attached as _score to the results.

Terms queries

POST books/_search?
{
  "_source": ["amazon_rating","author"],
  "query": {
    "term": {
      "author": "Joshua"
    }
  }
}
// Doesn't return results. Why? (clue - term query doesn't get analyzed!). Change to `joshua` and try.
GET books/_search
{
  "query": {
    "match": {
      "author": "Joshua"
    }
  }
}

#IDs
GET books/_search
{
  "query": {
    "ids": {
      "values": [1,2]
    }
  }
}

#terms
GET books/_validate/query
{
  "query": {
    "terms": {
      "author": ["joshua","joseph"]
    }
  }
}

#Range query
GET books/_search
{
  "_source": "amazon_rating", 
  "query": {
    "range": {
      "amazon_rating": {
        "gte": 4.5,
        "lte": 5
      }
    }
  }
}

#Prefix  - should return the Concurrency book!
GET books/_search
{
  "_source": "title", 
  "query": {
    "prefix": {
      "title": {
        "value": "con"
      }
    }
  }
}

# Wildcard with highlighting
GET books/_search
{
  "_source": false, 
  "query": {
    "wildcard": {
      "title": {
        "value": "*st"
      }
    }
  },"highlight": {
    "fields": {
      "title": {}
    }
  }
}
#Fuzzy
GET books/_search
{
  "_source": false, 
  "query": {
    "fuzzy": {
      "title": {
        "value": "kaava",
        "fuzziness": 2
      }
    }
  },"highlight": {
    "fields": {
      "title": {}
    }
  }
}

Full-text Queries

There are handful of full-text queries that Elasticsearch search API exposes:

  • Match all
  • Match query
  • Match Phrase
  • Match Phrase Prefix
  • Multi match and others. Let's see a few of them in action.

Match_all Query

Full-text queries work on fields that are unstructured. The match_all query, as the name suggests fetches ALL the documents - as the examples are shown below indicate:

#Matchall books index
GET books/_search
{
  "query": {
    "match_all": {}
  }
}

#Match-all wildcard indices
GET bo*/_search
{
  "query": {
    "match_all": {}
  }
}

#Match-all all indices - remember this brings ALL the documents across ALL the indices
GET _search
{
  "query": {
    "match_all": {}
  }
}
#Match-all multi index query
GET covid,books/_search
{
  "query": {
    "match_all": {}
  }
}
#Match
GET books/_search
{
  "explain": true, 
  "query": {
    "match": {
      "author": "Joshua"
    }
  }, "highlight": {
    "fields": {
      "author": {}
    }
  }
}

#Match-all wildcard indices
GET bo*/_search
{
  "query": {
    "match_all": {"boost":"2.0"}
  }
}
Try matching with the author as "Josh" - there wouldn't be any result. Why?:)

Match Queries

Match queries are the ones that would find documents that satisfies the given search criteria - usually body of the text.

#Match query - matches all books with given tags
GET books/_search
{
  "query": {
    "match": {
      "tags": "Java programming"
    }
  },
  "highlight": {"fields": {"tags": {}}}
}
#Match - run the same for explanation
GET books/_search
{
  "explain": true, 
   "query": {
    "match": {
      "tags": "Java programming"
    }
  },
  "highlight": {"fields": {"tags": {}}}
}

Match with Operators

The above query will be translated to Java OR programming. The OR is the default operator (rerun the same with highlight to find the operator in action)

We can change the operator by adding an operator to the query clause. Do note there's a slight variation in defining the query clause, as demonstrated below:

GET books/_search
{
  "query": {
    "match": {
      "tags": {
        "query": "Computer Elasticsearch",
        "operator": "AND"
      }
    }
  }, "highlight": {"fields": {"tags": {}}}
}

// Try NOT as an operator. Does it work? (refer to the Operator spec: https://www.javadoc.io/doc/org.elasticsearch/elasticsearch/5.0.0/org/elasticsearch/index/query/Operator.html)

Match with spelling mistakes (fuzziness)

We can match documents even with spelling mistakes - fuzziness. Elasticsearch implements Levenshtein Edit Distance to apply the fuzziness. If the fuzziness to be defined as 1, one spelling mistake can be forgiven: like when we search for Compuuter. Follow the example below:

// With Fuzziness (see the spelling mistake in the query) - Levenshtein Edit Distance 
GET books/_search
{
  "query": {
    "match": {
      "tags": {
        "query": "Compuuter Elasticsearch",
        "operator": "OR",
        "fuzziness": 1
      }
    }
  }, "highlight": {"fields": {"tags": {}}}
}

// Exercise: try with two spelling mistakes: Compuutter for eg

Match against multiple fields

We may need to match a text against multiple fields of a document. This is where multi-match query comes handy

# Multimatch
GET books/_search
{
  "_source": false, 
  "query": {
    "multi_match": {
      "query": "Java",
      "fields": ["tags","synopsis"]
    }
  }, "highlight": {"fields": {"tags": {},"synopsis": {}}}
}

// Response
{
  "synopsis" : [
    "Core <em>Java</em> Volume I – Fundamentals is a <em>Java</em> reference book t..."
  ],
  "tags" : [
    "Programming Languages, <em>Java</em> Programming"
  ]
}

Match Phrase

Should we wish to search for a fixed phrase (in the same order), Match-Phrase query is the one to rescue:

# Match Phrase - checks out exact phrase "and lambda expressions" in the synopsis field
GET books/_search
{
  "_source": false,
  "query": {
    "match_phrase": {
      "synopsis": "and lambda Expressions"
    }
  },"highlight": {"fields": {"synopsis": {}}}
}

// Try removing the lambda from the phrase and see what happens?

Match Phrase with slop

Slop setting allows the match phrase to be a bit more lenient when the search for a phrase. That is, instead of search the exact phrase, slop tells the Elasticsearch to ignore n number of words based on the slop setting. Let's rerun the same example as above this time, remove lambda from the query but add slop as 1:

#Match phrase with slop
GET books/_search
{
  "_source": false,
  "query": {
    "match_phrase": {
      "synopsis": {
        "query": "and expressions",
        "slop": 1
      }
    }
  },"highlight": {"fields": {"synopsis": {}}}
}

// Match Phrase with slop setting 2
#Match phrase with slop
GET books/_search
{
  "_source": false,
  "query": {
    "match_phrase": {
      "synopsis": {
        "query": "including interfaces",
        "slop": 2
      }
    }
  },"highlight": {"fields": {"synopsis": {}}}
}

// now try swapping the words - interfaces including. Does this work? Order is important when searching using phrase search.

Match Phrase Prefix Query

This query works matching the prefix of the last word in the query.

##Match phrase prefix
GET books/_search
{
  "query": {
    "match_phrase_prefix": {
      "tags": "boo"
    }
  },"highlight": {"fields": {"tags": {}}}
}
// This will return
 "highlight" : {
   "tags" : [
     "Computer Science <em>Books</em>"
   ]
 }
The "boo" matched with "Books"

// Try setting the tags field to "Computer Sci"
// Try setting the tags field to "Compu Sci"

Compound Queries

Compound queries are the combination of one or more leaf queries as well as compound queries themselves. This is the most advanced query DSL that one should for querying the data with complex criteria.

Elasticsearch provides five types of compound queries:

  • Boolean (by far the most useful)
  • Constant score
  • Function score
  • Disjunction max
  • Boosting

We will cover Boolean and Constant score here

Boolean Query

The Boolean query is the most popular and flexible compound query one can use to create set of complex criteria for searching data. As the name indicates it is a combination of boolean clauses with each clause having individual queries like term level or full-text queries we've seen so far. Each of these clauses will have a typed occurrence of must, must_not, should or filer clauses.

  • The must clause is an AND query where all the documents must match to the query criteria
  • The must_not clause is a NOT query where none of the documents must match to the query criteria
  • The should clause is an OR query where one of the documents must match the query criteria
  • The filter clause is a filter query where the documents must match the query criteria (similar to must clause) except that filter clause will not boost the matches

Note: the must and should clauses will contribute to the relevance scoring while must_not and filter will not.

Let's see the Boolean Query in action.

Bool query structure

The bool query with empty clauses will look like this:

GET books/_search
{
  "query": {
    "bool": {
      "must": [
        {}
      ],
      "must_not": [
        {}
      ],
      "should": [
        {}
      ],
      "filter": [
        {}
      ]
    }
  }
}

Each of the clause can accept an array of the queries, for example, you can provide multiple term-level and full text queries inside any of these clauses as shown below:

GET boooks/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "FIELD": "TEXT"
          }
        },{
          "term": {
            "FIELD": {
              "value": "VALUE"
            }
          }
        }
      ],
      "should": [
        {
          "range": {
            "FIELD": {
              "gte": 10,
              "lte": 20
            }
          }
        },{
          "terms": {
            "FIELD": [
              "VALUE1",
              "VALUE2"
            ]
          }
        }
      ]
    }
  }  
}

Must clause

The must clause can be put to work by matching tags for 'computer':

GET books/_search
{
  "_source": false, 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "tags": "computer"
          }
        }
      ]
    }
  },"highlight": {
    "fields": {"tags": {}}
  }
}

If your query is throwing errors, you can use _validate API to find out the issues with the query: GET books/_validate/query?explain

Let's add another query clause - this time our must match computer in tags as well as word java in the title (term query)

# Must query 
GET books/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "tags": "computer"
          }
        },
        {
          "term": {
            "title": "java"
          }
        }
      ]
    }
  },"highlight": {
    "fields": {"tags": {},"title": {}}
  }
}

Try changing the java to Java and see if the results are returned? (clue: Term vs Match query)

The must clause will add to the relevance score of the results

Must-Not clause

As the name suggests, the query shouldn't match the criteria specified in these clauses. For example, all documents authored by Joshua but rating no less than 4.5

GET books/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "author": "Joshua"
          }
        }
      ],
      "must_not": [
        {
          "range": {
            "amazon_rating": {
              "lt": 4.5
            }
          }
        }
      ]
    }
  }
}

Should clause

The should clause is an OR clause will add to the scoring as any document that matches to any of the should clause will increase the relevance score

#Shoud
GET books/_search
{
  "_source": false, 
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": "Elasticsearch"
          }
        },{
          "term": {
            "author": {
              "value": "joshua"
            }
          }
        }
        
      ]
    }
  },"highlight": {"fields": {"title": {},"author": {}}}
}

the above will try to match all the titles with "Elasticsearch" word in it, and obviously the query fails. But of course, there's another clause attached to the should query the term query. This will try to fetch the documents matching with the author as Joshua. As you can expect, the above query will return the results as one of the queries match the criteria (as opposed to must where all conditions must be satisfied).

As an exercise, change the match query to include 'Java' instead of Elasticsearch - do you see any difference in the relevance score? (The relevance will be scored higher when all the should clauses are matched!

Filter clause

Similar to the must clause, the filter clause fetches all the documents which match the criteria. The only difference is the filter clause runs in a filter context and hence as you'd expect no relevance scores are added to the document results.

Let's match all the documents with the first edition - this time we use filter clause

// Search for all 1st edition books
GET books/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "edition": 1
          }
        }
      ]
    }
  }
}

Scoring will be ignored for filter queries You can create multiple filters, as demonstrated in the snippet below:

// Fetch the 3rd edition books written by Joshua
GET books/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "edition": 3
          }
        },{
          "match":{
            "author":"Joshua"
          }
        }
      ]
    }
  }
}

We can combine must and filter as shown below:

GET books/_search
{
  "_source":false, 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "author": "Joshua"
          }
        }
      ],
      "filter": [
        {
          "term": {
            "edition": 1
          }
        }
      ]
    }
  },"highlight": {
    "fields": {"author": {}, "edition": {}}
  }
}
// Response - do keep a note of the score
"hits" : [
  {
    "_index" : "books",
    "_type" : "_doc",
    "_id" : "6",
    "_score" : 0.81000566,
    "highlight" : {
    "author" : [
    "Brian Goetz with Tim Peierls, <em>Joshua</em> Bloch, Joseph Bowbeer, David Holmes, and Doug Lea"
    ]
   }
 }
]

The response indicates the score for the result is merely 0.8. As we discussed, filter is similar to must query except that it's not run in a query context. So, why don't we add the filter clause to must clause as a second must clause:

// We've moved the filter criteria to the must clause
GET books/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "author": "Joshua"
          }
        },{
          "term": {
            "edition": {
              "value": 1
            }
          }
        }
      ]
    }
  }
}

//Response
"hits" : [
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.8100057,
        "highlight" : {
          "author" : [
            "Brian Goetz with Tim Peierls, <em>Joshua</em> Bloch, Joseph Bowbeer, David Holmes, and Doug Lea"
          ]
        }
      }
    ]

// The score attribute has increased, did you notice?!

All clauses combined

Now let's combine, must, must_not and should together. We will find all books written by Joshua (must), but no less than 4.5 ratings (must_not) and tagged with Java (should) and only books after2015-01-01 (filter).

// Match books written by Joshua, must not have any rating less than 4.5, should have java in tags and filter by release_date
GET books/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "author": "Joshua"
          }
        }
      ],
      "must_not": [
        {
          "range": {
            "amazon_rating": {
              "lte": 4.5
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "tags": "Java"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "release_date": {
              "gt": "2015-01-01"
            }
          }
        }
      ]
    }
  }
}

Boosting Query

This query helps to boost the scoring for certain queries while lowering the scores for other matches intentionally. It is comprised of two components: positive and negative blocks. We provide appropriate query criteria in these blocks - positive block is where you write your positive query that you'd want the higher scoring while the documents resulting from the negative block query will have an intentionally lowered score by using negative_boost parameter. The negative_boost parameter is a positive floating number between 0 to 1.

Let's check out an example: We need to improve the scoring for all the documents which have 'Java' word in the title but the books written by Herbert Schildt should have a lower scoring (for no particular reason other than for demonstration purposed, Herbert!)

## Boosting
// Fetch all the Java titles but score lower for Herbet's titles by 0.5
GET books/_search
{
  "_source": ["title", "author"],
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "title": "Java"
        }
      },
      "negative": {
        "match": {
          "author":  "Herbert"
        }
      },
      "negative_boost": 0.5
    }
  },"highlight": {"fields": {"author": {}, "tags": {}}}
}

You can see the scoring for Herbert has been reduced from 0.3027879 (I've run the query with Cay as the author which will provide me the original score) to 0.15139395 -> which is derived by original_score * negative_boost

The negative_boost must be supplied between 0 and 1. Setting 0 as negative_boost will set the score of 0 for all matched documents while setting 1 will not alter the original scoring.