# Boolean Retrieval

_Using Elasticsearch and the Python api_

Boolean retrieval is a type of search that works by limiting and refining a search request by the use of Boolean logic. Through the use of operations such as `AND`, `OR`, and `NOT`, the relationship between sets of documents that match criteria in the query can be defined.

## What you will need

 - Python 3
 
This example uses an index based on media releases by a gallery, available at: https://data.qld.gov.au/dataset/qagoma-media-releases/resource/a1e4dffa-edb1-4e6d-a4a0-353aca79e9a3.

## Getting Started

In this example, we will use the Elaticsearch Python api. First, we will import and set-up all of the required Python modules and variables we will use later on. Additionally, if you wish to use curl instead of the Python api, the complimentary command line function has been commented above each api request.

In [1]:
from elasticsearch import Elasticsearch
import pandas as pd
es = Elasticsearch(urls=['localhost'], port=9200)

## Representing Boolean Expressions

Let's first investigate how to construct Boolean queries in Elasticsearch.

In traditional search engines, Boolean queries are expressed as a series of nested expressions, where keywords or subqueries are separated by an operator. As a trivial example, let's consider the following query:

`(a AND b AND (c OR d))`

One would expect this query to retrieve all of the documents that contain `a` and `b`, and also either/both `c` or `d`. When we place our operators in between operands, we call this an "infix" expression. You should be familiar with this type of expression representation as it is how math expressions are taught in school (BOMDAS).

Elasticsearch and it's DSL, on the otherhand, places the operators before the operands (a "prefix" expression). If we were to rearrange the previous expression to our new representation, we get:

`(AND a b (OR c d))`

Let's explore how to construct the same query in elasticsearch.

In [2]:
example_query = \
{
    'query': {
        'bool': {
            'must': [
                {
                    'match': {
                     'field': 'a'   
                    }
                },
                {
                    'match': {
                     'field': 'b'   
                    }
                },
                {
                    'bool': {
                        'should': [
                            {
                                'match': {
                                    'field': 'c'
                                }
                            },
                            {
                                'match': {
                                    'field': 'd'
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

The DSL of Elasticsearch is obviously more verbose than our trivial examples, and you should take note of several things:

 1. Elasticsearch does not use `AND`, `OR`, etc. and instead opts to use `must`, `filter`, `should`, and, `must_not`: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html#query-dsl-bool-query.
 2. The Boolean query in Elasticsearch (`bool`) is a compound query. Any type of query can be nested inside a `bool` query, including other `bool` queries.

## Searching with Boolean Queries

Let's explore the Boolean query api of Elasticsearch with some examples. First, let's visually see the effect `must` and `should` have on the search listing.

In [3]:
must_query = \
{
    'query': {
        'bool': {
            'must': [
                {
                    'match': {
                        'description': 'art'                        
                    }
                },
                {
                    'match': {
                        'description': 'australian'
                    }
                }
            ]
        }
    }
}

should_query = \
{
    'query': {
        'bool': {
            'should': [
                {
                    'match': {
                        'description': 'art'                        
                    }
                },
                {
                    'match': {
                        'description': 'australian'
                    }
                }
            ]
        }
    }
}

# curl -X GET localhost:9200/goma/_search -d @must_query.json
must_res = es.search(index='goma', body=must_query)
should_res = es.search(index='goma', body=should_query)

pd.DataFrame([must_res['hits']['total'], should_res['hits']['total']], 
             index=['must', 'should'],
             columns=['Number of results'])

Unnamed: 0,Number of results
must,15
should,79


The `must` query is much more restrictive than the `should` query, as documents are required to contain both terms being matched, rather than one or both.

# Gotchas

You may be tempted to use a `multi_match` query instead of several match queries for the same query but on different fields in your `bool` query. When ranking documents, Elasticsearch will take the maximum score from the `multi_match` and sum the scores of the queries inside one of the `bool` queries. 

This will produce to different rankings:

The `multi_match`:

```
{
    "query": {
        "bool": {
            "must": [
                {
                    "multi_match": {
                        "query": "a",
                        "fields": [
                            "abstract",
                            "title"
                        ],
                        "type": "phrase"
                    }
                }
            ]
        }
    }
}
```

orders as:
```
{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "trec-test",
        "_type": "article",
        "_id": "AV1duSV_gZ47e2uMXwv2",
        "_score": 0.2876821,
        "_source": {
          "title": "a b c",
          "abstract": "a"
        }
      },
      {
        "_index": "trec-test",
        "_type": "article",
        "_id": "AV1durdPgZ47e2uMXwv3",
        "_score": 0.2876821,
        "_source": {
          "title": "a",
          "abstract": "a b c"
        }
      },
      {
        "_index": "trec-test",
        "_type": "article",
        "_id": "AV1dusyTgZ47e2uMXwv4",
        "_score": 0.2876821,
        "_source": {
          "title": "d c",
          "abstract": "a"
        }
      },
      {
        "_index": "trec-test",
        "_type": "article",
        "_id": "AV1duuKPgZ47e2uMXwv5",
        "_score": 0.25811607,
        "_source": {
          "title": "a b",
          "abstract": "a b"
        }
      }
    ]
  }
}
```

And the two `match_phrase`s:

```
{
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "match_phrase": {
                                    "abstract": "a"
                                }
                            },
                            {
                                "match_phrase": {
                                    "title": "a"
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}
```

orders as:
```
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.5408423,
    "hits": [
      {
        "_index": "trec-test",
        "_type": "article",
        "_id": "AV1duSV_gZ47e2uMXwv2",
        "_score": 0.5408423,
        "_source": {
          "title": "a b c",
          "abstract": "a"
        }
      },
      {
        "_index": "trec-test",
        "_type": "article",
        "_id": "AV1durdPgZ47e2uMXwv3",
        "_score": 0.5408423,
        "_source": {
          "title": "a",
          "abstract": "a b c"
        }
      },
      {
        "_index": "trec-test",
        "_type": "article",
        "_id": "AV1duuKPgZ47e2uMXwv5",
        "_score": 0.51623213,
        "_source": {
          "title": "a b",
          "abstract": "a b"
        }
      },
      {
        "_index": "trec-test",
        "_type": "article",
        "_id": "AV1dusyTgZ47e2uMXwv4",
        "_score": 0.2876821,
        "_source": {
          "title": "d c",
          "abstract": "a"
        }
      }
    ]
  }
}
```

The `multi_match` phrase query expands into a `dis_max`:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html, whereas the boolean query is summing the scores of both `should` queries.