## Documentation

To read more about the search API, visit the [docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).

![query_dsl_docs](../images/query_dsl_docs.png)

## Connect to ElasticSearch

In [1]:
from pprint import pprint
from elasticsearch import Elasticsearch

HOST = "http://localhost:9200"

es = Elasticsearch(HOST)
client_info = es.info()
print("Connected tp Elasticsearch!")
pprint(client_info.body)

Connected tp Elasticsearch!
{'cluster_name': 'docker-cluster',
 'cluster_uuid': 'IzAz_bJfQnS_zfMDjIPmJA',
 'name': 'eb6cd056e782',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2025-01-09T14:09:01.578835424Z',
             'build_flavor': 'default',
             'build_hash': '0f88dde84795b30ca0d2c0c4796643ec5938aeb5',
             'build_snapshot': False,
             'build_type': 'docker',
             'lucene_version': '8.11.3',
             'minimum_index_compatibility_version': '6.0.0-beta1',
             'minimum_wire_compatibility_version': '6.8.0',
             'number': '7.17.27'}}


  client_info = es.info()


## Inserting documents

In [5]:
INDEX = "my_index"

settings = {
    "index": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    }
}

es.indices.delete(index=INDEX, ignore_unavailable=True)
es.indices.create(index=INDEX, settings=settings)

  es.indices.delete(index=INDEX, ignore_unavailable=True)
  es.indices.create(index=INDEX, settings=settings)


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'my_index'})

Let's index the documents sequentially.

In [6]:
import json
from tqdm import tqdm


dummy_data = json.load(open("../data/dummy_data.json"))
for document in tqdm(dummy_data, total=len(dummy_data)):
    response = es.index(index=INDEX, body=document)

  response = es.index(index=INDEX, body=document)
100%|██████████| 3/3 [00:00<00:00, 29.34it/s]


## Searching

### 1. Leaf clauses

#### 1.1. term query

Let's use the `Query DSL` language to construct a query that will find any document that was created on `2024-09-22`

In [7]:
response = es.search(
    index=INDEX,
    body={
        "query": {
            "term": {
                "created_on": "2024-09-22"
            }
        }
    }
)

n_hits = response['hits']['total']['value']
print(f"Found {n_hits} documents in my_index")

Found 1 documents in my_index


  response = es.search(


To retrieve the document just use the `hits` dictionary like this.

In [8]:
retrieved_documents = response["hits"]["hits"]
retrieved_documents

[{'_index': 'my_index',
  '_type': '_doc',
  '_id': 'RHjCJJUBpQvCJGK5hU3T',
  '_score': 1.0,
  '_source': {'title': 'Sample Title 1',
   'text': 'This is the first sample document text.',
   'created_on': '2024-09-22'}}]

#### 1.2. match query

Now, let's search for any document that contains the word `document` in the text field.

In [9]:
response = es.search(
    index=INDEX,
    body={
        "query": {
            "match": {
                "text": "document"
            }
        }
    }
)

n_hits = response["hits"]["total"]["value"]
print(f"Found {n_hits} documents in {INDEX}")

Found 3 documents in my_index


  response = es.search(


In [10]:
retrieved_documents = response["hits"]["hits"]
retrieved_documents

[{'_index': 'my_index',
  '_type': '_doc',
  '_id': 'RHjCJJUBpQvCJGK5hU3T',
  '_score': 0.13606146,
  '_source': {'title': 'Sample Title 1',
   'text': 'This is the first sample document text.',
   'created_on': '2024-09-22'}},
 {'_index': 'my_index',
  '_type': '_doc',
  '_id': 'RXjCJJUBpQvCJGK5hk0Z',
  '_score': 0.13606146,
  '_source': {'title': 'Sample Title 2',
   'text': 'Here is another example of a document.',
   'created_on': '2024-09-24'}},
 {'_index': 'my_index',
  '_type': '_doc',
  '_id': 'RnjCJJUBpQvCJGK5hk0o',
  '_score': 0.12874341,
  '_source': {'title': 'Sample Title 3',
   'text': 'The content of the third document goes here.',
   'created_on': '2024-09-24'}}]

#### 1.3. range query

Let's find documents that were created before `2024-09-24`

In [11]:
response = es.search(
    index=INDEX,
    body={
        "query": {
            "range": {
                "created_on": {
                    "lte": "2024-09-24"
                }
            }
        }
    }
)

n_hits = response["hits"]["total"]["value"]
print(f"Found {n_hits} documents in {INDEX}")

Found 3 documents in my_index


  response = es.search(


In [12]:
retrieved_documents = response["hits"]["hits"]
retrieved_documents

[{'_index': 'my_index',
  '_type': '_doc',
  '_id': 'RHjCJJUBpQvCJGK5hU3T',
  '_score': 1.0,
  '_source': {'title': 'Sample Title 1',
   'text': 'This is the first sample document text.',
   'created_on': '2024-09-22'}},
 {'_index': 'my_index',
  '_type': '_doc',
  '_id': 'RXjCJJUBpQvCJGK5hk0Z',
  '_score': 1.0,
  '_source': {'title': 'Sample Title 2',
   'text': 'Here is another example of a document.',
   'created_on': '2024-09-24'}},
 {'_index': 'my_index',
  '_type': '_doc',
  '_id': 'RnjCJJUBpQvCJGK5hk0o',
  '_score': 1.0,
  '_source': {'title': 'Sample Title 3',
   'text': 'The content of the third document goes here.',
   'created_on': '2024-09-24'}}]

This is how you use the leaf clauses. Now, if you want to combine leaf clauses together, you do that with the compound clauses.

### 2. Compound clauses

Let's search for documents that meet the following criteria:
- Created on `2024-09-24`
- Have the word `third` in the text field.

In [15]:
response = es.search(
    index=INDEX,
    body={
        "query": {
            "bool": {
                "must": [
                    {
                        "match": {
                            "text": "third"
                        }
                    },
                    {
                        "range": {
                            "created_on": {
                                "gte": "2024-09-24",
                                "lte": "2024-09-24"
                            }
                        }
                    }
                ]
            }
        }
    }
)

n_hits = response["hits"]["total"]["value"]
print(f"Found {n_hits} documents in {INDEX}")

Found 1 documents in my_index


  response = es.search(


In [16]:
retrieved_documents = response["hits"]["hits"]
retrieved_documents

[{'_index': 'my_index',
  '_type': '_doc',
  '_id': 'RnjCJJUBpQvCJGK5hk0o',
  '_score': 1.94566,
  '_source': {'title': 'Sample Title 3',
   'text': 'The content of the third document goes here.',
   'created_on': '2024-09-24'}}]

With the compound clause, we were to combine two leaf clauses to find a specific document.