## Documentation

To read more about the search after parameter, checkout the docs [here](https://www.elastic.co/guide/en/elasticsearch/reference/8.15/paginate-search-results.html#search-after).



## Connect to ElasticSearch

In [None]:
from pprint import pprint
from elasticsearch import Elasticsearch


es = Elasticsearch('http://localhost:9200')
client_info = es.info()
print('Connected to Elasticsearch!')
pprint(client_info.body)

## Preparing the index

The `timestamp` field is useful for sorting documents, which is essential for the `search_after` parameter. Alternatively, you can use the document ID for sorting as well.

In [None]:
index_name = 'my_index'
mapping = {
    "mappings": {
        "properties": {
            "timestamp": {"type": "date"},
            "value": {"type": "float"},
            "category": {"type": "keyword"},
            "description": {"type": "text"},
            "id": {"type": "keyword"},
        }
    },
}

es.indices.delete(index=index_name, ignore_unavailable=True)
es.indices.create(index=index_name, body=mapping)

## Generating fake data

The base documents will be duplicated to create a total of `100,000` documents. This is done to compare the `from/size` method with the `search_after` method.

In [None]:
base_documents = [
    {
        "category": "A",
        "value": 100,
        "description": "First sample document"
    },
    {
        "category": "B",
        "value": 200,
        "description": "Second sample document"
    },
    {
        "category": "C",
        "value": 300,
        "description": "Third sample document"
    },
    {
        "category": "D",
        "value": 400,
        "description": "Fourth sample document"
    },
    {
        "category": "E",
        "value": 500,
        "description": "Fifth sample document"
    }
]

The `generate_bulk_data` function determines the number of times to duplicate the base documents to achieve a target of `100,000` documents. It also assigns a unique `_id`, modifies the `value` field randomly, and appends a `timestamp` to each duplicated document.

In [None]:
import random

from datetime import datetime, timedelta


def generate_bulk_data(base_documents, target_size=100_000):
    documents = []
    base_count = len(base_documents)
    duplications_needed = target_size // base_count

    base_timestamp = datetime.now()

    for i in range(duplications_needed):
        for document in base_documents:
            new_doc = document.copy()
            new_doc['id'] = f"doc_{len(documents)}"
            new_doc['timestamp'] = (
                base_timestamp - timedelta(minutes=len(documents))).isoformat()
            new_doc['value'] = document['value'] + random.uniform(-10, 10)
            documents.append(new_doc)

    return documents


documents = generate_bulk_data(base_documents, target_size=100_000)
print(f"Generated {len(documents)} documents")

## Indexing

In [None]:
from tqdm import tqdm

operations = []
for document in tqdm(documents, total=len(documents)):
    operations.append({'index': {'_index': index_name}})
    operations.append(document)

response = es.bulk(operations=operations)
pprint(response.body["errors"])

In [None]:
es.indices.refresh(index=index_name)

count = es.count(index=index_name)["count"]
print(f"Indexed {count} documents")

## From / Size method

To use the `from/size` method, include two parameters in your query: `from`, which specifies the number of documents to skip, and `size`, which tells Elasticsearch how many documents to return.

In [None]:
response = es.search(
    index=index_name,
    body={
        "from": 0,
        "size": 10,
        "sort": [
            {"timestamp": "desc"},
            {"id": "desc"}
        ]
    }
)

hits = response["hits"]["hits"]
for hit in hits:
    print(f"ID: {hit['_source']['id']}")

To retrieve the next batch of documents, adjust the `from` parameter from 0 to 10.

In [None]:
response = es.search(
    index=index_name,
    body={
        "from": 10,
        "size": 10,
        "sort": [
            {"timestamp": "desc"},
            {"id": "desc"}
        ]
    }
)

hits = response["hits"]["hits"]
for hit in hits:
    print(f"ID: {hit['_source']['id']}")

## Search after method

To use the `search_after` method, include the following parameters in your query:

1. **size**: Specifies the number of documents to retrieve in each batch, similar to the `size` parameter in `from/size`.

2. **sort**: The `search_after` method requires specifying one or more fields to sort the results, such as `timestamp` or `id`. Sorting ensures a consistent order for navigating through result pages.

In [None]:
response = es.search(
    index=index_name,
    body={
        "size": 10,
        "sort": [
            {"timestamp": "desc"},
            {"id": "desc"}
        ]
    }
)

hits = response["hits"]["hits"]
for hit in hits:
    print(f"ID: {hit['_source']['id']}")
    print(f"Sort values: {hit['sort']}")
    print()

To retrieve the next batch of documents using `search_after`, youâ€™ll pass the `sort` values from the last document of the previous batch to the `search_after` parameter in the subsequent query.

In [None]:
last_sort_values = hits[-1]["sort"]
response = es.search(
    index=index_name,
    body={
        "size": 10,
        "sort": [
            {"timestamp": "desc"},
            {"id": "desc"}
        ],
        "search_after": last_sort_values
    }
)

hits = response["hits"]["hits"]
for hit in hits:
    print(f"ID: {hit['_source']['id']}")
    print(f"Sort values: {hit['sort']}")
    print()

## Benchmark

In this benchmark, we assess the performance of two pagination methods, `from/size` and `search_after`, by measuring and comparing their response times. We capture the response time of each method for multiple pages, plot the results, and calculate relevant statistics to provide insights.

### 1. From / Size test

In [None]:
import time

from tqdm import tqdm


def test_from_size_pagination(es: Elasticsearch, index_name: str, page_size=100, max_pages=50)-> list[float]:
    timings = []

    for page in tqdm(range(max_pages)):
        start_time = time.time()

        _ = es.search(
            index=index_name,
            body={
                "from": page * page_size,
                "size": page_size,
                "sort": [
                    {"timestamp": "desc"},
                    {"id": "desc"}
                ]
            }
        )

        end_time = time.time()
        final_time = (end_time - start_time) * 1000
        timings.append((page + 1, final_time))

    return timings

When attempting to retrieve more than 10,000 documents, Elasticsearch returns an error indicating that the `from / size` method cannot handle this request.

In [None]:
from_size_timings = test_from_size_pagination(
    es=es,
    index_name=index_name,
    page_size=1000,
    max_pages=50
)

Let's reduce the `page_size` to avoid this problem.

In [None]:
from_size_timings = test_from_size_pagination(
    es=es,
    index_name=index_name,
    page_size=200,
    max_pages=50
)

### 2. Search after test

In [None]:
def test_search_after_pagination(es: Elasticsearch, index_name: str, page_size=100, max_pages=50) -> list[float]:
    timings = []
    search_after = None

    for page in tqdm(range(max_pages)):
        start_time = time.time()

        body = {
            "size": page_size,
            "sort": [
                {"timestamp": "desc"},
                {"id": "desc"}
            ]
        }

        if search_after:
            body["search_after"] = search_after

        response = es.search(
            index=index_name,
            body=body
        )

        hits = response["hits"]["hits"]
        if hits:
            search_after = hits[-1]["sort"]

        end_time = time.time()
        final_time = (end_time - start_time) * 1000
        timings.append((page + 1, final_time))

    return timings

In [None]:
search_after_timings = test_search_after_pagination(
    es, index_name, page_size=200, max_pages=50)

### 3. Plotting & statistics

In [None]:
import matplotlib.pyplot as plt


def plot_comparison(from_size_timings: list[float], search_after_timings: list[float]) -> None:
    plt.figure(figsize=(12, 6))

    pages_from_size, times_from_size = zip(*from_size_timings)
    pages_search_after, times_search_after = zip(*search_after_timings)

    plt.plot(pages_from_size, times_from_size, 'b-', label='from/size')
    plt.plot(pages_search_after, times_search_after,
             'g-', label='search_after')

    plt.xlabel('Page number')
    plt.ylabel('Response time (milliseconds)')
    plt.title('Pagination performance comparison')
    plt.legend()
    plt.grid(True)
    plt.show()


plot_comparison(from_size_timings, search_after_timings)

The `search_after` method performs more efficiently, especially for deep pagination, due to its stable response time. In contrast, `from/size` may be suitable for shallow pagination but becomes inefficient as the page depth grows.

In [None]:
from typing import TypedDict

class TimeStats(TypedDict):
    avg_time: float
    max_time: float
    min_time: float

# Define the type for the outer dictionary
class PagingStats(TypedDict):
    from_size: TimeStats
    search_after: TimeStats


def calculate_stats(from_size_timings: list[float], search_after_timings: list[float]) -> PagingStats:
    _, times_from_size = zip(*from_size_timings)
    _, times_search_after = zip(*search_after_timings)

    stats = {
        'from_size': {
            'avg_time': sum(times_from_size) / len(times_from_size),
            'max_time': max(times_from_size),
            'min_time': min(times_from_size)
        },
        'search_after': {
            'avg_time': sum(times_search_after) / len(times_search_after),
            'max_time': max(times_search_after),
            'min_time': min(times_search_after)
        }
    }
    return stats


stats = calculate_stats(from_size_timings, search_after_timings)

print("\nPerformance statistics:")
print("\n- From/Size pagination:")
print(f"Average time: {stats['from_size']['avg_time']:.3f} milliseconds")
print(f"Maximum time: {stats['from_size']['max_time']:.3f} milliseconds")
print(f"Minimum time: {stats['from_size']['min_time']:.3f} milliseconds")

print("\n- Search after pagination:")
print(f"Average time: {stats['search_after']['avg_time']:.3f} milliseconds")
print(f"Maximum time: {stats['search_after']['max_time']:.3f} milliseconds")
print(f"Minimum time: {stats['search_after']['min_time']:.3f} milliseconds")

These statistics validate that `search_after` is the preferable pagination method for consistent and scalable performance.

In [None]:
plt.figure(figsize=(12, 6))
_, times_from_size = zip(*from_size_timings)
_, times_search_after = zip(*search_after_timings)

plt.hist(times_from_size, alpha=0.5, label='from/size', bins=20)
plt.hist(times_search_after, alpha=0.5, label='search_after', bins=20)
plt.xlabel('Response time (milliseconds)')
plt.ylabel('Frequency')
plt.title('Distribution of response times')
plt.legend()
plt.grid(True)
plt.show()

Based on this histogram visualization, the `search_after` approach (shown in orange) demonstrates consistently faster response times clustered around 2-5 milliseconds, while the `from/size` method (in blue) shows a wider distribution of response times spreading up to 16 milliseconds, suggesting that `search_after` provides more predictable and generally better performance for pagination.

In [None]:
def calculate_degradation(timings:float)-> float:
    first_page_time = timings[0][1]
    last_page_time = timings[-1][1]
    degradation_factor = last_page_time / first_page_time
    return degradation_factor


from_size_degradation = calculate_degradation(from_size_timings)
search_after_degradation = calculate_degradation(search_after_timings)

print("\nPerformance degradation (Last page time / First page time):")
print(f"- From/Size degradation factor   : {from_size_degradation:.2f}x")
print(f"- Search after degradation factor: {search_after_degradation:.2f}x")

The `search_after` method is far superior in maintaining a stable response time, even for large page numbers. In contrast, `from/size` exhibits significant performance degradation, making it less suitable for deep pagination.


## Conclusion

For larger indexes, it's recommended to use the `search_after` method. For smaller indexes, both methods work well.