[BUG] Sometimes aggregations are empty with terminate_after #13288

Rmaan · 2024-04-18T15:02:45Z

Describe the bug

We found weird bugs in our search faceting after moving to OpenSearch 2.11 from Elasticsearch, it seems when terminate_after is passed, sometimes returned buckets are fully empty (Even though all processed docs should have a bucket) and sometimes it's way less than the terminate_after * primary_shard_count, although search is terminated early and all items have a value for the aggregation.

We couldn't reproduce this issues with OpenSearch 2.9 but 2.10 was affected.

Related component

Search:Aggregations

To Reproduce

Exact reproduction is hard, seems we need to have a couple segments to see the problem, and reported issues are when we aggregate on a keyword field while filtering based on some integer field. When we aggregate on the same integer column the issue doesn't happen.

Sample request:

{
  "track_total_hits": true,
  "_source": ["materials.facet.en.lvl0"],
  "aggregations": {
    "materials_facet": {
      "terms": {
        "field": "materials.facet.en.lvl0"
      }
    }
  },
  "query": {
    "term": {
      "materials.ids": 1
    }
  },
  "size": 100,
  "terminate_after": 1
}

Sample response:

{
  "took": 11,
  "timed_out": false,
  "terminated_early": true,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "products_01",
        "_id": "63033976",
        "_score": 1,
        "_source": {
          "materials": {
            "facet": {
              "en": {
                "lvl0": "Cashmere#1"
              }
            }
          }
        }
      },
      {
        "_index": "products_01",
        "_id": "31224269",
        "_score": 1,
        "_source": {
          "materials": {
            "facet": {
              "en": {
                "lvl0": "Cashmere#1"
              }
            }
          }
        }
      },
      {
        "_index": "products_01",
        "_id": "63080864",
        "_score": 1,
        "_source": {
          "materials": {
            "facet": {
              "en": {
                "lvl0": "Cashmere#1"
              }
            }
          }
        }
      }
    ]
  },
  "aggregations": {
    "materials_facet": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": []
    }
  }
}

Expected behavior

We should see a bucket in aggregations.materials_facet.buckets.

As you can see we have terminate_after=1 means each shard should at least process 1 document, we have 3 shards so in total 3 docs should be processed. This can be verified in hits.total.value and in hits array. But as you can see aggregations doesn't match with the documents that you can see in hits.

The issue will go away if we remove terminate_after but that will hurt performance because we have a high number of documents. Terminating after a 100K items is enough for us.

Additional Details

Host/Environment:

Version 2.11
AWS managed OpenSearch

The text was updated successfully, but these errors were encountered:

getsaurabh02 · 2024-04-24T16:42:02Z

@sandeshkr419 do you want to take a stab at verifying this issue and confirming if its a bug?

sandeshkr419 · 2024-04-29T17:45:35Z

Hi @Rmaan,

Thanks for reporting this.

The usage of terminate_after is indeterministic, but the the field actually ensures that it stops processing documents after terminate_after many documents are reached, not terminate_after * primary_shard_count. Basically, you specify how many documents to want to terminate after on the total response, not on individual shard level.

~~Now, you may see that sometimes the number of documents processed is way larger terminate_after because the documents have already been processed specifically with concurrent searches.~~

Also, with some of the optimizations we did with terms aggregation, where the documents are not iterated and terminate_after may not even be breached at all and all documents will be counted in probably much less time.

Rmaan · 2024-04-29T19:53:24Z

Hi @sandeshkr419

Thanks for taking time and replying.

The usage of terminate_after is indeterministic, but the the field actually ensures that it stops processing documents after terminate_after

I put terminate_after=1 but zero documents were processed. There are no buckets. This would happen with much higher numbers as well. Is this the intended behavior? Can we process LESS documents than terminate_after?

not terminate_after * primary_shard_count

I'm sure that was the behavior, it's clearly stating this in ES v7.10 docs. Also my team tested this and it was the behavior till OpenSearch 2.9.

Also, with some of the optimizations we did with terms aggregation, where the documents are not iterated and terminate_after may not even be breached at all and all documents will be counted in probably much less time.

Sorry I didn't get this part, does it mean that if I put terminate after 1000 docs, it might terminate before 1000 docs? Like after 500 docs it might stop?

sandeshkr419 · 2024-04-30T18:50:09Z

I put terminate_after=1 but zero documents were processed. There are no buckets. This would happen with much higher numbers as well. Is this the intended behavior? Can we process LESS documents than terminate_after?

Ideally, we should not be processing less documents than the terminate_after field, I'm still checking if I missed a change in 2.10 with terminate_after logic. I know we introduced concurrent search as an experimental feature in 2.10.

@Rmaan Also, wanted to check if concurrent search is enabled for the index on which the queries are run upon. Some known bugs were reported and fixed with concurrent searches in later iteration is what I remember vaguely: https://github.com/opensearch-project/OpenSearch/pulls?q=is%3Apr+terminate_after+

I'm sure that was the behavior, it's clearly stating this in ES v7.10 docs. Also my team tested this and it was the behavior till OpenSearch 2.9.

I was mistaken, you are right. Striking off my responses above to eliminate confusion further.

Sorry I didn't get this part, does it mean that if I put terminate after 1000 docs, it might terminate before 1000 docs? Like after 500 docs it might stop?

I was talking about #11643 where in certain cases it would read aggregation values directly from Lucene index structure and not iterate over documents. In that cases(segments with no deletes, no _doc_value field), the terminate_after field will not hold any value and the counts will be much higher than terminate_after as the documents are not iterated in the first place for getting aggregations. This was introduced in 2.13 and clearly this optimization does not seem to kick in the response you shared.

Rmaan · 2024-05-01T12:42:30Z

I was also suspicious of concurrent search but I checked and it was off in settings, and as I understood when I set terminate_after, anyway it will be off.

So do you think this is a bug? For now we downgraded to OpenSearch 2.9 and the issue is gone but we like to have access to new features such as hybrid search 😅 If you need any help regarding reproducing, our team can work on providing a sample document set and query.

sandeshkr419 · 2024-05-01T19:25:10Z

Thanks for the details @Rmaan.

Yeah, please help me with sample document set & test query. I can try reproducing it on 2.10 and maybe a later 2.x version (if you already ran it for 2.13 with same results - let me know as well - I'll not spend time checking on differences then).

Rmaan · 2024-05-02T11:47:59Z

We reproduced on 2.10, 2.11, 2.13 but on 2.9 it works. I will try to provide a reproduce, as I understood it needs a fair amount of docs but will give it a go.

Rmaan · 2024-05-16T15:51:39Z

Hello

We made a reproducible pack for it, it's a Go code that generate 5K documents, and then does the problematic search that results in no buckets.

Download it here: opensearch_bug_proof.tar.gz

To run just do

docker-compose up

And then you will see the result

opensearch-bug-proof-bug-proof-1  | {"took":20,"timed_out":false,"terminated_early":true,"_shards":{"total":3,"successful":3,"skipped":0,"failed":0},"hits":{"total":{"value":3,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"products_01","_id":"7","_score":1.0,"_source":{"materials":{"facet":{"en":{"lvl0":"iron"}}}}},{"_index":"products_01","_id":"4","_score":1.0,"_source":{"materials":{"facet":{"en":{"lvl0":"iron"}}}}},{"_index":"products_01","_id":"6","_score":1.0,"_source":{"materials":{"facet":{"en":{"lvl0":"iron"}}}}}]},"aggregations":{"materials_facet":{"doc_count_error_upper_bound":0,"sum_other_doc_count":0,"buckets":[]}}}

That has no buckets in it, even though we had some docs matching.

You can also connect it to your local OpenSearch by putting credentials at the top of the go file.

Can you also please open the ticket?

Rmaan · 2024-05-23T08:15:14Z

@sandeshkr419 Kindly reminder

BTW we also reproduced with less complications, simply indexing a couple docs and aggregating a keyword field with terminate_after=1 will give no buckets. i.e. it terminates before processing 1 doc per shard. This doesn't happen with integer fields for example.

sandeshkr419 · 2024-05-24T18:28:51Z

Thanks @Rmaan for the reproduce SOP.
Let me try to root cause this and get back.

Rmaan added bug Something isn't working untriaged labels Apr 18, 2024

github-actions bot added the Search:Aggregations label Apr 18, 2024

getsaurabh02 assigned sandeshkr419 Apr 24, 2024

sandeshkr419 closed this as completed Apr 29, 2024

sandeshkr419 reopened this May 24, 2024

andrross removed the untriaged label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Sometimes aggregations are empty with terminate_after #13288

[BUG] Sometimes aggregations are empty with terminate_after #13288

Rmaan commented Apr 18, 2024 •

edited

getsaurabh02 commented Apr 24, 2024

sandeshkr419 commented Apr 29, 2024 •

edited

Rmaan commented Apr 29, 2024

sandeshkr419 commented Apr 30, 2024 •

edited

Rmaan commented May 1, 2024

sandeshkr419 commented May 1, 2024

Rmaan commented May 2, 2024

Rmaan commented May 16, 2024 •

edited

Rmaan commented May 23, 2024

sandeshkr419 commented May 24, 2024

[BUG] Sometimes aggregations are empty with terminate_after #13288

[BUG] Sometimes aggregations are empty with terminate_after #13288

Comments

Rmaan commented Apr 18, 2024 • edited

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

getsaurabh02 commented Apr 24, 2024

sandeshkr419 commented Apr 29, 2024 • edited

Rmaan commented Apr 29, 2024

sandeshkr419 commented Apr 30, 2024 • edited

Rmaan commented May 1, 2024

sandeshkr419 commented May 1, 2024

Rmaan commented May 2, 2024

Rmaan commented May 16, 2024 • edited

Rmaan commented May 23, 2024

sandeshkr419 commented May 24, 2024

Rmaan commented Apr 18, 2024 •

edited

sandeshkr419 commented Apr 29, 2024 •

edited

sandeshkr419 commented Apr 30, 2024 •

edited

Rmaan commented May 16, 2024 •

edited