Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sometimes aggregations are empty with terminate_after #13288

Open
Rmaan opened this issue Apr 18, 2024 · 10 comments
Open

[BUG] Sometimes aggregations are empty with terminate_after #13288

Rmaan opened this issue Apr 18, 2024 · 10 comments
Assignees
Labels
bug Something isn't working Search:Aggregations

Comments

@Rmaan
Copy link

Rmaan commented Apr 18, 2024

Describe the bug

We found weird bugs in our search faceting after moving to OpenSearch 2.11 from Elasticsearch, it seems when terminate_after is passed, sometimes returned buckets are fully empty (Even though all processed docs should have a bucket) and sometimes it's way less than the terminate_after * primary_shard_count, although search is terminated early and all items have a value for the aggregation.

We couldn't reproduce this issues with OpenSearch 2.9 but 2.10 was affected.

Related component

Search:Aggregations

To Reproduce

Exact reproduction is hard, seems we need to have a couple segments to see the problem, and reported issues are when we aggregate on a keyword field while filtering based on some integer field. When we aggregate on the same integer column the issue doesn't happen.

Sample request:

{
  "track_total_hits": true,
  "_source": ["materials.facet.en.lvl0"],
  "aggregations": {
    "materials_facet": {
      "terms": {
        "field": "materials.facet.en.lvl0"
      }
    }
  },
  "query": {
    "term": {
      "materials.ids": 1
    }
  },
  "size": 100,
  "terminate_after": 1
}

Sample response:

{
  "took": 11,
  "timed_out": false,
  "terminated_early": true,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "products_01",
        "_id": "63033976",
        "_score": 1,
        "_source": {
          "materials": {
            "facet": {
              "en": {
                "lvl0": "Cashmere#1"
              }
            }
          }
        }
      },
      {
        "_index": "products_01",
        "_id": "31224269",
        "_score": 1,
        "_source": {
          "materials": {
            "facet": {
              "en": {
                "lvl0": "Cashmere#1"
              }
            }
          }
        }
      },
      {
        "_index": "products_01",
        "_id": "63080864",
        "_score": 1,
        "_source": {
          "materials": {
            "facet": {
              "en": {
                "lvl0": "Cashmere#1"
              }
            }
          }
        }
      }
    ]
  },
  "aggregations": {
    "materials_facet": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": []
    }
  }
}

Expected behavior

We should see a bucket in aggregations.materials_facet.buckets.

As you can see we have terminate_after=1 means each shard should at least process 1 document, we have 3 shards so in total 3 docs should be processed. This can be verified in hits.total.value and in hits array. But as you can see aggregations doesn't match with the documents that you can see in hits.

The issue will go away if we remove terminate_after but that will hurt performance because we have a high number of documents. Terminating after a 100K items is enough for us.

Additional Details

Host/Environment:

  • Version 2.11
  • AWS managed OpenSearch
@getsaurabh02
Copy link
Member

@sandeshkr419 do you want to take a stab at verifying this issue and confirming if its a bug?

@sandeshkr419
Copy link
Contributor

sandeshkr419 commented Apr 29, 2024

Hi @Rmaan,

Thanks for reporting this.

The usage of terminate_after is indeterministic, but the the field actually ensures that it stops processing documents after terminate_after many documents are reached, not terminate_after * primary_shard_count. Basically, you specify how many documents to want to terminate after on the total response, not on individual shard level.

Now, you may see that sometimes the number of documents processed is way larger terminate_after because the documents have already been processed specifically with concurrent searches.

Also, with some of the optimizations we did with terms aggregation, where the documents are not iterated and terminate_after may not even be breached at all and all documents will be counted in probably much less time.

@Rmaan
Copy link
Author

Rmaan commented Apr 29, 2024

Hi @sandeshkr419

Thanks for taking time and replying.

The usage of terminate_after is indeterministic, but the the field actually ensures that it stops processing documents after terminate_after

I put terminate_after=1 but zero documents were processed. There are no buckets. This would happen with much higher numbers as well. Is this the intended behavior? Can we process LESS documents than terminate_after?

not terminate_after * primary_shard_count

I'm sure that was the behavior, it's clearly stating this in ES v7.10 docs. Also my team tested this and it was the behavior till OpenSearch 2.9.

Also, with some of the optimizations we did with terms aggregation, where the documents are not iterated and terminate_after may not even be breached at all and all documents will be counted in probably much less time.

Sorry I didn't get this part, does it mean that if I put terminate after 1000 docs, it might terminate before 1000 docs? Like after 500 docs it might stop?

@sandeshkr419
Copy link
Contributor

sandeshkr419 commented Apr 30, 2024

I put terminate_after=1 but zero documents were processed. There are no buckets. This would happen with much higher numbers as well. Is this the intended behavior? Can we process LESS documents than terminate_after?

Ideally, we should not be processing less documents than the terminate_after field, I'm still checking if I missed a change in 2.10 with terminate_after logic. I know we introduced concurrent search as an experimental feature in 2.10.

@Rmaan Also, wanted to check if concurrent search is enabled for the index on which the queries are run upon. Some known bugs were reported and fixed with concurrent searches in later iteration is what I remember vaguely: https://github.com/opensearch-project/OpenSearch/pulls?q=is%3Apr+terminate_after+

I'm sure that was the behavior, it's clearly stating this in ES v7.10 docs. Also my team tested this and it was the behavior till OpenSearch 2.9.

I was mistaken, you are right. Striking off my responses above to eliminate confusion further.

Sorry I didn't get this part, does it mean that if I put terminate after 1000 docs, it might terminate before 1000 docs? Like after 500 docs it might stop?

I was talking about #11643 where in certain cases it would read aggregation values directly from Lucene index structure and not iterate over documents. In that cases(segments with no deletes, no _doc_value field), the terminate_after field will not hold any value and the counts will be much higher than terminate_after as the documents are not iterated in the first place for getting aggregations. This was introduced in 2.13 and clearly this optimization does not seem to kick in the response you shared.

@Rmaan
Copy link
Author

Rmaan commented May 1, 2024

I was also suspicious of concurrent search but I checked and it was off in settings, and as I understood when I set terminate_after, anyway it will be off.

So do you think this is a bug? For now we downgraded to OpenSearch 2.9 and the issue is gone but we like to have access to new features such as hybrid search 😅 If you need any help regarding reproducing, our team can work on providing a sample document set and query.

@sandeshkr419
Copy link
Contributor

Thanks for the details @Rmaan.

Yeah, please help me with sample document set & test query. I can try reproducing it on 2.10 and maybe a later 2.x version (if you already ran it for 2.13 with same results - let me know as well - I'll not spend time checking on differences then).

@Rmaan
Copy link
Author

Rmaan commented May 2, 2024

We reproduced on 2.10, 2.11, 2.13 but on 2.9 it works. I will try to provide a reproduce, as I understood it needs a fair amount of docs but will give it a go.

@Rmaan
Copy link
Author

Rmaan commented May 16, 2024

Hello

We made a reproducible pack for it, it's a Go code that generate 5K documents, and then does the problematic search that results in no buckets.

Download it here: opensearch_bug_proof.tar.gz

To run just do

docker-compose up

And then you will see the result

opensearch-bug-proof-bug-proof-1  | {"took":20,"timed_out":false,"terminated_early":true,"_shards":{"total":3,"successful":3,"skipped":0,"failed":0},"hits":{"total":{"value":3,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"products_01","_id":"7","_score":1.0,"_source":{"materials":{"facet":{"en":{"lvl0":"iron"}}}}},{"_index":"products_01","_id":"4","_score":1.0,"_source":{"materials":{"facet":{"en":{"lvl0":"iron"}}}}},{"_index":"products_01","_id":"6","_score":1.0,"_source":{"materials":{"facet":{"en":{"lvl0":"iron"}}}}}]},"aggregations":{"materials_facet":{"doc_count_error_upper_bound":0,"sum_other_doc_count":0,"buckets":[]}}}

That has no buckets in it, even though we had some docs matching.

You can also connect it to your local OpenSearch by putting credentials at the top of the go file.

Can you also please open the ticket?

@Rmaan
Copy link
Author

Rmaan commented May 23, 2024

@sandeshkr419 Kindly reminder

BTW we also reproduced with less complications, simply indexing a couple docs and aggregating a keyword field with terminate_after=1 will give no buckets. i.e. it terminates before processing 1 doc per shard. This doesn't happen with integer fields for example.

@sandeshkr419 sandeshkr419 reopened this May 24, 2024
@sandeshkr419
Copy link
Contributor

Thanks @Rmaan for the reproduce SOP.
Let me try to root cause this and get back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Aggregations
Projects
Status: 🏗 In progress
Development

No branches or pull requests

4 participants