Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling of invalid bulk requests with illegal_argument_exception #815

Open
colinsurprenant opened this issue Nov 20, 2018 · 2 comments
Open

Comments

@colinsurprenant
Copy link
Contributor

ES will return a status 400 illegal_argument_exception error at the bulk request level for any malformed bulk requests. Some examples:

The problem is that all 400 illegal_argument_exception errors are infinitely retried but these are not transient error and will always result in that error when retried.

I am not sure what the best action should be here really. If we decide to DLQ at the bulk level for these errors, then the output will become a passthru into the DLQ which I do not think is necessarily a good idea.

We basically have 3 choices: a) Retry indefinitely, b) DLQ and c) Stop.

a) Retry indefinitely: If using PQ, retrying indefinitely will result in a backlog growing in PQ, and once the problem is fixed then upon restarting LS, events will flow back correctly into ES and no data should be lost. Without PQ, backpressure will be applied to whichever input is being used and depending on that input, events might get lost upon restarting LS.

b) DLQ: If DLQing these, then all bulk requests will end up in DLQ. Upon restarting LS after applying a fix then all DLQed events will have to be reprocessed which involves a separate and manual process. Note that using PQ has no impact here.

c) Stop: Stopping implies completely stopping the pipeline which might impact other inputs & filters and might result in loosing events if PQ is not enabled.

@nilskuhn
Copy link

According to documentation, since version 8.1.1 all requests to elasticsearch are made via bulk requests:

The retry policy has changed significantly in the 8.1.1 release. This plugin uses the Elasticsearch bulk API to optimize its imports into Elasticsearch.

Further on it says for bulk requests, that responses with another status than 200 will be retried indefinetly:

HTTP requests to the bulk API are expected to return a 200 response code. All other response codes are retried indefinitely.

Doesn't this mean DLQ feature won't work at all with versions >= 8.1.1 ?!

@nilskuhn
Copy link

Concerning your choice b)
Why do you think all bulk requests provide other status codes than 200? And wouldn't it be possible to cut failed bulks into smaller chunks, retry these and write failing chunks / bulk requests to DLQ when they have reached a minimum size?

@roaksoax roaksoax added jira-migrated issues migrated from previous issue tracker Team:Logstash int-shortlist and removed jira-migrated issues migrated from previous issue tracker labels Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants