You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problem is that all 400 illegal_argument_exception errors are infinitely retried but these are not transient error and will always result in that error when retried.
I am not sure what the best action should be here really. If we decide to DLQ at the bulk level for these errors, then the output will become a passthru into the DLQ which I do not think is necessarily a good idea.
We basically have 3 choices: a) Retry indefinitely, b) DLQ and c) Stop.
a) Retry indefinitely: If using PQ, retrying indefinitely will result in a backlog growing in PQ, and once the problem is fixed then upon restarting LS, events will flow back correctly into ES and no data should be lost. Without PQ, backpressure will be applied to whichever input is being used and depending on that input, events might get lost upon restarting LS.
b) DLQ: If DLQing these, then all bulk requests will end up in DLQ. Upon restarting LS after applying a fix then all DLQed events will have to be reprocessed which involves a separate and manual process. Note that using PQ has no impact here.
c) Stop: Stopping implies completely stopping the pipeline which might impact other inputs & filters and might result in loosing events if PQ is not enabled.
The text was updated successfully, but these errors were encountered:
According to documentation, since version 8.1.1 all requests to elasticsearch are made via bulk requests:
The retry policy has changed significantly in the 8.1.1 release. This plugin uses the Elasticsearch bulk API to optimize its imports into Elasticsearch.
Further on it says for bulk requests, that responses with another status than 200 will be retried indefinetly:
HTTP requests to the bulk API are expected to return a 200 response code. All other response codes are retried indefinitely.
Doesn't this mean DLQ feature won't work at all with versions >= 8.1.1 ?!
Concerning your choice b)
Why do you think all bulk requests provide other status codes than 200? And wouldn't it be possible to cut failed bulks into smaller chunks, retry these and write failing chunks / bulk requests to DLQ when they have reached a minimum size?
ES will return a status 400
illegal_argument_exception
error at the bulk request level for any malformed bulk requests. Some examples:version
option value with a non-numeric valueThe problem is that all 400
illegal_argument_exception
errors are infinitely retried but these are not transient error and will always result in that error when retried.I am not sure what the best action should be here really. If we decide to DLQ at the bulk level for these errors, then the output will become a passthru into the DLQ which I do not think is necessarily a good idea.
We basically have 3 choices: a) Retry indefinitely, b) DLQ and c) Stop.
a) Retry indefinitely: If using PQ, retrying indefinitely will result in a backlog growing in PQ, and once the problem is fixed then upon restarting LS, events will flow back correctly into ES and no data should be lost. Without PQ, backpressure will be applied to whichever input is being used and depending on that input, events might get lost upon restarting LS.
b) DLQ: If DLQing these, then all bulk requests will end up in DLQ. Upon restarting LS after applying a fix then all DLQed events will have to be reprocessed which involves a separate and manual process. Note that using PQ has no impact here.
c) Stop: Stopping implies completely stopping the pipeline which might impact other inputs & filters and might result in loosing events if PQ is not enabled.
The text was updated successfully, but these errors were encountered: