-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka output stalling on small topic #444
Comments
Issue moved the the Hindsight repository |
DataOps (trink - 2019)
automation
moved this from In progress
to Completed (Sprint 10 May 17)
Jul 19, 2019
There may also be a data race condition in the sandbox Kafka module causing it to miss an ack, normally this is not as issue as it is treated as a high water mark and the next one will advance it. I was unable to reproduce this while testing but we are still seeing an occasional stall in production so more investigation is necessary. |
DataOps (trink - 2019)
automation
moved this from Completed (Sprint 13 July 26)
to To do
Sep 4, 2019
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The scenario we're seeing is that a kafka output to a particular topic is stalling i.e. hindsight.cp shows that the output is not processing further data. This eventually leads to backpressure because of low disk free. In most cases it resolves itself automatically before reaching low disk free, perhaps due to some retry loop finally succeeding.
The topic we're seeing this on appears to be healthy from a kafka perspective, in that it is fully replicated and all leaders are preferred:
Of note it appears coincidentally that the volume of output to this already sparse topic dropped at roughly the same time the issue arose. Restarting the output appears to resolve the issue as a workaround, though if there is an in-flight ping at restart time perhaps it is lost.
Per IRC discussion there is likely either an underlying HS async cp bug, a libkafka problem, or a kafka issue. As far as I can tell, the kafka cluster is fully operational. I'm guessing the volume drop on input is not coincidental and is related to the issue, but further investigation would be required to determine if this is the case.
The text was updated successfully, but these errors were encountered: