Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCN-458 handle specific decoding error when schema registry unavailable #133

Conversation

hchienjo
Copy link
Contributor

Return a specific error that can be used to detect when the schema registry is unavailable. This means that, for example, number one can retry indefinitely in case such an error is encountered so as not to skip messages when consuming but otherwise respect a limited retry mechanism that might be in place.

Instead of trying to check all the net.Error, url.Error or syscall.ECONNREFUSED, the logic has simply been reduced to "in case there is an error during performing the request and we have no more retries left then the registry is not available". This means that, even if we have consecutive timeouts that exceed the retry limit for operations will be considered as the schema registry is not available.

@hchienjo hchienjo self-assigned this Feb 20, 2024
@hchienjo hchienjo force-pushed the BCN-458-handle-specific-decoding-error-when-schema-registry-unavailable branch 4 times, most recently from d8d85ca to 267a7ba Compare February 20, 2024 12:36
@hchienjo hchienjo force-pushed the BCN-458-handle-specific-decoding-error-when-schema-registry-unavailable branch from 267a7ba to 6ecad9f Compare February 20, 2024 12:44
@hchienjo hchienjo marked this pull request as ready for review February 20, 2024 12:50
@hchienjo
Copy link
Contributor Author

Tested on a cluster by deliberately killing the schema registry:

message conversion error: cannot unmarshal: schema registry unavailability caused by: Get "http://10.43.7.86:10450/schemas/ids/2464": dial tcp 10.43.7.86:10450: connect: connection refused

@hchienjo hchienjo force-pushed the BCN-458-handle-specific-decoding-error-when-schema-registry-unavailable branch 2 times, most recently from f566360 to cd4a426 Compare March 4, 2024 11:07
Of course after exhausing all retries.
@hchienjo hchienjo force-pushed the BCN-458-handle-specific-decoding-error-when-schema-registry-unavailable branch from cd4a426 to c82aacb Compare March 4, 2024 11:26
@hchienjo hchienjo force-pushed the BCN-458-handle-specific-decoding-error-when-schema-registry-unavailable branch from 2a3c667 to 4f1eb1e Compare March 5, 2024 05:22
@Sharykhin
Copy link
Contributor

Another potential issue I would like to highlight, in case it was not discussed, is that if we go this way and return UnavailableError for every 5xx I think we should update consumers not to use infinite retry, otherwise consumer group will go into infinite loop, won't it?
I've found 31 places where kafka.InfiniteRetry was in use. So as part of gardening those places should be fixed
cc: @skateinmars

avroregistry/errors.go Show resolved Hide resolved
@skateinmars
Copy link
Member

we should update consumers not to use infinite retry, otherwise consumer group will go into infinite loop, won't it?

In that case the infinite loop is something we want, isn't it?
If decoding fails because of the registry availability for one message, it will fail for the next messages anyway, so there's no point in moving forward consuming the topic (assuming the infinite retry is there because every message is important and we don't want to miss any message)

@Sharykhin
Copy link
Contributor

In that case the infinite loop is something we want, isn't it?

If decoding fails because of the registry availability for one message, it will fail for the next messages anyway, so there's no point in moving forward consuming the topic (assuming the infinite retry is there because every message is important and we don't want to miss any message)

If we are taking about schema registry being unavailable then - yes. But I am afraid if there can be other errors, I don't know for example bad format or whatever, that we might what to skip, it came to my mind when I saw this test case?

@skateinmars
Copy link
Member

if there can be other errors, I don't know for example bad format or whatever, that we might what to skip

In those cases we won't return an UnavailableError so it should be okay?

@Sharykhin
Copy link
Contributor

if there can be other errors, I don't know for example bad format or whatever, that we might what to skip

In those cases we won't return an UnavailableError so it should be okay?

Please correct me If I am wrong, I might miss something, but doesn't this piece of code would return UnavailableError for any unexpected errors? Or such errors as bad format would have non-5xx status, if so then we should be good?

Screenshot 2024-04-11 at 12 23 20

@skateinmars
Copy link
Member

500 type error include database errors (which should be temporary)
400 type errors would be for things like non-existing schema or invalid query 🙇

@Sharykhin Sharykhin merged commit eaf42ab into master Apr 12, 2024
1 check passed
@Sharykhin Sharykhin deleted the BCN-458-handle-specific-decoding-error-when-schema-registry-unavailable branch April 12, 2024 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants