Consumer coordinator fix #251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

dantswain merged 3 commits into master from consumer_coordinator_fix

Nov 22, 2017

Collaborator

dantswain commented Nov 21, 2017

I believe this should fix #247

In addition to #248 and #249 I think there were two lingering issues:

Not stopping the previous consumer supervisor on rebalance. This accounts for the {:error, {:already_started, _}} errors.
Not gracefully handling a change in consumer group coordinator node.

I think (1) is pretty straightforward but I would appreciate some discussion on (2). The previous behavior could fall under the "let it fail" mantra but I think it was causing enough failures to run into restart policies. I don't think what I've done here is overly complicated - it just detects the failure and re-acquires the group metadata then recursively tries itself again.

I also had a concern about the right scope for this to happen. Part of me says it should happen in the server implementation because one could argue that the server implementation is intended to abstract away the notion of talking to a cluster, and this is dealing with figuring out which cluster node we need to talk to. Part of me says that following the consumer group coordinator node is part of the behavior of the consumer group itself.

The changes to the server implementation also feel a little un-DRY, but I didn't want to do any significant refactoring until the above concerns were addressed.

If this approach seems reasonable, we may actually want to implement something similar for other server transactions - e.g., update metadata whenever we get :not_leader_for_partition.

I'll update #247 with instructions on how to reproduce the original problem.

@bjhaid and @joshuawscott I know it's the holiday week but I'd really appreciate your input on this. @CJPoll this hopefully fixes our issue.

dantswain added 2 commits

November 20, 2017 18:06


          Stop consumers on rebalance before restarting

This gets rid of the `{:error, {:already_started, _}}` issues that were
(I think) causing consumer groups from gracefully handling rebalances.


          Update consumer group metadata when coordinator changes

3c34a7e

sourcelevel-bot bot reviewed

View reviewed changes

lib/kafka_ex/server_0_p_9_p_0.ex Outdated

    
                  true = consumer_group?(state)

                  {broker, state} = broker_for_consumer_group_with_update(state)

                  request = JoinGroup.create_request(state.correlation_id, @client_id, join_group_request)

                  wire_request = JoinGroup.create_request(state.correlation_id, @client_id, join_group_request)

sourcelevel-bot bot Nov 21, 2017

Line is too long (max is 80, was 97).

lib/kafka_ex/server_0_p_9_p_0.ex Outdated

    
                  wire_request = JoinGroup.create_request(state.correlation_id, @client_id, join_group_request)

                  response = broker

                    |> NetworkClient.send_sync_request(request, config_sync_timeout(network_timeout))

                    |> NetworkClient.send_sync_request(wire_request, config_sync_timeout(network_timeout))

sourcelevel-bot bot Nov 21, 2017

Line is too long (max is 80, was 92).

lib/kafka_ex/server_0_p_9_p_0.ex Outdated

    
                  if response.error_code == :not_coordinator_for_consumer do

                    {_, updated_state} = update_consumer_metadata(state)

                    kafka_server_join_group(join_group_request, network_timeout, updated_state)

sourcelevel-bot bot Nov 21, 2017

Line is too long (max is 80, was 81).

lib/kafka_ex/server_0_p_9_p_0.ex Outdated

    
                  true = consumer_group?(state)

                  {broker, state} = broker_for_consumer_group_with_update(state)

                  request = SyncGroup.create_request(state.correlation_id, @client_id, sync_group_request)

                  wire_request = SyncGroup.create_request(state.correlation_id, @client_id, sync_group_request)

sourcelevel-bot bot Nov 21, 2017

Line is too long (max is 80, was 97).

lib/kafka_ex/server_0_p_9_p_0.ex Outdated

    
                  wire_request = SyncGroup.create_request(state.correlation_id, @client_id, sync_group_request)

                  response = broker

                    |> NetworkClient.send_sync_request(request, config_sync_timeout(network_timeout))

                    |> NetworkClient.send_sync_request(wire_request, config_sync_timeout(network_timeout))

sourcelevel-bot bot Nov 21, 2017

Line is too long (max is 80, was 92).

lib/kafka_ex/server_0_p_9_p_0.ex Outdated

    
                  if response.error_code == :not_coordinator_for_consumer do

                    {_, updated_state} = update_consumer_metadata(state)

                    kafka_server_sync_group(sync_group_request, network_timeout, updated_state)

sourcelevel-bot bot Nov 21, 2017

Line is too long (max is 80, was 81).

lib/kafka_ex/server_0_p_9_p_0.ex Outdated

    
                  true = consumer_group?(state)

                  {broker, state} = broker_for_consumer_group_with_update(state)

                  request = LeaveGroup.create_request(state.correlation_id, @client_id, request)

                  wire_request = LeaveGroup.create_request(state.correlation_id, @client_id, request)

sourcelevel-bot bot Nov 21, 2017

Line is too long (max is 80, was 87).

lib/kafka_ex/server_0_p_9_p_0.ex Outdated

    
                  response = broker

                    |> NetworkClient.send_sync_request(request, config_sync_timeout(network_timeout))

                    |> NetworkClient.send_sync_request(wire_request, config_sync_timeout(network_timeout))

sourcelevel-bot bot Nov 21, 2017

Line is too long (max is 80, was 92).

lib/kafka_ex/server_0_p_9_p_0.ex Outdated

    
                  true = consumer_group?(state)

                  {broker, state} = broker_for_consumer_group_with_update(state)

                  request = Heartbeat.create_request(state.correlation_id, @client_id, request)

                  wire_request = Heartbeat.create_request(state.correlation_id, @client_id, request)

sourcelevel-bot bot Nov 21, 2017

Line is too long (max is 80, was 86).

lib/kafka_ex/server_0_p_9_p_0.ex Outdated

    
                  wire_request = Heartbeat.create_request(state.correlation_id, @client_id, request)

                  response = broker

                    |> NetworkClient.send_sync_request(request, config_sync_timeout(network_timeout))

                    |> NetworkClient.send_sync_request(wire_request, config_sync_timeout(network_timeout))

sourcelevel-bot bot Nov 21, 2017

Line is too long (max is 80, was 92).

Collaborator Author

dantswain commented Nov 21, 2017

I will address the Ebert issues before final merge, pending requested discussion.

dantswain mentioned this pull request

A broker going down or coming up causes a never-ending loop of rebalances #247

Closed

dantswain requested review from bjhaid and joshuawscott

November 21, 2017 15:28

joshuawscott approved these changes

View reviewed changes

Member

joshuawscott left a comment

This seems fine to me. I guess it could be debated where to put the code, but there is a lot of refactoring that should be done, and it feels like it would be better to undertake that after the bugs are worked out sufficiently?

bjhaid approved these changes

View reviewed changes


          Credo cleanup

72f20a5

sourcelevel-bot bot commented Nov 22, 2017

Ebert has finished reviewing this Pull Request and has found:

5 fixed issues! 🎉

But beware that this branch is 2 commits behind the kafkaex:master branch, and a review of an up to date branch would produce more accurate results.

You can see more details about this review at https://ebertapp.io/github/kafkaex/kafka_ex/pulls/251.

Collaborator Author

dantswain commented Nov 22, 2017

I'm going to merge this but holding off until Monday to do a release.

dantswain merged commit 93facc0 into master

dantswain deleted the consumer_coordinator_fix branch

November 22, 2017 16:09

dantswain mentioned this pull request

Fix correlation id bug, refactor consumer group network requests #252

Merged

robotarmy pushed a commit to RAM9/kafka_ex that referenced this pull request


          Merge pull request kafkaex#251 from kafkaex/consumer_coordinator_fix

415d9ea

Consumer coordinator fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet