Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

produce_sync raises on timeout #468

Closed
GPrimola opened this issue Aug 20, 2021 · 8 comments
Closed

produce_sync raises on timeout #468

GPrimola opened this issue Aug 20, 2021 · 8 comments

Comments

@GPrimola
Copy link

Hey guys,

I've been using Kaffe and detected an undesired behaviour.

When brod tries to connect to Kafka but the connections times out, it raises when calling produce_sync.

I believe a proper behaviour would be to return {error, timeout}, like it does with {error, errconection}.

The exception is:

[error] GenServer :kaffe_producer_client terminating
** (stop) [{{'localhost', 29092}, {{{:kpro_req, #Reference<0.3351001794.2516844547.112162>, :api_versions, 0, false, []}, :timeout}, [{:kpro_lib, :send_and_recv_raw, 4, [file: 'src/kpro_lib.erl', line: 70]}, {:kpro_lib, :send_and_recv, 5, [file: 'src/kpro_lib.erl', line: 81]}, {:kpro_connection, :query_api_versions, 4, [file: 'src/kpro_connection.erl', line: 246]}, {:kpro_connection, :init_connection, 2, [file: 'src/kpro_connection.erl', line: 233]}, {:kpro_connection, :init, 4, [file: 'src/kpro_connection.erl', line: 170]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 226]}]}}]
    (brod 3.15.6) /Users/giolucas/projects/jaya/swap/norbert/deps/brod/src/brod_client.erl:554: :brod_client.ensure_metadata_connection/1
    (brod 3.15.6) /Users/giolucas/projects/jaya/swap/norbert/deps/brod/src/brod_client.erl:300: :brod_client.handle_info/2
    (stdlib 3.13.2) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13.2) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13.2) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: :init
State: {:state, :kaffe_producer_client, [{'localhost', 29092}], :undefined, [], :undefined, :undefined, [auto_start_producers: true, allow_topic_auto_creation: false, default_producer_config: [required_acks: -1, ack_timeout: 1000, partition_buffer_limit: 512, partition_onwire_limit: 1, max_batch_size: 1048576, max_retries: 3, retry_backoff_ms: 500, compression: :no_compression, min_compression_batch_size: 1024]], :kaffe_producer_client}
** (exit) exited in: :gen_server.call(:kaffe_producer_client, {:get_metadata, "onboarding"}, :infinity)
    ** (EXIT) [{{'localhost', 29092}, {{{:kpro_req, #Reference<0.3351001794.2516844547.112162>, :api_versions, 0, false, []}, :timeout}, [{:kpro_lib, :send_and_recv_raw, 4, [file: 'src/kpro_lib.erl', line: 70]}, {:kpro_lib, :send_and_recv, 5, [file: 'src/kpro_lib.erl', line: 81]}, {:kpro_connection, :query_api_versions, 4, [file: 'src/kpro_connection.erl', line: 246]}, {:kpro_connection, :init_connection, 2, [file: 'src/kpro_connection.erl', line: 233]}, {:kpro_connection, :init, 4, [file: 'src/kpro_connection.erl', line: 170]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 226]}]}}]
    (stdlib 3.13.2) gen_server.erl:246: :gen_server.call/3
    (brod 3.15.6) /Users/giolucas/projects/jaya/swap/norbert/deps/brod/src/brod_client.erl:839: :brod_client.safe_gen_call/3
    (brod 3.15.6) /Users/giolucas/projects/jaya/swap/norbert/deps/brod/src/brod_client.erl:743: :brod_client.get_partitions_count/3
    (kaffe 1.20.0) lib/kaffe/producer.ex:157: Kaffe.Producer.group_by_partition/3
    (kaffe 1.20.0) lib/kaffe/producer.ex:121: Kaffe.Producer.produce_list/3

Steps to reproduce

produce_sync :: {error, errconnection}

  1. Be sure your Kafka server is totally down
  2. Start your application
  3. brod will try to connect to Kafka server, but will warn (mind the reason closed):
[warn] :brod_client [#PID<0.314.0>] :kaffe_producer_client is terminating
reason: [
{{'localhost', 29092},
 {{{:kpro_req, #Reference<0.3351001794.2516844546.115076>, :api_versions, 0,
    false, []}, :closed},
  [
    {:kpro_lib, :send_and_recv_raw, 4, [file: 'src/kpro_lib.erl', line: 70]},
    {:kpro_lib, :send_and_recv, 5, [file: 'src/kpro_lib.erl', line: 81]},
    {:kpro_connection, :query_api_versions, 4,
     [file: 'src/kpro_connection.erl', line: 246]},
    {:kpro_connection, :init_connection, 2,
     [file: 'src/kpro_connection.erl', line: 233]},
    {:kpro_connection, :init, 4, [file: 'src/kpro_connection.erl', line: 170]},
    {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 226]}
  ]}}
]
  1. Run produce_sync will warn and return {error, client_down}.
[warn] Error while grouping by partition :client_down
{:error, :client_down}

There's no exception raising here, even when brod is trying to connect and you repeatedly try to produce_sync.

produce_sync :: raise

  1. Be sure your Kafka server is totally down
  2. Start a server on the same host and port your app would connect to Kafka. Example: localhost:9092
  3. Start your application
  4. brod will try to connect to Kafka server, but will warn (mind the reason timeout):
[warn] :brod_client [#PID<0.460.0>] :kaffe_producer_client is terminating
reason: [
  {{'localhost', 29092},
   {{{:kpro_req, #Reference<0.3351001794.2516844549.111672>, :api_versions, 0,
      false, []}, :timeout},
    [
      {:kpro_lib, :send_and_recv_raw, 4, [file: 'src/kpro_lib.erl', line: 70]},
      {:kpro_lib, :send_and_recv, 5, [file: 'src/kpro_lib.erl', line: 81]},
      {:kpro_connection, :query_api_versions, 4,
       [file: 'src/kpro_connection.erl', line: 246]},
      {:kpro_connection, :init_connection, 2,
       [file: 'src/kpro_connection.erl', line: 233]},
      {:kpro_connection, :init, 4, [file: 'src/kpro_connection.erl', line: 170]},
      {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 226]}
    ]}}
]
  1. Run produce_sync while brod is trying to connect (ie. the gen_server is not sleeping waiting to kafka server restablish). It will:
  • lock the execution for a little while (I assume it's waiting to connect to Kafka server before send the message)
  • log a warn message
  • raise (not returning anything):
[warn] :brod_client [#PID<0.473.0>] :kaffe_producer_client is terminating
reason: [
  {{'localhost', 29092},
   {{{:kpro_req, #Reference<0.3351001794.2516844547.112162>, :api_versions, 0,
      false, []}, :timeout},
    [
      {:kpro_lib, :send_and_recv_raw, 4, [file: 'src/kpro_lib.erl', line: 70]},
      {:kpro_lib, :send_and_recv, 5, [file: 'src/kpro_lib.erl', line: 81]},
      {:kpro_connection, :query_api_versions, 4,
       [file: 'src/kpro_connection.erl', line: 246]},
      {:kpro_connection, :init_connection, 2,
       [file: 'src/kpro_connection.erl', line: 233]},
      {:kpro_connection, :init, 4, [file: 'src/kpro_connection.erl', line: 170]},
      {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 226]}
    ]}}
]

[error] GenServer :kaffe_producer_client terminating
** (stop) [{{'localhost', 29092}, {{{:kpro_req, #Reference<0.3351001794.2516844547.112162>, :api_versions, 0, false, []}, :timeout}, [{:kpro_lib, :send_and_recv_raw, 4, [file: 'src/kpro_lib.erl', line: 70]}, {:kpro_lib, :send_and_recv, 5, [file: 'src/kpro_lib.erl', line: 81]}, {:kpro_connection, :query_api_versions, 4, [file: 'src/kpro_connection.erl', line: 246]}, {:kpro_connection, :init_connection, 2, [file: 'src/kpro_connection.erl', line: 233]}, {:kpro_connection, :init, 4, [file: 'src/kpro_connection.erl', line: 170]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 226]}]}}]
    (brod 3.15.6) /Users/giolucas/projects/jaya/swap/norbert/deps/brod/src/brod_client.erl:554: :brod_client.ensure_metadata_connection/1
    (brod 3.15.6) /Users/giolucas/projects/jaya/swap/norbert/deps/brod/src/brod_client.erl:300: :brod_client.handle_info/2
    (stdlib 3.13.2) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13.2) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13.2) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: :init
State: {:state, :kaffe_producer_client, [{'localhost', 29092}], :undefined, [], :undefined, :undefined, [auto_start_producers: true, allow_topic_auto_creation: false, default_producer_config: [required_acks: -1, ack_timeout: 1000, partition_buffer_limit: 512, partition_onwire_limit: 1, max_batch_size: 1048576, max_retries: 3, retry_backoff_ms: 500, compression: :no_compression, min_compression_batch_size: 1024]], :kaffe_producer_client}
** (exit) exited in: :gen_server.call(:kaffe_producer_client, {:get_metadata, "onboarding"}, :infinity)
    ** (EXIT) [{{'localhost', 29092}, {{{:kpro_req, #Reference<0.3351001794.2516844547.112162>, :api_versions, 0, false, []}, :timeout}, [{:kpro_lib, :send_and_recv_raw, 4, [file: 'src/kpro_lib.erl', line: 70]}, {:kpro_lib, :send_and_recv, 5, [file: 'src/kpro_lib.erl', line: 81]}, {:kpro_connection, :query_api_versions, 4, [file: 'src/kpro_connection.erl', line: 246]}, {:kpro_connection, :init_connection, 2, [file: 'src/kpro_connection.erl', line: 233]}, {:kpro_connection, :init, 4, [file: 'src/kpro_connection.erl', line: 170]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 226]}]}}]
    (stdlib 3.13.2) gen_server.erl:246: :gen_server.call/3
    (brod 3.15.6) /Users/giolucas/projects/jaya/swap/norbert/deps/brod/src/brod_client.erl:839: :brod_client.safe_gen_call/3
    (brod 3.15.6) /Users/giolucas/projects/jaya/swap/norbert/deps/brod/src/brod_client.erl:743: :brod_client.get_partitions_count/3
    (kaffe 1.20.0) lib/kaffe/producer.ex:157: Kaffe.Producer.group_by_partition/3
    (kaffe 1.20.0) lib/kaffe/producer.ex:121: Kaffe.Producer.produce_list/3

This error happens only when brod is trying to connect to kafka server and you run produce_sync. When the gen_server is sleeping and you run produce_sync, it returns {error, client_down}.

Expected behaviour

I'd expect that, even though the connection times out, produce_sync should return something like {error, timeout}, and not raise or exit for any reason.

@GPrimola
Copy link
Author

GPrimola commented Sep 3, 2021

@zmstone sorry to call you, but since you're the main contributor, I'm needing some attention with this.
I know you can be busy to give attention to this, but could you at least point me on the right direction to tackle this? 🙂
Thank you very much and sorry to bother you!

@zmstone
Copy link
Contributor

zmstone commented Sep 3, 2021

Hi @GPrimola , sorry :(

it seems to me that kafka did not reply the API version query request.
are you sure the remote end is a plaintext but not an SSL listener ?

which Kafka version is it ? and which brod version is it ?

@v0idpwn
Copy link
Contributor

v0idpwn commented Sep 3, 2021

@zmstone the issue here is that if the client call times out (instead of failing because client is already dead), safe_gen_call doesn't handle the timeout.

Do you think it should return an error tuple for this case or leave it as it is? I don't have a strong opinion on that.

I just wrote a working fix, I can submit it, but I'm not sure of the impact (possibility of being a breaking change, etc)

@GPrimola
Copy link
Author

GPrimola commented Sep 3, 2021

@zmstone No problem at all! 🙂
The version I'm using is brod 3.0 from kaffe 1.20.0, but I think the issue is related to what @v0idpwn said.

Regarding should that be done on brod, I'm in the same page of @v0idpwn, since I don't know whether brod should handle this, or maybe kaffe could do it too.
But in order to get more insights on this, I've opened the issue spreedly/kaffe#113 to Kaffe so the guys there could also jump on this.

@v0idpwn
Copy link
Contributor

v0idpwn commented Sep 3, 2021

you can easily handle this exception on your application side, though.

@GPrimola
Copy link
Author

GPrimola commented Sep 3, 2021

Exaclty, and that's what I'm doing. However I wonder should I be doing this or should the lib brod handle this better, as it does with {error, errconection}. 🤔

@zmstone
Copy link
Contributor

zmstone commented Sep 3, 2021

safe_gen_call catches noproc exception but not other errors.

brod/src/brod_client.erl

Lines 840 to 841 in f88130e

catch exit : {noproc, _} ->
{error, client_down}

I think it's safe and reasonable to capture other exit reasons too for a more consistent behaviour.

BTW, the reported stacktrace does not look like it's from brod 3.0.0

@zmstone
Copy link
Contributor

zmstone commented Feb 6, 2022

fixed in: #492
will cut a release soon.

@zmstone zmstone closed this as completed Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants