Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout errors #389

Closed
andrielfn opened this issue Nov 28, 2019 · 18 comments · Fixed by #405
Closed

Timeout errors #389

andrielfn opened this issue Nov 28, 2019 · 18 comments · Fixed by #405

Comments

@andrielfn
Copy link

Hi all,

It all started in the Heroku review-envs when we upgraded our application from (Erlang 21.3.7 and Elixir 1.8.2) to (Erlang 22.0.7 and Elixir 1.9.1).
We noticed, for some reason, that downgrading Erlang back to 21.3.7 solved the timeouts.

In this first moment, messages were not being consumed/produced and thousands of timeout errors were flooding the logs. Then we bumped to kafka_ex 0.10, which seemed to have improved a bit (fewer timeouts in the logs and messages being consumed).

However, in the last week, we started to get the same timeouts in production but starting with a lower frequency (~10 timeouts/day).

Some extra information:

  • In the Heroku review-envs we use Basic Kafka instances (Multi-Tenant);
  • In production, we use a Private Standard Kafka instance;
  • We currently have 28 topics created;

I also tried the kafka_ex on the master branch and with the new kayrock API, but seems that it didn't help. Lots of timeouts keep flooding the logs.

The error on 0.10:

2019-11-22T22:25:24.679435+00:00 app[web.1]: 22:25:24.672 [error] GenServer #PID<0.2066.0> terminating
2019-11-22T22:25:24.679445+00:00 app[web.1]: ** (MatchError) no match of right hand side value: :timeout
2019-11-22T22:25:24.679448+00:00 app[web.1]: (kafka_ex) lib/kafka_ex/gen_consumer.ex:678: KafkaEx.GenConsumer.consume/1
2019-11-22T22:25:24.679450+00:00 app[web.1]: (kafka_ex) lib/kafka_ex/gen_consumer.ex:630: KafkaEx.GenConsumer.handle_info/2
2019-11-22T22:25:24.679452+00:00 app[web.1]: (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
2019-11-22T22:25:24.679454+00:00 app[web.1]: (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
2019-11-22T22:25:24.679458+00:00 app[web.1]: (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
2019-11-22T22:25:24.679461+00:00 app[web.1]: Last message: :timeout
2019-11-22T22:25:24.679465+00:00 app[web.1]: State: %KafkaEx.GenConsumer.State{acked_offset: 0, auto_offset_reset: :earliest, commit_interval: 5000, commit_threshold: 500, committed_offset: 0, consumer_module: FOO.Consumer, consumer_state: nil, current_offset: 0, fetch_options: [auto_commit: false, worker_name: #PID<0.2289.0>], group: "foo.xxx.yyy", last_commit: -576460703586, partition: 0, topic: "bar.baz.bla", worker_name: #PID<0.2289.0>}

And the error now with master:

2019-11-28T19:30:08.763137+00:00 app[web.1]: 19:30:08.762 [debug] Successfully connected to broker "xxxxxxxxx.amazonaws.com":9096
2019-11-28T19:30:08.816252+00:00 app[web.1]: 19:30:08.815 [debug] Successfully connected to broker "xxxxxxxxx.amazonaws.com":9096
2019-11-28T19:30:08.863527+00:00 app[web.1]: 19:30:08.863 [error] Receiving data from broker "xxxxxxxxx.amazonaws.com":9096 failed with :timeout
2019-11-28T19:30:08.869546+00:00 app[web.1]: 19:30:08.869 [debug] Successfully connected to broker "xxxxxxxxx.amazonaws.com":9096
2019-11-28T19:30:08.879048+00:00 app[web.1]: 19:30:08.878 [error] GenServer #PID<0.9105.93> terminating
2019-11-28T19:30:08.879052+00:00 app[web.1]: ** (stop) {:error, :timeout}
2019-11-28T19:30:08.879053+00:00 app[web.1]: Last message: :timeout
2019-11-28T19:30:08.879063+00:00 app[web.1]: State: %KafkaEx.GenConsumer.State{acked_offset: 0, api_versions: %{fetch: 0, offset_commit: 0, offset_fetch: 0}, auto_offset_reset: :earliest, commit_interval: 5000, commit_threshold: 500, committed_offset: 0, consumer_module: FOO.Consumer, consumer_state: nil, current_offset: 0, fetch_options: [auto_commit: false, worker_name: #PID<0.19074.91>], generation_id: 136, group: "foo-51686.foo_service", last_commit: -576424651131, member_id: "kafka_ex-fe6cdc27-cf12-4c45-9beb-76fcc2a1cdf8", partition: 0, topic: "foo.zzz.yyy", worker_name: #PID<0.19074.91>}
2019-11-28T19:30:08.926883+00:00 app[web.1]: 19:30:08.926 [debug] Successfully connected to broker "xxxxxxxxx.amazonaws.com":9096
2019-11-28T19:30:08.971555+00:00 app[web.1]: 19:30:08.971 [debug] Successfully connected to broker "xxxxxxxxx.amazonaws.com":9096

The Successfully connected... and Receiving data from broker... messages are also logged indefinitely, many per second.

@joshuawscott
Copy link
Member

I would try increasing the sync_timeout setting if you haven't already. It defaults to 5000 (ms), but I would recommend setting it higher, say 15_000.

@andrielfn
Copy link
Author

Hey @joshrotenberg, I forgot to mention but increasing the sync_timeout was one of my attempts. I've just tried one more time with 15_000 and the timeouts are still there.

I also noticed this other error that I believe was not happening on 0.10, but not sure if it's related or not:

2019-11-28T21:35:27.891234+00:00 app[web.1]: 21:35:27.888 [debug] Shutting down worker #PID<0.28898.1>, reason: {:function_clause, [{KafkaEx.New.Client, :handle_info, [{:ssl_closed, {:sslsocket, {:gen_tcp, #Port<0.17549>, :tls_connection, :undefined}, [#PID<0.28902.1>, #PID<0.28901.1>]}}, %KafkaEx.New.Client.State{allow_auto_topic_creation: true, api_versions: %{11 => {0, 4}, 39 => {0, 1}, 34 => {0, 1}, 26 => {0, 1}, 15 => {0, 2}, 20 => {0, 3}, 17 => {0, 1}, 25 => {0, 1}, 13 => {0, 2}, 0 => {0, 7}, 8 => {0, 6}, 36 => {0, 1}, 7 => {0, 2}, 1 => {0, 10}, 32 => {0, 2}, 37 => {0, 1}, 35 => {0, 1}, 3 => {0, 7}, 6 => {0, 5}, 2 => {0, 5}, 41 => {0, 1}, 33 => {0, 1}, 42 => {0, 1}, 43 => {0, 0}, 10 => {0, 2}, 9 => {0, 5}, 19 => {0, 3}, 14 => {0, 2}, 5 => {0, 1}, 18 => {0, 2}, 31 => {0, 1}, 22 => {0, 1}, 29 => {0, 1}, 21 => {0, 1}, 27 => {0, 0}, 24 => {0, 1}, 40 => {0, 1}, 30 => {0, 1}, 23 => {0, ...}, 28 => {...}, ...}, bootstrap_uris: [{"xxxxxxxxx.amazonaws.com", 9096}, {"xxxxxxxxxxxxxxx.amazonaws.com", 9096}, {"xxxxxxxxxxxxxxxxx.amazonaws.com", 9096}, {"xxxxxxxxx.amazonaws.com", 9096}, {"xxxxxxxxx.amazonaws.com", 9096}, {"xxxxxxxxx.amazonaws.com", 9096}, {"xxxxxxxxx.amazonaws.com", 9096}, {"xxxxxxxxx.amazonaws.com", 9096}], cluster_metadata: %KafkaEx.New.ClusterMetadata{brokers: %{0 => %KafkaEx.New.Broker{host: "xxxxxxxxx.amazonaws.com", node_id: 0, port: 9096, rack: "us-east-1e", socket: %KafkaEx.Socket{socket: {:sslsocket, {:gen_tcp, #Port<0.17550>, :tls_connection, :undefined}, [#PID<0.28904.1>, #PID<0.28903.1>]}, ssl: true}}, 1 => %KafkaEx.New.Broker{host: "xxxxxxxxx.amazonaws.com", node_id: 1, port: 9096, rack: "us-east-1c", socket: %KafkaEx.Socket{socket: {:sslsocket, {:gen_tcp, #Port<0.17552>, :tls_connection, :undefined}, [#PID<0.28908.1>, #PID<0.28907.1>]}, ssl: true}}, 2 => %KafkaEx.New.Broker{host: "xxxxxxxxx.amazonaws.com", node_id: 2, port: 9096, rack: "us-east-1e", socket: %KafkaEx.Socket{socket: {:sslsocket, {:gen_tcp, #Port<0.17549>, :tls_connection, :undefined}, [#PID<0.28902.1>, #PID<0.28901.1>]}, ssl: true}}, 3 => %KafkaEx.New.Broker{host: "xxxxxxxxx.amazonaws.com", node_id: 3, port: 9096, rack: "us-east-1a", socket: %KafkaEx.Socket{socket: {:sslsocket, {:gen_tcp, #Port<0.17553>, :tls_connection, :undefined}, [#PID<0.28910.1>, #PID<0.28909.1>]}, ssl: true}}, 5 => %KafkaEx.New.Broker{host: "xxxxxxxxx.amazonaws.com", node_id: 5, port: 9096, rack: "us-east-1d", socket: %KafkaEx.Socket{socket: {:sslsocket, {:gen_tcp, #Port<0.17548>, :tls_connection, :undefined}, [#PID<0.28900.1>, #PID<0.28899.1>]}, ssl: true}}, 6 => %KafkaEx.New.Broker{host: "xxxxxxxxx.amazonaws.com", node_id: 6, port: 9096, rack: "us-east-1d", socket: %KafkaEx.Socket{socket: {:sslsocket, {:gen_tcp, #Port<0.17551>, :tls_connection, :undefined}, [#PID<0.28906.1>, #PID<0.28905.1>]}, ssl: true}}, 7 => %KafkaEx.New.Broker{host: "xxxxxxxxx.amazonaws.com", node_id: 7, port: 9096, rack: "us-east-1c", socket: %KafkaEx.Socket{socket: {:sslsocket, {:gen_tcp, #Port<0.17554>, :tls_connection, :undefined}, [#PID<0.28912.1>, #PID<0.28911.1>]}, ssl: true}}, 8 => %KafkaEx.New.Broker{host: "xxxxxxxxx.amazonaws.com", node_id: 8, port: 9096, rack: "us-east-1a", socket: %KafkaEx.Socket{socket: {:sslsocket, {:gen_tcp, #Port<0.17555>, :tls_connection, :undefined}, [#PID<0.28914.1>, #PID<0.28913.1>]}, ssl: true}}}, consumer_group_coordinators: %{"xxxxx.yyyy" => 7}, controller_id: 7, topics: %{"xxxxx.yyyy.zzzzz" => %KafkaEx.New.Topic{is_internal: 0, name: "xxxxx.yyyy.zzzzz", partition_leaders: %{0 => 7}, partitions: [%KafkaEx.New.Partition{isr: [7, 5, 0], leader: 7, partition_id: 0, replicas: [7, 5, 0]}]}}}, consumer_group_for_auto_commit: "xxxxx.yyyy", consumer_group_update_interval: 30000, correlation_id: 160, metadata_update_interval: 30000, ssl_options: [cert: <<48, 130, 3, 67, 48, 130, 2, 43, 160, 3, 2, 1, 2, 2, 1, 0, 48, 13, 6, 9, 42, 134, 72, 134, 247, 13, 1, 1, 11, 5, 0, 48, ...>>, key: {:RSAPrivateKey, <<48, 130, 4, 163, 2, 1, 0, 2, 130, 1, 1, 0, 221, 216, 88, 132, 223, 93, 49, 100, 242, 20, 13, 106, 143, 177, 37, 95, 120, ...>>}], use_ssl: true, worker_name: #PID<0.28898.1>}], [file: 'lib/kafka_ex/new/client.ex', line: 179]}, {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 637]}, {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 711]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 249]}]}
2019-11-28T21:35:27.913314+00:00 app[web.1]: 21:35:27.899 [error] GenServer #PID<0.28898.1> terminating
2019-11-28T21:35:27.913335+00:00 app[web.1]: ** (FunctionClauseError) no function clause matching in KafkaEx.New.Client.handle_info/2

@dantswain
Copy link
Collaborator

Is it possible that the ssl config changed?

Also, would it be possible to try different versions of Erlang? Can you try 22.1?

I vaguely remember running into some ssl-related issues with different erlang versions but I don’t remember the details :(

@andrielfn
Copy link
Author

Is it possible that the ssl config changed?

Which config do you mean exactly? But I'd say no, we haven't changed any config in the app nor in the Kafka cluster.

Can you try 22.1?

Actually, we are on the 22.1.6 already. I've just tried the 22.1.8 as well but still the same. And to make sure the Erlang version is somehow related, I downgraded to the initial version and it works perfectly, no timeouts at all.

@ elixir_buildpack.config:1 @
-erlang_version=22.1.8
+erlang_version=21.3.7

@andrielfn
Copy link
Author

While the 21.3.7 is working fine, the next closest version (22.0.3) available in the Heroku buildpack is already raising the timeouts, so potentially is something related to the major upgrade.

@dmitry-semenov
Copy link

Confirming the issue, we had to downgrade to the same version (21.3.7) to avoid timout issues.
At the time of researching the problem we suspected this change to the OTP https://github.com/erlang/otp/pull/2224/files to be the root cause.

@Argonus
Copy link
Contributor

Argonus commented Mar 12, 2020

Hi, any news with that?
@joshuawscott

@jbruggem
Copy link
Collaborator

FYI, I created an issue on the OTP project.

https://bugs.erlang.org/browse/ERL-1213

I'm not sure what we can really do in the mean time, we can try silently re-create a socket every time there's a timeout, but that sounds like a bad idea.

@jbruggem
Copy link
Collaborator

jbruggem commented Apr 9, 2020

I tested many versions of OTP. The problem appears in OTP 21.3. With OTP 21.2 and lower, things are fine:

https://github.com/jbruggem/kafka_ex_ssl_bug#run-many-times

@dantswain
Copy link
Collaborator

Thank you for following up on this so thoroughly!

@dantswain
Copy link
Collaborator

@jbruggem whats the conclusion here? I’ll get kayrock patched ASAP but my understanding was that’s not the bug but rather a compilation problem you discovered while hunting the bug.

@jbruggem
Copy link
Collaborator

@jbruggem whats the conclusion here? I’ll get kayrock patched ASAP but my understanding was that’s not the bug but rather a compilation problem you discovered while hunting the bug.

Nothing changes here, still waiting for somebody to dig into this. It will take a few days at least.

I'm not sure you need to patch Kayrock, as your dependency definition allows for crc32cer to use more recent versions.

@dantswain
Copy link
Collaborator

Should we re-open the issue then?

@jbruggem jbruggem reopened this Apr 12, 2020
@jbruggem
Copy link
Collaborator

Should we re-open the issue then?

My bad. I closed this by mistake and didn't even realize.

@jbruggem
Copy link
Collaborator

News from upstream: after many back-and-forth to help a very dedicated OTP dev figure out the issue, 🎉 she did 🎉 !

From the issue:

We will be fixing this bug on 21, 22 and master of course.

There's nothing more to do on our side, except update the the documentation to mention this bug and the CI to use the latest versions of OTP releases.

@bjhaid
Copy link
Member

bjhaid commented Apr 22, 2020

Thanks @jbruggem !

@dantswain
Copy link
Collaborator

Amazing! Thank you!

@TC-aLi
Copy link
Contributor

TC-aLi commented Mar 9, 2022

If anyone's still having the same issue, you can try upgrade OTP to 22.3.4.12 or beyond. There was several fixes on ssl recv after the major fix in 22.3.3.
On 22.3.3 we observe Receiving timeout several times a day. After the upgrade, almost no occurrences of receiving timeout.

https://www.erlang.org/patches/otp-22.3.3#ssl-9.6.2
https://www.erlang.org/patches/otp-22.3.4.2#ssl-9.6.2.1
https://www.erlang.org/patches/otp-22.3.4.4#ssl-9.6.2.2
https://www.erlang.org/patches/otp-22.3.4.12#ssl-9.6.2.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants