Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratelimiter state machine errors #517

Closed
tspenov opened this issue Jun 29, 2023 · 10 comments
Closed

Ratelimiter state machine errors #517

tspenov opened this issue Jun 29, 2023 · 10 comments

Comments

@tspenov
Copy link

tspenov commented Jun 29, 2023

I got these Ratelimiter errors in the logs after upgrading to latest master.
Edit: these seems to be frequent in the logs

2023-06-29T10:54:24.916961392Z stdout F {"level":"error","message":"** State machine 'Elixir.Nostrum.Api.Ratelimiter' terminating\n** Last event = {info,{gun_error,<0.5654.0>,\n                                 #Ref<0.331392565.3752329217.62190>,closed}}\n** When server state  = {connected,#{conn => <0.5654.0>,inflight => #{},\n                                     outstanding =>\n                                         #{<<\"/users/@me\">> =>\n                                               {initial,{[],[]}}},\n                                     remaining_in_window => 48,\n                                     running =>\n                                         #{#Ref<0.331392565.3752329217.62190> =>\n                                               {<<\"/users/@me\">>,\n                                                #{body => <<>>,\n                                                  headers =>\n                                                      [{<<\"content-type\">>,\n                                                        <<\"application/json\">>}],\n                                                  method => get,params => [],\n                                                  route => <<\"/users/@me\">>},\n                                                {<0.6921.0>,\n                                                 #Ref<0.331392565.3752329217.62189>}}}}}\n** Reason for termination = error:function_clause\n** Callback modules = ['Elixir.Nostrum.Api.Ratelimiter']\n** Callback mode = state_functions\n** Stacktrace =\n**  [{'Elixir.Nostrum.Api.Ratelimiter',connected,\n         [info,\n          {gun_error,<0.5654.0>,#Ref<0.331392565.3752329217.62190>,closed},\n          #{conn => <0.5654.0>,inflight => #{},\n            outstanding => #{<<\"/users/@me\">> => {initial,{[],[]}}},\n            remaining_in_window => 48,\n            running =>\n                #{#Ref<0.331392565.3752329217.62190> =>\n                      {<<\"/users/@me\">>,\n                       #{body => <<>>,\n                         headers =>\n                             [{<<\"content-type\">>,<<\"application/json\">>}],\n                         method => get,params => [],route => <<\"/users/@me\">>},\n                       {<0.6921.0>,#Ref<0.331392565.3752329217.62189>}}}}],\n         [{file,\"lib/nostrum/api/ratelimiter.ex\"},{line,397}]},\n     {gen_statem,loop_state_callback,11,[{file,\"gen_statem.erl\"},{line,1426}]},\n     {proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{line,240}]}]\n** Time-outs: {1,[{{timeout,reset_bot_calls_window},expired}]}\n","timestamp":"2023-06-29T10:54:24.916"}
2023-06-29T10:54:24.921324593Z stdout F {"level":"error","message":"Task #PID<0.6921.0> started from #PID<0.5666.0> terminating\n** (stop) exited in: :gen_statem.call(Nostrum.Api.Ratelimiter, {:queue, %{body: \"\", headers: [{\"content-type\", \"application/json\"}], method: :get, params: [], route: \"/users/@me\"}}, :infinity)\n    ** (EXIT) an exception was raised:\n        ** (FunctionClauseError) no function clause matching in Nostrum.Api.Ratelimiter.connected/3\n            (nostrum 0.9.0-alpha3) lib/nostrum/api/ratelimiter.ex:397: Nostrum.Api.Ratelimiter.connected(:info, {:gun_error, #PID<0.5654.0>, #Reference<0.331392565.3752329217.62190>, :closed}, %{conn: #PID<0.5654.0>, inflight: %{}, outstanding: %{\"/users/@me\" => {:initial, {[], []}}}, remaining_in_window: 48, running: %{#Reference<0.331392565.3752329217.62190> => {\"/users/@me\", %{body: \"\", headers: [{\"content-type\", \"application/json\"}], method: :get, params: [], route: \"/users/@me\"}, {#PID<0.6921.0>, #Reference<0.331392565.3752329217.62189>}}}})\n            (stdlib 4.3) gen_statem.erl:1426: :gen_statem.loop_state_callback/11\n            (stdlib 4.3) proc_lib.erl:240: :proc_lib.init_p_do_apply/3\n    (stdlib 4.3) gen.erl:243: :gen.do_call/4\n    (stdlib 4.3) gen_statem.erl:900: :gen_statem.call_dirty/4\n    (nostrum 0.9.0-alpha3) lib/nostrum/api.ex:2602: Nostrum.Api.get_current_user/0\n    (sanbase 0.0.1) lib/sanbase/discord/discord_consumer.ex:256: Sanbase.DiscordConsumer.warm_up/1\n    (sanbase 0.0.1) lib/sanbase/discord/discord_consumer.ex:70: Sanbase.DiscordConsumer.handle_event/1\n    (elixir 1.14.4) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2\n    (stdlib 4.3) proc_lib.erl:240: :proc_lib.init_p_do_apply/3\nFunction: #Function<1.121426740/0 in Sanbase.DiscordConsumer.handle_info/2>\n    Args: []","timestamp":"2023-06-29T10:54:24.916"}
--

Nostrum version: "nostrum": {:git, "https://github.com/Kraigie/nostrum.git", "49a2944d7e1a953a4cafae18b97e9379da36f3bb", []},, tag version: nostrum 0.9.0-alpha3

Elixir and otp versions:

Erlang/OTP 25 [erts-13.2] [source] [64-bit] [smp:8:1] [ds:8:1:10] [async-threads:1] [jit:ns]

Interactive Elixir (1.14.4) - press Ctrl+C to exit (type h() ENTER for help)

Traceback

** State machine 'Elixir.Nostrum.Api.Ratelimiter' terminating
** Last event = {info,{gun_error,<0.5654.0>,
                                 #Ref<0.331392565.3752329217.62190>,closed}}
** When server state  = {connected,#{conn => <0.5654.0>,inflight => #{},
                                     outstanding =>
                                         #{<<"/users/@me">> =>
                                               {initial,{[],[]}}},
                                     remaining_in_window => 48,
                                     running =>
                                         #{#Ref<0.331392565.3752329217.62190> =>
                                               {<<"/users/@me">>,
                                                #{body => <<>>,
                                                  headers =>
                                                      [{<<"content-type">>,
                                                        <<"application/json">>}],
                                                  method => get,params => [],
                                                  route => <<"/users/@me">>},
                                                {<0.6921.0>,
                                                 #Ref<0.331392565.3752329217.62189>}}}}}
** Reason for termination = error:function_clause
** Callback modules = ['Elixir.Nostrum.Api.Ratelimiter']
** Callback mode = state_functions
** Stacktrace =
**  [{'Elixir.Nostrum.Api.Ratelimiter',connected,
         [info,
          {gun_error,<0.5654.0>,#Ref<0.331392565.3752329217.62190>,closed},
          #{conn => <0.5654.0>,inflight => #{},
            outstanding => #{<<"/users/@me">> => {initial,{[],[]}}},
            remaining_in_window => 48,
            running =>
                #{#Ref<0.331392565.3752329217.62190> =>
                      {<<"/users/@me">>,
                       #{body => <<>>,
                         headers =>
                             [{<<"content-type">>,<<"application/json">>}],
                         method => get,params => [],route => <<"/users/@me">>},
                       {<0.6921.0>,#Ref<0.331392565.3752329217.62189>}}}}],
         [{file,"lib/nostrum/api/ratelimiter.ex"},{line,397}]},
     {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1426}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]
** Time-outs: {1,[{{timeout,reset_bot_calls_window},expired}]}
jchristgit added a commit that referenced this issue Jun 29, 2023
@jchristgit
Copy link
Collaborator

Please retry with the latest master. I haven't seen this error myself before nor could I reproduce it, but it should deal with automatically requeueing requests that were closed abruptly.

@tspenov
Copy link
Author

tspenov commented Jun 29, 2023

Please retry with the latest master. I haven't seen this error myself before nor could I reproduce it, but it should deal with automatically requeueing requests that were closed abruptly.

Thanks for the quick reply! I'll test it out and let you know

@tspenov
Copy link
Author

tspenov commented Jun 29, 2023

Please retry with the latest master. I haven't seen this error myself before nor could I reproduce it, but it should deal with automatically requeueing requests that were closed abruptly.

@jchristgit I tested and in logs I see now this error:

Date: 2023-06-29T14:03:02.570
Level: error

State machine 'Elixir.Nostrum.Api.Ratelimiter' terminating

Last event = {
  info, {
    gun_down, <0.5653.0>, http2, closed, [#Ref<0.2297931708.4051173377.228586>]
  }
}

Server state = {
  connected, {
    conn => <0.5653.0>,
    inflight => #{},
    outstanding => {
      <<"/users/@me">> => {
        initial, {
          [{
            #{body => <<>>,
              headers => [{<<content-type>>, <<application/json>>}],
              method => get,
              params => [],
              route => <<"/users/@me">>
            }, 
            {<0.6601.0>, #Ref<0.2297931708.4051173377.228585>}
          }], []
        }
      }
    },
    remaining_in_window => 48,
    running => #{}
  }
}

Reason for termination = error: {badkey, #Ref<0.2297931708.4051173377.228586>}

Callback modules = ['Elixir.Nostrum.Api.Ratelimiter']

Callback mode = state_functions

Stacktrace = [
  {erlang, map_get, [#Ref<0.2297931708.4051173377.228586>, #{}], [{error_info, #{module => erl_erts_errors}}]},
  {'Elixir.Nostrum.Api.Ratelimiter', '-connected/3-fun-1-', 3, [{file, "lib/nostrum/api/ratelimiter.ex"}, {line, 881}]},
  {'Elixir.Enum', '-map/2-lists^map/1-0-', 2, [{file, "lib/enum.ex"}, {line, 1658}]},
  {'Elixir.Nostrum.Api.Ratelimiter', connected, 3, [{file, "lib/nostrum/api/ratelimiter.ex"}, {line, 878}]},
  {gen_statem, loop_state_callback, 11, [{file, "gen_statem.erl"}, {line, 1426}]},
  {proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 240}]}
]

Time-outs: {1, [{{timeout, reset_bot_calls_window}, expired}]}

@jchristgit
Copy link
Collaborator

I pushed another commit to master, with the same caveats: I cannot really test it.

Are you on an unreliable network? Or do you have another clue why these may keep appearing?

@tspenov
Copy link
Author

tspenov commented Jun 29, 2023

I pushed another commit to master, with the same caveats: I cannot really test it.

Are you on an unreliable network? Or do you have another clue why these may keep appearing?

@jchristgit Thank you! I'll test it out tomorrow and let you know.

The elixir app is deployed to AWS, so there shouldn't be any problem with the network.
Not sure why the WS to discord gets closed so often.
Another thing the app is experiencing very often is {:stream_error, :closed} from gun in which case I just retry the command/interaction.

@tspenov
Copy link
Author

tspenov commented Jun 30, 2023

I pushed another commit to master, with the same caveats: I cannot really test it.

Are you on an unreliable network? Or do you have another clue why these may keep appearing?

@jchristgit Sometimes after Nostrum.Api.get_current_user() is executed I see in logs: "Request to \"/users/@me\" was closed abnormally, requeueing" but the call doesn't return anything, neither raises an exception. It seems like it is blocking indefinitely.

To work around this I have wrapped it in a Task like this:

  defp warm_up(msg_or_interaction, retries \\ 3) do
    t1 = System.monotonic_time(:millisecond)
    log(msg_or_interaction, "WARM UP STARTING...")

    task = Task.async(fn -> Nostrum.Api.get_current_user() end)

    case Task.yield(task, 5000) || Task.shutdown(task) do
      nil ->
        log(msg_or_interaction, "WARM UP ERROR: Timeout reached.", type: :error)

        if retries > 0 do
          log(msg_or_interaction, "WARM UP TIMEOUT: Retrying...", type: :error)

          warm_up(msg_or_interaction, retries - 1)
        else
          log(msg_or_interaction, "WARM UP TIMEOUT: No more retries.", type: :error)
        end

      {:ok, result} ->
        # handle the result
        handle_result(result, msg_or_interaction, retries)
    end

    t2 = System.monotonic_time(:millisecond)
    log(msg_or_interaction, "Time spent warming up #{t2 - t1}ms.")
  end

Here are some log excerpts:

{"level":"info","message":"[id=1124357812546457661] WARM UP STARTING... msg.content=<@1039814526708764742> hi metadata=%
{"level":"debug","message":"Accounting for request with 50 remaining user calls (initial)","timestamp":"2023-06-30T15:16:32.178"}
{"level":"debug","message":"Accounting for request with 49 remaining user calls","timestamp":"2023-06-30T15:16:32.178"}
{"level":"warning","message":"Request to \"/users/@me\" was closed abnormally, requeueing","timestamp":"2023-06-30T15:16:32.183"}
{"level":"info","message":"[id=1124357812546457661] WARM UP ERROR: Timeout reached. msg.content=<@1039814526708764742> {"level":"info","message":"[id=1124357812546457661] WARM UP TIMEOUT: Retrying...\"852836083381174282\"}","timestamp":"2023-06-30T15:16:37.178"}
{"level":"info","message":"[id=1124357812546457661] WARM UP STARTING... \"852836083381174282\"}","timestamp":"2023-06-30T15:16:37.178"}
{"level":"debug","message":"Accounting for request with 50 remaining user calls (initial)","timestamp":"2023-06-30T15:16:37.202"}
{"level":"debug","message":"Accounting for request with 49 remaining user calls","timestamp":"2023-06-30T15:16:37.202"}
{"level":"info","message":"[id=1124357812546457661] WARM UP SUCCESS ….
{"level":"info","message":"[id=1124357812546457661] Time spent warming up 190ms.
{"level":"info","message":"[id=1124357812546457661] Time spent warming up 5191ms …

@jchristgit
Copy link
Collaborator

jchristgit commented Jul 1, 2023 via email

@tspenov
Copy link
Author

tspenov commented Jul 3, 2023

I cannot reproduce this here, not with spamming the endpoint nor with running it one by one like you do. Do you have a proxy sitting in front of nostrum? Some outgoing firewall?

@jchristgit The elixir app is deployed in kubernetes cluster in AWS.
The problem is only on the production cluster. Locally and on stage cluster it is fine.
On production I can reproduce every time if the WS connection has been idle for 5-10 mins (no other API call to discord has been made).
The only difference between stage and prod clusters is that on prod cluster there is AWS NAT Gateway for outgoing requests. The default timeout for tcp connections for the AWS NAT Gateway is 350 seconds ~ 6 mins
But I wonder why this happens since there are HEARTBEAT_ACK in the logs pretty much every 40 seconds.
They should be refreshing this connection so this timeout from AWS should not happen.
Do you have any insight on this ?

@jchristgit
Copy link
Collaborator

Sorry but I have no environment where I can reproduce this, at all.

Since you mentioned that the only difference between stage and prod is the AWS NAT Gateway I cannot help but assume that the AWS NAT Gateway is at fault, and nostrum works properly here. Perhaps the NAT Gateway is messing around with HTTP/2 connections or it is doing some other strange form of introspection on the request. In nostrum the bot gateway request does not do anything out of the ordinary in terms of HTTP requests.

@tspenov
Copy link
Author

tspenov commented Jul 4, 2023

@jchristgit My current workaround is having a job that makes an API call every couple of minutes so it makes sure that the connection doesn't get idle.

Thank you for quick responses and the fixes, really appreciated! 💙

I am closing the issue.

@tspenov tspenov closed this as completed Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants