New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bot frequently goes offline with zombie connections #447

Closed
csuhta opened this Issue Jan 12, 2018 · 11 comments

Comments

Projects
None yet
4 participants
@csuhta

csuhta commented Jan 12, 2018

Our Discord bot often enters a very slow loop with zombie connections.

Our bot executes in a Rake task, this is all of the code used:

task "discord:run" => :environment do

  require "discordrb"

  Thread.abort_on_exception = true
  STDOUT.sync = true
  STDERR.sync = true

  # Gracefully shut down the server when we get a system signal
  Signal.trap("SIGTERM") { logs "Shutting down Discord bot"; exit(0) }
  Signal.trap("SIGINT")  { logs "Shutting down Discord bot"; exit(0) }

  @bot = Discordrb::Commands::CommandBot.new(token: ENV["DISCORD_BOT_TOKEN"], client_id: ENV["DISCORD_CLIENT_ID"], prefix: "/")

  @bot.message do |event|
    # [...] Calls a class that replies to the message event
  end

  @bot.command(:rule, chain_usable: false, description: "Rules Lookup") do |event, rule_number|
    # [...] Calls a class that replies to the command
  end

  @bot.run :async

end

Here is a slice of what gets logged to the console, this is even sometimes before the bot receives any events from Discord:

[INFO : websocket @ 2018-01-11 23:48:31.281] Discord using gateway protocol version: 6, requested: 6
[WARN : heartbeat @ 2018-01-11 23:51:15.762] Last heartbeat was not acked, so this is a zombie connection! Reconnecting
[ERROR : heartbeat @ 2018-01-11 23:51:15.763] The websocket connection has closed: nil
[INFO : websocket @ 2018-01-11 23:51:15.763] Instant reconnection flag was set - reconnecting right away
[WARN : heartbeat @ 2018-01-11 23:52:39.268] Last heartbeat was not acked, so this is a zombie connection! Reconnecting
[ERROR : heartbeat @ 2018-01-11 23:53:15.876] The websocket connection has closed: nil
[INFO : websocket @ 2018-01-11 23:53:18.380] Discord using gateway protocol version: 6, requested: 6
[WARN : websocket @ 2018-01-11 23:53:18.380] Server streaming timed out with 245 servers remaining
[WARN : websocket @ 2018-01-11 23:53:18.380] This means some servers are unavailable due to an outage. Notifying ready now, we'll have to live without these servers
[WARN : heartbeat @ 2018-01-11 23:54:40.394] Last heartbeat was not acked, so this is a zombie connection! Reconnecting
[ERROR : heartbeat @ 2018-01-11 23:54:40.398] The websocket connection has closed: nil
[INFO : websocket @ 2018-01-11 23:54:40.399] Instant reconnection flag was set - reconnecting right away
[WARN : heartbeat @ 2018-01-11 23:56:04.086] Last heartbeat was not acked, so this is a zombie connection! Reconnecting
[ERROR : heartbeat @ 2018-01-11 23:56:04.101] The websocket connection has closed: nil
[INFO : websocket @ 2018-01-11 23:56:04.104] Instant reconnection flag was set - reconnecting right away
[INFO : websocket @ 2018-01-11 23:56:13.013] Discord using gateway protocol version: 6, requested: 6
[WARN : heartbeat @ 2018-01-11 23:57:27.606] Last heartbeat was not acked, so this is a zombie connection! Reconnecting
[ERROR : heartbeat @ 2018-01-11 23:57:27.683] The websocket connection has closed: nil
[INFO : websocket @ 2018-01-11 23:57:27.716] Instant reconnection flag was set - reconnecting right away
… (continues for a long time) …

This happens at least once a day, even when the official Discord client is working fine and Discord is not reporting any outages at http://status.discordapp.com

What can we do to debug this situation or make the bot more aggressive about fixing itself? We don't have very much information to work with.

@z64

This comment has been minimized.

Show comment
Hide comment
@z64

z64 Jan 12, 2018

Collaborator

Hi @csuhta .

A few things;

[WARN : websocket @ 2018-01-11 23:53:18.380] Server streaming timed out with 245 servers remaining
[WARN : websocket @ 2018-01-11 23:53:18.380] This means some servers are unavailable due to an outage. Notifying ready now, we'll have to live without these servers

This comes from lib/discordrb/bot.rb:897 which is prefaced by:

# Check whether there are still unavailable servers and there have been more than 10 seconds since READY
if @unavailable_servers && @unavailable_servers > 0 && (Time.now - @unavailable_timeout_time) > 10

While it says "due to an outage", this can just as well be because discordrb is taking too long to load all of those servers into memory. It says 254 remaining, how many are you on exactly? This is likely the case that ruby / the library is slow for this task - it tries to ensure all the guilds are loaded into memory before calling the ready callback. So, if you're not doing anything with, for example, bot.ready, this can likely just be ignored.

As for the zombie connections, I don't know of what you can do exactly to debug / remedy this. A general look at the stability of your servers internet connection to Discord would be worth taking a look at, as the problem could lie anywhere in between you and Discord (i.e. Cloudflare).

If you're unfamiliar with exactly what "zombie connection" means, its the following process:

  • When you connect to Discord's websocket, you are given a heartbeat_interval which is (always) 41.25 seconds.
  • At each interval, you send a "ping" to Discord and the library waits for a "pong" back.
  • If we don't get a "pong" back anytime before the next heartbeat is due (the full 41.25 second window), this is a zombie connection, and somewhere the connection was lost.
  • Try to reconnect.

@meew0 may have some other ideas to debug, but I'm not sure what to suggest in the meantime. But hopefully this clarifies whats going on a little. That logging line should probably be rephrased, in any case.

Collaborator

z64 commented Jan 12, 2018

Hi @csuhta .

A few things;

[WARN : websocket @ 2018-01-11 23:53:18.380] Server streaming timed out with 245 servers remaining
[WARN : websocket @ 2018-01-11 23:53:18.380] This means some servers are unavailable due to an outage. Notifying ready now, we'll have to live without these servers

This comes from lib/discordrb/bot.rb:897 which is prefaced by:

# Check whether there are still unavailable servers and there have been more than 10 seconds since READY
if @unavailable_servers && @unavailable_servers > 0 && (Time.now - @unavailable_timeout_time) > 10

While it says "due to an outage", this can just as well be because discordrb is taking too long to load all of those servers into memory. It says 254 remaining, how many are you on exactly? This is likely the case that ruby / the library is slow for this task - it tries to ensure all the guilds are loaded into memory before calling the ready callback. So, if you're not doing anything with, for example, bot.ready, this can likely just be ignored.

As for the zombie connections, I don't know of what you can do exactly to debug / remedy this. A general look at the stability of your servers internet connection to Discord would be worth taking a look at, as the problem could lie anywhere in between you and Discord (i.e. Cloudflare).

If you're unfamiliar with exactly what "zombie connection" means, its the following process:

  • When you connect to Discord's websocket, you are given a heartbeat_interval which is (always) 41.25 seconds.
  • At each interval, you send a "ping" to Discord and the library waits for a "pong" back.
  • If we don't get a "pong" back anytime before the next heartbeat is due (the full 41.25 second window), this is a zombie connection, and somewhere the connection was lost.
  • Try to reconnect.

@meew0 may have some other ideas to debug, but I'm not sure what to suggest in the meantime. But hopefully this clarifies whats going on a little. That logging line should probably be rephrased, in any case.

@csuhta

This comment has been minimized.

Show comment
Hide comment
@csuhta

csuhta Jan 12, 2018

It says 254 remaining, how many are you on exactly?

Our bot is authorized on 1,335 Discord guilds, and it has heard an event from about 800 of them in the last week.

Is there any way to skip this guild iteration step? We do not need to list or send messages to all of the guilds at once, we only use the @bot.message hook and @bot.command hook.

Our bot is run on Heroku, so we're not suspicious that the bot doesn't have a constant internet connection. This zombie period often happens when the bot first boots.

csuhta commented Jan 12, 2018

It says 254 remaining, how many are you on exactly?

Our bot is authorized on 1,335 Discord guilds, and it has heard an event from about 800 of them in the last week.

Is there any way to skip this guild iteration step? We do not need to list or send messages to all of the guilds at once, we only use the @bot.message hook and @bot.command hook.

Our bot is run on Heroku, so we're not suspicious that the bot doesn't have a constant internet connection. This zombie period often happens when the bot first boots.

@z64

This comment has been minimized.

Show comment
Hide comment
@z64

z64 Jan 12, 2018

Collaborator

The state logic in discordrb is fully integrated into the websocket logic. There is no way to use WS events, or most of the libraries abstractions without state/cache.

If your bot is extremely simple, I might suggest using meew0/discordcr instead. Crystal is a language very similar in feel to Ruby, but statically typed with global type inference, and compiles to a native executable. The library is completely stateless by default. If your bot is as simple as this, it should be trivial to port.

Probably not what you want to hear lol, but discordrb is primarily meant to be extremely easy to use above all, and performant last.

Collaborator

z64 commented Jan 12, 2018

The state logic in discordrb is fully integrated into the websocket logic. There is no way to use WS events, or most of the libraries abstractions without state/cache.

If your bot is extremely simple, I might suggest using meew0/discordcr instead. Crystal is a language very similar in feel to Ruby, but statically typed with global type inference, and compiles to a native executable. The library is completely stateless by default. If your bot is as simple as this, it should be trivial to port.

Probably not what you want to hear lol, but discordrb is primarily meant to be extremely easy to use above all, and performant last.

@csuhta

This comment has been minimized.

Show comment
Hide comment
@csuhta

csuhta Jan 18, 2018

We're looking into discordcr, but it does not seem ready to use. The WIP status concerns me.

Is there a way to federate a Ruby bot? Ex: Have two or more processes that split the workload and each handle a percentage of the servers?

but discordrb is primarily meant to be extremely easy to use above all, and performant last.

Which official library for building a Discord bot is the most performant one?
Are there examples of a popular Discord bot that has scaled past 2000 servers?

csuhta commented Jan 18, 2018

We're looking into discordcr, but it does not seem ready to use. The WIP status concerns me.

Is there a way to federate a Ruby bot? Ex: Have two or more processes that split the workload and each handle a percentage of the servers?

but discordrb is primarily meant to be extremely easy to use above all, and performant last.

Which official library for building a Discord bot is the most performant one?
Are there examples of a popular Discord bot that has scaled past 2000 servers?

@z64

This comment has been minimized.

Show comment
Hide comment
@z64

z64 Jan 18, 2018

Collaborator

We're looking into discordcr, but it does not seem ready to use. The WIP status concerns me.

I would almost contest that "WIP" status, and given your example, I'd wager all the functionality you need is already in the library. Note that is intended to be a small, performant low level toolkit. If you need some of the higher level abstractions offered by discordrb, my extension may be able to help with this. /shamelessplug

If you could be more specific about your needs I would be better able to answer you of course.

Is there a way to federate a Ruby bot?

Yes. This is called sharding. See shard_id in Bot#initialize. You would spawn multiple processes with their own shard_id that would distribute guild activity across them. See also the Discord docs on sharding for other misc details on Discord's end.

Which official library for building a Discord bot is the most performant one?

There's so many variables to this, that I don't think someone has tackled such a benchmark. You'll find the largest bots often do not make use of a singular library, but have components in different languages / frameworks. For example, Mee6 (on almost some 800K guilds IIRC) uses Elixir for the websocket system and python for processing those events / REST calls.

I would wager Crystal, especially due to the native binary nature, would be way up there. It's not nearly as popular, but I bring it up as it would be a very easy transition for you to make since the language is very ruby-inspired.

Collaborator

z64 commented Jan 18, 2018

We're looking into discordcr, but it does not seem ready to use. The WIP status concerns me.

I would almost contest that "WIP" status, and given your example, I'd wager all the functionality you need is already in the library. Note that is intended to be a small, performant low level toolkit. If you need some of the higher level abstractions offered by discordrb, my extension may be able to help with this. /shamelessplug

If you could be more specific about your needs I would be better able to answer you of course.

Is there a way to federate a Ruby bot?

Yes. This is called sharding. See shard_id in Bot#initialize. You would spawn multiple processes with their own shard_id that would distribute guild activity across them. See also the Discord docs on sharding for other misc details on Discord's end.

Which official library for building a Discord bot is the most performant one?

There's so many variables to this, that I don't think someone has tackled such a benchmark. You'll find the largest bots often do not make use of a singular library, but have components in different languages / frameworks. For example, Mee6 (on almost some 800K guilds IIRC) uses Elixir for the websocket system and python for processing those events / REST calls.

I would wager Crystal, especially due to the native binary nature, would be way up there. It's not nearly as popular, but I bring it up as it would be a very easy transition for you to make since the language is very ruby-inspired.

@tripl3dogdare

This comment has been minimized.

Show comment
Hide comment
@tripl3dogdare

tripl3dogdare Jan 18, 2018

Contributor

The question of which library is the most performant is not an easy or even necessarily possible question to answer. In general, performance is rarely a concern when it comes to Discord bots, so most libraries don't worry too much about performance beyond basic usability levels.

Given the amount of servers you're dealing with, your problem may be less with discordrb and more with overall process load. You might look into sharding as a solution - this splits your bot into multiple processes each handling a different segment of the total guilds the bot is on, which means each individual process has less to deal with and performs better. I don't remember offhand if discordrb in particular has sharding support, however; that'd be a question for one of the more active contributors.

It's impossible for a Discord bot to scale over about 2.5k servers without sharding, due to an intentional API limitation; however, there are some Discord bots that are on tens of thousands of servers through use of sharding.

Contributor

tripl3dogdare commented Jan 18, 2018

The question of which library is the most performant is not an easy or even necessarily possible question to answer. In general, performance is rarely a concern when it comes to Discord bots, so most libraries don't worry too much about performance beyond basic usability levels.

Given the amount of servers you're dealing with, your problem may be less with discordrb and more with overall process load. You might look into sharding as a solution - this splits your bot into multiple processes each handling a different segment of the total guilds the bot is on, which means each individual process has less to deal with and performs better. I don't remember offhand if discordrb in particular has sharding support, however; that'd be a question for one of the more active contributors.

It's impossible for a Discord bot to scale over about 2.5k servers without sharding, due to an intentional API limitation; however, there are some Discord bots that are on tens of thousands of servers through use of sharding.

@csuhta

This comment has been minimized.

Show comment
Hide comment
@csuhta

csuhta Jan 18, 2018

Ok, we're going to attempt to fix our issues with sharding and possibly by re-architecting the bot to offload the actual reply work to a different process.

csuhta commented Jan 18, 2018

Ok, we're going to attempt to fix our issues with sharding and possibly by re-architecting the bot to offload the actual reply work to a different process.

@csuhta csuhta closed this Jan 18, 2018

@z64

This comment has been minimized.

Show comment
Hide comment
@z64

z64 Jan 18, 2018

Collaborator

Sure thing. Hope some of my rambling here helped :)

Any other questions, you know where to find us, here or on Discord. 👍

Collaborator

z64 commented Jan 18, 2018

Sure thing. Hope some of my rambling here helped :)

Any other questions, you know where to find us, here or on Discord. 👍

@FreshWebCoder

This comment has been minimized.

Show comment
Hide comment
@FreshWebCoder

FreshWebCoder Oct 5, 2018

I have an similar issue.
This is the code used

     bot = Discordrb::Bot.new token: ENV["discord_token"], client_id: ENV["discord_client"]
      bot.run :async
      bot.gateway
      discord_server = bot.servers[auth.guild_id.to_i]
      if discord_server.present?
        account = nil
        discord_server.users.each do|member|
          account = member if member.id.to_i == auth.uid.to_i
        end

        if discord_server.is_a?(Discordrb::Server)
          discord_server.text_channels.each do|channel|
            if channel.text?
              channels << channel.name
            end
          end
        end
        [channels, discord_server, account]

I often get this log message

[WARN : heartbeat @ 2018-10-05 17:04:43.511] Last heartbeat was not acked, so this is a zombie connection! Reconnecting
[ERROR : heartbeat @ 2018-10-05 17:04:43.511] The websocket connection has closed: nil

and connect lost
This happens so frequently on my ruby app.

FreshWebCoder commented Oct 5, 2018

I have an similar issue.
This is the code used

     bot = Discordrb::Bot.new token: ENV["discord_token"], client_id: ENV["discord_client"]
      bot.run :async
      bot.gateway
      discord_server = bot.servers[auth.guild_id.to_i]
      if discord_server.present?
        account = nil
        discord_server.users.each do|member|
          account = member if member.id.to_i == auth.uid.to_i
        end

        if discord_server.is_a?(Discordrb::Server)
          discord_server.text_channels.each do|channel|
            if channel.text?
              channels << channel.name
            end
          end
        end
        [channels, discord_server, account]

I often get this log message

[WARN : heartbeat @ 2018-10-05 17:04:43.511] Last heartbeat was not acked, so this is a zombie connection! Reconnecting
[ERROR : heartbeat @ 2018-10-05 17:04:43.511] The websocket connection has closed: nil

and connect lost
This happens so frequently on my ruby app.

@z64

This comment has been minimized.

Show comment
Hide comment
@z64

z64 Oct 5, 2018

Collaborator

Please see my explanation of zombie connections here: #447 (comment)

While the code you posted is not relevant, it is also unsafe code and uses API that doesn't exist in our library.

Please do not use bot.run(:async) unless you understand the implications.

Collaborator

z64 commented Oct 5, 2018

Please see my explanation of zombie connections here: #447 (comment)

While the code you posted is not relevant, it is also unsafe code and uses API that doesn't exist in our library.

Please do not use bot.run(:async) unless you understand the implications.

@z64

This comment was marked as off-topic.

Show comment
Hide comment
@z64

z64 Oct 5, 2018

Collaborator

This is off topic, but in short, you should be using a bot.ready handler and not using bot.run(anything) at all.

We recently updated the documentation for bot.run to clarify when the argument should and should not be used, and what to do instead: https://meew0.github.io/discordrb/master/Discordrb/Bot.html#run-instance_method

Collaborator

z64 commented Oct 5, 2018

This is off topic, but in short, you should be using a bot.ready handler and not using bot.run(anything) at all.

We recently updated the documentation for bot.run to clarify when the argument should and should not be used, and what to do instead: https://meew0.github.io/discordrb/master/Discordrb/Bot.html#run-instance_method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment