Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling Presence.Track in a channel process changes the behavior of the channel when the server is stopped #68

Closed
hanrelan opened this issue Feb 16, 2017 · 16 comments

Comments

@hanrelan
Copy link

hanrelan commented Feb 16, 2017

If you have two channels, call them AChannel and BChannel defined as:

defmodule AChannel do
  def join("room:lobby", _, socket) do
    send self(), :after_join
    {:ok, socket}
  end
  def handle_info(:after_join, socket) do
    Presence.track(socket, "1234", %{})
    {:noreply, socket}
  end
end

defmodule BChannel do
  def join("room:lobby", _, socket) do
    {:ok, socket}
  end
end

And you deploy this server in a release, then stopping or restarting the release (eg. bin/myapp restart) has subtly different behavior:

In the AChannel case, when the server is restarted, the client will receive an onClose event.
In the BChannel case, when the server is restarted, the client will receive an onError event!

This results in the client for AChannel not attempting to reconnect to the channel when the server comes back up, but the client in BChannel will reconnect.

I believe this is due to this line:
https://github.com/phoenixframework/phoenix_pubsub/blob/master/lib/phoenix/tracker.ex#L402

I think that this causes the process in AChannel to get shutdown correctly when the supervision trees are brought down, so it sends a close message, whereas people are probably expecting the behavior exhibited by BChannel.

Not sure what the fix is here - setting trap_exit on the channel process seems like overkill. Perhaps the Tracker should only set up a monitor?

@chrismccord
Copy link
Member

It's not clear in your description why the second channel is erroneously closing. If it closes for a non-normal reason, you will necessarily receive error logs about the reason. Do you have any relevant logs for BChannel? We link to the tracker because our presence is necessarily tied to the tracker. If it goes down, we should also fails and when we recover and the client reconnects, the presence will be re-tracked.

@hanrelan
Copy link
Author

The channels are closing because the Application is being stopped. The issue is that they're closing with different reasons - AChannel is exiting with reason :shutdown, whereas BChannel exits with reason :killed.
I believe this is because (please correct me if I'm wrong), when you create a channel using this guide, by default it isn't part of a supervision tree. So when you stop the application, the processes are just killed. However, when you link the channel process to a presence process, the presence process is part of a supervision tree. So when the supervisor stops the presence process with reason :shutdown, that causes the channel process to also get stopped with reason :shutdown.
By itself this isn't the worst thing, but it causes unintuitive behavior where tracking presence in a channel will cause the client not to reconnect if the Application is restarted.

The way I repro-ed this was:

  • Create a client that connects to both AChannel and BChannel
  • Run the server with iex -S mix
  • Set up traces on the AChannel and BChannel processes (I just used the observer)
  • In iex, run Application.stop(:myapp)
  • The trace logs will show
(This is AChannel's process - ie. linked to a presence tracker)
13:45:07:849561 (<0.747.0>) in_exiting 0
13:45:07:849589 (<0.747.0>) exit shutdown
13:45:07:849603 (<0.747.0>) out_exited 0
(This is BChannel's process - no presence)
13:45:12:854318 (<0.744.0>) in_exiting 0
13:45:12:854329 (<0.744.0>) exit killed
13:45:12:854332 (<0.744.0>) out_exited 0

@josevalim
Copy link
Member

Thanks for following up, that has been very well detailed.

If we are going to fix this, we will likely do it by making BChannel's exist with shutdown reason too. Therefore, could you let us know why is your application being restarted?

@hanrelan
Copy link
Author

For deployment I build releases using distillery. We're not doing hot upgrades, so to deploy I just untar the new release and run:
bin/myapp restart

This cleanly stops the application and then starts it with the new code. My interim solution to this problem is to just kill the beam process and then run bin/myapp start

Incidentally - the behavior I'd prefer is for AChannel to exit with :killed :) That way after the restart the client will reconnect to the channel. I can see why that doesn't make sense in a clean shutdown, I just wonder how many people already rely on that reconnect behavior and would get thrown off by it not reconnecting.

Perhaps an alternate solution is to have the Presence process monitor the channel process instead of linking. Then if the channel process dies, the presence process can also kill itself, but if the presence process does the supervisor can restart it and it can monitor the channel process again?

@josevalim
Copy link
Member

josevalim commented Feb 17, 2017

Maybe the answer is to force a client reconnect unless the channel exits with normal reason.

hanrelan added a commit to hanrelan/phoenix that referenced this issue Feb 17, 2017
@hanrelan
Copy link
Author

hanrelan commented Feb 17, 2017

I put together what I think is the least invasive approach that makes sense. You can see the change here: hanrelan/phoenix@2a3b38e

Essentially I treat {:shutdown, :closed} and {:shutdown, :left} as special cases which don't send phx_error, whereas all other :shutdown types do. I think this makes sense because you want those shutdowns to propagate over links (eg. to end the presence process), so changing those to :normal would break in other places. All tests still pass with this change.

Happy to submit this as a PR in a topic branch if you think this makes sense.

@chrismccord
Copy link
Member

Maybe the answer is to force a client reconnect unless the channel exits with normal reason.

This should already happen today. If the channel exits ungracefully, the client will auto rejoin with expo back off. @hanrelan are you using the phoenix.js channels client?

@hanrelan
Copy link
Author

hanrelan commented Feb 17, 2017

I am using the phoenix.js channels client

I think Jose meant :normal in the sense of the signal. What's happening is that this code:
https://github.com/phoenixframework/phoenix/blob/master/lib/phoenix/socket/transport.ex#L288
causes the server to send a phx_close event to the client when the process receives the :shutdown command. Once the client receive phx_close, it doesn't try to reconnect.

However, :shutdown is used by supervisors to stop processes, so in this case during an application restart it's stopping the Presence process which is propagating the :shutdown to the channel process. This results in the client not reconnecting if the server is restarted (in the sense of a release restart).

The diff I sent before changes it so only explicitly closing the channel (via the client's leave() function or the server's close) is considered a clean shutdown. All other shutdowns - including one enforced by a supervisor or via a link - is considered unclean and the client will try to reconnect.

This keeps the behavior between processes that are linked to a presence and processes that aren't linked to a presence the same with respect to shutdowns and reconnect attemps.

Happy to discuss on Slack as well if that's easier, I'm @hanrelan in the phoenix channel

@chrismccord
Copy link
Member

Ah I see. We could change the behavior of the phx_close event to differentiate :normal and :shutdown, but I'm unsure what makes the most sense. For example, in your channel code today, you could do:

{:stop, {:shutdown, :normal}, socket}

What should happen in that case? We won't be able to determine the difference between this graceful exit and a "graceful restart" where the client should re-establish a connection.

@josevalim
Copy link
Member

@chrismccord I think we should consider it the same as a link. Which means that we will break on everything that is not :normal. I will push the changes to Phoenix master today. :)

@josevalim
Copy link
Member

So I attempted a fix at this and we use {:shutdown, ...} for clean shutdown in many cases. I am not sure what is the alternative. We could special case :shutdown but that may be confusing.

@chrismccord
Copy link
Member

yes, that is my concern that {:shutdown, :normal} is a clean shutdown, which we would need to propagate to the client as a phx_error, which is wrong. A pre app shutdown task might be the way to go, that forcefully kills transport pids, shuts down the transport, then issues the app shutdown. That way restarts would leave clients in rejoin mode? I see no other way around it.

@hanrelan
Copy link
Author

Another thought, but obviously you guys are much more familiar with this than I am so feel free to ignore if this makes no sense.
What if instead of linking to the channel process, Phoenix.Tracker just set up a monitor of the channel process instead. Then if the channel process dies, the Tracker can just stop itself.

The other case to deal with is if the Tracker process errors out for some reason, you don't want presence to silently stop working for a channel process. My understanding is a fuzzier here, but maybe this could be resolved by using a transient restart strategy in the presence process' supervisor so it gets restarted if it errors?

@josevalim
Copy link
Member

@hanrelan the tracker trap exits exactly because we don't want the tracker to crash if a "client" crashes. But we need to link to solve your last paragraph otherwise there is no way for the "presence client" to be brought down if the presence server crashes.

The client could monitor and raise but links are more appropriate here. I believe @chrismccord is working on a solution though. :)

@hanrelan
Copy link
Author

hanrelan commented Feb 20, 2017

Ah got it I misunderstood the purpose of the link. Thanks for the explanation and the quick response to this issue @chrismccord and @josevalim!

chrismccord added a commit to phoenixframework/phoenix that referenced this issue Feb 21, 2017
ref: phoenixframework/phoenix_pubsub#68

Previously a channel could gracefully terminate
using the stop semantics of a regular genserver;
however when restarting an application for deploys
the shutdown of the transport and channel processes
would be indistinguisable from an intentional channel
shutdown and would cause clients to incorrectly not
reconnect after server restart. This commit adds a
{:graceful_exit, channel_pid, %Phoenix.Socket.Message{}}
contract to distinguish an intentional channel exit from
what should be regarded as an error condition on the client.
chrismccord added a commit to phoenixframework/phoenix that referenced this issue Feb 21, 2017
ref: phoenixframework/phoenix_pubsub#68

Previously a channel could gracefully terminate
using the stop semantics of a regular genserver;
however when restarting an application for deploys
the shutdown of the transport and channel processes
would be indistinguisable from an intentional channel
shutdown and would cause clients to incorrectly not
reconnect after server restart. This commit adds a
{:graceful_exit, channel_pid, %Phoenix.Socket.Message{}}
contract to distinguish an intentional channel exit from
what should be regarded as an error condition on the client.
@chrismccord
Copy link
Member

This has been fixed on phoenix master and will ship with the 1.3 release. Check the PR commit details for details but tldr; is your app restarts will be seen as errors by the client and they will reconnect. Cheers!

Gazler pushed a commit to Gazler/phoenix_view that referenced this issue Nov 17, 2020
ref: phoenixframework/phoenix_pubsub#68

Previously a channel could gracefully terminate
using the stop semantics of a regular genserver;
however when restarting an application for deploys
the shutdown of the transport and channel processes
would be indistinguisable from an intentional channel
shutdown and would cause clients to incorrectly not
reconnect after server restart. This commit adds a
{:graceful_exit, channel_pid, %Phoenix.Socket.Message{}}
contract to distinguish an intentional channel exit from
what should be regarded as an error condition on the client.
Gazler pushed a commit to Gazler/phoenix_view that referenced this issue Nov 17, 2020
ref: phoenixframework/phoenix_pubsub#68

Previously a channel could gracefully terminate
using the stop semantics of a regular genserver;
however when restarting an application for deploys
the shutdown of the transport and channel processes
would be indistinguisable from an intentional channel
shutdown and would cause clients to incorrectly not
reconnect after server restart. This commit adds a
{:graceful_exit, channel_pid, %Phoenix.Socket.Message{}}
contract to distinguish an intentional channel exit from
what should be regarded as an error condition on the client.
Gazler pushed a commit to Gazler/phoenix_view that referenced this issue Nov 17, 2020
ref: phoenixframework/phoenix_pubsub#68

Previously a channel could gracefully terminate
using the stop semantics of a regular genserver;
however when restarting an application for deploys
the shutdown of the transport and channel processes
would be indistinguisable from an intentional channel
shutdown and would cause clients to incorrectly not
reconnect after server restart. This commit adds a
{:graceful_exit, channel_pid, %Phoenix.Socket.Message{}}
contract to distinguish an intentional channel exit from
what should be regarded as an error condition on the client.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants