Calling Presence.Track in a channel process changes the behavior of the channel when the server is stopped #68

hanrelan · 2017-02-16T11:03:39Z

If you have two channels, call them AChannel and BChannel defined as:

defmodule AChannel do
  def join("room:lobby", _, socket) do
    send self(), :after_join
    {:ok, socket}
  end
  def handle_info(:after_join, socket) do
    Presence.track(socket, "1234", %{})
    {:noreply, socket}
  end
end

defmodule BChannel do
  def join("room:lobby", _, socket) do
    {:ok, socket}
  end
end

And you deploy this server in a release, then stopping or restarting the release (eg. bin/myapp restart) has subtly different behavior:

In the AChannel case, when the server is restarted, the client will receive an onClose event.
In the BChannel case, when the server is restarted, the client will receive an onError event!

This results in the client for AChannel not attempting to reconnect to the channel when the server comes back up, but the client in BChannel will reconnect.

I believe this is due to this line:
https://github.com/phoenixframework/phoenix_pubsub/blob/master/lib/phoenix/tracker.ex#L402

I think that this causes the process in AChannel to get shutdown correctly when the supervision trees are brought down, so it sends a close message, whereas people are probably expecting the behavior exhibited by BChannel.

Not sure what the fix is here - setting trap_exit on the channel process seems like overkill. Perhaps the Tracker should only set up a monitor?

The text was updated successfully, but these errors were encountered:

chrismccord · 2017-02-16T20:41:03Z

It's not clear in your description why the second channel is erroneously closing. If it closes for a non-normal reason, you will necessarily receive error logs about the reason. Do you have any relevant logs for BChannel? We link to the tracker because our presence is necessarily tied to the tracker. If it goes down, we should also fails and when we recover and the client reconnects, the presence will be re-tracked.

hanrelan · 2017-02-16T21:56:45Z

The channels are closing because the Application is being stopped. The issue is that they're closing with different reasons - AChannel is exiting with reason :shutdown, whereas BChannel exits with reason :killed.
I believe this is because (please correct me if I'm wrong), when you create a channel using this guide, by default it isn't part of a supervision tree. So when you stop the application, the processes are just killed. However, when you link the channel process to a presence process, the presence process is part of a supervision tree. So when the supervisor stops the presence process with reason :shutdown, that causes the channel process to also get stopped with reason :shutdown.
By itself this isn't the worst thing, but it causes unintuitive behavior where tracking presence in a channel will cause the client not to reconnect if the Application is restarted.

The way I repro-ed this was:

Create a client that connects to both AChannel and BChannel
Run the server with iex -S mix
Set up traces on the AChannel and BChannel processes (I just used the observer)
In iex, run Application.stop(:myapp)
The trace logs will show

(This is AChannel's process - ie. linked to a presence tracker)
13:45:07:849561 (<0.747.0>) in_exiting 0
13:45:07:849589 (<0.747.0>) exit shutdown
13:45:07:849603 (<0.747.0>) out_exited 0
(This is BChannel's process - no presence)
13:45:12:854318 (<0.744.0>) in_exiting 0
13:45:12:854329 (<0.744.0>) exit killed
13:45:12:854332 (<0.744.0>) out_exited 0

josevalim · 2017-02-16T23:28:02Z

Thanks for following up, that has been very well detailed.

If we are going to fix this, we will likely do it by making BChannel's exist with shutdown reason too. Therefore, could you let us know why is your application being restarted?

hanrelan · 2017-02-16T23:54:03Z

For deployment I build releases using distillery. We're not doing hot upgrades, so to deploy I just untar the new release and run:
bin/myapp restart

This cleanly stops the application and then starts it with the new code. My interim solution to this problem is to just kill the beam process and then run bin/myapp start

Incidentally - the behavior I'd prefer is for AChannel to exit with :killed :) That way after the restart the client will reconnect to the channel. I can see why that doesn't make sense in a clean shutdown, I just wonder how many people already rely on that reconnect behavior and would get thrown off by it not reconnecting.

Perhaps an alternate solution is to have the Presence process monitor the channel process instead of linking. Then if the channel process dies, the presence process can also kill itself, but if the presence process does the supervisor can restart it and it can monitor the channel process again?

josevalim · 2017-02-17T00:08:31Z

Maybe the answer is to force a client reconnect unless the channel exits with normal reason.

…:left} and {:shutdown, :closed}. Fixes phoenixframework/phoenix_pubsub#68

hanrelan · 2017-02-17T02:25:04Z

I put together what I think is the least invasive approach that makes sense. You can see the change here: hanrelan/phoenix@2a3b38e

Essentially I treat {:shutdown, :closed} and {:shutdown, :left} as special cases which don't send phx_error, whereas all other :shutdown types do. I think this makes sense because you want those shutdowns to propagate over links (eg. to end the presence process), so changing those to :normal would break in other places. All tests still pass with this change.

Happy to submit this as a PR in a topic branch if you think this makes sense.

chrismccord · 2017-02-17T04:38:23Z

Maybe the answer is to force a client reconnect unless the channel exits with normal reason.

This should already happen today. If the channel exits ungracefully, the client will auto rejoin with expo back off. @hanrelan are you using the phoenix.js channels client?

hanrelan · 2017-02-17T04:49:14Z

I am using the phoenix.js channels client

I think Jose meant :normal in the sense of the signal. What's happening is that this code:
https://github.com/phoenixframework/phoenix/blob/master/lib/phoenix/socket/transport.ex#L288
causes the server to send a phx_close event to the client when the process receives the :shutdown command. Once the client receive phx_close, it doesn't try to reconnect.

However, :shutdown is used by supervisors to stop processes, so in this case during an application restart it's stopping the Presence process which is propagating the :shutdown to the channel process. This results in the client not reconnecting if the server is restarted (in the sense of a release restart).

The diff I sent before changes it so only explicitly closing the channel (via the client's leave() function or the server's close) is considered a clean shutdown. All other shutdowns - including one enforced by a supervisor or via a link - is considered unclean and the client will try to reconnect.

This keeps the behavior between processes that are linked to a presence and processes that aren't linked to a presence the same with respect to shutdowns and reconnect attemps.

Happy to discuss on Slack as well if that's easier, I'm @hanrelan in the phoenix channel

chrismccord · 2017-02-20T03:04:39Z

Ah I see. We could change the behavior of the phx_close event to differentiate :normal and :shutdown, but I'm unsure what makes the most sense. For example, in your channel code today, you could do:

{:stop, {:shutdown, :normal}, socket}

What should happen in that case? We won't be able to determine the difference between this graceful exit and a "graceful restart" where the client should re-establish a connection.

josevalim · 2017-02-20T08:21:08Z

@chrismccord I think we should consider it the same as a link. Which means that we will break on everything that is not :normal. I will push the changes to Phoenix master today. :)

josevalim · 2017-02-20T11:34:10Z

So I attempted a fix at this and we use {:shutdown, ...} for clean shutdown in many cases. I am not sure what is the alternative. We could special case :shutdown but that may be confusing.

chrismccord · 2017-02-20T15:21:15Z

yes, that is my concern that {:shutdown, :normal} is a clean shutdown, which we would need to propagate to the client as a phx_error, which is wrong. A pre app shutdown task might be the way to go, that forcefully kills transport pids, shuts down the transport, then issues the app shutdown. That way restarts would leave clients in rejoin mode? I see no other way around it.

hanrelan · 2017-02-20T20:23:16Z

Another thought, but obviously you guys are much more familiar with this than I am so feel free to ignore if this makes no sense.
What if instead of linking to the channel process, Phoenix.Tracker just set up a monitor of the channel process instead. Then if the channel process dies, the Tracker can just stop itself.

The other case to deal with is if the Tracker process errors out for some reason, you don't want presence to silently stop working for a channel process. My understanding is a fuzzier here, but maybe this could be resolved by using a transient restart strategy in the presence process' supervisor so it gets restarted if it errors?

josevalim · 2017-02-20T20:45:33Z

@hanrelan the tracker trap exits exactly because we don't want the tracker to crash if a "client" crashes. But we need to link to solve your last paragraph otherwise there is no way for the "presence client" to be brought down if the presence server crashes.

The client could monitor and raise but links are more appropriate here. I believe @chrismccord is working on a solution though. :)

hanrelan · 2017-02-20T20:55:48Z

Ah got it I misunderstood the purpose of the link. Thanks for the explanation and the quick response to this issue @chrismccord and @josevalim!

ref: phoenixframework/phoenix_pubsub#68 Previously a channel could gracefully terminate using the stop semantics of a regular genserver; however when restarting an application for deploys the shutdown of the transport and channel processes would be indistinguisable from an intentional channel shutdown and would cause clients to incorrectly not reconnect after server restart. This commit adds a {:graceful_exit, channel_pid, %Phoenix.Socket.Message{}} contract to distinguish an intentional channel exit from what should be regarded as an error condition on the client.

chrismccord · 2017-02-21T19:05:00Z

This has been fixed on phoenix master and will ship with the 1.3 release. Check the PR commit details for details but tldr; is your app restarts will be seen as errors by the client and they will reconnect. Cheers!

ref: phoenixframework/phoenix_pubsub#68 Previously a channel could gracefully terminate using the stop semantics of a regular genserver; however when restarting an application for deploys the shutdown of the transport and channel processes would be indistinguisable from an intentional channel shutdown and would cause clients to incorrectly not reconnect after server restart. This commit adds a {:graceful_exit, channel_pid, %Phoenix.Socket.Message{}} contract to distinguish an intentional channel exit from what should be regarded as an error condition on the client.

hanrelan added a commit to hanrelan/phoenix that referenced this issue Feb 17, 2017

Makes channel send phx_error on shutdown messages except {:shutdown, …

2a3b38e

…:left} and {:shutdown, :closed}. Fixes phoenixframework/phoenix_pubsub#68

chrismccord mentioned this issue Feb 21, 2017

Add graceful exit contract to transports phoenixframework/phoenix#2107

Merged

chrismccord closed this as completed Feb 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling Presence.Track in a channel process changes the behavior of the channel when the server is stopped #68

Calling Presence.Track in a channel process changes the behavior of the channel when the server is stopped #68

hanrelan commented Feb 16, 2017 •

edited

Loading

chrismccord commented Feb 16, 2017

hanrelan commented Feb 16, 2017

josevalim commented Feb 16, 2017

hanrelan commented Feb 16, 2017

josevalim commented Feb 17, 2017 •

edited

Loading

hanrelan commented Feb 17, 2017 •

edited

Loading

chrismccord commented Feb 17, 2017

hanrelan commented Feb 17, 2017 •

edited

Loading

chrismccord commented Feb 20, 2017

josevalim commented Feb 20, 2017

josevalim commented Feb 20, 2017

chrismccord commented Feb 20, 2017

hanrelan commented Feb 20, 2017

josevalim commented Feb 20, 2017

hanrelan commented Feb 20, 2017 •

edited

Loading

chrismccord commented Feb 21, 2017

Calling Presence.Track in a channel process changes the behavior of the channel when the server is stopped #68

Calling Presence.Track in a channel process changes the behavior of the channel when the server is stopped #68

Comments

hanrelan commented Feb 16, 2017 • edited Loading

chrismccord commented Feb 16, 2017

hanrelan commented Feb 16, 2017

josevalim commented Feb 16, 2017

hanrelan commented Feb 16, 2017

josevalim commented Feb 17, 2017 • edited Loading

hanrelan commented Feb 17, 2017 • edited Loading

chrismccord commented Feb 17, 2017

hanrelan commented Feb 17, 2017 • edited Loading

chrismccord commented Feb 20, 2017

josevalim commented Feb 20, 2017

josevalim commented Feb 20, 2017

chrismccord commented Feb 20, 2017

hanrelan commented Feb 20, 2017

josevalim commented Feb 20, 2017

hanrelan commented Feb 20, 2017 • edited Loading

chrismccord commented Feb 21, 2017

hanrelan commented Feb 16, 2017 •

edited

Loading

josevalim commented Feb 17, 2017 •

edited

Loading

hanrelan commented Feb 17, 2017 •

edited

Loading

hanrelan commented Feb 17, 2017 •

edited

Loading

hanrelan commented Feb 20, 2017 •

edited

Loading