How to do application-level ping timeout properly? #159

michalrus · 2017-11-06T11:36:45Z

I have a working solution with the current API, but it feels sooooo hacky. 😺

First, because connectionOnPong doesn’t get its Connection (why?), we’re binding onPong <- newTVarIO (pure ()).
And as connectionOnPong, we’re passing join (readTVarIO onPong).
After we get a Connection, we’re putting an action into this onPong, which:
1. Starts a new Async, which:
  1. threadDelays (2 * tcpTimeout).
  2. Sends an error data message to the connection.
  3. Sends close to the connection.
  4. Kills the main WAI request thread with throwTo waiThread PingTimeout (this is awful).
2. Cancels the previous Async.
Apart from all that, we have to wrap everything in a finally, that cancels that Async, as well, so that after clean disconnection we don’t run the code meant for ping timeout.

This way, on average after 1.5 * tcpTimeout ^(*), a dropped connection is detected and killed, and the whole system can learn about it, run appropriate clean up code etc.

^(*) Even if we were scheduling this killer Async right after sending a Ping (which seems more in line with the naming (ping timeout), it’d still happen after 1.5 * tcpTimeout, because on average, the Ping message would be sent 0.5 * tcpTimeout after the connection got dropped.

When we don’t do that, dropping the connection on iptables does absolutely nothing, it just stays there as open, even after Warp’s setTimeout timeout passes! I don’t understand why, though.

How to do this properly? Because my way surely doesn’t feel appropriate (although it works well).

The text was updated successfully, but these errors were encountered:

michalrus · 2018-09-29T15:37:36Z

The extracted code (with ClassyPrelude as implicit Prelude):

import qualified Control.Exception                  as Unsafe
import qualified Data.Aeson                         as J
import           GHC.Conc                           (myThreadId)
import qualified Network.Wai                        as W
import qualified Network.Wai.Handler.WebSockets     as W
import qualified Network.WebSockets                 as W
import qualified Servant.Server.Internal.ServantErr as W (responseServantErr)

data KillWaiThread =
  KillWaiThread
  deriving (Show)

instance Exception KillWaiThread

-- |The whole point of this module is to have application-level pings for greater control in application. I.e. to solve <https://github.com/jaspervdj/websockets/issues/159>.
pingingWsApp :: (W.Connection -> IO ()) -> W.Application
pingingWsApp action __unused1 __unused2 = do
  onTcpPong <- newTVarIO (pure ())
  W.websocketsOr
    W.defaultConnectionOptions
      { W.connectionOnPong = join (readTVarIO onTcpPong)
      }
    (wsApp onTcpPong)
    backup
    __unused1
    __unused2
  where
    backup _ respond' = respond' . W.responseServantErr $ toServantErr NotAWebSocket
    -----------------------------------------------
    wsApp onTcpPong pendingConn =
      ignoreKillWaiThread . ignoreClosedTCP $ do
        conn <- W.acceptRequest pendingConn
        waiThreadId <- myThreadId
        pingTimeoutKiller :: TVar (Maybe (Async ())) <- newTVarIO Nothing
        flip finally (traverse_ cancel =<< readTVarIO pingTimeoutKiller) $ do
          atomically . writeTVar onTcpPong $ do
            newKiller <-
              async $ do
                threadDelay' (2 * connectionTimeout)
                sendDisconnect conn ("Ping timeout." :: Text)
                Unsafe.throwTo waiThreadId KillWaiThread -- this is ugly, but what else can we do w/ the API provided by 'Network.WebSockets'? / Unsafe.throw, because we don’t want it wrapped — we’ll catch it later in 'ignoreKillWaiThread'.
            previousKiller <- atomically $ swapTVar pingTimeoutKiller (Just newKiller)
            cancel `traverse_` previousKiller
          W.forkPingThread conn (round connectionTimeout)
        action conn
    -----------------------------------------------
    ignoreClosedTCP = handle (\(_ :: W.ConnectionException) -> pure ()) -- don’t log them, no point
    -----------------------------------------------
    ignoreKillWaiThread = Unsafe.handle (\(_ :: KillWaiThread) -> pure ()) -- don’t log them, no point
    -----------------------------------------------
    sendDisconnect conn why = do
      sendJson conn . J.object $ ["tag" J..= ("Error" :: Text), "description" J..= why]
      W.sendClose conn ("{}" :: ByteString)

michalrus · 2018-09-29T15:38:26Z

When adapting ↑, it’s important to get masked and unmasked exceptions right here…

michalrus · 2018-09-29T18:42:26Z

Also because of the API of connectionOnPong, we can’t easily reuse that code for client connections. Hmm.

thomasjm · 2019-11-11T22:44:49Z

@michalrus thanks for the code above! Do you happen to know if any progress has been made since this was posted, or is your solution still the best way to do this?

jaspervdj · 2019-11-11T22:59:34Z

It looks solid -- I agree that the API is not ideal, and I'm happy to change parts of it if there's a way that offers existing users a clean migration path. I think it would be a bit cleaner to have a single killer thread that shares an reference to a timeout, and every pong that is received prolongs this timeout -- that way it's not necessary to cancel and restart the killer threads. That is the way it's done in the snap backend of websockets, but of course snap also provides a different API than warp.

thomasjm · 2019-11-12T02:28:19Z

@jaspervdj thanks!

Question, is there some reason why the ping thread itself doesn't wait for pongs and cause the connection be closed when a pong doesn't return for a long time? This seems like a natural job for the ping thread (in addition to keeping the connection alive through proxies and stuff).

That was my expectation anyway for how the ping thread would behave--it was an unpleasant discovery that this library doesn't actually have a mechanism to detect disconnects by itself and that you have to build something crazy out of TVars like the above. I think it would cause less astonishment for users to let the ping thread do this out of the box (with an option to turn it on and off).

I'm currently working on another way to manage pongs using a Chan and a single thread, I'll post again if I have any other ideas about the API...

thomasjm · 2019-11-12T02:59:15Z

Okay, I have a couple thoughts about the current API (particularly when writing a server):

It seems that connectionOnPong is actually a global setting--the same action is invoked in response to a pong for every connection (at least when you run the server using something like runServerWith or websocketsOr from WAI). This kind of makes it a non-starter to track ping timeouts on a per-connection basis using this callback.
In order to work around this, I am forced to stop using receiveDataMessage and instead directly use receive so that I can receive the control messages myself. However, I can't quite replicate the behavior of receiveDataMessage because it accesses the connectionSentClose field of Connection, which is not publicly exposed; see here. So, I have to make a fork of this repo and expose that constructor in order to do this. I think in this case receiveDataMessage is aiming for convenience but is actually kind of doing users a disservice because it prevents us from seeing the control messages.

Suggestion: what if you provide a way to set the connectionOnPong callback on a per-connection basis, perhaps as part of the acceptRequest function?

jaspervdj · 2019-11-12T04:29:33Z

Hmm, this is a bit surprising to me. connectionOnPong should absolutely be a per-connection thing, if not is quite useless, as you pointed out. I don't use/maintain the WAI backend of websockets in any large projects myself; and it seems like this could be an misunderstanding that arose when that library was written.

Once that is fixed, I think it should be doable to build an application-level ping timeout on top of withPingThread that was added recently, that users can easily wrap their application in -- and I'd be happy merging that into either this library or helping out with getting it into the WAI backend of websockets (depending on what parts it uses).

thomasjm · 2019-11-12T07:52:55Z

Maybe I'm misunderstanding, but the runServerWith function provided by this library seems to have the same problem -- it accepts a ConnectionOptions argument that seems to get used in every connection; see here. It seems like the websocketsOr function from the WAI integration is following the lead of this function?

FWIW, having just been through the process of writing my own ping/pong system, I don't think withPingThread is quite the right building block to use. I think the ping thread needs to internally understand what a pong is so it can notice when one takes too long to come. In case it helps, here's a sketch of how I did it:

runApp pending = do
  conn <- WS.acceptRequest pending
  withPinger conn $ \pongChan -> do
    -- Run main application loop
    forever $ WS.receive conn >>= \case
      WS.ControlMessage (WS.Pong _) -> writeChan pongChan ()
      -- ... other cases, including DataMessage

withPinger conn action = do
  pongChan <- newChan
  mainAsync <- async $ action pongChan
  pingerAsync <- async $ runPinger conn pongChan

  waitEitherCatch mainAsync pingerAsync >>= \case
    -- If the application async died for any reason, kill the pinger async
    Left _ -> cancel pingerAsync
    -- The pinger thread should never throw an exception. If it does, kill the app thread
    Right (Left err) -> cancel mainAsync
    -- The pinger thread exited due to a pong timeout. Tell the app thread about it.
    Right (Right ()) -> cancelWith mainAsync PongTimeout

runPinger conn pongChan = fix $ \loop -> do
  WS.sendPing conn (mempty :: B.ByteString)
  threadDelay pingWaitTime
  -- See if we got a pong in that time
  timeout 1000000 (readChan pongChan) >>= \case
    Just () -> loop
    Nothing -> return ()

jaspervdj · 2019-11-14T17:45:42Z

@thomasjm That makes sense. I think runServerWith needs to follow the lead of websockets-snap, basically cloning the options & updating connectionOnPong when a new connection comes in. I think we can use withPingThread in combination with a killer for that, but I'd need to have a stab at it to be sure.

nbouscal · 2019-11-22T12:03:06Z

@jaspervdj Please let me know if there are specific parts of this that my team could help with. We're using the library pretty heavily in production and have run into quite a few issues with connection state. (Possible we could sponsor some of the work if that would help.)

FWIW we're just using runServer, we tried using wai-websockets but couldn't figure out how to reasonably disable the slowloris timeout. Haven't tried websockets-snap (yet).

jaspervdj · 2019-11-22T18:52:55Z

Hi @nbouscal -- sorry for dragging this out. I am currently vacationing in Mexico so I don't have time to look at this properly, but I did have some downtime on a bus this morning, so I put together #199 with a draft of what I think I want it to look like. Note that because of the circumstances, I have only checked that this compiles; I haven't tried to run it at all. Could you help me out by reviewing and sense-checking that PR? Thanks for your patience!

nbouscal · 2019-11-22T20:18:09Z

No worries, hope you’re enjoying your vacation! I’ll take a look this weekend. Thanks!

domenkozar · 2022-06-28T16:37:18Z

There's a similar implementation out there at https://github.com/digitallyinduced/ihp/blob/dbb4ec64fe7b460fd80041e0b7a2867d90529d78/IHP/WebSocket.hs#L124

domenkozar · 2022-06-29T21:34:39Z

I've written an implementation that reuses withPingThread: cachix/cachix#414

There is no clean or built in way to do this, see jaspervdj/websockets#159. This implementation works by keeping track of the last received pong, and then checking whether the previous ping has been answered with a pong when sending a new ping. If that isn't the case, then the connection is terminated early through the use of `withInteruptablePingThread`.

domenkozar · 2023-12-22T14:47:50Z

Looking at the current implementation of ping pong handling, I suggest the following changes:

requireServerPong: It should also install withPingThread and be enabled by default.
Client should also support ping/pong, as the connection can hang from both sides of protocol.

domenkozar · 2023-12-23T13:14:26Z

I've found a way to implement ping-pong generically for any connection and it also simplifies the code!

Please take a look at #239

dtchepak mentioned this issue Apr 6, 2018

forkPingThread exceptions #174

Closed

michalrus mentioned this issue Oct 3, 2018

Unable to sendTextData after receiveData gets an async exception. #182

Open

nbouscal mentioned this issue Oct 20, 2019

Orphaned ping thread can disconnect other connections #194

Closed

thomasjm mentioned this issue Nov 12, 2019

Request: close outdated issues #197

Closed

georgefst mentioned this issue Feb 14, 2021

Handle client timeout georgefst/monpad#22

Open

domenkozar mentioned this issue Jun 28, 2022

can multiple threads send at the same time? #208

Open

dktr0 mentioned this issue Jul 4, 2022

use withPingThread in server dktr0/estuary#212

Closed

robbert-vdh mentioned this issue May 9, 2023

Add timeout handling to the WebSocket subscriber connections channable/icepeak#104

Closed

robbert-vdh mentioned this issue May 11, 2023

Add WebSocket time-out handling using the ping mechanism channable/icepeak#113

Merged

domenkozar mentioned this issue Dec 23, 2023

Add ping-pong implementation that handles stale connections. #239

Merged

domenkozar closed this as completed in #239 Dec 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do application-level ping timeout properly? #159

How to do application-level ping timeout properly? #159

michalrus commented Nov 6, 2017 •

edited

michalrus commented Sep 29, 2018

michalrus commented Sep 29, 2018

michalrus commented Sep 29, 2018

thomasjm commented Nov 11, 2019

jaspervdj commented Nov 11, 2019

thomasjm commented Nov 12, 2019

thomasjm commented Nov 12, 2019

jaspervdj commented Nov 12, 2019

thomasjm commented Nov 12, 2019

jaspervdj commented Nov 14, 2019

nbouscal commented Nov 22, 2019

jaspervdj commented Nov 22, 2019

nbouscal commented Nov 22, 2019

domenkozar commented Jun 28, 2022

domenkozar commented Jun 29, 2022

domenkozar commented Dec 22, 2023

domenkozar commented Dec 23, 2023

How to do application-level ping timeout properly? #159

How to do application-level ping timeout properly? #159

Comments

michalrus commented Nov 6, 2017 • edited

michalrus commented Sep 29, 2018

michalrus commented Sep 29, 2018

michalrus commented Sep 29, 2018

thomasjm commented Nov 11, 2019

jaspervdj commented Nov 11, 2019

thomasjm commented Nov 12, 2019

thomasjm commented Nov 12, 2019

jaspervdj commented Nov 12, 2019

thomasjm commented Nov 12, 2019

jaspervdj commented Nov 14, 2019

nbouscal commented Nov 22, 2019

jaspervdj commented Nov 22, 2019

nbouscal commented Nov 22, 2019

domenkozar commented Jun 28, 2022

domenkozar commented Jun 29, 2022

domenkozar commented Dec 22, 2023

domenkozar commented Dec 23, 2023

michalrus commented Nov 6, 2017 •

edited