Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventing Websocket connection time if destination is not available #2734

Closed
JanFellner opened this issue Jul 13, 2021 · 26 comments
Closed

Eventing Websocket connection time if destination is not available #2734

JanFellner opened this issue Jul 13, 2021 · 26 comments

Comments

@JanFellner
Copy link
Contributor

JanFellner commented Jul 13, 2021

Hi @lminiero
hi MeetEcho team

as you were complaining about not receiving PRs (At least @fippo told me so ;] ) from our side i am finally stepping into the code.
Actually i wanted to improve something super trivial but in the end my lack of knowledge in libwebsocket hinders me to really make a PR out of it but i provide the code.

We are using Janus as SFU and have an own signalling server upfront.
In order to get everything up and running is thus start janus in advance the the signalling server.
The problem is that it then takes time until janus is able to setup the outbound eventing websocket connection to our signalling server as in the initial startup the connect fails. In the end leading that leads to a total delay of additional 30 seconds to get a new deployed server up an running.

Furthermore if the connection attempt fails janus is always doubling the time for a reconnect which makes it kind of ugly to develop. If the signalling server is not running for 10 minutes i need to wait several minutes for janus to reconnect or i restart janus.

Thus the code tries to solve two things:

  • A configurable parameter to define how long the janus should maximum delay a reconnect attempt (max_reconnect_retry_timeout)
  • Improve that the reconnect acts faster.

master...JanFellner:feature/CC-2166-make-eventing-websocket-reconnect-time-configurable

The issue i have is that lws_service always blocks for about 25 seconds. I added diagnostic code that shows the issue:

  • [Tue Jul 13 16:05:03 2021] woke up (return from lws_service)
  • [Tue Jul 13 16:05:03 2021] going to sleep (diving into lws_service)
  • [Tue Jul 13 16:05:03 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:729] Reconnecting failed (reason: Unable to connect), next retry in 1s...
  • [Tue Jul 13 16:05:28 2021] woke up (return from lws_service)
  • [Tue Jul 13 16:05:28 2021] [WARN] Reconnecting WebSocketsEventHandler event handler to 10.20.3.220:3000...
  • [Tue Jul 13 16:05:28 2021] going to sleep (diving into lws_service)
  • [Tue Jul 13 16:05:33 2021] woke up (return from lws_service)
  • [Tue Jul 13 16:05:33 2021] going to sleep (diving into lws_service)
  • [Tue Jul 13 16:05:33 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:729] Reconnecting failed (reason: Unable to connect), next retry in 2s...
@JanFellner
Copy link
Contributor Author

Using libwebsocket 4.1-stable branch just to have it mentioned

@atoppi
Copy link
Member

atoppi commented Jul 14, 2021

@JanFellner

The issue i have is that lws_service always blocks for about 25 seconds. I added diagnostic code that shows the issue:

Not strictly related to your changes, but the lws_service waking up after 25-30 seconds is an issue I've run into too.
A simple way to reproduce is starting janus (I don't use event handlers, only the transport), wait a couple of seconds, then close with CTRL-C. The process will halt unti the lws_service returns.

I noticed that the issue happens only on lws >= 3.2.
According to the changelog the lws team completely refactored the scheduler, indeed the timeout parameter of lws_service is being ignored.

@atoppi
Copy link
Member

atoppi commented Jul 14, 2021

As a side note I've tested the minimal lws server example and CTRL-C immediately stops the process.
I guess there is something we do in janus (ctx init, threading etc.) that triggers the issue.

@JanFellner
Copy link
Contributor Author

Thanks @atoppi for stepping into it.
So you will track that 25 second lws issue loosen from my issue here and i wait for that fix before i place the PR or shall i place the PR for the changes i made?

@lminiero
Copy link
Member

You may want to investigate this yourself as well, in the meanwhile. As Alessandro said, it doesn't happen with the example they provide, so looking at the differences from more POV would help.

@lminiero
Copy link
Member

On the PR, probably better to wait until the issue is solved, as otherwise from what I've understood it wouldn't be usable due to the long timeout anyway, is that correct?

@JanFellner
Copy link
Contributor Author

If i compare the websocket sample they do nothing in the lws_service loop and handle the connecting completely in dedicated scheduler callbacks (lws_sul_schedule)

https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/ws-client/minimal-ws-client-tx/minimal-ws-client.c

I will check if thats a working approach cause i was already playing around with all the initialization parameters without any effect. Looks like if the websocket is not connected the lws_service is just internally ideling...

@lminiero
Copy link
Member

Yes, you need to "wake up" the loop using lws_cancel_service: we just found out that's what was causing the issue @atoppi was experiencing in the WS transport. Very likely we'll need to do the same in the WS EVH reconnect logic.

@atoppi
Copy link
Member

atoppi commented Jul 14, 2021

here is the commit
eb5b20a

@JanFellner
Copy link
Contributor Author

JanFellner commented Jul 14, 2021

Nice finding, but for the eventing websocket i dont see it solving the problem.
I added the cancel to the connecting failed. It wakes the loop up, the loop checks wether the duration for the reconnect has passed by, and as it has not, sleeps again.

I wonder as all the implementations i find concerning the libwebsocket do nothing in the lws_service loop so i would implement it based on the sample. handle the connect with dedicated callbacks from the scheduler and let the loop just do the lws_service.

I would adopt the PR in that direction if you are not currently investigate in that direction.

@lminiero
Copy link
Member

The examples use lws_sul_schedule to do scheduled callbacks, but that's only available since 3.2, and we support older versions too. I'm not going to make changes that break backwards compatibility.

@JanFellner
Copy link
Contributor Author

i will have that in mind...... so up to 3.2 connect attempt out of the service loop
starting with 3.2 and higher build something based on the scheduler...

@lminiero
Copy link
Member

No need, we fixed it already. Commit to come.

@JanFellner
Copy link
Contributor Author

k, standing by, if you touch it please check the maximum duration a reconnect may take.
That was the initial thing about my changes. In case you don´t touch that one i will make a new PR based on the new master.
(Idea was to be able to specify a maximum idle time of the eventing websocket to speed up connects/reconnects)

@atoppi atoppi closed this as completed in 5607013 Jul 14, 2021
@atoppi
Copy link
Member

atoppi commented Jul 14, 2021

The above commit fixes the issue for us.
We have added a macro hardcoded to 8 (seconds).

@JanFellner
Copy link
Contributor Author

Will check if it fixes the startup issue of the two servers. Thanks.

@JanFellner
Copy link
Contributor Author

The lws_service still sleeps for 25 seconds. the janus shutdown is fixed, it returns from lws_service due to the lws_cancel.
I guess the key is not to do the reconnect in the lws_service loop but a dedicated method loosen from that thread.
(like in the samples)

  • [Wed Jul 14 16:43:03 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:724] Error connecting to backend
  • [Wed Jul 14 16:43:31 2021] [WARN] Reconnecting to WebSockets event handler backend... (next retry in 1s)
  • [Wed Jul 14 16:43:36 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:724] Error connecting to backend
  • [Wed Jul 14 16:44:01 2021] [WARN] Reconnecting to WebSockets event handler backend... (next retry in 2s)
  • [Wed Jul 14 16:44:06 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:724] Error connecting to backend
  • [Wed Jul 14 16:44:31 2021] [WARN] Reconnecting to WebSockets event handler backend... (next retry in 4s)
  • [Wed Jul 14 16:44:36 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:724] Error connecting to backend
  • [Wed Jul 14 16:45:01 2021] [WARN] Reconnecting to WebSockets event handler backend... (next retry in 8s)
  • [Wed Jul 14 16:45:06 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:724] Error connecting to backend
  • [Wed Jul 14 16:45:31 2021] [WARN] Reconnecting to WebSockets event handler backend... (next retry in 8s)

@atoppi
Copy link
Member

atoppi commented Jul 14, 2021

I can not reproduce the issue anymore. Please try again with a clean install.
If it still happens, please provide some steps.

@atoppi
Copy link
Member

atoppi commented Jul 14, 2021

these are the logs from a janus instance with enabled ws ev handler (and upstream ws server not started)

[Wed Jul 14 17:03:10 2021] [WARN] Reconnecting to WebSockets event handler backend... (next retry in 2s)
[Wed Jul 14 17:03:10 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:722] Error connecting to backend
[Wed Jul 14 17:03:12 2021] [WARN] Reconnecting to WebSockets event handler backend... (next retry in 4s)
[Wed Jul 14 17:03:12 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:722] Error connecting to backend
[Wed Jul 14 17:03:16 2021] [WARN] Reconnecting to WebSockets event handler backend... (next retry in 8s)
[Wed Jul 14 17:03:16 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:722] Error connecting to backend
[Wed Jul 14 17:03:24 2021] [WARN] Reconnecting to WebSockets event handler backend... (next retry in 8s)
[Wed Jul 14 17:03:24 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:722] Error connecting to backend
[Wed Jul 14 17:03:32 2021] [WARN] Reconnecting to WebSockets event handler backend... (next retry in 8s)
[Wed Jul 14 17:03:32 2021] [ERR] [events/janus_wsevh.c:janus_wsevh_callback:722] Error connecting to backend

@JanFellner
Copy link
Contributor Author

JanFellner commented Jul 14, 2021

strange, which libwebsocket is being used?
made clean one, same ... :\

@atoppi
Copy link
Member

atoppi commented Jul 14, 2021

lws 4.2

@JanFellner
Copy link
Contributor Author

Found it:
The connect itself takes 25 seconds if the target is not on the same machine.
If i specify a local port it works like expected ...

I will play a bit and see if the connect is established if the server starts while he is waiting for the connect.

@atoppi
Copy link
Member

atoppi commented Jul 15, 2021

Good catch.
I guess that is a whole different problem: checking lws client connection timeout.
Please submit a new PR in case you end up with a proper fix.

@atoppi
Copy link
Member

atoppi commented Jul 15, 2021

fyi there are a couple of references to timeout settings in a lws client example here

@JanFellner
Copy link
Contributor Author

I already played around with those timer values yesterday but did not get a proper result.
The funny thing is that the connection failed callback always comes after 5 seconds.
But the lws_service does not wake up afterwards.

I have it not working with the approach from the samples using the lws_sul_schedule and leaving the lws_service on it´s own. That looks promising.

@JanFellner
Copy link
Contributor Author

Looks promising. But due to the change with the scheduler a bit refactored.
Prepping PR, let´s follow up the discussion then in the PR.

[Thu Jul 15 13:32:10 2021] [WARN] WebSocketsEventHandler: Error attempting connection... (next retry in 1s)
[Thu Jul 15 13:32:11 2021] WebSocketsEventHandler: Connecting to upstream websocket server 10.20.3.220:3000...
[Thu Jul 15 13:32:16 2021] [WARN] WebSocketsEventHandler: Error attempting connection... (next retry in 2s)
[Thu Jul 15 13:32:18 2021] WebSocketsEventHandler: Connecting to upstream websocket server 10.20.3.220:3000...
[Thu Jul 15 13:32:23 2021] [WARN] WebSocketsEventHandler: Error attempting connection... (next retry in 4s)
[Thu Jul 15 13:32:27 2021] WebSocketsEventHandler: Connecting to upstream websocket server 10.20.3.220:3000...
[Thu Jul 15 13:32:32 2021] [WARN] WebSocketsEventHandler: Error attempting connection... (next retry in 8s)
[Thu Jul 15 13:32:40 2021] WebSocketsEventHandler: Connecting to upstream websocket server 10.20.3.220:3000...
[Thu Jul 15 13:32:45 2021] [WARN] WebSocketsEventHandler: Error attempting connection... (next retry in 8s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants