Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

websocket disconnects after pairing #11989

Open
nowords123 opened this issue Dec 3, 2022 · 12 comments
Open

websocket disconnects after pairing #11989

nowords123 opened this issue Dec 3, 2022 · 12 comments
Assignees
Labels

Comments

@nowords123
Copy link

https://lichess.org/forum/lichess-feedback/lag-spike-at-game-start-unable-to-move-and-game-is-aborted-after-time-limit-is-reached?

We are trying to see if there is a socket disconnection issue and which side it is on pls can we get some help

@ddugovic
Copy link
Contributor

ddugovic commented Dec 3, 2022

Indeed, I tried briefly looking at this; hopefully other developers looking at issues can help put it into perspective since at the moment it seems to be over my head. From that forum thread:

I play a bunch of 1/0 bullet, and every 3 to 8 games I'll start a game, but I can't make my first move.

I believe this is the result of a lag spike that usually occurs at the beginning of games, as this is the only time I ever have problems with connection.

Sometimes at the start of games, and no other times, there will be a massive disconnect. It will often say reconnecting right after I realize that although my piece has been moved, it doesn't yet say that it is my opponents turn. It will then do one of a few things.

  1. It may get back connection after 5 or so seconds and I can play as if nothing happened.
  2. It will go down to the wire and last second I'll get a connection
  3. It won't ever reconnect between game start and abort timer end. In other words, I can move and the game is aborted.

Of the three above, it seems like 3 is the most likely. 2 almost never happens, and 1 happens almost as much as 3.

When starting a game, the first step seems to be to connect two players together, once connected, I believe that is when the abort timer begins, in my mind, there is an additional network step between when I can move and when the timer begins, and I think it is this step where the problem arises. I imagine it would also have to occur in an operation which would be expected to take almost no time at all and would be very consistent and thus would be difficult to find. This is all my speculation, I don't really have a clue about the systems involved, but I still hope my ponderings can provide some inspiration into the problem.

It may be a me problem instead of a server problem as well, as when I play bullet on my phone with low net speed and on app, I don't have any problems, but when I play on my computer sometimes I get a major lag spike (a temporary total disconnection of computer or server) only at the start of games. This I imagine would make the problem much harder to find, so here is a bit of computer info for you.

Windows 10, lotsa ram, Brave browser (like chrome but with ad-blockers and data tracker blockers), I'm pretty far from France I think and I don't use a VPN.

... and finally:

OK, Toadofsky's post made me think, what else can i see from client-side.

Overall, today is bad diagnosis day, as this "lagging" happened really few compared to 3 previous days.
Only some tournaments were affected from 9-15 EET+2

Also turns out, that recording whole session with Mozilla developer tools performance analyzer requests too much memory (or overflows memory), to get result of one tournament saved as profile.

But, i was able to catch several interesting findings, while running Lichess in special "focus mode" with developer tools network analyzer turned on
First i hoped, that i found culprit, but not so sure now.

Findings so far. Quite often there is blockage on webpage loading request, reason i can not tell, but once it took 14,98 secs and this tournament game is was timed out.
get socket is sometimes quite slow, often 300ms but also reached 7-14 seconds in worst cases, but looks like it does not depend, witch of the sockets is free (from socket 1 -to 5)

But, there are some get TLS session problems, and those mach more with "disconnect feature", as they are quite long -from 5-7 secs following with message - reconnecting in 3500 ms.
One game i lost exactly while this setup TLS timed out - with no answer or reason to request.

All those things happened with Firefox, only one tab open. (maybe should try plenty of tab open too, to compare.

For fact-hungry persons i can provide screenshots from those timeouts too.

@schlawg schlawg changed the title Trying to solve lag websocket disconnects after pairing Dec 3, 2022
@schlawg schlawg added the bug label Dec 3, 2022
@schlawg
Copy link
Collaborator

schlawg commented Dec 3, 2022

possible ws connection status & lila onlineUserSet out of sync issue

@schlawg schlawg self-assigned this Dec 3, 2022
@ddugovic
Copy link
Contributor

ddugovic commented Dec 4, 2022

I am relaying comments from https://lichess.org/forum/lichess-feedback/lag-spike-at-game-start-unable-to-move-and-game-is-aborted-after-time-limit-is-reached?page=3#21 which may or may not apply in this context:

Next example is hard to read, not nesccesarily spam. Those requests make most of the problems, as they get often stalled:

You can also see timings on message reguests themselves.

WebSocketv5?sri=2EM8hafMx4G1&v=22 /tournament/WIFzR85V/socket/v5 wss://socket1.lichess.org/tournament/WIFzR85V/socket/v5?sri=2EM8hafMx4G1&v=22 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 15.88 s upgrade nginx WebSocketv5?sri=3c7px7DBhxA0&v=23 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=3c7px7DBhxA0&v=23 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 16.97 s upgrade nginx WebSocketv5?sri=6jCyQ6EY4KIp&v=23 /tournament/WIFzR85V/socket/v5 wss://socket1.lichess.org/tournament/WIFzR85V/socket/v5?sri=6jCyQ6EY4KIp&v=23 GET 101 websocket socket1.lichess.org websocket 0 B 6.79 s upgrade nginx WebSocketv5?sri=A9Hits96plAP&v=23 /tournament/WIFzR85V/socket/v5 wss://[socket5.lichess.org](https://socket5.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=A9Hits96plAP&v=23 GET 101 websocket socket5.lichess.org websocket 0 B Pending upgrade nginx WebSocketv5?sri=aaG81qGJMxsk&v=7 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=aaG81qGJMxsk&v=7 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 8.89 s upgrade nginx WebSocketv5?sri=COZFmLr8Ys0v&v=11 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=COZFmLr8Ys0v&v=11 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 5.52 s upgrade nginx WebSocketv5?sri=Ff8qrmAMhix_&v=7 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=Ff8qrmAMhix_&v=7 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 13.24 s upgrade nginx WebSocketv5?sri=kysHtX_zn7at&v=17 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=kysHtX_zn7at&v=17 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 5.63 s upgrade nginx WebSocketv5?sri=n19sdQv0TIIX&v=10 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=n19sdQv0TIIX&v=10 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 6.10 s upgrade nginx WebSocketv5?sri=P2nYXG54CrIe&v=23 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=P2nYXG54CrIe&v=23 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 11.50 s upgrade nginx WebSocketv5?sri=QC5odHRfYcHc&v=11 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=QC5odHRfYcHc&v=11 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 12.06 s upgrade nginx WebSocketv5?sri=u9Jc6MsUgth4&v=13 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=u9Jc6MsUgth4&v=13 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 1.27 s upgrade nginx WebSocketv5?sri=W6eoM283P65J&v=23 /tournament/WIFzR85V/socket/v5 wss://socket1.lichess.org/tournament/WIFzR85V/socket/v5?sri=W6eoM283P65J&v=23 GET 101 websocket socket1.lichess.org websocket 0 B 8.59 s upgrade nginx WebSocketv5?sri=Wjr4t3SbfG_7&v=23 /tournament/WIFzR85V/socket/v5 wss://socket1.lichess.org/tournament/WIFzR85V/socket/v5?sri=Wjr4t3SbfG_7&v=23 GET 101 websocket socket1.lichess.org websocket 0 B 5.75 s upgrade nginx WebSocketv5?sri=xD9_HLZBXdvd&flag=tournament /socket/v5 wss://socket1.lichess.org/socket/v5?sri=xD9_HLZBXdvd&flag=tournament GET 101 websocket socket1.lichess.org websocket 0 B 1.05 s upgrade nginx WebSocketv5?sri=yyn05E1aiLLJ&v=20 /tournament/WIFzR85V/socket/v5 wss://socket1.lichess.org/tournament/WIFzR85V/socket/v5?sri=yyn05E1aiLLJ&v=20 GET 101 websocket socket1.lichess.org websocket 0 B 12.17 s upgrade nginx WebSocketv6?sri=68pkAmCEJvNf&v=0 /play/Q3Buj6WRPAyb/v6 wss://socket1.lichess.org/play/Q3Buj6WRPAyb/v6?sri=68pkAmCEJvNf&v=0 GET 101 websocket socket1.lichess.org websocket 0 B 47.43 s upgrade nginx WebSocketv6?sri=6jtBCCIvZ_FA&v=0 /play/RDSuYehNV2Ql/v6 wss://socket1.lichess.org/play/RDSuYehNV2Ql/v6?sri=6jtBCCIvZ_FA&v=0 GET 101 websocket socket1.lichess.org websocket 0 B 2.1 min upgrade nginx WebSocketv6?sri=9Ua6X_y8uf_n&v=0 /play/1mCYlL0zriwn/v6 wss://socket1.lichess.org/play/1mCYlL0zriwn/v6?sri=9Ua6X_y8uf_n&v=0 GET 101 websocket socket1.lichess.org websocket 0 B 46.02 s upgrade nginx WebSocketv6?sri=_qeSficwWqsK&v=0 /play/oz9XiRFQq898/v6 wss://socket1.lichess.org/play/oz9XiRFQq898/v6?sri=_qeSficwWqsK&v=0 Finished socket1.lichess.org websocket 0 B (unknown) WebSocketv6?sri=_qeSficwWqsK&v=0 /play/oz9XiRFQq898/v6 wss://[socket5.lichess.org](https://socket5.lichess.org/)/play/oz9XiRFQq898/v6?sri=_qeSficwWqsK&v=0 GET 101 websocket socket5.lichess.org websocket 0 B 2.2 min upgrade nginx WebSocketv6?sri=_XLeRQvaRRNI&v=0 /play/Yh0S1s5jEc1M/v6 wss://[socket1.lichess.org](https://socket1.lichess.org/)/play/Yh0S1s5jEc1M/v6?sri=_XLeRQvaRRNI&v=0 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 2.0 min upgrade nginx WebSocketv6?sri=b7j1OzgxJ0lJ&v=0 /play/weKduVYiOIr4/v6 wss://[socket1.lichess.org](https://socket1.lichess.org/)/play/weKduVYiOIr4/v6?sri=b7j1OzgxJ0lJ&v=0 GET 101 websocket socket1.lichess.org websocket 0 B 2.1 min upgrade nginx WebSocketv6?sri=EN4reiGByWp_&v=0 /play/f32V2syXU0oZ/v6 wss://socket1.lichess.org/play/f32V2syXU0oZ/v6?sri=EN4reiGByWp_&v=0 GET 101 websocket socket1.lichess.org websocket 0 B 1.4 min upgrade nginx

looks like those previous messages contain mainly chat presence, public presense in chat or tournament and etc.
Next example message -got stalled for 8,5 seconds:

4
18:31:33.207
{"t":"startWatching","d":"aIfPXkp0"} 36
18:31:33.207
{"t":"reload"} 14
18:31:33.314
0 1
18:31:33.356
{"t":"fen","d":{"id":"aIfPXkp0","lm":"e7e6","fen":"r1bqkbnr/pp1p1ppp/2n1p3/2p5/4P3/1Q6/PPPP1PPP/RNB1KBNR w","wc":59,"bc":57}} 125
18:31:33.356
{"t":"fen","d":{"id":"aIfPXkp0","lm":"b3f3","fen":"r1bqkbnr/pp1p1ppp/2n1p3/2p5/4P3/5Q2/PPPP1PPP/RNB1KBNR b","wc":58,"bc":57}} 125
18:31:33.430
{"t":"crowd","d":{"nb":20,"users":["cadetblue","pirkarpenter","Shaurya1303","tom230","BuDDhInIAM","dimoss","ppermas_3_0","bozkurtfrkn","GenaBerezov","Diksha_Alluri","kudzu12","basile73","Basico2k","lestrand","Emmanuelv7","chumbichess","Manova-jeu","theindiansniper","Ishanth43","Gleb_6_let"],"anons":0}} 303
18:31:33.954
{"t":"fen","d":{"id":"aIfPXkp0","lm":"g7g6","fen":"r1bqkbnr/pp1p1p1p/2n1p1p1/2p5/4P3/5Q2/PPPP1PPP/RNB1KBNR w","wc":58,"bc":56}} 127
18:31:34.215
{"t":"reload"} 14
18:31:34.446
{"t":"fen","d":{

There are also some random errors like this:

lichess.min.js:36 WebSocket connection to 'wss://socket1.lichess.org/play/oz9XiRFQq898/v6?sri=_qeSficwWqsK&v=0' failed: WebSocket is closed before the connection is established.
I.disconnect @ lichess.min.js:36
I.destroy @ lichess.min.js:36
I.connect @ lichess.min.js:36
(anonymous) @ lichess.min.js:36
setTimeout (async)
I.scheduleConnect @ lichess.min.js:36
I.connect @ lichess.min.js:36
I @ lichess.min.js:36
q @ round.min.js:1
(anonymous) @ oz9XiRFQ:9
Promise.then (async)
(anonymous) @ oz9XiRFQ:9

And some cosmetic bug -in every page load:

Error with Permissions-Policy header: Origin trial controlled feature not enabled: 'interest-cohort'.

22lichess.min.js:33 Uncaught TypeError: Cannot read properties of undefined (reading 'cancel')
at HTMLDivElement.mouseenter (lichess.min.js:33:3639)
at HTMLDivElement.ya (lichess.min.js:21:306)
mouseenter @ lichess.min.js:33
ya @ lichess.min.js:21
22lichess.min.js:33 Uncaught TypeError: Cannot read properties of undefined (reading 'hide')
at HTMLDivElement.mouseleave (lichess.min.js:33:3706)
at HTMLDivElement.ya (lichess.min.js:21:306)
mouseleave @ lichess.min.js:33
ya @ lichess.min.js:21
Error with Permissions-Policy header: Origin trial controlled feature not enabled: 'interest-cohort'.

@Kudzu12
Copy link

Kudzu12 commented Dec 7, 2022

Here i can provide several client side logs .har format, you can use them via browser developer tools feature -import to see all saved internals.
Logs are created by Firefox and Vivaldi (chromium based)
You can see dates by file creating time and usually there is also included time played in EET+2 time format in file name itself.

I try to provide those more problematic ones.
Allalaadimised.zip
more har files.zip

If looking inside logs, you can consider my ping very stable 48-49 milliseconds.
No packet-loss or whatsoever lagging against ISP and lichess.org itself while those tournaments are recorded

Everything else (media, fonts, board items, audio, java-script) looks like running very fast micro and milliseconds.
Fetches and WS (websocket), its connection upgrade process are most problematic areas.

Some ideas i currently have:

i do not think, it's software. Too random are those socket stalls, drops.
I can imagine something like that happening on overload or congestion of network, something that congests cache, CPU for short periods.

There is need to see what happens on and between lichess virtual servers those times.

I do not know, if they did change anything on server level yesterday or today, but today most tournaments i was present ran almost perfectly without hiccups whatsoever.

If any questions or need something more specific data, please ask.

@Kudzu12
Copy link

Kudzu12 commented Dec 7, 2022

Some thoughts, do not know if this is right place to post them or who may be interested if anyone at all.

For being successful in more complicated issue tracking, specially those like socket errors, happening possibly in server software level or on database interactions level and/or networking stack congestion's should be taken into account also on par with servers overall load monitoring, with timestamps if needed.

So, if servers are in located OVH hosted environment, do anybody from server administering persons or developing teems have access to virtual machinery's load monitoring?
Usually hosting providers are not too keen to make this kind access directly available.
If direct monitoring is possible, it would be easier to track down client-side events regarding to server internals.

If OVH is not allowing access to servers performance monitoring or graphs, then there are possibility to add something like Nagios system or equivalent monitoring on hosting system by admins.

Another way to achieve such task is to run some simple scripts like atop or similar, paying attention to CPU, Memory and filesystem IO wait and stats and create alerts by findings.

As for client-side, isn't it possible to make some small add-on script with on/off "knob"t o client interface, who will volunteer in client-side monitoring, activating performance analyzer or network-tools logging then by script in their browsers and automatic sending of logs after time intervals.
There may be also some trigger like if socket is lost closed or something, then log and send it with timestamp.

So, just ideas... maybe that wheel is allready invented and i just do not know it :)

@schlawg
Copy link
Collaborator

schlawg commented Dec 7, 2022

As for client-side, isn't it possible to make some small add-on script with on/off "knob"t o client interface, who will volunteer in client-side monitoring, activating performance analyzer or network-tools logging then by script in their browsers and automatic sending of logs after time intervals.
There may be also some trigger like if socket is lost closed or something, then log and send it with timestamp.

That is very much the goal of #11991. We would start small with lichess log statement persistence and collection but once that is working, data from any number of analysis tools can be added if they opt in. Regarding event triggers such as unexpected socket errors, the current plan is to give users the option to review everything they'd be sending at the point of submission. This is mainly to reinforce that we're not harvesting their grocery lists and christmas card labels. Implementations may always be revised, but this is my current plan.

@Kudzu12
Copy link

Kudzu12 commented Dec 7, 2022

.... Regarding event triggers such as unexpected socket errors, the current plan is to give users the option to review everything they'd be sending at the point of submission. This is mainly to reinforce that we're not harvesting their grocery lists and christmas card labels. Implementations may always be revised, but this is my current plan.

Last one shouldn't be too big deal. I just tried as experiment both with Mozilla and Vivaldi browser - network tools log runs by default only in boundaries of one tab it was initially started on, so browsing on other pages does not affect log collected as long network analyzer is not started on that tab too.

But, for security reasons or for paranoid people - overview of sent material may be safer if they commit log sending by themselves. That of cause limits sending process automation.

@schlawg
Copy link
Collaborator

schlawg commented Jan 21, 2024

Hey all, we've got the client log delivery mechanism in place (finally). Can anyone still reproduce this though?

@Kudzu12
Copy link

Kudzu12 commented Jan 22, 2024

I do not see implementation on lichess clients web interface. So how could one invoke this log delivery?

@schlawg
Copy link
Collaborator

schlawg commented Jan 22, 2024

https://lichess.org/#debug but we are very sparse with logging there. Log statements need to be tailored to each issue. Currently we have statements in place for general lila-ws troubles but nothing centered around pairing. That's why I wanted to reach out and see if you guys could still reproduce this specific issue. I could add some statements and try to gather us useful data after the next asset deployment.

@Kudzu12
Copy link

Kudzu12 commented Jan 22, 2024 via email

@schlawg
Copy link
Collaborator

schlawg commented Jan 22, 2024

So you experience poor websockets performance/uptime in general? There is an alternate IP with different routing we can try. PM me on lichess if you're interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants