there is something wrong with meteor when we have more than 4k users online (xhr flood) #11285

codeneno · 2021-01-13T01:16:50Z

We use RoketChat (With Meteor), may be it is sockjs's bug, Presence Broadcast leads to clients flood front-end with long pooling XHR requests with websocket correctly configured. May be socket.io is better.

we enabled that sticky sessions in Nginx with ip_hash so one source IP is sticked to one upstream server. We see in Nginx logs, But we have about 4000 clients that send every second request like this:

POST /sockjs/464/4sqavoxf/xhr HTTP/1.1
Host: rocketchat.company.com
Connection: keep-alive
Content-Length: 0
Origin: https://rocketchat.company.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Rocket.Chat/2.17.7 Chrome/78.0.3904.130 Electron/7.1.10 Safari/537.36
Accept: /
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Referer: https://rocketchat.company.com/direct/iodE4TwMg4i729GoHy5RWQyYZLZBKRuhpt
Accept-Encoding: gzip, deflate, br
Accept-Language: ru
Cookie: rc_uid=y5RWQyYZLZBKRuhpt; rc_token=2O55h3bWfNex-_KiYgwsvcEanzyL-Qdr7bXptnKir6m

And CPU Load and Connections will be very high,then Crash ........

With response like this:

HTTP/1.1 200 OK
Server: nginx
Date: Thu, 07 May 2020 05:02:20 GMT
Content-Type: application/javascript; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Cache-Control: no-store, no-cache, no-transform, must-revalidate, max-age=0
Access-Control-Allow-Credentials: true
Access-Control-Allow-Origin: https://rocketchat.company.com
Vary: Origin
Access-Control-Allow-Origin: *.company.com
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload

So, it is defiantly not a websocket. But Rocket.Chat works pretty normal for such problem clients.
I don't why!
What is this? Is it some kind of compatibility to websocket or what? JavaScript Socket?

RocketChat/Rocket.Chat#17559
itional context
CTRL+R on client totally fixed that problem. Just after reload client successfully open websocket (response 101) and request flood stops.

https://user-images.githubusercontent.com/4023037/81270072-de543500-9052-11ea-8bb5-ba2c421ee2fc.png

But I want to know what exactly is the mode in which problem clients are working and how to fix it permanently?
And what developers think could be the reason for that behavior?

evolross · 2021-01-13T23:46:27Z

Can you reproduce in a load test type environment?

codeneno · 2021-01-14T00:34:17Z

@evolross no,we can't creat a test environment just like real environment,but we found:
1.flood xhr requests destroyed our servers (too many connections and too high cpu load)
2.it's fine when users count just about 2~3k
3.why xhr instead of websocket

dj-foxxy · 2021-01-17T11:37:28Z

We just encountered a similar (the same?) issue. The client-side fell back immediately to XHR despite web sockets functioning correctly. This only happened in production, which is strange because our dev environment includes Nginx with the same configuration. Most importantly it only happens when dynamic imports (e.g. await import('...')) are used (all of which we have removed to fix the issue).

Did a little poking around: line 1967 in socket-stream-clean.js

    that.unload_ref = utils.unload_add(function () {
      that.ws.close();
    });

Could it be that something about dynamic imports is causing this unload handler to fire immediately. I noticed that the dynamic imports code has changed in recent versions.

codeneno · 2021-01-17T11:55:36Z

@dj-foxxy Hello,where is socket-stream-clean.js

dj-foxxy · 2021-01-17T11:57:46Z

@codeneno I'm not sure where the file comes from but inside you project it's .meteor/local/build/programs/web.browser/packages/socket-stream-client.js

codeneno · 2021-01-17T12:05:25Z

@df-foxxy please give us one file,and How many online users do you have?

filipenevola · 2021-01-17T13:10:56Z

Hi @codeneno @dj-foxxy I never saw this problem before but I have always worked in apps running on Galaxy and Galaxy has a custom proxy written in Go, we don't use nginx so it's a different environment.

I'm saying that because maybe this is a clue where is the root cause or maybe it's something custom on Rocket.chat application, not sure.

We have many clients with more than 15k simultaneous connections every day and we have no reports like this.

dj-foxxy · 2021-01-17T13:39:53Z

@codeneno I not sure what you means, does the relative file path relative file path not work. As for users, very few we use it as the back end for a Twitch stream (the issues occurs regardless of user count).

@filipenevola Does Meteor support running your own instance? If so, is there documentation describing what a proxy should provide?

filipenevola · 2021-01-20T11:49:26Z

@filipenevola Does Meteor support running your own instance? If so, is there documentation describing what a proxy should provide?

No, we don't. But I'm not saying that you need to run a custom proxy to solve your issue (as Meteor is using a WebSocket lib and not a custom implementation it would make no sense) but what I'm saying is that we have clients running more connections than you without problem and that MAYBE your issue is in your Nginx setup.

codeneno · 2021-01-22T04:49:18Z

@filipenevola no ,Meteor use sockjs

codeneno · 2021-01-22T05:05:26Z

@filipenevola filipenevola,you have more than 15k simultaneous connections ,we too.
but you dont have more 15k online users just connections.

@dj-foxxy

dj-foxxy · 2021-01-22T13:50:17Z

@filipenevola Chrome reported that the websocket was closed by the client before a connection to the server was established, so it's unlikely to be the server and more with sockjs's fallback mechanisms. I believe it happens in production due to timing, e.g., the server is not not the same box as the client. So the timing between Meteor dynamically loading JS and Sockjs starting up is different the the issue occurs.

a4xrbj1 · 2021-01-27T22:36:36Z

We're getting sometimes (not regularly) a bunch of XHR errors (see below DataDog log) but we have a maximum of 5-6 concurrent users.

Could this be related (sorry, no expert on Sockjs)?

dj-foxxy · 2021-01-28T09:43:13Z

@a4xrbj1 Looks like what we got. Are you using dynamic imports? When a client falls back to XHR, does the browser dev tools say that websocket (that sockjs attempted to use) was closed before a connection was established?

a4xrbj1 · 2021-01-28T12:27:04Z

We're using ElectronJS as a client and therefore cannot identify what is happening on the dev tools (it's only happening on Production).

The only way we can identify which customer is actually experiencing is from location data that we get along in DataDog. I've attached two screenshots from one of the log entries but we've got 8 XHR errors in 1 minute from the same client in Canada. The URL is always the same.

Upon examine more of our log files we can see that the user had the last action at 2:25 and the XHR errors happen at 2:42, so 17 minutes later. It looks like the user left the computer and ElectronJS app running (3rd screenshot).

We do have an automated process at the Backend to kick out inactive users after 15 minutes but there's no trace of that in the log file. So that might have caused the XHR errors. Meaning that the Backend logged the user out after 15 minutes and the Frontend app tried to contact the Backend 2 minutes later but as it wasn't connected anymore it threw the XHR error. So that would mean the Backend couldn't inform/reach the Frontend app properly of the logout somehow.

anyway111 · 2021-02-26T08:20:15Z

i have same issues, how to hot fix? please help to consider.

markdowney · 2021-03-30T11:20:41Z

I believe I'm seeing the same issue. My app is being hammered with requests to /sockjs/info and the number of connections reported by Monti APM keeps accumulating without older connections being released.

I also suspect dynamic-imports but haven't confirmed yet.

technicalbirdVayuz · 2021-05-19T15:42:11Z

How to solve this. We have built a system total based on REST APIs but this is still coming. Also tried DISABLE_WEBSOCKETS=1 but it does not work. Please help!

vitorflores · 2022-01-28T13:24:05Z

Hi, this issue was opened a year ago, and as we still don't have a way to reproduce this issue I'm closing it.

Of course, if we have a reproduction in the future we can re-open it. No problem.

filipenevola added the needs-reproduction We can't reproduce so it's blocked label Jan 17, 2021

emikolajczak mentioned this issue May 25, 2021

Rocket.Chat stability issue when many users disconnected and connected again RocketChat/Rocket.Chat#21182

Open

vitorflores closed this as completed Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

there is something wrong with meteor when we have more than 4k users online (xhr flood) #11285

there is something wrong with meteor when we have more than 4k users online (xhr flood) #11285

codeneno commented Jan 13, 2021 •

edited

evolross commented Jan 13, 2021

codeneno commented Jan 14, 2021 •

edited

dj-foxxy commented Jan 17, 2021

codeneno commented Jan 17, 2021

dj-foxxy commented Jan 17, 2021

codeneno commented Jan 17, 2021 •

edited

filipenevola commented Jan 17, 2021

dj-foxxy commented Jan 17, 2021

filipenevola commented Jan 20, 2021 •

edited

codeneno commented Jan 22, 2021

codeneno commented Jan 22, 2021 •

edited

dj-foxxy commented Jan 22, 2021

a4xrbj1 commented Jan 27, 2021

dj-foxxy commented Jan 28, 2021

a4xrbj1 commented Jan 28, 2021

anyway111 commented Feb 26, 2021

markdowney commented Mar 30, 2021

technicalbirdVayuz commented May 19, 2021

vitorflores commented Jan 28, 2022

there is something wrong with meteor when we have more than 4k users online (xhr flood) #11285

there is something wrong with meteor when we have more than 4k users online (xhr flood) #11285

Comments

codeneno commented Jan 13, 2021 • edited

evolross commented Jan 13, 2021

codeneno commented Jan 14, 2021 • edited

dj-foxxy commented Jan 17, 2021

codeneno commented Jan 17, 2021

dj-foxxy commented Jan 17, 2021

codeneno commented Jan 17, 2021 • edited

filipenevola commented Jan 17, 2021

dj-foxxy commented Jan 17, 2021

filipenevola commented Jan 20, 2021 • edited

codeneno commented Jan 22, 2021

codeneno commented Jan 22, 2021 • edited

dj-foxxy commented Jan 22, 2021

a4xrbj1 commented Jan 27, 2021

dj-foxxy commented Jan 28, 2021

a4xrbj1 commented Jan 28, 2021

anyway111 commented Feb 26, 2021

markdowney commented Mar 30, 2021

technicalbirdVayuz commented May 19, 2021

vitorflores commented Jan 28, 2022

codeneno commented Jan 13, 2021 •

edited

codeneno commented Jan 14, 2021 •

edited

codeneno commented Jan 17, 2021 •

edited

filipenevola commented Jan 20, 2021 •

edited

codeneno commented Jan 22, 2021 •

edited