Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

there is something wrong with meteor when we have more than 4k users online (xhr flood) #11285

Closed
codeneno opened this issue Jan 13, 2021 · 19 comments
Labels
needs-reproduction We can't reproduce so it's blocked

Comments

@codeneno
Copy link

codeneno commented Jan 13, 2021

We use RoketChat (With Meteor), may be it is sockjs's bug, Presence Broadcast leads to clients flood front-end with long pooling XHR requests with websocket correctly configured. May be socket.io is better.

we enabled that sticky sessions in Nginx with ip_hash so one source IP is sticked to one upstream server. We see in Nginx logs, But we have about 4000 clients that send every second request like this:

POST /sockjs/464/4sqavoxf/xhr HTTP/1.1
Host: rocketchat.company.com
Connection: keep-alive
Content-Length: 0
Origin: https://rocketchat.company.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Rocket.Chat/2.17.7 Chrome/78.0.3904.130 Electron/7.1.10 Safari/537.36
Accept: /
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Referer: https://rocketchat.company.com/direct/iodE4TwMg4i729GoHy5RWQyYZLZBKRuhpt
Accept-Encoding: gzip, deflate, br
Accept-Language: ru
Cookie: rc_uid=y5RWQyYZLZBKRuhpt; rc_token=2O55h3bWfNex-_KiYgwsvcEanzyL-Qdr7bXptnKir6m

And CPU Load and Connections will be very high,then Crash ........

With response like this:

HTTP/1.1 200 OK
Server: nginx
Date: Thu, 07 May 2020 05:02:20 GMT
Content-Type: application/javascript; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Cache-Control: no-store, no-cache, no-transform, must-revalidate, max-age=0
Access-Control-Allow-Credentials: true
Access-Control-Allow-Origin: https://rocketchat.company.com
Vary: Origin
Access-Control-Allow-Origin: *.company.com
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload

So, it is defiantly not a websocket. But Rocket.Chat works pretty normal for such problem clients.
I don't why!
What is this? Is it some kind of compatibility to websocket or what? JavaScript Socket?

RocketChat/Rocket.Chat#17559
itional context
CTRL+R on client totally fixed that problem. Just after reload client successfully open websocket (response 101) and request flood stops.

https://user-images.githubusercontent.com/4023037/81270072-de543500-9052-11ea-8bb5-ba2c421ee2fc.png

But I want to know what exactly is the mode in which problem clients are working and how to fix it permanently?
And what developers think could be the reason for that behavior?

@evolross
Copy link
Contributor

Can you reproduce in a load test type environment?

@codeneno
Copy link
Author

codeneno commented Jan 14, 2021

@evolross no,we can't creat a test environment just like real environment,but we found:
1.flood xhr requests destroyed our servers (too many connections and too high cpu load)
2.it's fine when users count just about 2~3k
3.why xhr instead of websocket

@dj-foxxy
Copy link

We just encountered a similar (the same?) issue. The client-side fell back immediately to XHR despite web sockets functioning correctly. This only happened in production, which is strange because our dev environment includes Nginx with the same configuration. Most importantly it only happens when dynamic imports (e.g. await import('...')) are used (all of which we have removed to fix the issue).

Did a little poking around: line 1967 in socket-stream-clean.js

    that.unload_ref = utils.unload_add(function () {
      that.ws.close();
    });

Could it be that something about dynamic imports is causing this unload handler to fire immediately. I noticed that the dynamic imports code has changed in recent versions.

@codeneno
Copy link
Author

@dj-foxxy Hello,where is socket-stream-clean.js

@dj-foxxy
Copy link

@codeneno I'm not sure where the file comes from but inside you project it's .meteor/local/build/programs/web.browser/packages/socket-stream-client.js

@codeneno
Copy link
Author

codeneno commented Jan 17, 2021

@df-foxxy please give us one file,and How many online users do you have?

@filipenevola
Copy link
Collaborator

Hi @codeneno @dj-foxxy I never saw this problem before but I have always worked in apps running on Galaxy and Galaxy has a custom proxy written in Go, we don't use nginx so it's a different environment.

I'm saying that because maybe this is a clue where is the root cause or maybe it's something custom on Rocket.chat application, not sure.

We have many clients with more than 15k simultaneous connections every day and we have no reports like this.

@filipenevola filipenevola added the needs-reproduction We can't reproduce so it's blocked label Jan 17, 2021
@dj-foxxy
Copy link

@codeneno I not sure what you means, does the relative file path relative file path not work. As for users, very few we use it as the back end for a Twitch stream (the issues occurs regardless of user count).

@filipenevola Does Meteor support running your own instance? If so, is there documentation describing what a proxy should provide?

@filipenevola
Copy link
Collaborator

filipenevola commented Jan 20, 2021

@filipenevola Does Meteor support running your own instance? If so, is there documentation describing what a proxy should provide?

No, we don't. But I'm not saying that you need to run a custom proxy to solve your issue (as Meteor is using a WebSocket lib and not a custom implementation it would make no sense) but what I'm saying is that we have clients running more connections than you without problem and that MAYBE your issue is in your Nginx setup.

@codeneno
Copy link
Author

@filipenevola no ,Meteor use sockjs

@codeneno
Copy link
Author

codeneno commented Jan 22, 2021

@filipenevola filipenevola,you have more than 15k simultaneous connections ,we too.
but you dont have more 15k online users just connections.

@dj-foxxy

@dj-foxxy
Copy link

@filipenevola Chrome reported that the websocket was closed by the client before a connection to the server was established, so it's unlikely to be the server and more with sockjs's fallback mechanisms. I believe it happens in production due to timing, e.g., the server is not not the same box as the client. So the timing between Meteor dynamically loading JS and Sockjs starting up is different the the issue occurs.

@a4xrbj1
Copy link

a4xrbj1 commented Jan 27, 2021

We're getting sometimes (not regularly) a bunch of XHR errors (see below DataDog log) but we have a maximum of 5-6 concurrent users.

Could this be related (sorry, no expert on Sockjs)?

Screenshot 2021-01-28 at 06 34 06

@dj-foxxy
Copy link

@a4xrbj1 Looks like what we got. Are you using dynamic imports? When a client falls back to XHR, does the browser dev tools say that websocket (that sockjs attempted to use) was closed before a connection was established?

@a4xrbj1
Copy link

a4xrbj1 commented Jan 28, 2021

We're using ElectronJS as a client and therefore cannot identify what is happening on the dev tools (it's only happening on Production).

The only way we can identify which customer is actually experiencing is from location data that we get along in DataDog. I've attached two screenshots from one of the log entries but we've got 8 XHR errors in 1 minute from the same client in Canada. The URL is always the same.

Upon examine more of our log files we can see that the user had the last action at 2:25 and the XHR errors happen at 2:42, so 17 minutes later. It looks like the user left the computer and ElectronJS app running (3rd screenshot).

We do have an automated process at the Backend to kick out inactive users after 15 minutes but there's no trace of that in the log file. So that might have caused the XHR errors. Meaning that the Backend logged the user out after 15 minutes and the Frontend app tried to contact the Backend 2 minutes later but as it wasn't connected anymore it threw the XHR error. So that would mean the Backend couldn't inform/reach the Frontend app properly of the logout somehow.

Screenshot 2021-01-28 at 20 18 13

Screenshot 2021-01-28 at 20 18 28

Screenshot 2021-01-28 at 20 23 33

@anyway111
Copy link

i have same issues, how to hot fix? please help to consider.

@markdowney
Copy link

I believe I'm seeing the same issue. My app is being hammered with requests to /sockjs/info and the number of connections reported by Monti APM keeps accumulating without older connections being released.

reqs

I also suspect dynamic-imports but haven't confirmed yet.

@technicalbirdVayuz
Copy link

How to solve this. We have built a system total based on REST APIs but this is still coming. Also tried DISABLE_WEBSOCKETS=1 but it does not work. Please help!

@vitorflores
Copy link
Contributor

Hi, this issue was opened a year ago, and as we still don't have a way to reproduce this issue I'm closing it.

Of course, if we have a reproduction in the future we can re-open it. No problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-reproduction We can't reproduce so it's blocked
Projects
None yet
Development

No branches or pull requests

9 participants