Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU usage after restarting one of the cluster nodes #1134

Open
yadvlz opened this issue Dec 24, 2020 · 29 comments
Open

100% CPU usage after restarting one of the cluster nodes #1134

yadvlz opened this issue Dec 24, 2020 · 29 comments

Comments

@yadvlz
Copy link

yadvlz commented Dec 24, 2020

Our cluster consists of three nodes:

  • 32xXeon(R) CPU E7-8860 v4 @ 2.20GHz, 128 GB RAM, SSD, VMware VM.
  • nats-streaming-server version 0.19.0, nats-server: v2.1.9

Configuration file:

port: 4222
http_port: 8222
write_deadline: "15s"
debug: false
trace: false
logtime: true
logfile_size_limit: 10MB
log_file: "/storage/nats-streaming-server/logs/nats.log"

authorization {
    timeout: 2
    users: [
        { user: user1, password: 1 }
        	...
        { user: nss_cluster, password: clusterpass }
    ]
}

cluster: {
    listen: 0.0.0.0:6222
    authorization: {
        timeout: 2
        user: nats_route
        password: natsroutepass
    }
    routes: ["nats-route://nats_route:natsroutepass@10.0.0.2:6222", "nats-route://nats_route:natsroutepass@10.0.0.3:6222"]
}

streaming: {
    id: "stan-eas"
    sd: false
    sv: false
    store: "file"
    dir: "/storage/nats-streaming-server/data"
    hb_interval: "30s"
    hb_timeout: "10s"
    hb_fail_count: 10
    username: nss_cluster
    password: clusterpass

    store_limits: {
        max_channels: 0
        max_msgs: 0
        max_bytes: 0MB
        max_age: "144h"
        max_subs: 0
        max_inactivity: "144h"
        channels: {
            "LocalAction": { max_age: "144h" }
            "PME.Control": { max_age: "144h" }
            "PME.Rules": { max_age: "8760h", max_msgs: 5 }
            "PerfCounter.Metrics": { max_age: "4h" }
            "SQLExporter.Metrics": { max_age: "4h" }
            "WatchDog.Metrics": { max_age: "4h" }
            "PerfCounter.Log": { max_age: "4h" }
            "SQLExporter.Log": { max_age: "4h" }
            "WatchDog.Log": { max_age: "4h" }
        }
    }

    file: {
        compact_min_size: 100MB
        sync_on_flush: true
        file_descriptors_limit: 100
        parallel_recovery: 4
    }

    cluster: {
        node_id: "a"
        peers: ["b", "c"]
    	log_path: "/storage/nats-streaming-server/nsslog"
    	log_cache_size: 1024
    	log_snapshots: 10
        trailing_logs: 256
        raft_logging: false
    }
}

Number of clients: ~ 40 000.
The average consumption of CPU resources: 400/3200 (from "top").
If I restart one of the nodes, the utilization on all nodes jumps to 3200.

Log:

[64189] 2020/12/22 14:12:58.305625 [WRN] 10.121.75.67:52353 - cid:3100306 - Readloop processing time: 3.453755114s
[64189] 2020/12/22 14:12:58.305672 [WRN] 10.52.214.213:55416 - cid:3100375 - Readloop processing time: 3.607654096s
[64189] 2020/12/22 14:12:58.305736 [WRN] 10.28.100.161:56442 - cid:3100352 - Readloop processing time: 3.657804437s
[64189] 2020/12/22 14:12:58.305799 [WRN] 10.42.118.194:50449 - cid:3100231 - Readloop processing time: 3.519814355s
[64189] 2020/12/22 14:12:58.305843 [WRN] 10.52.214.213:55420 - cid:3100472 - Readloop processing time: 3.552920741s
[64189] 2020/12/22 14:12:58.305903 [WRN] 10.125.220.213:52240 - cid:3099976 - Readloop processing time: 3.991156933s
[64189] 2020/12/22 14:12:58.305970 [WRN] 10.25.81.205:49452 - cid:3099913 - Readloop processing time: 6.234881178s
[64189] 2020/12/22 14:12:58.305991 [WRN] 10.126.77.154:58490 - cid:3099999 - Readloop processing time: 3.49807845s
[64189] 2020/12/22 14:12:58.306033 [WRN] 10.32.7.145:63423 - cid:3100051 - Readloop processing time: 3.728901556s
[64189] 2020/12/22 14:12:58.306076 [WRN] 10.118.201.117:55069 - cid:3100431 - Readloop processing time: 3.650559115s
[64189] 2020/12/22 14:12:58.306110 [WRN] 10.121.75.67:53058 - cid:3100164 - Readloop processing time: 3.618370455s
[64189] 2020/12/22 14:12:58.306163 [WRN] 10.84.89.227:51199 - cid:3100208 - Readloop processing time: 3.640992134s
[64189] 2020/12/22 14:12:58.306206 [WRN] 10.66.110.225:58100 - cid:3100222 - Readloop processing time: 3.704752827s
[64189] 2020/12/22 14:12:58.306256 [WRN] 10.121.75.67:53157 - cid:3100299 - Readloop processing time: 3.487310526s
[64189] 2020/12/22 14:12:58.306304 [WRN] 10.96.8.16:56316 - cid:3100012 - Readloop processing time: 3.882298912s
[64189] 2020/12/22 14:12:58.306336 [WRN] 10.113.81.66:52301 - cid:3100159 - Readloop processing time: 3.729128858s
[64189] 2020/12/22 14:12:58.306374 [WRN] 10.42.93.98:54259 - cid:3100947 - Readloop processing time: 2.452962671s
[64189] 2020/12/22 14:12:58.306382 [WRN] 10.39.210.213:49808 - cid:3100371 - Readloop processing time: 3.542212148s
[64189] 2020/12/22 14:12:58.306416 [WRN] 10.118.88.4:52712 - cid:3100117 - Readloop processing time: 3.520391606s
[64189] 2020/12/22 14:12:58.306490 [WRN] 10.28.97.66:50986 - cid:3100471 - Readloop processing time: 3.542449103s
[64189] 2020/12/22 14:12:58.306587 [WRN] 10.93.217.5:51259 - cid:3100391 - Readloop processing time: 3.552975942s
[64189] 2020/12/22 14:12:58.306633 [WRN] 10.121.75.67:52246 - cid:3100373 - Readloop processing time: 3.608532296s
[64189] 2020/12/22 14:12:58.306677 [WRN] 10.121.75.67:53091 - cid:3100210 - Readloop processing time: 3.586355502s

@kozlovic
Copy link
Member

Some questions:

  • Does the problem resolves on its own after a while, if so how long? If not, how long did you wait?
  • Any idea on how many connections were connected to the restarted node? You mentioned something like 40,000 clients total.
  • How long was the restarted node left not running? Was there a lot of activity in that time?
  • Were the warning reported the whole extent of them or just a sample? Were there in a single server log or on all of them?

The fact that all 3 nodes were experiencing high CPU usage makes me believe that it may have been due to the amount of clients reconnecting at the same time. If not, and if the stopped node would have been down for a while and required a lot of catching up on restart, this would be with the leader, so would then expect only 2 to - possibly - be very active.

@yadvlz
Copy link
Author

yadvlz commented Dec 28, 2020

  • It doesn't. I waited for about 6-8 hours. To resolve the issue I block all incoming packets with a firewall and run "for i in {0..255}; do echo "$i"; iptables -I INPUT 1 -s 10.$i.0.0/16 -j ACCEPT; sleep 30; done" on each node.
  • 12-15k.
  • Just run "systemctl restart nats-streaming-server".
  • Warnings were on all servers.

@yadvlz
Copy link
Author

yadvlz commented Dec 28, 2020

CPU jumps in all nodes after restart one of them.

@kozlovic
Copy link
Member

To make sure I have correctly understood the description of the issue:

The issue was triggered when simply restarting one NATS Streaming server of the cluster that probably had between 12 to 15,000 clients. After the restart, all nodes were 100% CPU for 6 to 8 hours.
You then blocked traffic for all incoming clients for about 30 seconds, and that solved the issue.

Is it about right?

@yadvlz
Copy link
Author

yadvlz commented Dec 28, 2020

The issue was triggered when simply restarting one NATS Streaming server of the cluster that probably had between 12 to 15,000 clients.

Correct.

After the restart, all nodes were 100% CPU for 6 to 8 hours.

Maybe more, I don't tested.

You then blocked traffic for all incoming clients for about 30 seconds, and that solved the issue.

Clients are located on the 10.0.0.0/8 network, I block all incoming packets for about 20 minutes, after that the CPU load drops. If I unlock incoming packets, the load increases again. To prevent this, I am forced to unlock incoming packets partially, by /16 every 30 seconds.

@yadvlz
Copy link
Author

yadvlz commented Dec 28, 2020

The issue was triggered when simply restarting one NATS Streaming server of the cluster that probably had between 12 to 15,000 clients.

12,000 - 15,000. Sorry.

@kozlovic
Copy link
Member

12,000 - 15,000. Sorry.

Yes, I understood that, no worries.

To prevent this, I am forced to unlock incoming packets partially, by /16 every 30 seconds.

So do you get to a point where all clients traffic is resumed and servers behave normally (no high CPU usage on all nodes)?

@yadvlz
Copy link
Author

yadvlz commented Dec 29, 2020

So do you get to a point where all clients traffic is resumed and servers behave normally (no high CPU usage on all nodes)?

Correct. Right now the cluster is in this state and the unlock script is running on all nodes :)

@kozlovic
Copy link
Member

Correct. Right now the cluster is in this state and the unlock script is running on all nodes :)

Oh so you mean that you are still filtering clients? So if you were to "open the gate" there would still be the issue?
And we are sure that the client traffic is same than before and the issue was really started with a restart of a node and not simply some client usage that has suddenly increased?

How many channels do you have, and what is the normal message rate? How many subscriptions exist and are they mainly 1 to 1 or do 1 message go to many different subscriptions?

@yadvlz
Copy link
Author

yadvlz commented Dec 29, 2020

Oh so you mean that you are still filtering clients?

The problem has been observed for several months and every time it appears I have to solve it by blocking/unblocking clients.

And we are sure that the client traffic is same than before and the issue was really started with a restart of a node and not simply some client usage that has suddenly increased?

Sometimes the load increases without restarting the node. The cluster works normally, then suddenly jumps. The problem is solved by blocking / unblocking clients.

How many channels do you have, and what is the normal message rate?

~42,000 channels.

{
"num_subscriptions": 124494,
"num_cache": 10469384,
"num_inserts": 117474553,
"num_removes": 117350059,
"num_matches": 2124441735,
"cache_hit_rate": 0.0279807824430638,
"max_fanout": 0,
"avg_fanout": 0
}

@kozlovic
Copy link
Member

Looks like you may have 1 channel per client since you mentioned about 40,000 clients and 40,000 channels. I see that there is no limit on these channels except for max_age or max_inactivity. Is there cases where those channels are removed due to inactivity or is it mainly messages removed after 6 days?

When the event occurs (either on server restart or on its own), have you correlated to an increase in message rate in some channels? Have you noticed some channels unexpectedly filling up with messages maybe?
Are you using streaming only or do you have NATS core usage from client applications on this cluster?

@yadvlz
Copy link
Author

yadvlz commented Dec 29, 2020

I need the help of my teammate. He'll be able to respond in a few hours.

@yadvlz
Copy link
Author

yadvlz commented Dec 29, 2020

Looks like you may have 1 channel per client since you mentioned about 40,000 clients and 40,000 channels.

We have one channel with 40,000 subscribers and 40,000 1-to-1 channels.

Is there cases where those channels are removed due to inactivity or is it mainly messages removed after 6 days?

Messages removed after 6 days.

When the event occurs (either on server restart or on its own), have you correlated to an increase in message rate in some channels?

We need to check it out.

Have you noticed some channels unexpectedly filling up with messages maybe?

Noticed - where?

Are you using streaming only or do you have NATS core usage from client applications on this cluster?

Both of them.

@kozlovic
Copy link
Member

We have one channel with 40,000 subscribers

Ouch, so 1 message will be sent to 40,000 subscribers... this could become a bottleneck if the message rate to this channel is high.

Noticed - where?

Are you using any kind of metric or the monitoring endpoints to see if some of the channels unexpectedly increase in size? In case you don't know, you can use monitoring endpoints to check on some of the streaming resources: https://docs.nats.io/nats-streaming-concepts/monitoring and https://docs.nats.io/nats-streaming-concepts/monitoring/endpoints

Both of them.

I have a nagging suspicion that it either a sudden influx of streaming messages or NATS related (or some subscriptions being constantly created/destroyed). The reason I say that is that all 3 nodes are affected. That can be the case if streaming messages are added to the system since they get replicated to all 3 nodes. That being said it is the same for any other activity: connection create/close, subscription create/close, messages delivered to sub/ack'ed: all that get replicated. So an increase in activity would cause each node to have to work to process the replicated information.

If you have any way that you measure normal activity in the cluster, you should then see if you notice any increase activity. You could even have a simple group of NATS (not streaming) consumers on ">" that simply measure the message rate/size and report every second. It does not have to be too sophisticated, just a way to see a jump. Or, when the situation occurs (100% CPU), start a NATS consumer on ">" and capture traffic for few seconds and try to see what the bulk of the traffic is?

As you can see, I am a bit at loss to explain what is happening...

@yadvlz
Copy link
Author

yadvlz commented Dec 30, 2020

Ouch, so 1 message will be sent to 40,000 subscribers... this could become a bottleneck if the message rate to this channel is high.

1-2 messages per month.

Are you using any kind of metric or the monitoring endpoints to see if some of the channels unexpectedly increase in size? In case you don't know, you can use monitoring endpoints to check on some of the streaming resources: https://docs.nats.io/nats-streaming-concepts/monitoring and https://docs.nats.io/nats-streaming-concepts/monitoring/endpoints

We use prometheus-metrics-exporter, but we haven't set up triggers yet.

Two questions:

  • at the moment I observe the following messages in the logs:

[ERR] STREAM: Asynchronous error on connection _NSS-stan-eas-general, subject _STAN.sub.stan-eas: nats: slow consumer, messages dropped

What is it? CPU load is 2900/3200.

  • logs also contain the following messages:

[Client:replicator_412025] Duplicate durable subscription registration

They appear in dozens per second. Maybe this is the case?

@kozlovic
Copy link
Member

[ERR] STREAM: Asynchronous error on connection _NSS-stan-eas-general, subject _STAN.sub.stan-eas: nats: slow consumer, messages dropped
What is it? CPU load is 2900/3200.

The Streaming server is not a server per-se (https://docs.nats.io/nats-streaming-concepts/intro and https://docs.nats.io/nats-streaming-concepts/relation-to-nats). It creates NATS connections to the NATS Server and uses subscriptions to receive messages from clients. This subscription is for "subscription requests" coming from clients. By default NATS subscriptions have a limit of pending messages (waiting to be dispatched/processed) after which the connection does not enqueue them but drop them and notifies the async error handler.
So the error above indicates that the server is receiving way more subscriptions requests that it can handle and they are dropped.

[Client:replicator_412025] Duplicate durable subscription registration
They appear in dozens per second. Maybe this is the case?

Unless this is a new issue and/or a side effect of something else, it could indeed be the reason for the issue. Could it be that you have an application that tries to create a durable subscription and for some reason loops trying to do so? It could be that initially it got a timeout waiting for the response, but the durable was actually created. In that case, your app may want to check for the error returned and in the case of "Duplicate durable subscription" it would have to close the connection otherwise any other attempt will fail.

@kozlovic
Copy link
Member

kozlovic commented Jan 5, 2021

@yadvlz Could it be that issue #1135 is related to this issue? If so, I have submitted a PR to add a new configuration option that would cause the server to accept the close the current durable and accept the new durable subscription. That may help the issue that you have seen?

@yadvlz
Copy link
Author

yadvlz commented Jan 6, 2021

@yadvlz Could it be that issue #1135 is related to this issue?

I think it is.

@kozlovic
Copy link
Member

kozlovic commented Jan 6, 2021

@yadvlz I was actually wondering if the issue was from a colleague of yours. If not, there have been 2 issues reported in the streaming server repo and 2 in the stan.go for the same "duplicate durable" issue in a short amount of time, so this is why I was wondering if they come from the same deployment or are actually different users.
Nevertheless, I believe that PR #1136 will help.

@kozlovic
Copy link
Member

kozlovic commented Jan 7, 2021

I am closing this since PR #1136 has been merged. Thank you for your patience!

@kozlovic kozlovic closed this as completed Jan 7, 2021
@yadvlz
Copy link
Author

yadvlz commented Jan 8, 2021

I deployed new version of nats-streaming-server (with "replace_durable: true"), but my problem still exist.
Log from the leader at the time of restarting the follower

[32749] 2021/01/08 12:33:27.428845 [ERR] STREAM: [Client:replicator_659469] Timed out on heartbeats
[32749] 2021/01/08 12:33:28.690700 [ERR] STREAM: [Client:replicator_606369] Timed out on heartbeats
[32749] 2021/01/08 12:33:38.565970 [ERR] STREAM: [Client:replicator_693023] Timed out on heartbeats
[32749] 2021/01/08 12:33:52.827200 [ERR] STREAM: [Client:replicator_157785] Timed out on heartbeats
[32749] 2021/01/08 12:33:54.833447 [INF] 10.193.68.251:38968 - rid:10961 - Router connection closed: Read Error
[32749] 2021/01/08 12:33:55.699539 [ERR] STREAM: [Client:replicator_462420] Timed out on heartbeats
[32749] 2021/01/08 12:33:55.887039 [INF] STREAM: server lost leadership, performing leader stepdown actions
[32749] 2021/01/08 12:33:55.908709 [INF] 10.193.68.251:40086 - rid:97807 - Route connection created
[32749] 2021/01/08 12:33:57.227611 [WRN] 10.112.24.9:63226 - cid:97429 - Readloop processing time: 2.263951272s
[32749] 2021/01/08 12:33:57.230102 [WRN] 10.92.47.72:57609 - cid:97354 - Readloop processing time: 2.017834321s
[32749] 2021/01/08 12:33:57.262937 [INF] STREAM: finished leader stepdown actions
[32749] 2021/01/08 12:33:57.276524 [WRN] 10.87.91.162:64614 - cid:97364 - Readloop processing time: 2.246782296s
[32749] 2021/01/08 12:33:57.334539 [WRN] 10.100.141.210:57416 - cid:97291 - Readloop processing time: 2.106756313s
[32749] 2021/01/08 12:33:57.340296 [WRN] 10.106.78.130:64324 - cid:97318 - Readloop processing time: 2.310444523s
[32749] 2021/01/08 12:33:57.586867 [WRN] 10.109.121.2:51399 - cid:97379 - Readloop processing time: 2.623306455s
[32749] 2021/01/08 12:33:57.589297 [WRN] 10.113.77.67:63701 - cid:97433 - Readloop processing time: 2.60364601s
[32749] 2021/01/08 12:33:57.589529 [WRN] 10.108.213.197:52596 - cid:97422 - Readloop processing time: 2.625908581s
[32749] 2021/01/08 12:33:57.589593 [WRN] 10.92.108.200:61466 - cid:97273 - Readloop processing time: 2.615766941s
[32749] 2021/01/08 12:33:57.589661 [WRN] 10.92.84.194:55216 - cid:97287 - Readloop processing time: 2.669059357s
[32749] 2021/01/08 12:33:57.589721 [WRN] 10.24.84.2:60485 - cid:97323 - Readloop processing time: 2.647962912s
[32749] 2021/01/08 12:33:57.589780 [WRN] 10.93.96.146:58826 - cid:97378 - Readloop processing time: 2.669010652s

@kozlovic
Copy link
Member

kozlovic commented Jan 8, 2021

I am re-opening this issue.
Seeing the timed out errors and the routed disconnected (and leader step down), I wonder if you did not have a network issue there too. But otherwise, we are now back to do the investigation I suggested in earlier comments.

@kozlovic kozlovic reopened this Jan 8, 2021
@yadvlz
Copy link
Author

yadvlz commented Jan 11, 2021

[42312] 2021/01/11 11:55:42.398945 [FTL] STREAM: Failed to start: nats: no servers available for connection
[42349] 2021/01/11 11:55:42.711058 [INF] STREAM: Starting nats-streaming-server[stan-eas] version 0.19.0
[42349] 2021/01/11 11:55:42.711155 [INF] STREAM: ServerID: q1zdQ4ez66kGwQEFWz5stM
[42349] 2021/01/11 11:55:42.711163 [INF] STREAM: Go version: go1.15.6
[42349] 2021/01/11 11:55:42.711169 [INF] STREAM: Git commit: [not set]
[42349] 2021/01/11 11:55:42.712526 [INF] Starting nats-server version 2.1.9
[42349] 2021/01/11 11:55:42.712550 [INF] Git commit [not set]
[42349] 2021/01/11 11:55:42.712790 [INF] Starting http monitor on 0.0.0.0:8222
[42349] 2021/01/11 11:55:42.714242 [INF] Listening for client connections on 0.0.0.0:4222
[42349] 2021/01/11 11:55:42.714262 [INF] Server id is ND4V3Q4K6EOQX62KBQWBAPD4GUFBQGMC3X4PAISUG5XSHAK6JEMKK6TF
[42349] 2021/01/11 11:55:42.714268 [INF] Server is ready
[42349] 2021/01/11 11:55:42.714798 [INF] Listening for route connections on 0.0.0.0:6222
[42349] 2021/01/11 11:55:42.716096 [INF] 10.193.68.252:6222 - rid:1 - Route connection created
[42349] 2021/01/11 11:55:42.716140 [INF] 10.193.68.253:6222 - rid:2 - Route connection created
[42349] 2021/01/11 11:55:45.010386 [WRN] 10.77.28.252:63791 - cid:105 - Readloop processing time: 2.048394269s
[42349] 2021/01/11 11:55:45.021490 [WRN] 10.82.77.136:61951 - cid:64 - Readloop processing time: 2.014462329s
[42349] 2021/01/11 11:55:45.041510 [WRN] 10.80.91.131:53914 - cid:65 - Readloop processing time: 2.149629235s
[42349] 2021/01/11 11:55:45.049700 [WRN] 10.105.127.98:63396 - cid:120 - Readloop processing time: 2.087822186s
[42349] 2021/01/11 11:55:45.056968 [WRN] 10.41.89.2:63512 - cid:114 - Readloop processing time: 2.073160631s
[42349] 2021/01/11 11:55:45.060889 [FTL] STREAM: Failed to start: nats: no servers available for connection
[42385] 2021/01/11 11:55:45.210725 [INF] STREAM: Starting nats-streaming-server[stan-eas] version 0.19.0
[42385] 2021/01/11 11:55:45.210821 [INF] STREAM: ServerID: olFM1hi2L8w7kgv5B0pb6O
[42385] 2021/01/11 11:55:45.210832 [INF] STREAM: Go version: go1.15.6
[42385] 2021/01/11 11:55:45.210839 [INF] STREAM: Git commit: [not set]
[42385] 2021/01/11 11:55:45.212160 [INF] Starting nats-server version 2.1.9
[42385] 2021/01/11 11:55:45.212184 [INF] Git commit [not set]
[42385] 2021/01/11 11:55:45.212469 [INF] Starting http monitor on 0.0.0.0:8222
[42385] 2021/01/11 11:55:45.213904 [INF] Listening for client connections on 0.0.0.0:4222
[42385] 2021/01/11 11:55:45.213925 [INF] Server id is NAO7GBYFGAJ6OFZMRRRQ3I4TS6SIC2YD344ERBPRZNLTEUBBOTIOOR75
[42385] 2021/01/11 11:55:45.213930 [INF] Server is ready

These messages appear cyclically on the node after restart.

@yadvlz
Copy link
Author

yadvlz commented Jan 11, 2021

If I block clients, the log looks like this:

[42965] 2021/01/11 11:57:19.464101 [INF] Starting nats-server version 2.1.9
[42965] 2021/01/11 11:57:19.464116 [INF] Git commit [not set]
[42965] 2021/01/11 11:57:19.464408 [INF] Starting http monitor on 0.0.0.0:8222
[42965] 2021/01/11 11:57:19.467430 [INF] Listening for client connections on 0.0.0.0:4222
[42965] 2021/01/11 11:57:19.467444 [INF] Server id is NCYRM2UX76PYBASHN3PO4IN7EA6RBMAW62EW7GDL7B7I42RYZKPBA3V4
[42965] 2021/01/11 11:57:19.467448 [INF] Server is ready
[42965] 2021/01/11 11:57:19.468038 [INF] Listening for route connections on 0.0.0.0:6222
[42965] 2021/01/11 11:57:19.469184 [INF] 10.193.68.252:6222 - rid:1 - Route connection created
[42965] 2021/01/11 11:57:19.469228 [INF] 10.193.68.253:6222 - rid:2 - Route connection created
[42965] 2021/01/11 11:57:20.535651 [INF] STREAM: Recovering the state...
[42965] 2021/01/11 12:00:34.815674 [INF] STREAM: Recovered 42537 channel(s)
[42965] 2021/01/11 12:00:34.923018 [INF] STREAM: Cluster Node ID : a
[42965] 2021/01/11 12:00:34.923112 [INF] STREAM: Cluster Log Path: /storage/nats-streaming-server/nsslog
[42965] 2021/01/11 12:00:34.923121 [INF] STREAM: Cluster known peers:
[42965] 2021/01/11 12:00:34.923132 [INF] STREAM: peer 1: "b"
[42965] 2021/01/11 12:00:34.923138 [INF] STREAM: peer 2: "c"
[42965] 2021/01/11 12:00:35.166678 [INF] STREAM: restoring from snapshot

@kozlovic
Copy link
Member

It looks to me that the server - as soon as routes are created - is overwhelmed with traffic to the point that the internal connections requests to the NATS server fail! Now the fact that the server is embedding its own NATS server makes it even more worrisome because this is just loopback. The default connect timeout is 2 seconds, so that may explain why it fails seeing those warnings.
If you can recompile the server, you could try bumping the connect timeout (this is not currently configurable through options). This would be here (based on main branch):


and you could add:

ncOpts.Timeout = 10 * time.Second

before the Connect() call. But to be honest, I fear that this is not going to help much. I am thinking that the volume is too much for a single server to handle.

@yadvlz
Copy link
Author

yadvlz commented Jan 11, 2021

I am thinking that the volume is too much for a single server to handle.

So how can I deal with it?

@kozlovic
Copy link
Member

So how can I deal with it?

I described some of the ways we could try to debug this by capturing some of the traffic and see if this traffic is "expected". From your earlier description, it sounded to me like there are times where all is well, but then suddenly an event (server restart or other) causes this issue and the only way to restore is to basically stop client traffic for a while.

If we determine that the traffic is normal, just higher volume from time to time, then maybe you need to have several clusters and separate connections/traffic.
If the traffic shows an abnormal pattern (could have been a constant attempt to create durable subscriptions as we had suspected at one point) then it could help us fix the root of this traffic.

@yadvlz
Copy link
Author

yadvlz commented Jan 13, 2021

ncOpts.Timeout = 10 * time.Second

Tried this, didn't help.

If we determine that the traffic is normal, just higher volume from time to time, then maybe you need to have several clusters and separate connections/traffic.

I think the problem occurs in cases where multiple clients are trying to connect to the server (establish new connection) at the same time.

@kozlovic
Copy link
Member

I think the problem occurs in cases where multiple clients are trying to connect to the server (establish new connection) at the same time.

If this is following a disconnect/reconnect, and depending which library (and version) you use, we have some ways to add jitter to the reconnect logic. This may help to prevent too many connections from reconnecting at the same time.
https://github.com/nats-io/nats.go/blob/acca6d69b4e3ee1abef2b96cae85a3878aa2a520/nats.go#L790
or https://github.com/nats-io/nats.go/blob/acca6d69b4e3ee1abef2b96cae85a3878aa2a520/nats.go#L800 to implement your own back-off computation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants