-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NATS reconnect not working #7056
Comments
Are we using The reconnect/error handling when the connection breaks is done by The idea was that services crash when they loose nats connection, triggering a restart. Should we change that logic or just fix it? |
Yes,
This is the case for nats-js events implementation (https://github.com/go-micro/plugins/tree/main/v4/events/natsjs). But the nats-js store does not seem to make use of channels (https://github.com/go-micro/plugins/tree/main/v4/store/nats-js). Seems like we indeed have a closed connection: https://github.com/nats-io/nats.go/blob/e2ac73b92f5baae9aff730a65a8d8cee9d069e4c/nats.go#L92 It theoretically should reconnect but there is also an upper bound of 60 reconnnects, which could be quickly reached under some circumstances (https://github.com/nats-io/nats.go/blob/e2ac73b92f5baae9aff730a65a8d8cee9d069e4c/nats.go#L53-L56)
Restarting a service might also be fine but in this particular situation I honestly would prefer a proper reconnect since NATS (2 of 3 nodes) is still available at any time. A reconnect should be a magnitude faster than relying on a supervisor to restart the service. If a reconnect fails constantly, dying / failing health the check seems like a good idea. |
|
ok. Thanks. |
will be supersed if #7272 is implemented |
This seems also be a problem when using NATS only as message bus. Happens when NATS is restarted and oCIS not. |
We need to tackle this problem on two sides:
|
I cannot reproduce nats reconnect errors when only using it as a message bus. The nats clients reliably reconnects when killing any nats node. But when using |
ah so nats-js has two ways of setting the default nats options. And the second one completely overwrites the built in defaults:
stepping through this code reveals that wa actually pass in nats options via the config ... but they are not merged. They replace the default config ... and we are basically passing in an empty config:
I'll fix that to use proper defaults. |
@butonic please reference the bugfix and close this. |
Fixed. |
Describe the bug
I have a NATS Cluster with 3 nodes. I have set replication for all oCIS streams to 3:
When I restart all three NATS nodes on after another, there should be no service interruptions.
Actually when I'm done restarting all nodes oCIS does not recover from it. All files I upload now are stuck in the postprocessing step.
Steps to reproduce
Steps to reproduce the behavior:
Expected behavior
The image uploads fine and I see preview. I also can download it.
Actual behavior
The image uploads fine. I see no preview. I cannot download the file.
I see following error logs:
After restarting postprocessing, eventhistory and userlog service, I get a thumbnail after uploading a new file and also can download that file again.
The
ocis postprocessing restart
command actually does not fix the interrupted postprocessing. It creates this error log line:postprocessing-896958dff-ktnvl postprocessing {"level":"error","service":"postprocessing","uploadID":"19b9cd12-8ca4-4b5a-9ced-f295faa18a3f","error":"expected only one result for '19b9cd12-8ca4-4b5a-9ced-f295faa18a3f', got 0","time":"2023-08-17T07:54:44.131766697Z","line":"github.com/owncloud/ocis/v2/services/postprocessing/pkg/service/service.go:99","message":"cannot get upload"}
Setup
oCIS 3.1.0-rc.1
Additional context
Add any other context about the problem here.
About setting replicas for a stream: #7023
The text was updated successfully, but these errors were encountered: