NATS reconnect not working #7056

wkloucek · 2023-08-17T07:44:16Z

Describe the bug

I have a NATS Cluster with 3 nodes. I have set replication for all oCIS streams to 3:

~ # nats stream report --leaders
Obtaining Stream stats

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                    Stream Report                                                     │
├────────────────────┬─────────┬───────────┬───────────┬──────────┬─────────┬──────┬─────────┬─────────────────────────┤
│ Stream             │ Storage │ Placement │ Consumers │ Messages │ Bytes   │ Lost │ Deleted │ Replicas                │
├────────────────────┼─────────┼───────────┼───────────┼──────────┼─────────┼──────┼─────────┼─────────────────────────┤
│ OBJ_ocis           │ File    │           │ 0         │ 0        │ 0 B     │ 0    │ 0       │ nats-0, nats-1, nats-2* │
│ OBJ_userlog        │ File    │           │ 0         │ 0        │ 0 B     │ 0    │ 0       │ nats-0*, nats-1, nats-2 │
│ OBJ_postprocessing │ File    │           │ 0         │ 27       │ 13 KiB  │ 0    │ 36      │ nats-0*, nats-1, nats-2 │
│ OBJ_eventhistory   │ File    │           │ 0         │ 242      │ 165 KiB │ 0    │ 0       │ nats-0*, nats-1, nats-2 │
│ main-queue         │ File    │           │ 7         │ 192      │ 191 KiB │ 0    │ 0       │ nats-0, nats-1, nats-2* │
╰────────────────────┴─────────┴───────────┴───────────┴──────────┴─────────┴──────┴─────────┴─────────────────────────╯

When I restart all three NATS nodes on after another, there should be no service interruptions.

Actually when I'm done restarting all nodes oCIS does not recover from it. All files I upload now are stuck in the postprocessing step.

Steps to reproduce

Steps to reproduce the behavior:

run https://github.com/owncloud/ocis-charts/tree/master/deployments/ocis-nats
set replication to 3 for all oCIS streams
restart each NATS node / pod one after another (waiting 60 seconds between each restart)
upload a file, in my case a image

Expected behavior

The image uploads fine and I see preview. I also can download it.

Actual behavior

The image uploads fine. I see no preview. I cannot download the file.

I see following error logs:

postprocessing-6f94854c48-ff7jx postprocessing {"level":"error","service":"postprocessing","uploadID":"19b9cd12-8ca4-4b5a-9ced-f295faa18a3f","error":"Failed to store data in bucket: nats: connection closed","time":"2023-08-17T07:33:15.887203798Z","line":"github.com/owncloud/ocis/v2/services/postprocessing/pkg/service/service.go:107","message":"cannot store upload"}
eventhistory-5687b98bb9-dxxp5 eventhistory {"level":"error","service":"eventhistory","error":"Failed to store data in bucket: nats: connection closed","eventid":"2b1b0473-1f5d-4ca6-bbdf-cd1899ae2677","time":"2023-08-17T07:33:15.888093426Z","line":"github.com/owncloud/ocis/v2/services/eventhistory/pkg/service/service.go:70","message":"could not store event"}

eventhistory-5687b98bb9-vmz6t eventhistory {"level":"error","service":"eventhistory","error":"Failed to store data in bucket: nats: connection closed","eventid":"1c738a17-ddf5-4619-9dcd-a3b58cffee3c","time":"2023-08-17T07:33:15.897523875Z","line":"github.com/owncloud/ocis/v2/services/eventhistory/pkg/service/service.go:70","message":"could not store event"}

thumbnails-7469b846bb-lg9tk thumbnails {"level":"warn","service":"thumbnails","method":"Thumbnails.GetThumbnail","duration":4.07693,"error":"{\"id\":\"com.owncloud.api.thumbnails\",\"code\":425,\"detail\":\"File Processing\",\"status\":\"Too Early\"}","time":"2023-08-17T07:33:17.038606811Z","line":"github.com/owncloud/ocis/v2/services/thumbnails/pkg/service/grpc/v0/decorators/logging.go:46","message":"Failed to execute"}

After restarting postprocessing, eventhistory and userlog service, I get a thumbnail after uploading a new file and also can download that file again.

The ocis postprocessing restart command actually does not fix the interrupted postprocessing. It creates this error log line: postprocessing-896958dff-ktnvl postprocessing {"level":"error","service":"postprocessing","uploadID":"19b9cd12-8ca4-4b5a-9ced-f295faa18a3f","error":"expected only one result for '19b9cd12-8ca4-4b5a-9ced-f295faa18a3f', got 0","time":"2023-08-17T07:54:44.131766697Z","line":"github.com/owncloud/ocis/v2/services/postprocessing/pkg/service/service.go:99","message":"cannot get upload"}

Setup

oCIS 3.1.0-rc.1

Additional context

Add any other context about the problem here.

About setting replicas for a stream: #7023

The text was updated successfully, but these errors were encountered:

kobergj · 2023-08-17T08:58:37Z

Are we using nats as store for postprocessing and eventhistory? That is where the errors come from.

The reconnect/error handling when the connection breaks is done by go-micro or nats directly, I need to dig deeper to find out. We are just listening to a channel. Maybe it is not closed correctly or we don't handle the closed channel correctly.

The idea was that services crash when they loose nats connection, triggering a restart. Should we change that logic or just fix it?

wkloucek · 2023-08-17T11:12:40Z

Are we using nats as store for postprocessing and eventhistory? That is where the errors come from.

Yes, nats-js is used as store for postprocessing, eventhistory and userlog https://github.com/owncloud/ocis-charts/blob/ebc13a5369ff954e66266933d416e077e4b93c65/deployments/ocis-nats/helmfile.yaml#L79-L82

The reconnect/error handling when the connection breaks is done by go-micro or nats directly, I need to dig deeper to find out. We are just listening to a channel. Maybe it is not closed correctly or we don't handle the closed channel correctly.

This is the case for nats-js events implementation (https://github.com/go-micro/plugins/tree/main/v4/events/natsjs). But the nats-js store does not seem to make use of channels (https://github.com/go-micro/plugins/tree/main/v4/store/nats-js).

Seems like we indeed have a closed connection: https://github.com/nats-io/nats.go/blob/e2ac73b92f5baae9aff730a65a8d8cee9d069e4c/nats.go#L92

It theoretically should reconnect but there is also an upper bound of 60 reconnnects, which could be quickly reached under some circumstances (https://github.com/nats-io/nats.go/blob/e2ac73b92f5baae9aff730a65a8d8cee9d069e4c/nats.go#L53-L56)

The idea was that services crash when they loose nats connection, triggering a restart. Should we change that logic or just fix it?

Restarting a service might also be fine but in this particular situation I honestly would prefer a proper reconnect since NATS (2 of 3 nodes) is still available at any time. A reconnect should be a magnitude faster than relying on a supervisor to restart the service. If a reconnect fails constantly, dying / failing health the check seems like a good idea.

micbar · 2023-09-01T09:40:22Z

@kobergj @wkloucek We need to get that actionable.

Can you try to formulate what needs to be done?

wkloucek · 2023-09-01T09:56:25Z

@kobergj @wkloucek We need to get that actionable.

Can you try to formulate what needs to be done?

Run a NATS Cluster. I can help with that.
Configure stream replicas to 3.
Start oCIS with active debugger to see what's going on
resteart a NATS node after another
check if a proper reconnect happens or in which state it's stuck

micbar · 2023-09-01T12:51:29Z

ok. Thanks.

wkloucek · 2023-09-12T13:56:40Z

will be supersed if #7272 is implemented

wkloucek · 2023-09-21T15:13:16Z

This seems also be a problem when using NATS only as message bus.

Happens when NATS is restarted and oCIS not.

kobergj · 2023-09-25T08:06:49Z

We need to tackle this problem on two sides:

We need to check in nats package if we are handling reconnect correctly or if there is bug
Independent from that, if we get an error on Publish calls, we should fail hard (aka kill the service) so client know that something is wrong @butonic

micbar · 2023-09-25T08:09:37Z

@kobergj We debugged that on friday.

Single Service NATs does reconnect perfectly
Running NATs in a cluster with 3 replicas doesn't reconnect

@butonic for more information

kobergj · 2023-09-25T08:10:49Z

Thats not my opinion, it's team decision discussed with: @fschade @aduffeck @case0sh @2403905 @butonic

butonic · 2023-09-25T12:22:18Z

I cannot reproduce nats reconnect errors when only using it as a message bus. The nats clients reliably reconnects when killing any nats node.

But when using nats-js as a store (for postprocessing, userlog etc), uploads do get stuck. So it is an issue with using nats-js as the store implementation. (I was using redis as the store before).

butonic · 2023-09-25T12:51:30Z

ah so nats-js has two ways of setting the default nats options. And the second one completely overwrites the built in defaults:

func (n *natsStore) setOption(opts ...store.Option) {
	for _, o := range opts {
		o(&n.opts)
	}

	n.Once.Do(func() {
		n.nopts = nats.GetDefaultOptions()
	})

	// Extract options from context
	if nopts, ok := n.opts.Context.Value(natsOptionsKey{}).(nats.Options); ok {
		n.nopts = nopts
	}

stepping through this code reveals that wa actually pass in nats options via the config ... but they are not merged. They replace the default config ... and we are basically passing in an empty config:

	case TypeNatsJS:
		ttl, _ := options.Context.Value(ttlContextKey{}).(time.Duration)
		// TODO nats needs a DefaultTTL option as it does not support per Write TTL ...
		// FIXME nats has restrictions on the key, we cannot use slashes AFAICT
		// host, port, clusterid
		return natsjs.NewStore(
			append(opts,
				natsjs.NatsOptions(nats.Options{Name: "TODO"}),
				natsjs.DefaultTTL(ttl))...,

I'll fix that to use proper defaults.

micbar · 2023-09-28T08:15:02Z

@butonic please reference the bugfix and close this.

micbar · 2023-09-28T08:29:58Z

Fixed.

wkloucek added Type:Bug Severity:sev2-high operations severely restricted, workaround available p3-medium labels Aug 17, 2023

micbar added this to Infinite Scale Team Board Sep 1, 2023

github-project-automation bot moved this to Qualification in Infinite Scale Team Board Sep 1, 2023

micbar moved this from Qualification to Prio 2 in Infinite Scale Team Board Sep 1, 2023

micbar removed the p3-medium label Sep 1, 2023

wkloucek mentioned this issue Sep 12, 2023

NATS registry / cache / store #7272

Closed

1 task

wkloucek changed the title ~~NATS store reconnect not working~~ NATS reconnect not working Sep 21, 2023

butonic mentioned this issue Sep 25, 2023

pass default nats options cs3org/reva#4214

Merged

micbar closed this as completed Sep 28, 2023

github-project-automation bot moved this from Prio 2 to Done in Infinite Scale Team Board Sep 28, 2023

micbar added this to the Release 5.0.0 milestone Jan 22, 2024

wkloucek mentioned this issue Apr 8, 2024

healthcheck not failing even when service registration is not successful #8783

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NATS reconnect not working #7056

NATS reconnect not working #7056

wkloucek commented Aug 17, 2023 •

edited

Loading

kobergj commented Aug 17, 2023

wkloucek commented Aug 17, 2023

micbar commented Sep 1, 2023

wkloucek commented Sep 1, 2023

micbar commented Sep 1, 2023

wkloucek commented Sep 12, 2023

wkloucek commented Sep 21, 2023

kobergj commented Sep 25, 2023 •

edited

Loading

micbar commented Sep 25, 2023 •

edited

Loading

kobergj commented Sep 25, 2023

butonic commented Sep 25, 2023

butonic commented Sep 25, 2023

micbar commented Sep 28, 2023

micbar commented Sep 28, 2023

NATS reconnect not working #7056

NATS reconnect not working #7056

Comments

wkloucek commented Aug 17, 2023 • edited Loading

Describe the bug

Steps to reproduce

Expected behavior

Actual behavior

Setup

Additional context

kobergj commented Aug 17, 2023

wkloucek commented Aug 17, 2023

micbar commented Sep 1, 2023

wkloucek commented Sep 1, 2023

micbar commented Sep 1, 2023

wkloucek commented Sep 12, 2023

wkloucek commented Sep 21, 2023

kobergj commented Sep 25, 2023 • edited Loading

micbar commented Sep 25, 2023 • edited Loading

kobergj commented Sep 25, 2023

butonic commented Sep 25, 2023

butonic commented Sep 25, 2023

micbar commented Sep 28, 2023

micbar commented Sep 28, 2023

wkloucek commented Aug 17, 2023 •

edited

Loading

kobergj commented Sep 25, 2023 •

edited

Loading

micbar commented Sep 25, 2023 •

edited

Loading