Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NATS reconnect not working #7056

Closed
wkloucek opened this issue Aug 17, 2023 · 14 comments
Closed

NATS reconnect not working #7056

wkloucek opened this issue Aug 17, 2023 · 14 comments
Labels
Severity:sev2-high operations severely restricted, workaround available Type:Bug
Milestone

Comments

@wkloucek
Copy link
Contributor

wkloucek commented Aug 17, 2023

Describe the bug

I have a NATS Cluster with 3 nodes. I have set replication for all oCIS streams to 3:

~ # nats stream report --leaders
Obtaining Stream stats

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                    Stream Report                                                     │
├────────────────────┬─────────┬───────────┬───────────┬──────────┬─────────┬──────┬─────────┬─────────────────────────┤
│ Stream             │ Storage │ Placement │ Consumers │ Messages │ Bytes   │ Lost │ Deleted │ Replicas                │
├────────────────────┼─────────┼───────────┼───────────┼──────────┼─────────┼──────┼─────────┼─────────────────────────┤
│ OBJ_ocis           │ File    │           │ 0         │ 0        │ 0 B     │ 0    │ 0       │ nats-0, nats-1, nats-2* │
│ OBJ_userlog        │ File    │           │ 0         │ 0        │ 0 B     │ 0    │ 0       │ nats-0*, nats-1, nats-2 │
│ OBJ_postprocessing │ File    │           │ 0         │ 27       │ 13 KiB  │ 0    │ 36      │ nats-0*, nats-1, nats-2 │
│ OBJ_eventhistory   │ File    │           │ 0         │ 242      │ 165 KiB │ 0    │ 0       │ nats-0*, nats-1, nats-2 │
│ main-queue         │ File    │           │ 7         │ 192      │ 191 KiB │ 0    │ 0       │ nats-0, nats-1, nats-2* │
╰────────────────────┴─────────┴───────────┴───────────┴──────────┴─────────┴──────┴─────────┴─────────────────────────╯

When I restart all three NATS nodes on after another, there should be no service interruptions.

Actually when I'm done restarting all nodes oCIS does not recover from it. All files I upload now are stuck in the postprocessing step.

Steps to reproduce

Steps to reproduce the behavior:

  1. run https://github.com/owncloud/ocis-charts/tree/master/deployments/ocis-nats
  2. set replication to 3 for all oCIS streams
  3. restart each NATS node / pod one after another (waiting 60 seconds between each restart)
  4. upload a file, in my case a image

Expected behavior

The image uploads fine and I see preview. I also can download it.

Actual behavior

The image uploads fine. I see no preview. I cannot download the file.

I see following error logs:

postprocessing-6f94854c48-ff7jx postprocessing {"level":"error","service":"postprocessing","uploadID":"19b9cd12-8ca4-4b5a-9ced-f295faa18a3f","error":"Failed to store data in bucket: nats: connection closed","time":"2023-08-17T07:33:15.887203798Z","line":"github.com/owncloud/ocis/v2/services/postprocessing/pkg/service/service.go:107","message":"cannot store upload"}
eventhistory-5687b98bb9-dxxp5 eventhistory {"level":"error","service":"eventhistory","error":"Failed to store data in bucket: nats: connection closed","eventid":"2b1b0473-1f5d-4ca6-bbdf-cd1899ae2677","time":"2023-08-17T07:33:15.888093426Z","line":"github.com/owncloud/ocis/v2/services/eventhistory/pkg/service/service.go:70","message":"could not store event"}

eventhistory-5687b98bb9-vmz6t eventhistory {"level":"error","service":"eventhistory","error":"Failed to store data in bucket: nats: connection closed","eventid":"1c738a17-ddf5-4619-9dcd-a3b58cffee3c","time":"2023-08-17T07:33:15.897523875Z","line":"github.com/owncloud/ocis/v2/services/eventhistory/pkg/service/service.go:70","message":"could not store event"}

thumbnails-7469b846bb-lg9tk thumbnails {"level":"warn","service":"thumbnails","method":"Thumbnails.GetThumbnail","duration":4.07693,"error":"{\"id\":\"com.owncloud.api.thumbnails\",\"code\":425,\"detail\":\"File Processing\",\"status\":\"Too Early\"}","time":"2023-08-17T07:33:17.038606811Z","line":"github.com/owncloud/ocis/v2/services/thumbnails/pkg/service/grpc/v0/decorators/logging.go:46","message":"Failed to execute"}

After restarting postprocessing, eventhistory and userlog service, I get a thumbnail after uploading a new file and also can download that file again.

The ocis postprocessing restart command actually does not fix the interrupted postprocessing. It creates this error log line: postprocessing-896958dff-ktnvl postprocessing {"level":"error","service":"postprocessing","uploadID":"19b9cd12-8ca4-4b5a-9ced-f295faa18a3f","error":"expected only one result for '19b9cd12-8ca4-4b5a-9ced-f295faa18a3f', got 0","time":"2023-08-17T07:54:44.131766697Z","line":"github.com/owncloud/ocis/v2/services/postprocessing/pkg/service/service.go:99","message":"cannot get upload"}

Setup

oCIS 3.1.0-rc.1

Additional context

Add any other context about the problem here.

About setting replicas for a stream: #7023

@wkloucek wkloucek added Type:Bug Severity:sev2-high operations severely restricted, workaround available p3-medium labels Aug 17, 2023
@kobergj
Copy link
Collaborator

kobergj commented Aug 17, 2023

Are we using nats as store for postprocessing and eventhistory? That is where the errors come from.

The reconnect/error handling when the connection breaks is done by go-micro or nats directly, I need to dig deeper to find out. We are just listening to a channel. Maybe it is not closed correctly or we don't handle the closed channel correctly.

The idea was that services crash when they loose nats connection, triggering a restart. Should we change that logic or just fix it?

@wkloucek
Copy link
Contributor Author

Are we using nats as store for postprocessing and eventhistory? That is where the errors come from.

Yes, nats-js is used as store for postprocessing, eventhistory and userlog https://github.com/owncloud/ocis-charts/blob/ebc13a5369ff954e66266933d416e077e4b93c65/deployments/ocis-nats/helmfile.yaml#L79-L82

The reconnect/error handling when the connection breaks is done by go-micro or nats directly, I need to dig deeper to find out. We are just listening to a channel. Maybe it is not closed correctly or we don't handle the closed channel correctly.

This is the case for nats-js events implementation (https://github.com/go-micro/plugins/tree/main/v4/events/natsjs). But the nats-js store does not seem to make use of channels (https://github.com/go-micro/plugins/tree/main/v4/store/nats-js).

Seems like we indeed have a closed connection: https://github.com/nats-io/nats.go/blob/e2ac73b92f5baae9aff730a65a8d8cee9d069e4c/nats.go#L92

It theoretically should reconnect but there is also an upper bound of 60 reconnnects, which could be quickly reached under some circumstances (https://github.com/nats-io/nats.go/blob/e2ac73b92f5baae9aff730a65a8d8cee9d069e4c/nats.go#L53-L56)

The idea was that services crash when they loose nats connection, triggering a restart. Should we change that logic or just fix it?

Restarting a service might also be fine but in this particular situation I honestly would prefer a proper reconnect since NATS (2 of 3 nodes) is still available at any time. A reconnect should be a magnitude faster than relying on a supervisor to restart the service. If a reconnect fails constantly, dying / failing health the check seems like a good idea.

@micbar micbar moved this from Qualification to Prio 2 in Infinite Scale Team Board Sep 1, 2023
@micbar
Copy link
Contributor

micbar commented Sep 1, 2023

@kobergj @wkloucek We need to get that actionable.

Can you try to formulate what needs to be done?

@wkloucek
Copy link
Contributor Author

wkloucek commented Sep 1, 2023

@kobergj @wkloucek We need to get that actionable.

Can you try to formulate what needs to be done?

  • Run a NATS Cluster. I can help with that.
  • Configure stream replicas to 3.
  • Start oCIS with active debugger to see what's going on
  • resteart a NATS node after another
  • check if a proper reconnect happens or in which state it's stuck

@micbar micbar removed the p3-medium label Sep 1, 2023
@micbar
Copy link
Contributor

micbar commented Sep 1, 2023

ok. Thanks.

@wkloucek
Copy link
Contributor Author

will be supersed if #7272 is implemented

@wkloucek wkloucek changed the title NATS store reconnect not working NATS reconnect not working Sep 21, 2023
@wkloucek
Copy link
Contributor Author

This seems also be a problem when using NATS only as message bus.

Happens when NATS is restarted and oCIS not.

@kobergj
Copy link
Collaborator

kobergj commented Sep 25, 2023

We need to tackle this problem on two sides:

  • We need to check in nats package if we are handling reconnect correctly or if there is bug
  • Independent from that, if we get an error on Publish calls, we should fail hard (aka kill the service) so client know that something is wrong @butonic

@micbar
Copy link
Contributor

micbar commented Sep 25, 2023

@kobergj We debugged that on friday.

  • Single Service NATs does reconnect perfectly
  • Running NATs in a cluster with 3 replicas doesn't reconnect

@butonic for more information

@kobergj
Copy link
Collaborator

kobergj commented Sep 25, 2023

Thats not my opinion, it's team decision discussed with: @fschade @aduffeck @case0sh @2403905 @butonic

@butonic
Copy link
Member

butonic commented Sep 25, 2023

I cannot reproduce nats reconnect errors when only using it as a message bus. The nats clients reliably reconnects when killing any nats node.

But when using nats-js as a store (for postprocessing, userlog etc), uploads do get stuck. So it is an issue with using nats-js as the store implementation. (I was using redis as the store before).

@butonic
Copy link
Member

butonic commented Sep 25, 2023

ah so nats-js has two ways of setting the default nats options. And the second one completely overwrites the built in defaults:

func (n *natsStore) setOption(opts ...store.Option) {
	for _, o := range opts {
		o(&n.opts)
	}

	n.Once.Do(func() {
		n.nopts = nats.GetDefaultOptions()
	})

	// Extract options from context
	if nopts, ok := n.opts.Context.Value(natsOptionsKey{}).(nats.Options); ok {
		n.nopts = nopts
	}

stepping through this code reveals that wa actually pass in nats options via the config ... but they are not merged. They replace the default config ... and we are basically passing in an empty config:

	case TypeNatsJS:
		ttl, _ := options.Context.Value(ttlContextKey{}).(time.Duration)
		// TODO nats needs a DefaultTTL option as it does not support per Write TTL ...
		// FIXME nats has restrictions on the key, we cannot use slashes AFAICT
		// host, port, clusterid
		return natsjs.NewStore(
			append(opts,
				natsjs.NatsOptions(nats.Options{Name: "TODO"}),
				natsjs.DefaultTTL(ttl))...,

I'll fix that to use proper defaults.

@micbar
Copy link
Contributor

micbar commented Sep 28, 2023

@butonic please reference the bugfix and close this.

@micbar
Copy link
Contributor

micbar commented Sep 28, 2023

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Severity:sev2-high operations severely restricted, workaround available Type:Bug
Projects
Archived in project
Development

No branches or pull requests

4 participants